AWS-DAS-C01 – Nextra

Exam Guide

Domain 1: Collection

AWS Whitepaper - Patterns and Considerations for using AWS Services to Move Data into a Modern Data Architecture (opens in a new tab)

Task Statement 1.1: Determine the operational characteristics of the collection system

Task Statement 1.2: Select a collection system that handles the frequency, volume, and source of data

Describe and characterize the volume and flow characteristics of incoming data (streaming, transactional, batch)
- Streaming data
  - Kinesis Data Streams
  - Amazon MSK
- Batch data
  - Glue
  - Transfer Family
  - DataSync
  - Storage Gateway
- Transactional data
  - DMS
- Resources
  - AWS Whitepaper - Storage Best Practices for Data and Analytics Applications - Data ingestion methods (opens in a new tab)
  - AWS Whitepaper - Machine Learning Best Practices for Public Sector Organizations - Data Ingestion and Preparation (opens in a new tab)

Task Statement 1.3: Select a collection system that addresses the key properties of data, such as order, format, and compression

Describe how to capture data changes at the source
- DMS supports Change Data Capture (CDC).
- AWS Big Data Blog - Stream change data to Amazon Kinesis Data Streams with AWS DMS (opens in a new tab)
Describe how to transform and filter data during the collection process
- Glue for batch ETL
- Kinesis Data Analytics for streaming ETL
- Kinesis Data Firehose and Lambda for streaming ETL

Domain 2: Storage and Data Management

Task Statement 2.1: Determine the operational characteristics of the storage solution for analytics

Services

Analytics

Athena

Athena (opens in a new tab)

In-place querying of data in S3
Serverless
PB scale
Presto (opens in a new tab) under the hood
Supports SQL and Apache Spark
Analyze unstructured, semi-structured, and structured data stored in S3, formats including CSV, JSON, Apache Parquet and Apache ORC
Only successful or canceled queries are billed, while failed queries are not billed.
No charge for DDL statements
Use cases
- Ad-hoc queries of web logs, CloudTrail / CloudFront / VPC / ELB logs
- Querying staging data before loading to Redshift
- Integration with Jupyter, Zeppelin, RStudio notebooks
- Integration with QuickSight for visualization
- Integration via JDBC / ODBC for BI tools
SQL
- MSCK REPAIR TABLE updates the metastore with the current partitions of an external table.
  
  Repairing partitions manually using MSCK repair (opens in a new tab)

Athena - Workgroup

A workgroup is a collection of users, queries, and results that are associated with a specific data catalog and workgroup settings.
Workgroup settings include the location in S3 where query results are stored, the encryption configuration, and the data usage control settings.

Athena - Federated Query

AWS Docs - Using Amazon Athena Federated Query (opens in a new tab)

Query the data in place or build pipelines that extract data from multiple data sources and store them in S3.
Uses data source connectors that run on Lambda to run federated queries.

CloudSearch

AWS Docs - CloudSearch (opens in a new tab)

Managed search service, based on Apache Solr

OpenSearch

AWS Docs - OpenSearch (opens in a new tab)

Operational analytics and log analytics
Forked version of Elasticsearch
Number of Shards = Index Size / 30GB

EMR

AWS Docs - EMR (opens in a new tab)

Managed cluster platform that simplifies running big data frameworks, such as Hadoop and Spark, on AWS.
PB scale
RunJobFlow API Action creates and starts running a new cluster (job flow). The cluster runs the steps specified. After the steps complete, the cluster stops and the HDFS partition is lost.
An EMR cluster with multiple primary nodes can reside only in one AZ.
HDFS
- A typical block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB chunks, and if possible, each chunk will reside on a different DataNode.
EMR metrics (opens in a new tab)

EMR cluster - Node types

AWS Docs - EMR cluster - Node types (opens in a new tab)

Primary node / Master node
- Manages the cluster and typically runs primary components of distributed applications.
- Tracks the status of jobs submitted to the cluster
- Monitors the health of the instance groups
- HA support from 5.23.0+
Core nodes
- Store HDFS data and run tasks
- Multi-node clusters have at least one core node.
- One core instance group or instance fleet per cluster, but there can be multiple nodes running on multiple EC2 instances in the instance group or instance fleet.
Task nodes
- No HDFS data, therefore no data loss if terminated
- Can use Spot instances
- Optional in a cluster

EMR - EMRFS

Direct access to S3 from EMR cluster
Persistent storage for EMR clusters
Benefit from S3 features
- For clusters with multiple users who need different levels of access to data in S3 through EMRFS
- EMRFS can assume a different service role for cluster EC2 instances based on the user or group making the request, or based on the location of data in S3.
- Each IAM role for EMRFS can have different permissions for data access in S3.
- When EMRFS makes a request to S3 that matches users, groups, or the locations that you specify, the cluster uses the corresponding role that you specify instead of the EMR role for EC2.
AWS Docs - Authorizing access to EMRFS data in Amazon S3 (opens in a new tab)
S3DistCp (opens in a new tab)
- An extension to DistCp (Apache Hadoop Distributed Copy)
- Uses MapReduce to efficiently copy large amounts of data in a distributed manner
- Copy data from S3 to HDFS
- Copy data from HDFS to S3
- Copy data between S3 buckets
- AWS Big Data Blog - Seven Tips for Using S3DistCp on Amazon EMR to Move Data Efficiently Between HDFS and Amazon S3 (opens in a new tab)

EMR - Security - Apache Ranger

A RBAC framework to enable, monitor and manage comprehensive data security across the Hadoop ecosystem.
Centralized security administration and auditing
Fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN)
Syncs policies and users by using agents and plugins that run within the same process as the Hadoop component.
Supports row-level authentication and auditing capabilities with embedded search.

EMR - Security Configuration

AWS Docs - Security Configuration (opens in a new tab)

At-rest data encryption
- For WAL with EMR (opens in a new tab)
  
  Open-source HDFS encryption LUKS encryption
- For cluster nodes local disks (EC2 instance volumes) (opens in a new tab)
  - Open-source HDFS encryption
  - Instance store encryption
    - NVMe encryption
    - LUKS encryption
  - EBS volume encryption
    - EBS encryption
      - Recommended
      - EMR 5.24.0+
    - LUKS encryption
      - Only applies to attached storage volumes, not to the root device volume
- For EMRFS on S3
  - SSE-S3
  - SSE-KMS
  - SSE-C
  - CSE-KMS
    
    Objects are encrypted before being uploaded to S3 and the client uses keys provided by KMS.
  - CSE-Custom
    
    Objects are encrypted before being uploaded to S3 and the client uses a custom Java class that provides the client-side master key.
In-transit data encryption
- For EMR traffic between cluster nodes and WAL with EMR
- For distributed applications
- For EMRFS traffic between S3 and cluster nodes
Resources
- AWS Big Data Blog - Best Practices for Securing Amazon EMR (opens in a new tab)

EMR Studio

Web IDE for fully managed Jupyter notebooks running on EMR clusters

EMR Notebooks (opens in a new tab)

EMR - Open Source Ecosystem

Delta Lake

EMR - Delta Lake (opens in a new tab)

Storage layer framework for lakehouse architectures

Ganglia

EMR - Ganglia (opens in a new tab)

Performance monitoring tool for Hadoop and HBase clusters
Not included in EMR after 6.15.0

Apache HBase

EMR - Apache HBase (opens in a new tab)

Wide-column store NoSQL running on HDFS
EMR WAL support, able to restore an existing WAL retained for 30 days starting from the time when cluster was created, with a new cluster from the same S3 root directory.

Apache HCatalog

EMR - Apache HCatalog (opens in a new tab)

HCatalog is a tool that allows you to access Hive metastore tables within Pig, Spark SQL, and/or custom MapReduce applications.

Apache Hudi

EMR - Apache Hudi (opens in a new tab)

Open Table Format
Brings database and data warehouse capabilities to the data lake

Hue

EMR - Hue (opens in a new tab)

Web GUI for Hadoop

Apache Iceberg

EMR - Apache Iceberg (opens in a new tab)

Open Table Format

Apache Livy

EMR - Apache Livy (opens in a new tab)

A REST interface for Apache Spark

Apache MXNet

EMR - Apache MXNet (opens in a new tab)

Retired
A deep learning framework for building neural networks and other deep learning applications.

Apache Oozie

EMR - Apache Oozie (opens in a new tab)

Workflow scheduler system to manage Hadoop jobs
Apache Airflow is a modern alternative.

Apache Phoenix

EMR - Apache Phoenix (opens in a new tab)

OLTP and operational analytics in Hadoop for low latency applications

Apache Pig

EMR - Apache Pig (opens in a new tab)

SQL-like commands written in Pig Latin and converts those commands into Tez jobs based on DAG or MapReduce programs.

Apache Sqoop

EMR - Apache Sqoop (opens in a new tab)

Retired
A tool for transferring data between S3, Hadoop, HDFS, and RDBMS databases.

Apache Tez

EMR - Apache Tez (opens in a new tab)

An application execution framework for complex DAG of tasks, similar to Apache Spark

Glue

Glue (opens in a new tab)

Basically Serverless Spark ETL with a Hive metastore
Ingests data in batch while performing transformations
Either provide a script to perform the ETL job, or Glue can generate a script automatically.
Data source can be an AWS service, such as RDS, S3, DynamoDB, or Kinesis Data Streams, as well as a third-party JDBC-accessible database.
Data target can be an AWS service, such as S3, RDS, and DocumentDB, as well as a third-party JDBC-accessible database.
Hourly rate charge based on the number of Data Processing Units (or DPUs) used to run your ETL job.
Resources
- AWS Glue Web API (opens in a new tab)
Orchestration
- triggers
  - When fired, a trigger can start specified jobs and crawlers.
  - A trigger fires on demand, based on a schedule, or based on a combination of events.
  - trigger types
    - Scheduled
    - Conditional
      
      The trigger fires if the watched jobs or crawlers end with the specified statuses.
    - On-demand
- workflows
  - AWS Docs - Glue - workflows (opens in a new tab)
  - Create and visualize complex ETL activities involving multiple crawlers, jobs, and triggers.
- blueprints
  - AWS Docs - Glue - blueprints (opens in a new tab)
  - Glue blueprints provide a way to create and share Glue workflows. When there is a complex ETL process that could be used for similar use cases, rather than creating an Glue workflow for each use case, you can create a single blueprint.

Glue - Job

Job type
- Apache Spark
  - Minimum 2 DPU
  - 10 DPU (default)
- Apache Spark Streaming
  - Minimum 2 DPU (default)
- Python Shell
  - 1 DPU or 0.0625 DPU (default)
Job Bookmarks
- Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run. This persisted state information is called a job bookmark.

Glue - Worker

A single DPU is also called a worker.
A DPU is a relative measure of processing power that consists of 4 vCPU of compute capacity and 16 GB of memory.
A M-DPU is a DPU with 4 vCPU and 32 GB of memory.
Worker type
- For Glue 2.0+
  
  Specify Worker type and Number of workers
  - G.1X (default)
    - 1 DPU
    - Recommended for most workloads with cost effective performance
  - G.2X
    - 2 DPU
    - Recommended for most workloads with cost effective performance
  - G.4X
    - 4 DPU
    - Glue 3.0+ Spark jobs
  - G.8X
    - 8 DPU
    - Recommended for workloads with most demanding transformations, aggregations, and joins
    - Glue 3.0+ Spark jobs
  - G.025X
    - 0.25 DPU
    - Recommended for low volume streaming jobs
    - Glue 3.0+ Spark Streaming jobs

Glue - Data Catalog

AWS Docs - Glue Data Catalog (opens in a new tab)

Functionally similar to a schema registry
API compatible with Hive Metastore
lakeFS - Metadata Management: Hive Metastore vs AWS Glue (opens in a new tab)
Glue Data Catalog - database & table
- Databases and tables are objects in the Data Catalog that contain metadata definitions.
- The schema of your data is represented in your Glue table definition. The actual data remains in its original data store

Glue - Data Catalog - crawler

AWS Docs - Glue Data Catalog - crawler (opens in a new tab)

A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.
A classifier recognizes the format of your data and generates a schema. It returns a certainty number between 0.0 and 1.0, where 1.0 means 100% certainty. Glue uses the output of the classifier that has the highest certainty.
If no classifier returns a certainty greater than 0.0, Glue returns the default classification string of UNKNOWN.
Steps
- A crawler runs any custom classifiers that you choose to infer the format and schema of your data.
- The crawler connects to the data store.
- The inferred schema is created for your data.
- The crawler writes metadata to the Glue Data Catalog.
Partition threshold

If the following conditions are met, the schemas are denoted as partitions of a table.
- The partition threshold is higher than 0.7 (70%).
- The maximum number of different schemas doesn't exceed 5.
Learn how AWS Glue crawler detects the schema | AWS re:Post (opens in a new tab)

Glue - Data Catalog - security

You can only use an Glue resource policy to manage permissions for Data Catalog resources.
You can't attach it to any other Glue resources such as jobs, triggers, development endpoints, crawlers, or classifiers.
Only one resource policy is allowed per catalog, and its size is limited to 10 KB.
Each AWS account has one single catalog per Region whose catalog ID is the same as the AWS account ID.
You cannot delete or modify a catalog.

Glue Data Quality

AWS Docs - Glue Data Quality (opens in a new tab)

Glue Streaming

AWS Docs - Glue Streaming (opens in a new tab)

Using the Apache Spark Streaming framework
Near-real-time data processing

Glue DataBrew

AWS Docs - Glue DataBrew (opens in a new tab)

Visual data preparation tool that enables users to clean and normalize data without writing any code.

Kinesis Data Streams

AWS Docs - Kinesis Data Streams (opens in a new tab)

Key points
- Equivalent to Kafka, for event streaming, no ETL
- On-demand or provisioned mode
- PaaS with API access for developers
- The maximum size of the data payload of a record before base64-encoding is up to 1 MB.
- Retention Period by default is 1 day, and can be up to 365 days.
Shard (opens in a new tab)
- A stream is composed of one or more shards.
- A shard is a uniquely identified sequence of data records in a stream.
- All the data in the shard is sent to the same worker that is processing the shard.
- Read throughput
  
  Up to 2 MB/second/shard
  
  or
  
  5 transactions (API calls) /second/shard
  
  or
  
  2000 records/second/shard
  
  shared across all the consumers that are reading from a given shard
  
  Each call to GetRecords is counted as 1 read transaction.
  
  GetRecords can retrieve up to 10 MB / transaction / shard, and up to 10000 records per call. If a call to GetRecords returns 10 MB, subsequent calls made within the next 5 seconds throw an exception.
  
  up to 20 registered consumers (Enhanced Fan-out Limit) for each data stream.
- Write throughput
  
  Up to 1 MB/second/shard
  
  or
  
  1000 records/second/shard
- Record aggregation allows customers to combine multiple records into a single record. This allows customers to improve their per shard throughput.
- Not scaled automatically, resharding (opens in a new tab) is the process used to scale your data stream using a series of shard splits or merges.
Partition key (opens in a new tab)
- All data records with the same partition key map to the same shard.
- The number of partition keys should typically be much greater than the number of shards, namely many-to-one relationship.
Idempotency
- 2 primary causes for duplicate records
  - Producer retries
  - Consumer retries
- Consumer retries are more common than producer retries
  - Kinesis Data Streams does not guarantee the order of records across shards in a stream.
  - Kinesis Data Streams does guarantee the order of records within a shard.

Kinesis Data Streams - Producers

Kinesis Data Streams - Producers - KPL

AWS Docs - KPL (opens in a new tab)

Batching
- Aggregation
  
  Storing multiple payloads within a single Kinesis Data Streams record.
- Collection
  
  Consolidate multiple Kinesis Data Streams records into a single Kinesis Data Streams record to reduce HTTP requests.
Rate Limiting
- Limits per-shard throughput sent from a single producer
- Implemented using a token bucket algorithm with separate buckets for both Kinesis Data Streams records and bytes.
Pros
- Higher throughput due to aggregation and compression of records
- Abstraction over API to simplify coding using low level API directly
Cons
- Higher latency due to additional processing delay of up to configurable RecordMaxBufferedTime
- Only supports AWS SDK v1

Kinesis Data Streams - Producers - Kinesis Agent

AWS Docs - Kinesis Agent (opens in a new tab)

Stand-alone Java application running as a daemon
Continuously monitors a set of files and sends new data to your stream.

Kinesis Data Streams - Consumers

KCL (opens in a new tab)
- Tasks
  - Connects to the data stream
  - Enumerates the shards within the data stream
  - Uses leases to coordinates shard associations with its workers
  - Instantiates a record processor for every shard it manages
  - Pulls data records from the data stream
  - Pushes the records to the corresponding record processor
  - Checkpoints processed records
  - Balances shard-worker associations (leases) when the worker instance count changes or when the data stream is resharded (shards are split or merged)
- Versions
  - KCL 1.x is based on AWS SDK v1, and KCL 2.x is based on AWS SDK v2
  - KCL 1.x
    - Java
    - Node.js
    - .NET
    - Python
    - Ruby
  - KCL 2.x
    - Java
    - Python
- Multistream processing is only supported in KCL 2.x for Java (KCL 2.3+ for Java)
- You should ensure that the number of KCL instances does not exceed the number of shards (except for failure standby purposes)
- One KCL worker is deployed on one EC2 instance, able to process multiple shards.
- A KCL worker runs as a process and it has multiple record processors running as threads within it.
- One Shard has exactly one corresponding record processor.
- KCL 2.x enables you to create KCL consumer applications that can process more than one data stream at the same time.
- Checkpoints sequence number for the shard in the lease table to track processed records
- Enhanced fan-out (opens in a new tab)
  - Consumers using enhanced fan-out have dedicated throughput of up to 2 MB/second/shard, and without enhanced fan-out, all consumers reading from the same shard share the throughput of 2 MB/second/shard.
  - Requires KCL 2.x and Kinesis Data Streams with enhanced fan-out enabled
  - You can register up to 20 consumers per stream to use enhanced fan-out.
  - Kinesis Data Streams pushes data records from the stream to consumers using enhanced fan-out over HTTP/2, therefore no polling and lower latency.
- Lease Table (opens in a new tab)
  - At any given time, each shard of data records is bound to a particular KCL worker by a lease identified by the leaseKey variable.
  - By default, a worker can hold one or more leases (subject to the value of the maxLeasesForWorker variable) at the same time.
  - A DynamoDB table to keep track of the shards in a Kinesis Data Stream that are being leased and processed by the workers of the KCL consumer application.
  - Each row in the lease table represents a shard that is being processed by the workers of your consumer application.
  - KCL uses the name of the consumer application to create the name of the lease table that this consumer application uses, therefore each consumer application name must be unique.
  - One lease table for one KCL consumer application
  - KCL creates the lease table with a provisioned throughput of 10 reads / second and 10 writes / second
Kinesis Data Firehose (opens in a new tab)
Managed Apache Flink (opens in a new tab)
Troubleshooting (opens in a new tab)
- Consumer read API calls throttled (ReadProvisionedThroughputExceeded)
  - AWS re:Post - How do I detect and troubleshoot ReadProvisionedThroughputExceeded exceptions in Kinesis Data Streams? (opens in a new tab)
  - Occurs when GetRecords quota is exceeded, subsequent GetRecords calls are throttled.
  - Quota
    - 5 GetRecords API calls / second / each shard
    - 2 MB / second / each shard
    - Up to 10 MiB of data / API call / each shard
    - Up to 10000 records / API call
- Slow record processing
  - AWS re:Post - Why does the IteratorAgeMilliseconds value in Kinesis Data Streams continue to increase? (opens in a new tab)

Kinesis Data Firehose

AWS Docs - Kinesis Data Firehose (opens in a new tab)

Serverless, fully managed with automatic scaling
Pre-built Kinesis Data Streams connectors for various sources and destinations, without coding, similar to Kafka Connect
For data latency of 60 seconds or higher.
Sources
- Kinesis Data Streams
- MSK
  - Can only use S3 as destination
- Direct PUT
  - Use PutRecord and PutRecordBatch API to send data
  - If the source is Direct PUT, Firehose will retain data for 24 hours.
Destinations - AWS Docs - Data Firehose - Source, Destination, and Name (opens in a new tab)
- S3
- Redshift
- OpenSearch
- Custom HTTP Endpoint
- 3rd-party service
Buffering hints

AWS Docs - Buffering hints (opens in a new tab)
- Buffering size in MBs (destination specific)
  - 1 MB to 128 MB
  - Default is 5 MB
- Buffering interval in seconds (destination specific)
  - 0 to 900
  - Default is 60 / 300 seconds
Data Transformation (opens in a new tab) not natively supported, need to use Lambda
- Lambda
  - Custom transformation logic
  - Lambda invocation time of up to 5 min.
  - Error handling
    - Retry the invocation 3 times by default.
    - After retries, if the invocation is unsuccessful, Firehose then skips that batch of records. The skipped records are treated as unsuccessfully processed records.
    - The unsuccessfully processed records are delivered to your S3 bucket.
- Data format conversion
  - Firehose can convert JSON records using a schema from a table defined in Glue
  - Non-JSON data must be converted by invoking Lambda function
Dynamic Partitioning (opens in a new tab)
- Dynamic partitioning enables you to continuously partition streaming data in Firehose by using keys within data (for example, customer_id or transaction_id) and then deliver the data grouped by these keys into corresponding S3 prefixes.
- Partitioning your data minimizes the amount of data scanned, optimizes performance, and reduces costs of your analytics queries on S3.
Server-side encryption (opens in a new tab)
- Can be enabled depending on the source of data
- Kinesis Data Streams as the Source
  - Kinesis Data Streams first decrypts the data and then sends it to Firehose.
  - Firehose buffers the data in memory and then delivers without storing the unencrypted data at rest.
- With Direct PUT or other Data Sources
  - Can be turned on by using the StartDeliveryStreamEncryption operation.
  - To stop server-side-encryption, use the StopDeliveryStreamEncryption operation.
Troubleshooting
- Troubleshoot Amazon Kinesis yielding too many S3 files | AWS re:Post (opens in a new tab)

Kinesis Data Analytics

Input
- Kinesis Data Streams
- Kinesis Data Firehose
Output
- Kinesis Data Streams
- Kinesis Data Firehose
- Lambda

Kinesis Data Analytics - Managed Apache Flink

Equivalent to Kafka Streams, using Apache Flink behind the scene, online stream processing, namely streaming ETL

Managed Service for Apache Flink (opens in a new tab)

Streaming application built with API in Java, Scala, Python and SQL
Interactive notebooks

Kinesis Data Analytics - SQL (legacy)

Kinesis Data Analytics for SQL Applications (opens in a new tab)

Kinesis Data Analytics Studio

Kinesis Data Analytics Studio (opens in a new tab)

Based on
- Apache Zeppelin
- Apache Flink

Kinesis Video Streams

Kinesis Video Streams (opens in a new tab)

Out of scope for the exam

Lake Formation

Features (opens in a new tab)
- Simplify security management and governance at scale
- Centrally manage permissions across your users
- Comprehensive data access and audit logging
- Provides a RDBMS permissions model to grant or revoke access to databases, tables, and columns in the Data Catalog with underlying data in S3.
Resources

Lake Formation - Ingestion

Blueprints are predefined ETL workflows that you can use to ingest data into your Data Lake.
Workflows are instances of ingestion blueprints in Lake Formation.
Blueprint types
- Database snapshot
  - Loads or reloads data from all tables into the data lake from a JDBC source.
  - You can exclude some data from the source based on an exclude pattern.
  - When to use:
    - Schema evolution is flexible. (Columns are re-named, previous columns are deleted, and new columns are added in their place.)
    - Complete consistency is needed between the source and the destination.
- Incremental database
  - Loads only new data into the data lake from a JDBC source, based on previously set bookmarks.
  - You specify the individual tables in the JDBC source database to include.
  - When to use:
    - Schema evolution is incremental. (There is only successive addition of columns.)
    - Only new rows are added; previous rows are not updated.
- Log file
  
  Bulk loads data from log file sources

Lake Formation - Permission management

Permission types
- Metadata access
  
  Data Catalog permissions
- Underlying data access
  - Data access permissions
    
    Data access permissions (SELECT, INSERT, and DELETE) on Data Catalog tables that point to that location.
  - Data location permissions
    
    The ability to create Data Catalog resources that point to particular S3 locations.
    
    When you grant the CREATE_TABLE or ALTER permission to a principal, you also grant data location permissions to limit the locations for which the principal can create or alter metadata tables.
Permission model
- IAM permissions model consists of IAM policies.
- Lake Formation permissions model is implemented as DBMS-style GRANT/REVOKE commands.
- Principals
  - IAM users and roles
  - SAML users and groups
  - IAM Identity Center
  - External accounts
- Resources
  - LF-Tags
    
    Using LF-Tags can greatly simplify the number of grants over using Named Resource policies.
  - Named data catalog resources
- Permissions
  - Database permissions
  - Table permissions
- Data filtering
  - Column-level security
    
    Allows users to view only specific columns and nested columns that they have access to in the table.
  - Row-level security
    
    Allows users to view only specific rows of data that they have access to in the table.
  - Cell-level security
    
    By using both row filtering and column filtering at the same time
    
    Can restrict access to different columns depending on the row.

QuickSight

AWS Docs - QuickSight (opens in a new tab)

Key points
- Ad-hoc, interactive query, analysis and visualization
- For formatted canned reports, use Paginated Reports (opens in a new tab)
Editions
- Standard edition
- Enterprise edition
  - VPC connection
    
    Access to data source in your VPC without the need for a public IP address
  - Microsoft Active Directory SSO
  - Data stored in SPICE is encrypted at rest.
  - For SQL-based data sources, such as Redshift, Athena, PostgreSQL, or Snowflake, you can refresh your data incrementally within a look-back window of time.
Datasets
- Dataset refresh can be scheduled or on-demand.
- SPICE
  - In-memory
- Direct SQL query
Visual types
- Combo chart (opens in a new tab)
  
  Aka line and column (bar) charts
- Scatter plot (opens in a new tab)
  
  Used to observe relationships between variables
- Heat map (opens in a new tab)
  - Use heat maps to show a measure for the intersection of two dimensions, with color-coding to easily differentiate where values fall in the range.
  - Use a heat map if you want to identify trends and outliers, because the use of color makes these easier to spot.
- Pivot table (opens in a new tab)
  - Use a pivot table if you want to analyze data on the visual.
AutoGraph (opens in a new tab)

You create a visual by choosing AutoGraph and then selecting fields.

Migration and Transfer

DMS (Database Migration Service)

AWS Docs - DMS (Database Migration Service) (opens in a new tab)
Migrate RDBMS, data warehouses, NoSQL databases, and other types of data stores
One-time migration
Used for CDC (Change Data Capture) for real-time ongoing replication

AWS Big Data Blog - Stream change data to Amazon Kinesis Data Streams with AWS DMS (opens in a new tab)

DataSync

AWS Docs - DataSync (opens in a new tab)
One-off online data transfer, transferring hot or cold data should not impede your business.
Move data from on-premises to AWS
Move data between AWS services
Requires DataSync agent installation on-premises

Snow Family

Snowball Edge Storage Optimized

For large-scale data migrations and recurring transfer workflows, as well as local computing with higher capacity needs

Snowball Edge Compute Optimized

For use cases such as machine learning, full motion video analysis, analytics, and local computing stacks

Snowmobile

Intended for more than 10 PB data migrations from a single location
Up to 100 PB per Snowmobile
Shipping container

Transfer Family

Transfer flat files using SFTP, FTPS, SSH and FTP

Storage Gateway

Hybrid cloud storage solution that enables on-premises applications to use cloud storage without impact to existing applications
On-remises data is synced up in cloud storage, while applications continue to access the cached data locally
AWS Storage Blog - Cloud storage in minutes with AWS Storage Gateway (updated) (opens in a new tab)

Storage Gateway - Amazon S3 File Gateway

AWS Docs - Amazon S3 File Gateway User Guide (opens in a new tab)
Using file protocols such as NFS and SMB to access files stored as objects in S3.

Storage Gateway - Amazon FSx File Gateway

AWS Docs - Amazon FSx File Gateway User Guide (opens in a new tab)
Low latency and efficient access to in-cloud FSx for Windows File Server file shares from your on-premises facility.

Storage Gateway - Tape Gateway

AWS Docs - Tape Gateway User Guide (opens in a new tab)
Cloud-backed virtual tape storage
Tape Gateway presents an iSCSI-based virtual tape library (VTL) of virtual tape drives and a virtual media changer to your on-premises backup application.
Tape Gateway stores your virtual tapes in S3 and creates new ones automatically.

Storage Gateway - Volume Gateway

AWS Docs - Volume Gateway User Guide (opens in a new tab)
Stored volumes
- Entire dataset is stored locally and is asynchronously backed up to S3 as point-in-time snapshots.
Cached volumes
- Entire dataset is stored in S3 and the most frequently accessed data is cached locally.

Data Pipeline (opens in a new tab)

Define data-driven workflows for moving and transforming data from various sources to destinations, similar to Airflow
Managed ETL service using EMR under the hood
Batch processing
Maintenance mode, workloads can be migrated to
- AWS Glue
- AWS Step Functions
- Amazon MWAA (Amazon Managed Workflows for Apache Airflow)

Application Integration

Step Functions (opens in a new tab)

Able to Run EMR workloads on a schedule
Gneral-purpose serverless workflow orchestration service
AWS Docs - AWS Step Functions - Manage an Amazon EMR Job (opens in a new tab)
AWS Big Data Blog - Automating EMR workloads using AWS Step Functions (opens in a new tab)
AWS Big Data Blog - Orchestrate Amazon EMR Serverless jobs with AWS Step functions (opens in a new tab)

SQS

Kinesis Data Streams vs SQS

	Kinesis Data Streams	SQS
Consumption	Can be consumed many times	Records are deleted after being consumed
Retention	Retention period configurable from 24 hours to 1 year	Retention period configurable from 1 min to 14 days
Ordering	Ordering of records is preserved at the same shard	FIFO queues preserve ordering of records
Scaling	Manual resharding	Auto scaling
Delivery	At least once delivery	Exactly once delivery
Replay	Can replay	Cannot replay
Payload size	1 MB	256 KB

Database

Redshift

PB scale, columnar storage
MPP architecture
Loading data
- AWS Docs - Amazon Redshift best practices for loading data (opens in a new tab)
- COPY command
  - Can load data from multiple files in parallel.
  - Use the COPY command with COMPUPDATE set to ON to analyze and apply compression automatically
  - Split your data into files so that the number of files is a multiple of the number of slices in your cluster.
AWS Big Data Blog - JOIN Amazon Redshift AND Amazon RDS PostgreSQL WITH dblink (opens in a new tab)
- The dblink function allows the entire query to be pushed to Amazon Redshift. This lets Amazon Redshift do what it does best—query large quantities of data efficiently and return the results to PostgreSQL for further processing.

Redshift - Cluster

Every cluster can have up to 32 nodes.
Leader node and compute nodes have the same specs, and leader node is chosen automatically.
Leader node (2+ nodes)
- Parse the query and building an optimal execution plan, and compile code from execution plan
- Receive and aggregate the results from the compute nodes
Compute node
- Node slices
  
  A compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node.
Enhanced VPC routing
- Routes network traffic between your cluster and data repositories through a VPC, instead of through the internet.

Redshift - Cluster - Elastic resize

Preferred method
In a incremental manner
Add or remove nodes (same node type) to a cluster in minutes, aka in-place resize
Change the node type of an existing cluster

A snapshot is created, and a new cluster is created from the snapshot with the new node type.
Resources
- Troubleshoot classic resize for Amazon Redshift cluster | AWS re:Post (opens in a new tab)

Redshift - Cluster - Classic resize

When you need to change the cluster size or the node type of an existing cluster and elastic resize is not supported
Takes much longer than elastic resize

Redshift - Cluster - Node type

RA3

Node Size vCPU RAM (GiB) Managed Storage quota per node Number of nodes per cluster
ra3.xlplus 4 32 32 TB 1 to 32
ra3.4xlarge 12 96 128 TB 2 to 32
ra3.16xlarge 48 384 128 TB 2 to 128
- Pay for the managed storage and compute separately
- Managed storage uses SSD for local cache and S3 for persistence and automatically scales based on the workload
- Compute is separated from managed storage and can be scaled independently
- RA3 nodes are optimized for performance and cost-effectiveness
DC2 (Dense Compute)

Node Size vCPU RAM (GiB) Managed Storage quota per node Number of nodes per cluster
dc2.large 2 15 160 GB 1 to 32
dc2.8xlarge 32 244 2.6 TB 2 to 128
- Compute-intensive local SSD storage for high performance and low latency
- Recommended for datasets under 10 TB
DS2 (Dense Storage)
- Deprecated legacy option, for low-cost workloads with HDD storage, RA3 is recommended for new workloads
Resources
- AWS Big Data Blog - Compare different node types for your workload using Amazon Redshift (opens in a new tab)

Node Size	vCPU	RAM (GiB)	Managed Storage quota per node	Number of nodes per cluster
`ra3.xlplus`	4	32	32 TB	`1` to `32`
`ra3.4xlarge`	12	96	128 TB	`2` to `32`
`ra3.16xlarge`	48	384	128 TB	`2` to `128`

Node Size	vCPU	RAM (GiB)	Managed Storage quota per node	Number of nodes per cluster
`dc2.large`	2	15	160 GB	`1` to `32`
`dc2.8xlarge`	32	244	2.6 TB	`2` to `128`

Redshift - Cluster - HA

AZ
- Single AZ by default
- Multi AZ is supported

Redshift - Distribution style

Distribution style (opens in a new tab)

Redshift Distribution Keys (DIST Keys) determine where data is stored in Redshift. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node.
Uneven distribution of data across computing nodes leads to the skewness of the work a node has to do and you don’t want an under-utilised compute node. So the distribution of the data should be uniform.
Data should be distributed in such a way that the rows that participate in joins are already on the same node with their joining rows in other tables. This is called collocation, which reduces data movement and improves query performance.
Distribution is configurable per table. So you can select a different distribution style for each of the table.
AUTO distribution
- Redshift automatically distributes the data across the nodes in the cluster
- Redshift initially assigns ALL distribution to a small table, then changes to EVEN distribution when the table grows larger.
- Suitable for small tables or tables with a small number of distinct values
EVEN distribution
- The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column.
- Appropriate when a table doesn't participate in joins. It's also appropriate when there isn't a clear choice between KEY distribution and ALL distribution.
KEY distribution
- The rows are distributed according to the values in one column.
- All the entries with the same value in the column end up on the same slice.
- Suitable for large tables or tables with a large number of distinct values
ALL distribution
- A copy of the entire table is distributed to every node.
- Suitable for small tables or tables that are not joined with other tables

Redshift - Concurrency scaling

Concurrency scaling adds transient clusters to your cluster to handle concurrent requests with consistency and fast performance in a matter of seconds.

Redshift - Sort keys

Sort keys (opens in a new tab)

When you create a table, you can define one or more of its columns as sort keys.
Either a compound or interleaved sort key.

Redshift - Compression

AWS Docs - Working with column compression (opens in a new tab)

You can apply a compression type, or encoding, to the columns in a table manually when you create the table.
The COPY command analyzes your data and applies compression encodings to an empty table automatically as part of the load operation.
Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB and 125 MB after compression.
The number of files should be a multiple of the number of slices in your cluster.
When loading data, it is strongly recommended that you individually compress your load files using gzip, lzop, bzip2, or Zstandard when you have large datasets.

Redshift - Workload Management (WLM)

AWS Docs - Amazon Redshift - Implementing workload management (opens in a new tab)
Automatic WLM
Manual WLM
- Query Groups
  - Memory percentage and concurrency settings
- Query Monitoring Rules
  - Query Monitoring Rules are used to define the conditions under which a query is monitored and the action to take when the query is monitored.
- Query Queues
  - Memory percentage and concurrency settings
Concurrency scaling
- Concurrency scaling automatically adds and removes compute nodes to handle spikes in demand.
- Concurrency scaling is enabled by default for all RA3 and DC2 clusters.
- Concurrency scaling is not available for DS2 clusters.
Short query acceleration (opens in a new tab)
- Short query acceleration (SQA) prioritizes selected short-running queries ahead of longer-running queries.
- SQA runs short-running queries in a dedicated space, so that SQA queries aren't forced to wait in queues behind longer queries.
- CREATE TABLE AS (CTAS) statements and read-only queries, such as SELECT statements, are eligible for SQA.

Redshift - User-Defined Functions (UDF)

User-Defined Functions (UDFs) are functions that you define to extend the capabilities of the SQL language in Redshift.
SQL (opens in a new tab)
Python (opens in a new tab)
AWS Lambda (opens in a new tab)
- Accessing other AWS services

Redshift - Snapshots

Automated (opens in a new tab)
- Enabled by default when you create a cluster.
- By default, about every 8 hours or following every 5 GB per node of data changes, or whichever comes first.
- Default retention period is 1 day
- Cannot be deleted manually
Manual (opens in a new tab)
- By default, manual snapshots are retained indefinitely, even after you delete your cluster.
- AWS CLI: create-cluster-snapshot
- AWS API: CreateClusterSnapshot
Backup
- You can configure Redshift to automatically copy snapshots (automated or manual) for a cluster to another Region.
- Only one destination Region at a time can be configured for automatic snapshot copy.
- To change the destination Region that you copy snapshots to, first disable the automatic copy feature. Then re-enable it, specifying the new destination Region.
Snapshot copy grant
- KMS keys are specific to a Region. If you enable copying of Redshift snapshots to another Region, and the source cluster and its snapshots are encrypted using a master key from KMS, you need to configure a grant for Redshift to use a master key in the destination Region.
- The grant is created in the destination Region and allows Redshift to use the master key in the destination Region.

Redshift - Query

Data API
- Available in AWS SDK, based on HTTP and JSON, no persistent connection needed
- API calls are asynchronous
- Uses either credentials stored in Secrets Manager or temporary database credentials, no need to pass passwords in the API calls

Redshift - Security

IAM
- Does not support resource-based policies.
Access control
- Cluster level
  - IAM policies
    
    For Redshift API actions
  - Security groups
    
    For network connectivity
- Database level
  - Database user accounts
Data protection
- DB encryption using KMS (opens in a new tab)
  - KMS key hierarchy
    - Root key
    - Cluster encryption key (CEK)
    - Database encryption key (DEK)
    - Data encryption keys
Auditing
- Logs
  - Connection log
    
    Logs authentication attempts, connections, and disconnections.
  - User log
    
    Logs information about changes to database user definitions.
  - User activity log
    
    Logs each query before it's run on the database.

Redshift Spectrum

In-place querying of data in S3, without having to load the data into Redshift tables
EB scale
Pay for the number of bytes scanned
Federated query across operational databases, data warehouses, and data lakes

Redshift Spectrum - Comparison with Athena

	Redshift Spectrum	Athena
Performance	Greater control over performance; use dedicated resources of your `Redshift` cluster	Not configurable; use shared resources managed by `AWS`
Query Results	Stored in `Redshift`	Stored in `S3`

Redshift Serverless

Redshift measures data warehouse capacity in Redshift Processing Units (RPUs). You pay for the workloads you run in RPU-hours on a per-second basis (with a 60-second minimum charge), including queries that access data in open file formats in S3

Keyspaces

AWS Docs - Keyspaces (opens in a new tab)
Based on Apache Cassandra

Neptune

AWS Docs - Neptune (opens in a new tab)
Serverless graph database

Amazon Security Lake (opens in a new tab)

Fully managed security data lake service
Centralize security data from AWS environments, SaaS providers, on premises, cloud sources, and third-party sources into a purpose-built data lake that's stored in your AWS account.

Elemental MediaStore (opens in a new tab)

High performance for streaming media delivery

References

Redis AWS-MLS-C01

Exam Guide

Domain 1: Collection

Task Statement 1.1: Determine the operational characteristics of the collection system

Task Statement 1.2: Select a collection system that handles the frequency, volume, and source of data

Task Statement 1.3: Select a collection system that addresses the key properties of data, such as order, format, and compression

Domain 2: Storage and Data Management

Task Statement 2.1: Determine the operational characteristics of the storage solution for analytics

Services

Analytics

Athena

Athena - Workgroup

Athena - Federated Query

CloudSearch

OpenSearch

EMR

EMR cluster - Node types

EMR - EC2

EMR - EKS

EMR - Serverless

EMR - EMRFS

EMR - Security - Apache Ranger

EMR - Security Configuration

EMR Studio

EMR Notebooks (opens in a new tab)

EMR - Open Source Ecosystem

Delta Lake

Ganglia

Apache HBase

Apache HCatalog

Apache Hudi

Hue

Apache Iceberg

Apache Livy

Apache MXNet

Apache Oozie

Apache Phoenix

Apache Pig

Apache Sqoop

Apache Tez

Glue

Glue - Job

Glue - Worker

Glue - Data Catalog

Glue - Data Catalog - crawler

Glue - Data Catalog - security

Glue Data Quality

Glue Streaming

Glue DataBrew

Kinesis Data Streams

Kinesis Data Streams - Producers

Kinesis Data Streams - Producers - KPL

Kinesis Data Streams - Producers - Kinesis Agent

Kinesis Data Streams - Consumers

Kinesis Data Firehose

Kinesis Data Analytics

Kinesis Data Analytics - Managed Apache Flink

Kinesis Data Analytics - SQL (legacy)

Kinesis Data Analytics Studio

Kinesis Video Streams

Lake Formation

Lake Formation - Ingestion

Lake Formation - Permission management

QuickSight

Migration and Transfer

DMS (Database Migration Service)

DataSync

Snow Family

Snowball Edge Storage Optimized

Snowball Edge Compute Optimized

Snowmobile

Transfer Family

Storage Gateway

Storage Gateway - Amazon S3 File Gateway

Storage Gateway - Amazon FSx File Gateway

Storage Gateway - Tape Gateway

Storage Gateway - Volume Gateway

Data Pipeline (opens in a new tab)

Application Integration

Step Functions (opens in a new tab)

SQS