Exam Guide
Domain 1: Collection
Task Statement 1.1: Determine the operational characteristics of the collection system
Task Statement 1.2: Select a collection system that handles the frequency, volume, and source of data
-
Describe and characterize the volume and flow characteristics of incoming data (streaming, transactional, batch)
-
Streaming data
Kinesis Data Streams
Amazon MSK
-
Batch data
Glue
Transfer Family
DataSync
Storage Gateway
-
Transactional data
DMS
-
Resources
-
Task Statement 1.3: Select a collection system that addresses the key properties of data, such as order, format, and compression
-
Describe how to capture data changes at the source
-
DMS
supportsChange Data Capture
(CDC).
-
-
Describe how to transform and filter data during the collection process
Glue
for batchETL
Kinesis Data Analytics
for streamingETL
Kinesis Data Firehose
andLambda
for streamingETL
Domain 2: Storage and Data Management
Task Statement 2.1: Determine the operational characteristics of the storage solution for analytics
Services
Analytics
Athena
-
In-place querying
of data inS3
-
Serverless
-
PB
scale -
Presto
(opens in a new tab) under the hood -
Supports
SQL
andApache Spark
-
Analyze unstructured, semi-structured, and structured data stored in
S3
, formats includingCSV
,JSON
,Apache Parquet
andApache ORC
-
Only successful or canceled queries are billed, while failed queries are not billed.
-
No charge for
DDL
statements -
Use cases
- Ad-hoc queries of web logs,
CloudTrail
/CloudFront
/VPC
/ELB
logs - Querying staging data before loading to
Redshift
- Integration with
Jupyter
,Zeppelin
,RStudio
notebooks - Integration with
QuickSight
for visualization - Integration via
JDBC
/ODBC
forBI
tools
- Ad-hoc queries of web logs,
-
SQL
-
MSCK REPAIR TABLE
updates the metastore with the current partitions of an external table.Repairing partitions manually using MSCK repair (opens in a new tab)
-
Athena - Workgroup
- A
workgroup
is a collection of users, queries, and results that are associated with a specific data catalog and workgroup settings. Workgroup
settings include the location inS3
where query results are stored, the encryption configuration, and the data usage control settings.
Athena - Federated Query
AWS Docs - Using Amazon Athena Federated Query (opens in a new tab)
Query the data in place
or build pipelines thatextract data from multiple data sources
and store them inS3
.- Uses
data source connectors
that run onLambda
to run federated queries.
CloudSearch
AWS Docs - CloudSearch (opens in a new tab)
- Managed search service, based on
Apache Solr
OpenSearch
AWS Docs - OpenSearch (opens in a new tab)
- Operational analytics and log analytics
- Forked version of
Elasticsearch
Number of Shards = Index Size / 30GB
EMR
AWS Docs - EMR (opens in a new tab)
-
Managed cluster platform that simplifies running big data frameworks, such as
Hadoop
andSpark
, onAWS
. -
PB
scale -
RunJobFlow
API Action
creates and starts running a newcluster
(job flow
). Thecluster
runs thesteps
specified. After thesteps
complete, thecluster
stops and theHDFS
partition is lost. -
An
EMR
cluster with multipleprimary nodes
can reside only in oneAZ
. -
HDFS
- A typical block size used by
HDFS
is128 MB
. Thus, anHDFS
file is chopped up into128 MB
chunks, and if possible, each chunk will reside on a differentDataNode
.
- A typical block size used by
EMR cluster - Node types
AWS Docs - EMR cluster - Node types (opens in a new tab)
-
Primary node
/Master node
- Manages the cluster and typically runs primary components of distributed applications.
Tracks the status of jobs
submitted to the clusterMonitors the health
of the instance groupsHA
support from5.23.0+
-
Core nodes
- Store
HDFS
data and run tasks - Multi-node clusters have at least one
core node
. - One
core
instance group
orinstance fleet
per cluster, but there can be multiple nodes running on multipleEC2
instances in theinstance group
orinstance fleet
.
- Store
-
Task nodes
- No
HDFS
data, therefore no data loss if terminated - Can use
Spot instances
Optional
in a cluster
- No
EMR - EC2
AWS Docs - EMR - EC2 (opens in a new tab)
EMR - EKS
AWS Docs - EMR - EKS (opens in a new tab)
EMR - Serverless
AWS Docs - EMR - Serverless (opens in a new tab)
EMR - EMRFS
-
Direct access to
S3
fromEMR
cluster -
Persistent storage for
EMR
clusters -
Benefit from
S3
features- For clusters with multiple users who need different levels of access to data in
S3
throughEMRFS
EMRFS
can assume a different service role for clusterEC2
instances based on the user or group making the request, or based on the location of data inS3
.- Each
IAM role
forEMRFS
can have different permissions for data access inS3
. - When
EMRFS
makes a request toS3
that matches users, groups, or the locations that you specify, the cluster usesthe corresponding role that you specify
instead of theEMR role for EC2
.
- For clusters with multiple users who need different levels of access to data in
-
AWS Docs - Authorizing access to EMRFS data in Amazon S3 (opens in a new tab)
-
-
An extension to
DistCp
(Apache Hadoop Distributed Copy
) -
Uses
MapReduce
to efficiently copy large amounts of data in adistributed
manner -
Copy data from
S3
toHDFS
-
Copy data from
HDFS
toS3
-
Copy data between
S3
buckets
-
EMR - Security - Apache Ranger
- A
RBAC
framework to enable, monitor and manage comprehensivedata security
across theHadoop
ecosystem. - Centralized security administration and auditing
- Fine-grained authorization across many
Hadoop
components (Hadoop
,Hive
,HBase
,Storm
,Knox
,Solr
,Kafka
, andYARN
) Syncs policies and users
by usingagents
andplugins
that runwithin the same process as the Hadoop component
.- Supports
row-level authentication and auditing capabilities
with embedded search.
EMR - Security Configuration
AWS Docs - Security Configuration (opens in a new tab)
-
At-rest
data encryption-
For
WAL
withEMR
(opens in a new tab)Open-source
HDFS
encryptionLUKS
encryption -
For cluster nodes local disks (
EC2
instance volumes) (opens in a new tab)-
Open-source
HDFS
encryption -
Instance store
encryptionNVMe
encryptionLUKS
encryption
-
EBS volume
encryption-
EBS
encryption- Recommended
- EMR 5.24.0+
-
LUKS
encryption- Only applies to attached storage volumes, not to the
root device volume
- Only applies to attached storage volumes, not to the
-
-
-
For
EMRFS
onS3
-
SSE-S3
-
SSE-KMS
-
SSE-C
-
CSE-KMS
Objects are
encrypted before being uploaded
toS3
and the client useskeys provided by KMS
. -
CSE-Custom
Objects are
encrypted before being uploaded
toS3
and the client usesa custom Java class that provides the client-side master key
.
-
-
-
In-transit
data encryption- For
EMR
traffic betweencluster nodes
andWAL
withEMR
- For distributed applications
- For
EMRFS
traffic betweenS3
andcluster nodes
- For
-
Resources
EMR Studio
- Web IDE for fully managed Jupyter notebooks running on EMR clusters
EMR Notebooks (opens in a new tab)
EMR - Open Source Ecosystem
Delta Lake
EMR - Delta Lake (opens in a new tab)
- Storage layer framework for lakehouse architectures
Ganglia
EMR - Ganglia (opens in a new tab)
- Performance monitoring tool for
Hadoop
andHBase
clusters - Not included in
EMR
after6.15.0
Apache HBase
EMR - Apache HBase (opens in a new tab)
Wide-column store NoSQL
running onHDFS
EMR WAL
support, able to restore an existingWAL
retained for30 days
starting from the time when cluster was created, with anew cluster
from the sameS3
root directory.
Apache HCatalog
EMR - Apache HCatalog (opens in a new tab)
HCatalog
is a tool that allows you to accessHive
metastore
tables withinPig
,Spark SQL
, and/or customMapReduce
applications.
Apache Hudi
EMR - Apache Hudi (opens in a new tab)
Open Table Format
- Brings
database
anddata warehouse
capabilities to thedata lake
Hue
EMR - Hue (opens in a new tab)
Web GUI
forHadoop
Apache Iceberg
EMR - Apache Iceberg (opens in a new tab)
Open Table Format
Apache Livy
EMR - Apache Livy (opens in a new tab)
- A
REST
interface forApache Spark
Apache MXNet
EMR - Apache MXNet (opens in a new tab)
- Retired
- A
deep learning
framework for building neural networks and other deep learning applications.
Apache Oozie
EMR - Apache Oozie (opens in a new tab)
Workflow scheduler system
to manageHadoop
jobsApache Airflow
is a modern alternative.
Apache Phoenix
EMR - Apache Phoenix (opens in a new tab)
OLTP
andoperational analytics
inHadoop
forlow latency
applications
Apache Pig
EMR - Apache Pig (opens in a new tab)
SQL-like
commands written inPig Latin
and converts those commands intoTez
jobs based onDAG
orMapReduce
programs.
Apache Sqoop
EMR - Apache Sqoop (opens in a new tab)
- Retired
- A tool for transferring data between
S3
,Hadoop
,HDFS
, andRDBMS
databases.
Apache Tez
EMR - Apache Tez (opens in a new tab)
- An application execution framework for complex
DAG
of tasks, similar toApache Spark
Glue
-
Basically
Serverless Spark ETL with a Hive metastore
-
Ingests data in batch while performing
transformations
-
Either provide a script to perform the
ETL
job, orGlue
can generate a script automatically. -
Data source
can be anAWS service
, such asRDS
,S3
,DynamoDB
, orKinesis Data Streams
, as well as a third-partyJDBC
-accessible database. -
Data target
can be anAWS service
, such asS3
,RDS
, andDocumentDB
, as well as a third-partyJDBC
-accessible database. -
Hourly rate charge
based on the number ofData Processing Units (or DPUs)
used to run yourETL job
. -
Resources
-
Orchestration
-
triggers
-
When fired, a trigger can start specified
jobs
andcrawlers
. -
A trigger fires
on demand
,based on a schedule
, orbased on a combination of events
. -
trigger types
-
Scheduled
-
Conditional
The trigger fires if the watched jobs or crawlers end with the specified statuses.
-
On-demand
-
-
-
workflows
-
Create and visualize complex
ETL
activities involving multiplecrawlers
,jobs
, andtriggers
.
-
blueprints
-
Glue blueprints
provide a way to create and shareGlue workflows
. When there is a complexETL
process that could be used for similar use cases, rather than creating anGlue workflow
for each use case, you can create a singleblueprint
.
-
Glue - Job
-
Job type
-
Apache Spark
- Minimum
2 DPU
10 DPU
(default)
- Minimum
-
Apache Spark Streaming
- Minimum
2 DPU
(default)
- Minimum
-
Python Shell
1 DPU
or0.0625 DPU
(default)
-
-
Job Bookmarks
Glue
tracks data that has already been processed during a previous run of anETL
job by persisting state information from the job run. This persisted state information is called ajob bookmark
.
Glue - Worker
-
A single
DPU
is also called aworker
. -
A
DPU
is a relative measure of processing power that consists of4 vCPU
of compute capacity and16 GB
of memory. -
A
M-DPU
is aDPU
with4 vCPU
and32 GB
of memory. -
Worker type
-
For Glue 2.0+
Specify
Worker type
andNumber of workers
-
G.1X
(default)- 1 DPU
- Recommended for most workloads with cost effective performance
-
G.2X
- 2 DPU
- Recommended for most workloads with cost effective performance
-
G.4X
- 4 DPU
- Glue 3.0+
Spark jobs
-
G.8X
- 8 DPU
- Recommended for workloads with most demanding transformations, aggregations, and joins
- Glue 3.0+
Spark jobs
-
G.025X
- 0.25 DPU
- Recommended for
low volume streaming
jobs - Glue 3.0+
Spark Streaming jobs
-
-
Glue - Data Catalog
AWS Docs - Glue Data Catalog (opens in a new tab)
-
Functionally similar to a
schema registry
-
API
compatible withHive Metastore
-
lakeFS - Metadata Management: Hive Metastore vs AWS Glue (opens in a new tab)
-
Glue Data Catalog - database & table
Databases
andtables
are objects in theData Catalog
that contain metadata definitions.- The schema of your data is represented in your
Glue
table definition. The actual data remains in its original data store
Glue - Data Catalog - crawler
AWS Docs - Glue Data Catalog - crawler (opens in a new tab)
-
A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your
Data Catalog
.ETL
jobs that you define inAWS Glue
use theseData Catalog
tables as sources and targets. -
A
classifier
recognizes the format of your data and generates a schema. It returns acertainty number
between0.0
and1.0
, where1.0
means100%
certainty.Glue
uses the output of theclassifier
that has thehighest certainty
. -
If no
classifier
returns acertainty
greater than 0.0
,Glue
returns thedefault classification string
ofUNKNOWN
. -
Steps
- A
crawler
runs anycustom classifiers
that you choose to infer the format and schema of your data. - The crawler connects to the
data store
. - The
inferred schema
is created for your data. - The crawler writes
metadata
to theGlue Data Catalog
.
- A
-
Partition threshold
If the following conditions are met, the
schemas
are denoted aspartitions
of atable
.- The
partition threshold
is higher than0.7 (70%)
. - The maximum number of different
schemas
doesn't exceed5
.
Learn how AWS Glue crawler detects the schema | AWS re:Post (opens in a new tab)
- The
Glue - Data Catalog - security
- You can only use an
Glue
resource policy
to manage permissions forData Catalog resources
. - You can't attach it to any other
Glue resources
such asjobs
,triggers
,development endpoints
,crawlers
, orclassifiers
. - Only one
resource policy
is allowed percatalog
, and its size is limited to10 KB
. Each AWS account
hasone single catalog per Region
whosecatalog ID
is the same as theAWS account ID
.- You cannot delete or modify a
catalog
.
Glue Data Quality
AWS Docs - Glue Data Quality (opens in a new tab)
Glue Streaming
AWS Docs - Glue Streaming (opens in a new tab)
- Using the
Apache Spark Streaming
framework Near-real-time
data processing
Glue DataBrew
AWS Docs - Glue DataBrew (opens in a new tab)
Visual data preparation tool that enables users to clean and normalize data without writing any code.
Kinesis Data Streams
AWS Docs - Kinesis Data Streams (opens in a new tab)
-
Key points
- Equivalent to
Kafka
, for event streaming, noETL
On-demand
orprovisioned
modePaaS
withAPI
access for developers- The maximum size of the data payload of a record before
base64
-encoding is up to1 MB
. Retention Period
by default is1 day
, and can beup to 365 days
.
- Equivalent to
-
-
A
stream
is composed of one or moreshards
. -
A
shard
is a uniquely identified sequence ofdata records
in astream
. -
All the data in the
shard
is sent to the same worker that is processing theshard
. -
Read throughput
Up to
2 MB/second/shard
or
5 transactions (API calls) /second/shard
or
2000 records/second/shard
shared across all the consumers that are reading from a given
shard
Each call to
GetRecords
is counted as1 read transaction
.GetRecords
can retrieve up to10 MB / transaction / shard
, and up to10000 records per call
. If a call toGetRecords
returns10 MB
, subsequent calls made within the next 5 seconds throw an exception.up to 20 registered consumers
(Enhanced Fan-out
Limit) foreach data stream
. -
Write throughput
Up to
1 MB/second/shard
or
1000 records/second/shard
-
Record aggregation
allows customers tocombine multiple records into a single record
. This allows customers to improve their pershard
throughput. -
Not scaled automatically,
resharding
(opens in a new tab) is the process used to scale your data stream using a series ofshard splits or merges
.
-
-
Partition key
(opens in a new tab)- All
data records
with the samepartition key
map to the sameshard
. - The number of
partition keys
should typically be much greater than the number ofshards
, namelymany-to-one
relationship.
- All
-
Idempotency
-
2
primary causes for duplicate recordsProducer
retriesConsumer
retries
-
Consumer retries are more common than producer retries
Kinesis Data Streams
does not guarantee the order of records acrossshards
in astream
.Kinesis Data Streams
does guarantee the order of records within ashard
.
-
Kinesis Data Streams - Producers
Kinesis Data Streams - Producers - KPL
AWS Docs - KPL (opens in a new tab)
-
Batching
-
Aggregation
Storing multiple payloads within a single
Kinesis Data Streams record
. -
Collection
Consolidate multiple
Kinesis Data Streams
records into a singleKinesis Data Streams
record to reduce HTTP requests.
-
-
Rate Limiting
- Limits
per-shard throughput
sent from a single producer - Implemented using a
token bucket algorithm
with separate buckets for bothKinesis Data Streams
records and bytes.
- Limits
-
Pros
- Higher throughput due to
aggregation
andcompression
of records - Abstraction over
API
to simplify coding using low levelAPI
directly
- Higher throughput due to
-
Cons
- Higher latency due to additional processing delay of up to configurable
RecordMaxBufferedTime
- Only supports
AWS SDK v1
- Higher latency due to additional processing delay of up to configurable
Kinesis Data Streams - Producers - Kinesis Agent
AWS Docs - Kinesis Agent (opens in a new tab)
- Stand-alone
Java
application running as adaemon
- Continuously monitors a set of files and sends new data to your stream.
Kinesis Data Streams - Consumers
-
-
Tasks
- Connects to the data stream
- Enumerates the shards within the data stream
- Uses leases to coordinates shard associations with its workers
- Instantiates a record processor for every shard it manages
- Pulls data records from the data stream
- Pushes the records to the corresponding record processor
- Checkpoints processed records
- Balances shard-worker associations (leases) when the worker instance count changes or when the data stream is resharded (shards are split or merged)
-
Versions
-
KCL 1.x
is based onAWS SDK v1
, andKCL 2.x
is based onAWS SDK v2
-
KCL 1.x
- Java
- Node.js
- .NET
- Python
- Ruby
-
KCL 2.x
- Java
- Python
-
-
Multistream processing
is only supported inKCL 2.x for Java
(KCL 2.3+ for Java) -
You should ensure that the number of KCL instances does not exceed the number of
shards
(except for failure standby purposes) -
One
KCL worker
is deployed on oneEC2
instance, able to process multipleshards
. -
A
KCL worker
runs as aprocess
and it has multiplerecord processors
running asthreads
within it. -
One
Shard
has exactly one correspondingrecord processor
. -
KCL 2.x
enables you to createKCL
consumer applications that can processmore than one data stream at the same time
. -
Checkpoints
sequence number for theshard
in thelease table
to track processed records -
Enhanced fan-out (opens in a new tab)
Consumers
usingenhanced fan-out
have dedicated throughput ofup to 2 MB/second/shard
, and withoutenhanced fan-out
, allconsumers
reading from the same shard share thethroughput
of2 MB/second/shard
.- Requires
KCL 2.x
andKinesis Data Streams
withenhanced fan-out
enabled - You can register
up to 20 consumers per stream
to useenhanced fan-out
. Kinesis Data Streams
pushes data records from the stream toconsumers using enhanced fan-out
overHTTP/2
, thereforeno polling
andlower latency
.
-
Lease Table
(opens in a new tab)- At any given time,
each shard
of data records is bound toa particular KCL worker
by alease
identified by theleaseKey
variable. - By default, a
worker
can hold one or moreleases
(subject to the value of themaxLeasesForWorker
variable) at the same time. - A
DynamoDB
table to keep track of theshards
in aKinesis Data Stream
that are beingleased and processed
by theworkers
of theKCL consumer application
. Each row in the lease table
representsa shard that is being processed by the workers
of your consumer application.KCL
usesthe name of the consumer application
to createthe name of the lease table
that this consumer application uses, therefore each consumer application name must be unique.One lease table
for oneKCL consumer application
KCL
creates thelease table
with aprovisioned throughput
of10 reads / second
and10 writes / second
- At any given time,
-
-
Troubleshooting (opens in a new tab)
-
Consumer read API calls throttled (
ReadProvisionedThroughputExceeded
)-
Occurs when
GetRecords
quota is exceeded, subsequentGetRecords
calls are throttled. -
Quota
- 5
GetRecords
API calls / second / each shard 2 MB / second / each shard
- Up to 10 MiB of data / API call / each shard
- Up to 10000 records / API call
- 5
-
Slow record processing
-
Kinesis Data Firehose
AWS Docs - Kinesis Data Firehose (opens in a new tab)
-
Serverless
,fully managed
withautomatic scaling
-
Pre-built
Kinesis Data Streams
connectors for various sources and destinations, without coding, similar toKafka Connect
-
For data
latency
of60 seconds or higher
. -
Sources
-
Kinesis Data Streams
-
MSK
- Can only use
S3
as destination
- Can only use
-
Direct PUT
- Use
PutRecord
andPutRecordBatch
API
to send data - If the source is
Direct PUT
,Firehose
will retain data for24 hours
.
- Use
-
-
Destinations - AWS Docs - Data Firehose - Source, Destination, and Name (opens in a new tab)
S3
Redshift
OpenSearch
- Custom
HTTP
Endpoint 3rd-party
service
-
Buffering hints
AWS Docs - Buffering hints (opens in a new tab)
-
Buffering size
inMBs
(destination specific)1 MB
to128 MB
- Default is
5 MB
-
Buffering interval
inseconds
(destination specific)0
to900
- Default is
60
/300
seconds
-
-
Data Transformation (opens in a new tab) not natively supported, need to use
Lambda
-
Lambda
-
Custom transformation logic
-
Lambda
invocation time of up to5 min
. -
Error handling
- Retry the invocation
3 times by default
. - After retries, if the invocation is unsuccessful,
Firehose
then skips that batch of records. The skipped records are treated as unsuccessfully processed records. - The unsuccessfully processed records are delivered to your
S3 bucket
.
- Retry the invocation
-
-
Data format conversion
Firehose
can convertJSON
records using a schema from a table defined inGlue
- Non-
JSON
data must be converted by invokingLambda
function
-
-
Dynamic Partitioning (opens in a new tab)
Dynamic partitioning
enables you to continuously partition streaming data inFirehose
by usingkeys
within data (for example,customer_id
ortransaction_id
) and then deliver the data grouped by thesekeys
into correspondingS3
prefixes.Partitioning your data
minimizes the amount of data scanned, optimizes performance, and reduces costs of your analytics queries onS3
.
-
Server-side encryption (opens in a new tab)
-
Can be enabled depending on the source of data
-
Kinesis Data Streams
as the SourceKinesis Data Streams
firstdecrypts the data
and then sends it toFirehose
.Firehose
buffers the data in memory and then deliverswithout storing the unencrypted data at rest
.
-
With
Direct PUT
or other Data Sources- Can be turned on by using the
StartDeliveryStreamEncryption
operation. - To stop
server-side-encryption
, use theStopDeliveryStreamEncryption
operation.
- Can be turned on by using the
-
-
Troubleshooting
Kinesis Data Analytics
-
Input
Kinesis Data Streams
Kinesis Data Firehose
-
Output
Kinesis Data Streams
Kinesis Data Firehose
Lambda
Kinesis Data Analytics - Managed Apache Flink
- Equivalent to
Kafka Streams
, usingApache Flink
behind the scene, online stream processing, namelystreaming ETL
Managed Service for Apache Flink (opens in a new tab)
-
Streaming application built with
API
inJava
,Scala
,Python
andSQL
-
Interactive notebooks
Kinesis Data Analytics - SQL (legacy)
Kinesis Data Analytics for SQL Applications (opens in a new tab)
Kinesis Data Analytics Studio
Kinesis Data Analytics Studio (opens in a new tab)
-
Based on
-
Apache Zeppelin
-
Apache Flink
-
Kinesis Video Streams
Kinesis Video Streams (opens in a new tab)
- Out of scope for the exam
Lake Formation
-
- Simplify security management and governance at scale
- Centrally manage permissions across your users
- Comprehensive data access and audit logging
- Provides a
RDBMS
permissions model to grant or revoke access to databases, tables, and columns in theData Catalog
with underlying data inS3
.
-
Resources
Lake Formation - Ingestion
-
Blueprints
are predefinedETL
workflows that you can use to ingest data into yourData Lake
. -
Workflows
are instances of ingestionblueprints
inLake Formation
. -
Blueprint
types-
Database snapshot
-
Loads or reloads data from all tables into the data lake from a JDBC source.
-
You can exclude some data from the source based on an exclude pattern.
-
When to use:
- Schema evolution is flexible. (Columns are re-named, previous columns are deleted, and new columns are added in their place.)
- Complete consistency is needed between the source and the destination.
-
-
Incremental database
-
Loads only new data into the data lake from a JDBC source, based on previously set bookmarks.
-
You specify the individual tables in the JDBC source database to include.
-
When to use:
- Schema evolution is incremental. (There is only successive addition of columns.)
- Only new rows are added; previous rows are not updated.
-
-
Log file
Bulk loads data from log file sources
-
Lake Formation - Permission management
-
Permission types
-
Metadata access
Data Catalog permissions
-
Underlying data access
-
Data access permissions
Data access permissions (
SELECT
,INSERT
, andDELETE
) onData Catalog
tables that point to that location. -
Data location permissions
The ability to create
Data Catalog
resources that point to particularS3
locations.When you grant the
CREATE_TABLE
orALTER
permission to a principal, you also grant data location permissions to limit the locations for which the principal can create or alter metadata tables.
-
-
-
Permission model
-
IAM
permissions model consists ofIAM policies
. -
Lake Formation
permissions model is implemented asDBMS
-styleGRANT/REVOKE
commands. -
Principals
IAM
users and rolesSAML
users and groupsIAM Identity Center
- External accounts
-
Resources
-
LF-Tags
Using
LF-Tags
can greatly simplify the number of grants over usingNamed Resource policies
. -
Named data catalog resources
-
-
Permissions
- Database permissions
- Table permissions
-
Data filtering
-
Column-level
securityAllows users to
view only specific columns and nested columns
that they have access to in the table. -
Row-level
securityAllows users to
view only specific rows of data
that they have access to in the table. -
Cell-level
securityBy using both
row filtering
andcolumn filtering
at the same timeCan restrict access to different columns depending on the row.
-
-
QuickSight
AWS Docs - QuickSight (opens in a new tab)
-
Key points
- Ad-hoc, interactive query, analysis and visualization
- For formatted canned reports, use Paginated Reports (opens in a new tab)
-
Editions
-
Standard
edition -
Enterprise
edition-
VPC
connectionAccess to data source in your
VPC
without the need for a publicIP
address -
Microsoft
Active Directory
SSO -
Data stored in
SPICE
isencrypted at rest
. -
For
SQL-based data sources
, such asRedshift
,Athena
,PostgreSQL
, orSnowflake
, you can refresh your dataincrementally
within a look-back window of time.
-
-
-
Datasets
-
Dataset refresh can be scheduled or on-demand.
-
SPICE
- In-memory
-
Direct SQL query
-
-
Visual types
-
Combo chart (opens in a new tab)
Aka line and column (bar) charts
-
Scatter plot (opens in a new tab)
Used to observe relationships between variables
-
- Use heat maps to show a measure for the intersection of two dimensions, with color-coding to easily differentiate where values fall in the range.
- Use a heat map if you want to identify trends and outliers, because the use of color makes these easier to spot.
-
Pivot table (opens in a new tab)
- Use a pivot table if you want to analyze data on the visual.
-
-
AutoGraph (opens in a new tab)
You create a visual by choosing
AutoGraph
and then selecting fields.
Migration and Transfer
DMS (Database Migration Service)
-
AWS Docs - DMS (Database Migration Service) (opens in a new tab)
-
Migrate
RDBMS
,data warehouses
,NoSQL databases
, and other types of data stores -
One-time migration
-
Used for
CDC
(Change Data Capture) forreal-time ongoing replication
DataSync
- AWS Docs - DataSync (opens in a new tab)
- One-off online data transfer, transferring hot or cold data should not impede your business.
- Move data from
on-premises
toAWS
- Move data between
AWS
services - Requires
DataSync agent
installationon-premises
Snow Family
-
AWS Snow Family service models - Feature comparison matrix (opens in a new tab)
-
AWS Docs - AWS Snowball Edge Device Specifications (opens in a new tab)
Snowball Edge Storage Optimized
- For
large-scale data migrations
andrecurring transfer workflows
, as well aslocal computing with higher capacity needs
Snowball Edge Compute Optimized
- For use cases such as
machine learning
,full motion video analysis
,analytics
, andlocal computing stacks
Snowmobile
- Intended for
more than 10 PB
data migrations from a single location - Up to
100 PB
perSnowmobile
- Shipping container
Transfer Family
- Transfer flat files using
SFTP
,FTPS
,SSH
andFTP
Storage Gateway
-
Hybrid cloud storage solution that enables
on-premises
applications to usecloud storage
without impact to existing applications -
On-remises
data is synced up in cloud storage, while applications continue to access the cached data locally -
AWS Storage Blog - Cloud storage in minutes with AWS Storage Gateway (updated) (opens in a new tab)
Storage Gateway - Amazon S3 File Gateway
-
AWS Docs - Amazon S3 File Gateway User Guide (opens in a new tab)
-
Using file protocols such as
NFS
andSMB
to access files stored as objects inS3
.
Storage Gateway - Amazon FSx File Gateway
-
AWS Docs - Amazon FSx File Gateway User Guide (opens in a new tab)
-
Low latency and efficient access to in-cloud
FSx for Windows File Server
file shares from youron-premises
facility.
Storage Gateway - Tape Gateway
-
Cloud-backed virtual tape storage
-
Tape Gateway
presents aniSCSI-based virtual tape library (VTL)
of virtual tape drives and a virtual media changer to your on-premises backup application. -
Tape Gateway
stores your virtual tapes inS3
and creates new ones automatically.
Storage Gateway - Volume Gateway
-
Stored volumes
Entire dataset
is storedlocally
and is asynchronously backed up toS3
as point-in-time snapshots.
-
Cached volumes
Entire dataset
is stored inS3
andthe most frequently accessed
data is cached locally.
Data Pipeline (opens in a new tab)
-
Define data-driven workflows for moving and transforming data from various sources to destinations, similar to
Airflow
-
Managed
ETL
service usingEMR
under the hood -
Batch processing
-
Maintenance mode, workloads can be migrated to
AWS Glue
AWS Step Functions
Amazon MWAA (Amazon Managed Workflows for Apache Airflow)
Application Integration
Step Functions (opens in a new tab)
-
Able to Run
EMR
workloads on a schedule -
Gneral-purpose serverless workflow orchestration service
-
AWS Docs - AWS Step Functions - Manage an Amazon EMR Job (opens in a new tab)
-
AWS Big Data Blog - Automating EMR workloads using AWS Step Functions (opens in a new tab)
SQS
Kinesis Data Streams vs SQS
MSK | Kinesis Data Streams | SQS | |
---|---|---|---|
Consumption | Can be consumed many times | Records are deleted after being consumed | |
Retention | Retention period configurable from 24 hours to 1 year | Retention period configurable from 1 min to 14 days | |
Ordering | Ordering of records is preserved at the same shard | FIFO queues preserve ordering of records | |
Scaling | Manual resharding | Auto scaling | |
Delivery | At least once delivery | Exactly once delivery | |
Replay | Can replay | Cannot replay | |
Payload size | 1 MB | 256 KB |
Database
Redshift
-
PB
scale, columnar storage -
MPP
architecture -
Loading data
-
AWS Docs - Amazon Redshift best practices for loading data (opens in a new tab)
-
COPY
command- Can load data from multiple files in parallel.
- Use the
COPY
command withCOMPUPDATE
set toON
to analyze and apply compression automatically - Split your data into files so that
the number of files
is amultiple of the number of slices
in your cluster.
-
-
AWS Big Data Blog - JOIN Amazon Redshift AND Amazon RDS PostgreSQL WITH dblink (opens in a new tab)
- The dblink function allows the entire query to be pushed to Amazon Redshift. This lets Amazon Redshift do what it does best—query large quantities of data efficiently and return the results to PostgreSQL for further processing.
Redshift - Cluster
-
Every
cluster
can have up to32 nodes
. -
Leader node
andcompute nodes
have the same specs, andleader node
is chosen automatically. -
Leader node
(2+ nodes)- Parse the query and building an optimal execution plan, and compile code from execution plan
- Receive and aggregate the results from the
compute nodes
-
Compute node
-
Node
slices
A
compute node
is partitioned intoslices
. Eachslice
is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node.
-
-
Enhanced
VPC
routing- Routes network traffic between your cluster and data repositories through a
VPC
, instead of through the internet.
- Routes network traffic between your cluster and data repositories through a
Redshift - Cluster - Elastic resize
-
Preferred method
-
In a incremental manner
-
Add or remove
nodes
(samenode type
) to acluster
in minutes, akain-place
resize -
Change the
node type
of an existingcluster
A snapshot is created, and a new
cluster
is created from the snapshot with the newnode type
. -
Resources
Redshift - Cluster - Classic resize
- When you need to change the
cluster size
or thenode type
of an existingcluster
and elastic resize is not supported - Takes much longer than
elastic resize
Redshift - Cluster - Node type
-
RA3
Node Size vCPU RAM (GiB) Managed Storage quota per node Number of nodes per cluster ra3.xlplus
4 32 32 TB 1
to32
ra3.4xlarge
12 96 128 TB 2
to32
ra3.16xlarge
48 384 128 TB 2
to128
- Pay for the
managed storage
andcompute
separately Managed storage
usesSSD
for local cache andS3
for persistence and automatically scales based on the workloadCompute
is separated frommanaged storage
and can be scaled independentlyRA3
nodes are optimized for performance and cost-effectiveness
- Pay for the
-
DC2
(Dense Compute)Node Size vCPU RAM (GiB) Managed Storage quota per node Number of nodes per cluster dc2.large
2 15 160 GB 1
to32
dc2.8xlarge
32 244 2.6 TB 2
to128
Compute-intensive
local
SSD
storage forhigh performance
andlow latency
- Recommended for datasets under
10 TB
-
DS2
(Dense Storage)Deprecated legacy option
, forlow-cost
workloads withHDD
storage,RA3
is recommended fornew workloads
-
Resources
Redshift - Cluster - HA
-
AZ
Single AZ
by defaultMulti AZ
is supported
Redshift - Distribution style
Distribution style (opens in a new tab)
-
Redshift Distribution Keys (DIST Keys)
determine where data is stored inRedshift
. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node. -
Uneven distribution of data across computing nodes leads to the skewness of the work a node has to do and you don’t want an under-utilised compute node. So the distribution of the data should be uniform.
-
Data should be distributed in such a way that the rows that participate in joins are already on the
same node
with their joining rows in other tables. This is calledcollocation
, which reduces data movement and improves query performance. -
Distribution is configurable per table. So you can select a different
distribution style
for each of the table. -
AUTO distribution
Redshift
automatically distributes the data across the nodes in the clusterRedshift
initially assignsALL
distribution to asmall table
, then changes toEVEN
distribution when the table grows larger.- Suitable for small tables or tables with a small number of distinct values
-
EVEN distribution
- The
leader node
distributes the rowsacross the slices
in around-robin
fashion, regardless of the values in any particular column. - Appropriate when
a table doesn't participate in joins
. It's also appropriate when there isn't a clear choice betweenKEY
distribution andALL
distribution.
- The
-
KEY distribution
- The rows are distributed according to
the values in one column
. - All the entries with the
same value in the column
end up on thesame slice
. - Suitable for
large tables
ortables with a large number of distinct values
- The rows are distributed according to
-
ALL distribution
- A copy of the
entire table
is distributed toevery node
. - Suitable for
small tables
ortables that are not joined with other tables
- A copy of the
Redshift - Concurrency scaling
Concurrency scaling
addstransient clusters
to your cluster to handle concurrent requests with consistency and fast performance in a matter of seconds.
Redshift - Sort keys
Sort keys (opens in a new tab)
- When you
create a table
, you can defineone or more of its columns
assort keys
. - Either a
compound
orinterleaved
sort key
.
Redshift - Compression
AWS Docs - Working with column compression (opens in a new tab)
- You can apply a compression type, or encoding, to the columns in a table manually
when you create the table
. - The
COPY
command analyzes your data and applies compression encodings to an empty tableautomatically
as part of the load operation. - Split your load data files so that the files are about
equal size
, between1 MB
and1 GB
after compression. For optimum parallelism, the ideal size is between1 MB
and125 MB
after compression. - The number of files should be
a multiple of the number of slices
in your cluster. - When loading data, it is strongly recommended that you
individually compress your load files
usinggzip
,lzop
,bzip2
, orZstandard
when you have large datasets.
Redshift - Workload Management (WLM)
-
AWS Docs - Amazon Redshift - Implementing workload management (opens in a new tab)
-
Automatic WLM
-
Manual WLM
-
Query Groups
Memory percentage
andconcurrency
settings
-
Query Monitoring Rules
Query Monitoring Rules
are used to define the conditions under which a query is monitored and the action to take when the query is monitored.
-
Query Queues
Memory percentage
andconcurrency
settings
-
-
Concurrency scaling
Concurrency scaling
automatically adds and removescompute nodes
to handle spikes in demand.Concurrency scaling
is enabled by default for allRA3
andDC2
clusters.Concurrency scaling
is not available forDS2
clusters.
-
Short query acceleration (opens in a new tab)
Short query acceleration (SQA)
prioritizesselected short-running queries
ahead oflonger-running queries
.SQA
runs short-running queries in adedicated space
, so thatSQA
queries aren't forced to wait in queues behind longer queries.CREATE TABLE AS
(CTAS
) statements and read-only queries, such asSELECT
statements, are eligible forSQA
.
Redshift - User-Defined Functions (UDF)
-
User-Defined Functions
(UDFs
) are functions that you define to extend the capabilities of theSQL
language inRedshift
. -
AWS Lambda
(opens in a new tab)- Accessing other
AWS services
- Accessing other
Redshift - Snapshots
-
Automated (opens in a new tab)
Enabled by default
when you create a cluster.- By default, about
every 8 hours
orfollowing every 5 GB per node of data changes
, or whichever comes first. - Default
retention period
is1 day
- Cannot be deleted manually
-
- By default, manual snapshots are retained indefinitely, even after you delete your cluster.
- AWS CLI:
create-cluster-snapshot
- AWS API:
CreateClusterSnapshot
-
Backup
- You can configure
Redshift
to automatically copy snapshots (automated or manual) for a cluster to anotherRegion
. - Only one destination
Region
at a time can be configured for automatic snapshot copy. - To change the destination
Region
that you copy snapshots to, first disable the automatic copy feature. Then re-enable it, specifying the new destinationRegion
.
- You can configure
-
Snapshot copy grant
KMS
keys are specific to aRegion
. If you enable copying ofRedshift
snapshots to anotherRegion
, and the source cluster and its snapshots are encrypted using a master key fromKMS
, you need to configure a grant forRedshift
to use a master key in the destinationRegion
.- The grant is created in the destination
Region
and allowsRedshift
to use the master key in the destinationRegion
.
Redshift - Query
-
Data API
- Available in
AWS SDK
, based onHTTP
andJSON
, no persistent connection needed API calls
areasynchronous
- Uses either
credentials stored in Secrets Manager
ortemporary database credentials
, no need to pass passwords in the API calls
- Available in
Redshift - Security
-
IAM
- Does not support
resource-based policies
.
- Does not support
-
Access control
-
Cluster level
-
IAM policies
For Redshift API actions
-
Security groups
For network connectivity
-
-
Database level
Database user accounts
-
-
Data protection
-
DB encryption using
KMS
(opens in a new tab)-
KMS
key hierarchy- Root key
- Cluster encryption key (CEK)
- Database encryption key (DEK)
- Data encryption keys
-
-
-
Auditing
-
Logs
-
Connection log
Logs authentication attempts, connections, and disconnections.
-
User log
Logs information about changes to database user definitions.
-
User activity log
Logs each query before it's run on the database.
-
-
Redshift Spectrum
In-place querying
of data inS3
, without having to load the data intoRedshift
tablesEB
scale- Pay for
the number of bytes scanned
Federated query
acrossoperational databases
,data warehouses
, anddata lakes
Redshift Spectrum - Comparison with Athena
Redshift Spectrum | Athena | |
---|---|---|
Performance | Greater control over performance; use dedicated resources of your Redshift cluster | Not configurable; use shared resources managed by AWS |
Query Results | Stored in Redshift | Stored in S3 |
Redshift Serverless
Redshift
measures data warehouse capacity inRedshift Processing Units
(RPUs
). You pay for the workloads you run inRPU
-hours on a per-second basis (with a60-second minimum charge
), including queries that access data in open file formats in S3
Keyspaces
- AWS Docs - Keyspaces (opens in a new tab)
- Based on
Apache Cassandra
Neptune
- AWS Docs - Neptune (opens in a new tab)
- Serverless graph database
Amazon Security Lake (opens in a new tab)
Fully managed
security data lake serviceCentralize security data
from AWS environments, SaaS providers, on premises, cloud sources, and third-party sources intoa purpose-built data lake
that's stored in your AWS account.
Elemental MediaStore (opens in a new tab)
- High performance for streaming media delivery