Exam Guide
Domain 1: Collection
Task Statement 1.1: Determine the operational characteristics of the collection system
Task Statement 1.2: Select a collection system that handles the frequency, volume, and source of data
-
Describe and characterize the volume and flow characteristics of incoming data (streaming, transactional, batch)
-
Streaming dataKinesis Data StreamsAmazon MSK
-
Batch dataGlueTransfer FamilyDataSyncStorage Gateway
-
Transactional dataDMS
-
Resources
-
Task Statement 1.3: Select a collection system that addresses the key properties of data, such as order, format, and compression
-
Describe how to capture data changes at the source
-
DMSsupportsChange Data Capture(CDC).
-
-
Describe how to transform and filter data during the collection process
Gluefor batchETLKinesis Data Analyticsfor streamingETLKinesis Data FirehoseandLambdafor streamingETL
Domain 2: Storage and Data Management
Task Statement 2.1: Determine the operational characteristics of the storage solution for analytics
Services
Analytics
Athena
-
In-place queryingof data inS3 -
Serverless -
PBscale -
Presto(opens in a new tab) under the hood -
Supports
SQLandApache Spark -
Analyze unstructured, semi-structured, and structured data stored in
S3, formats includingCSV,JSON,Apache ParquetandApache ORC -
Only successful or canceled queries are billed, while failed queries are not billed.
-
No charge for
DDLstatements -
Use cases
- Ad-hoc queries of web logs,
CloudTrail/CloudFront/VPC/ELBlogs - Querying staging data before loading to
Redshift - Integration with
Jupyter,Zeppelin,RStudionotebooks - Integration with
QuickSightfor visualization - Integration via
JDBC/ODBCforBItools
- Ad-hoc queries of web logs,
-
SQL
-
MSCK REPAIR TABLEupdates the metastore with the current partitions of an external table.Repairing partitions manually using MSCK repair (opens in a new tab)
-
Athena - Workgroup
- A
workgroupis a collection of users, queries, and results that are associated with a specific data catalog and workgroup settings. Workgroupsettings include the location inS3where query results are stored, the encryption configuration, and the data usage control settings.
Athena - Federated Query
AWS Docs - Using Amazon Athena Federated Query (opens in a new tab)
Query the data in placeor build pipelines thatextract data from multiple data sourcesand store them inS3.- Uses
data source connectorsthat run onLambdato run federated queries.
CloudSearch
AWS Docs - CloudSearch (opens in a new tab)
- Managed search service, based on
Apache Solr
OpenSearch
AWS Docs - OpenSearch (opens in a new tab)
- Operational analytics and log analytics
- Forked version of
Elasticsearch Number of Shards = Index Size / 30GB
EMR
AWS Docs - EMR (opens in a new tab)
-
Managed cluster platform that simplifies running big data frameworks, such as
HadoopandSpark, onAWS. -
PBscale -
RunJobFlowAPI Actioncreates and starts running a newcluster(job flow). Theclusterruns thestepsspecified. After thestepscomplete, theclusterstops and theHDFSpartition is lost. -
An
EMRcluster with multipleprimary nodescan reside only in oneAZ. -
HDFS
- A typical block size used by
HDFSis128 MB. Thus, anHDFSfile is chopped up into128 MBchunks, and if possible, each chunk will reside on a differentDataNode.
- A typical block size used by
EMR cluster - Node types
AWS Docs - EMR cluster - Node types (opens in a new tab)
-
Primary node/Master node- Manages the cluster and typically runs primary components of distributed applications.
Tracks the status of jobssubmitted to the clusterMonitors the healthof the instance groupsHAsupport from5.23.0+
-
Core nodes- Store
HDFSdata and run tasks - Multi-node clusters have at least one
core node. - One
coreinstance grouporinstance fleetper cluster, but there can be multiple nodes running on multipleEC2instances in theinstance grouporinstance fleet.
- Store
-
Task nodes- No
HDFSdata, therefore no data loss if terminated - Can use
Spot instances Optionalin a cluster
- No
EMR - EC2
AWS Docs - EMR - EC2 (opens in a new tab)
EMR - EKS
AWS Docs - EMR - EKS (opens in a new tab)
EMR - Serverless
AWS Docs - EMR - Serverless (opens in a new tab)
EMR - EMRFS
-
Direct access to
S3fromEMRcluster -
Persistent storage for
EMRclusters -
Benefit from
S3features- For clusters with multiple users who need different levels of access to data in
S3throughEMRFS EMRFScan assume a different service role for clusterEC2instances based on the user or group making the request, or based on the location of data inS3.- Each
IAM roleforEMRFScan have different permissions for data access inS3. - When
EMRFSmakes a request toS3that matches users, groups, or the locations that you specify, the cluster usesthe corresponding role that you specifyinstead of theEMR role for EC2.
- For clusters with multiple users who need different levels of access to data in
-
AWS Docs - Authorizing access to EMRFS data in Amazon S3 (opens in a new tab)
-
-
An extension to
DistCp(Apache Hadoop Distributed Copy) -
Uses
MapReduceto efficiently copy large amounts of data in adistributedmanner -
Copy data from
S3toHDFS -
Copy data from
HDFStoS3 -
Copy data between
S3buckets
-
EMR - Security - Apache Ranger
- A
RBACframework to enable, monitor and manage comprehensivedata securityacross theHadoopecosystem. - Centralized security administration and auditing
- Fine-grained authorization across many
Hadoopcomponents (Hadoop,Hive,HBase,Storm,Knox,Solr,Kafka, andYARN) Syncs policies and usersby usingagentsandpluginsthat runwithin the same process as the Hadoop component.- Supports
row-level authentication and auditing capabilitieswith embedded search.
EMR - Security Configuration
AWS Docs - Security Configuration (opens in a new tab)
-
At-restdata encryption-
For
WALwithEMR(opens in a new tab)Open-source
HDFSencryptionLUKSencryption -
For cluster nodes local disks (
EC2instance volumes) (opens in a new tab)-
Open-source
HDFSencryption -
Instance storeencryptionNVMeencryptionLUKSencryption
-
EBS volumeencryption-
EBSencryption- Recommended
- EMR 5.24.0+
-
LUKSencryption- Only applies to attached storage volumes, not to the
root device volume
- Only applies to attached storage volumes, not to the
-
-
-
For
EMRFSonS3-
SSE-S3 -
SSE-KMS -
SSE-C -
CSE-KMSObjects are
encrypted before being uploadedtoS3and the client useskeys provided by KMS. -
CSE-CustomObjects are
encrypted before being uploadedtoS3and the client usesa custom Java class that provides the client-side master key.
-
-
-
In-transitdata encryption- For
EMRtraffic betweencluster nodesandWALwithEMR - For distributed applications
- For
EMRFStraffic betweenS3andcluster nodes
- For
-
Resources
EMR Studio
- Web IDE for fully managed Jupyter notebooks running on EMR clusters
EMR Notebooks (opens in a new tab)
EMR - Open Source Ecosystem
Delta Lake
EMR - Delta Lake (opens in a new tab)
- Storage layer framework for lakehouse architectures
Ganglia
EMR - Ganglia (opens in a new tab)
- Performance monitoring tool for
HadoopandHBaseclusters - Not included in
EMRafter6.15.0
Apache HBase
EMR - Apache HBase (opens in a new tab)
Wide-column store NoSQLrunning onHDFSEMR WALsupport, able to restore an existingWALretained for30 daysstarting from the time when cluster was created, with anew clusterfrom the sameS3root directory.
Apache HCatalog
EMR - Apache HCatalog (opens in a new tab)
HCatalogis a tool that allows you to accessHivemetastoretables withinPig,Spark SQL, and/or customMapReduceapplications.
Apache Hudi
EMR - Apache Hudi (opens in a new tab)
Open Table Format- Brings
databaseanddata warehousecapabilities to thedata lake
Hue
EMR - Hue (opens in a new tab)
Web GUIforHadoop
Apache Iceberg
EMR - Apache Iceberg (opens in a new tab)
Open Table Format
Apache Livy
EMR - Apache Livy (opens in a new tab)
- A
RESTinterface forApache Spark
Apache MXNet
EMR - Apache MXNet (opens in a new tab)
- Retired
- A
deep learningframework for building neural networks and other deep learning applications.
Apache Oozie
EMR - Apache Oozie (opens in a new tab)
Workflow scheduler systemto manageHadoopjobsApache Airflowis a modern alternative.
Apache Phoenix
EMR - Apache Phoenix (opens in a new tab)
OLTPandoperational analyticsinHadoopforlow latencyapplications
Apache Pig
EMR - Apache Pig (opens in a new tab)
SQL-likecommands written inPig Latinand converts those commands intoTezjobs based onDAGorMapReduceprograms.
Apache Sqoop
EMR - Apache Sqoop (opens in a new tab)
- Retired
- A tool for transferring data between
S3,Hadoop,HDFS, andRDBMSdatabases.
Apache Tez
EMR - Apache Tez (opens in a new tab)
- An application execution framework for complex
DAGof tasks, similar toApache Spark
Glue
-
Basically
Serverless Spark ETL with a Hive metastore -
Ingests data in batch while performing
transformations -
Either provide a script to perform the
ETLjob, orGluecan generate a script automatically. -
Data sourcecan be anAWS service, such asRDS,S3,DynamoDB, orKinesis Data Streams, as well as a third-partyJDBC-accessible database. -
Data targetcan be anAWS service, such asS3,RDS, andDocumentDB, as well as a third-partyJDBC-accessible database. -
Hourly rate chargebased on the number ofData Processing Units (or DPUs)used to run yourETL job. -
Resources
-
Orchestration
-
triggers-
When fired, a trigger can start specified
jobsandcrawlers. -
A trigger fires
on demand,based on a schedule, orbased on a combination of events. -
trigger types
-
Scheduled
-
Conditional
The trigger fires if the watched jobs or crawlers end with the specified statuses.
-
On-demand
-
-
-
workflows-
Create and visualize complex
ETLactivities involving multiplecrawlers,jobs, andtriggers.
-
blueprints-
Glue blueprintsprovide a way to create and shareGlue workflows. When there is a complexETLprocess that could be used for similar use cases, rather than creating anGlue workflowfor each use case, you can create a singleblueprint.
-
Glue - Job
-
Job type
-
Apache Spark- Minimum
2 DPU 10 DPU(default)
- Minimum
-
Apache Spark Streaming- Minimum
2 DPU(default)
- Minimum
-
Python Shell1 DPUor0.0625 DPU(default)
-
-
Job Bookmarks
Gluetracks data that has already been processed during a previous run of anETLjob by persisting state information from the job run. This persisted state information is called ajob bookmark.
Glue - Worker
-
A single
DPUis also called aworker. -
A
DPUis a relative measure of processing power that consists of4 vCPUof compute capacity and16 GBof memory. -
A
M-DPUis aDPUwith4 vCPUand32 GBof memory. -
Worker type
-
For Glue 2.0+
Specify
Worker typeandNumber of workers-
G.1X(default)- 1 DPU
- Recommended for most workloads with cost effective performance
-
G.2X- 2 DPU
- Recommended for most workloads with cost effective performance
-
G.4X- 4 DPU
- Glue 3.0+
Spark jobs
-
G.8X- 8 DPU
- Recommended for workloads with most demanding transformations, aggregations, and joins
- Glue 3.0+
Spark jobs
-
G.025X- 0.25 DPU
- Recommended for
low volume streamingjobs - Glue 3.0+
Spark Streaming jobs
-
-
Glue - Data Catalog
AWS Docs - Glue Data Catalog (opens in a new tab)
-
Functionally similar to a
schema registry -
APIcompatible withHive Metastore -
lakeFS - Metadata Management: Hive Metastore vs AWS Glue (opens in a new tab)
-
Glue Data Catalog - database & table
Databasesandtablesare objects in theData Catalogthat contain metadata definitions.- The schema of your data is represented in your
Gluetable definition. The actual data remains in its original data store
Glue - Data Catalog - crawler
AWS Docs - Glue Data Catalog - crawler (opens in a new tab)
-
A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your
Data Catalog.ETLjobs that you define inAWS Glueuse theseData Catalogtables as sources and targets. -
A
classifierrecognizes the format of your data and generates a schema. It returns acertainty numberbetween0.0and1.0, where1.0means100%certainty.Glueuses the output of theclassifierthat has thehighest certainty. -
If no
classifierreturns acertaintygreater than 0.0,Gluereturns thedefault classification stringofUNKNOWN. -
Steps
- A
crawlerruns anycustom classifiersthat you choose to infer the format and schema of your data. - The crawler connects to the
data store. - The
inferred schemais created for your data. - The crawler writes
metadatato theGlue Data Catalog.
- A
-
Partition threshold
If the following conditions are met, the
schemasare denoted aspartitionsof atable.- The
partition thresholdis higher than0.7 (70%). - The maximum number of different
schemasdoesn't exceed5.
Learn how AWS Glue crawler detects the schema | AWS re:Post (opens in a new tab)
- The
Glue - Data Catalog - security
- You can only use an
Glueresource policyto manage permissions forData Catalog resources. - You can't attach it to any other
Glue resourcessuch asjobs,triggers,development endpoints,crawlers, orclassifiers. - Only one
resource policyis allowed percatalog, and its size is limited to10 KB. Each AWS accounthasone single catalog per Regionwhosecatalog IDis the same as theAWS account ID.- You cannot delete or modify a
catalog.
Glue Data Quality
AWS Docs - Glue Data Quality (opens in a new tab)
Glue Streaming
AWS Docs - Glue Streaming (opens in a new tab)
- Using the
Apache Spark Streamingframework Near-real-timedata processing
Glue DataBrew
AWS Docs - Glue DataBrew (opens in a new tab)
Visual data preparation tool that enables users to clean and normalize data without writing any code.
Kinesis Data Streams
AWS Docs - Kinesis Data Streams (opens in a new tab)
-
Key points
- Equivalent to
Kafka, for event streaming, noETL On-demandorprovisionedmodePaaSwithAPIaccess for developers- The maximum size of the data payload of a record before
base64-encoding is up to1 MB. Retention Periodby default is1 day, and can beup to 365 days.
- Equivalent to
-
-
A
streamis composed of one or moreshards. -
A
shardis a uniquely identified sequence ofdata recordsin astream. -
All the data in the
shardis sent to the same worker that is processing theshard. -
Read throughput
Up to
2 MB/second/shardor
5 transactions (API calls) /second/shardor
2000 records/second/shardshared across all the consumers that are reading from a given
shardEach call to
GetRecordsis counted as1 read transaction.GetRecordscan retrieve up to10 MB / transaction / shard, and up to10000 records per call. If a call toGetRecordsreturns10 MB, subsequent calls made within the next 5 seconds throw an exception.up to 20 registered consumers(Enhanced Fan-outLimit) foreach data stream. -
Write throughput
Up to
1 MB/second/shardor
1000 records/second/shard -
Record aggregationallows customers tocombine multiple records into a single record. This allows customers to improve their pershardthroughput. -
Not scaled automatically,
resharding(opens in a new tab) is the process used to scale your data stream using a series ofshard splits or merges.
-
-
Partition key(opens in a new tab)- All
data recordswith the samepartition keymap to the sameshard. - The number of
partition keysshould typically be much greater than the number ofshards, namelymany-to-onerelationship.
- All
-
Idempotency
-
2primary causes for duplicate recordsProducerretriesConsumerretries
-
Consumer retries are more common than producer retries
Kinesis Data Streamsdoes not guarantee the order of records acrossshardsin astream.Kinesis Data Streamsdoes guarantee the order of records within ashard.
-
Kinesis Data Streams - Producers
Kinesis Data Streams - Producers - KPL
AWS Docs - KPL (opens in a new tab)
-
Batching
-
Aggregation
Storing multiple payloads within a single
Kinesis Data Streams record. -
Collection
Consolidate multiple
Kinesis Data Streamsrecords into a singleKinesis Data Streamsrecord to reduce HTTP requests.
-
-
Rate Limiting
- Limits
per-shard throughputsent from a single producer - Implemented using a
token bucket algorithmwith separate buckets for bothKinesis Data Streamsrecords and bytes.
- Limits
-
Pros
- Higher throughput due to
aggregationandcompressionof records - Abstraction over
APIto simplify coding using low levelAPIdirectly
- Higher throughput due to
-
Cons
- Higher latency due to additional processing delay of up to configurable
RecordMaxBufferedTime - Only supports
AWS SDK v1
- Higher latency due to additional processing delay of up to configurable
Kinesis Data Streams - Producers - Kinesis Agent
AWS Docs - Kinesis Agent (opens in a new tab)
- Stand-alone
Javaapplication running as adaemon - Continuously monitors a set of files and sends new data to your stream.
Kinesis Data Streams - Consumers
-
-
Tasks
- Connects to the data stream
- Enumerates the shards within the data stream
- Uses leases to coordinates shard associations with its workers
- Instantiates a record processor for every shard it manages
- Pulls data records from the data stream
- Pushes the records to the corresponding record processor
- Checkpoints processed records
- Balances shard-worker associations (leases) when the worker instance count changes or when the data stream is resharded (shards are split or merged)
-
Versions
-
KCL 1.xis based onAWS SDK v1, andKCL 2.xis based onAWS SDK v2 -
KCL 1.x- Java
- Node.js
- .NET
- Python
- Ruby
-
KCL 2.x- Java
- Python
-
-
Multistream processingis only supported inKCL 2.x for Java(KCL 2.3+ for Java) -
You should ensure that the number of KCL instances does not exceed the number of
shards(except for failure standby purposes) -
One
KCL workeris deployed on oneEC2instance, able to process multipleshards. -
A
KCL workerruns as aprocessand it has multiplerecord processorsrunning asthreadswithin it. -
One
Shardhas exactly one correspondingrecord processor. -
KCL 2.xenables you to createKCLconsumer applications that can processmore than one data stream at the same time. -
Checkpointssequence number for theshardin thelease tableto track processed records -
Enhanced fan-out (opens in a new tab)
Consumersusingenhanced fan-outhave dedicated throughput ofup to 2 MB/second/shard, and withoutenhanced fan-out, allconsumersreading from the same shard share thethroughputof2 MB/second/shard.- Requires
KCL 2.xandKinesis Data Streamswithenhanced fan-outenabled - You can register
up to 20 consumers per streamto useenhanced fan-out. Kinesis Data Streamspushes data records from the stream toconsumers using enhanced fan-outoverHTTP/2, thereforeno pollingandlower latency.
-
Lease Table(opens in a new tab)- At any given time,
each shardof data records is bound toa particular KCL workerby aleaseidentified by theleaseKeyvariable. - By default, a
workercan hold one or moreleases(subject to the value of themaxLeasesForWorkervariable) at the same time. - A
DynamoDBtable to keep track of theshardsin aKinesis Data Streamthat are beingleased and processedby theworkersof theKCL consumer application. Each row in the lease tablerepresentsa shard that is being processed by the workersof your consumer application.KCLusesthe name of the consumer applicationto createthe name of the lease tablethat this consumer application uses, therefore each consumer application name must be unique.One lease tablefor oneKCL consumer applicationKCLcreates thelease tablewith aprovisioned throughputof10 reads / secondand10 writes / second
- At any given time,
-
-
Troubleshooting (opens in a new tab)
-
Consumer read API calls throttled (
ReadProvisionedThroughputExceeded)-
Occurs when
GetRecordsquota is exceeded, subsequentGetRecordscalls are throttled. -
Quota
- 5
GetRecordsAPI calls / second / each shard 2 MB / second / each shard- Up to 10 MiB of data / API call / each shard
- Up to 10000 records / API call
- 5
-
Slow record processing
-
Kinesis Data Firehose
AWS Docs - Kinesis Data Firehose (opens in a new tab)
-
Serverless,fully managedwithautomatic scaling -
Pre-built
Kinesis Data Streamsconnectors for various sources and destinations, without coding, similar toKafka Connect -
For data
latencyof60 seconds or higher. -
Sources
-
Kinesis Data Streams -
MSK- Can only use
S3as destination
- Can only use
-
Direct PUT- Use
PutRecordandPutRecordBatchAPIto send data - If the source is
Direct PUT,Firehosewill retain data for24 hours.
- Use
-
-
Destinations - AWS Docs - Data Firehose - Source, Destination, and Name (opens in a new tab)
S3RedshiftOpenSearch- Custom
HTTPEndpoint 3rd-partyservice
-
Buffering hints
AWS Docs - Buffering hints (opens in a new tab)
-
Buffering sizeinMBs(destination specific)1 MBto128 MB- Default is
5 MB
-
Buffering intervalinseconds(destination specific)0to900- Default is
60/300seconds
-
-
Data Transformation (opens in a new tab) not natively supported, need to use
Lambda-
Lambda-
Custom transformation logic
-
Lambdainvocation time of up to5 min. -
Error handling
- Retry the invocation
3 times by default. - After retries, if the invocation is unsuccessful,
Firehosethen skips that batch of records. The skipped records are treated as unsuccessfully processed records. - The unsuccessfully processed records are delivered to your
S3 bucket.
- Retry the invocation
-
-
Data format conversion
Firehosecan convertJSONrecords using a schema from a table defined inGlue- Non-
JSONdata must be converted by invokingLambdafunction
-
-
Dynamic Partitioning (opens in a new tab)
Dynamic partitioningenables you to continuously partition streaming data inFirehoseby usingkeyswithin data (for example,customer_idortransaction_id) and then deliver the data grouped by thesekeysinto correspondingS3prefixes.Partitioning your dataminimizes the amount of data scanned, optimizes performance, and reduces costs of your analytics queries onS3.
-
Server-side encryption (opens in a new tab)
-
Can be enabled depending on the source of data
-
Kinesis Data Streamsas the SourceKinesis Data Streamsfirstdecrypts the dataand then sends it toFirehose.Firehosebuffers the data in memory and then deliverswithout storing the unencrypted data at rest.
-
With
Direct PUTor other Data Sources- Can be turned on by using the
StartDeliveryStreamEncryptionoperation. - To stop
server-side-encryption, use theStopDeliveryStreamEncryptionoperation.
- Can be turned on by using the
-
-
Troubleshooting
Kinesis Data Analytics
-
Input
Kinesis Data StreamsKinesis Data Firehose
-
Output
Kinesis Data StreamsKinesis Data FirehoseLambda
Kinesis Data Analytics - Managed Apache Flink
- Equivalent to
Kafka Streams, usingApache Flinkbehind the scene, online stream processing, namelystreaming ETL
Managed Service for Apache Flink (opens in a new tab)
-
Streaming application built with
APIinJava,Scala,PythonandSQL -
Interactive notebooks
Kinesis Data Analytics - SQL (legacy)
Kinesis Data Analytics for SQL Applications (opens in a new tab)
Kinesis Data Analytics Studio
Kinesis Data Analytics Studio (opens in a new tab)
-
Based on
-
Apache Zeppelin
-
Apache Flink
-
Kinesis Video Streams
Kinesis Video Streams (opens in a new tab)
- Out of scope for the exam
Lake Formation
-
- Simplify security management and governance at scale
- Centrally manage permissions across your users
- Comprehensive data access and audit logging
- Provides a
RDBMSpermissions model to grant or revoke access to databases, tables, and columns in theData Catalogwith underlying data inS3.
-
Resources
Lake Formation - Ingestion
-
Blueprintsare predefinedETLworkflows that you can use to ingest data into yourData Lake. -
Workflowsare instances of ingestionblueprintsinLake Formation. -
Blueprinttypes-
Database snapshot-
Loads or reloads data from all tables into the data lake from a JDBC source.
-
You can exclude some data from the source based on an exclude pattern.
-
When to use:
- Schema evolution is flexible. (Columns are re-named, previous columns are deleted, and new columns are added in their place.)
- Complete consistency is needed between the source and the destination.
-
-
Incremental database-
Loads only new data into the data lake from a JDBC source, based on previously set bookmarks.
-
You specify the individual tables in the JDBC source database to include.
-
When to use:
- Schema evolution is incremental. (There is only successive addition of columns.)
- Only new rows are added; previous rows are not updated.
-
-
Log fileBulk loads data from log file sources
-
Lake Formation - Permission management
-
Permission types
-
Metadata access
Data Catalog permissions -
Underlying data access
-
Data access permissionsData access permissions (
SELECT,INSERT, andDELETE) onData Catalogtables that point to that location. -
Data location permissionsThe ability to create
Data Catalogresources that point to particularS3locations.When you grant the
CREATE_TABLEorALTERpermission to a principal, you also grant data location permissions to limit the locations for which the principal can create or alter metadata tables.
-
-
-
Permission model
-
IAMpermissions model consists ofIAM policies. -
Lake Formationpermissions model is implemented asDBMS-styleGRANT/REVOKEcommands. -
Principals
IAMusers and rolesSAMLusers and groupsIAM Identity Center- External accounts
-
Resources
-
LF-TagsUsing
LF-Tagscan greatly simplify the number of grants over usingNamed Resource policies. -
Named data catalog resources
-
-
Permissions
- Database permissions
- Table permissions
-
Data filtering
-
Column-levelsecurityAllows users to
view only specific columns and nested columnsthat they have access to in the table. -
Row-levelsecurityAllows users to
view only specific rows of datathat they have access to in the table. -
Cell-levelsecurityBy using both
row filteringandcolumn filteringat the same timeCan restrict access to different columns depending on the row.
-
-
QuickSight
AWS Docs - QuickSight (opens in a new tab)
-
Key points
- Ad-hoc, interactive query, analysis and visualization
- For formatted canned reports, use Paginated Reports (opens in a new tab)
-
Editions
-
Standardedition -
Enterpriseedition-
VPCconnectionAccess to data source in your
VPCwithout the need for a publicIPaddress -
Microsoft
Active DirectorySSO -
Data stored in
SPICEisencrypted at rest. -
For
SQL-based data sources, such asRedshift,Athena,PostgreSQL, orSnowflake, you can refresh your dataincrementallywithin a look-back window of time.
-
-
-
Datasets
-
Dataset refresh can be scheduled or on-demand.
-
SPICE- In-memory
-
Direct SQL query
-
-
Visual types
-
Combo chart (opens in a new tab)
Aka line and column (bar) charts
-
Scatter plot (opens in a new tab)
Used to observe relationships between variables
-
- Use heat maps to show a measure for the intersection of two dimensions, with color-coding to easily differentiate where values fall in the range.
- Use a heat map if you want to identify trends and outliers, because the use of color makes these easier to spot.
-
Pivot table (opens in a new tab)
- Use a pivot table if you want to analyze data on the visual.
-
-
AutoGraph (opens in a new tab)
You create a visual by choosing
AutoGraphand then selecting fields.
Migration and Transfer
DMS (Database Migration Service)
-
AWS Docs - DMS (Database Migration Service) (opens in a new tab)
-
Migrate
RDBMS,data warehouses,NoSQL databases, and other types of data stores -
One-time migration -
Used for
CDC(Change Data Capture) forreal-time ongoing replication
DataSync
- AWS Docs - DataSync (opens in a new tab)
- One-off online data transfer, transferring hot or cold data should not impede your business.
- Move data from
on-premisestoAWS - Move data between
AWSservices - Requires
DataSync agentinstallationon-premises
Snow Family
-
AWS Snow Family service models - Feature comparison matrix (opens in a new tab)
-
AWS Docs - AWS Snowball Edge Device Specifications (opens in a new tab)
Snowball Edge Storage Optimized
- For
large-scale data migrationsandrecurring transfer workflows, as well aslocal computing with higher capacity needs
Snowball Edge Compute Optimized
- For use cases such as
machine learning,full motion video analysis,analytics, andlocal computing stacks
Snowmobile
- Intended for
more than 10 PBdata migrations from a single location - Up to
100 PBperSnowmobile - Shipping container
Transfer Family
- Transfer flat files using
SFTP,FTPS,SSHandFTP
Storage Gateway
-
Hybrid cloud storage solution that enables
on-premisesapplications to usecloud storagewithout impact to existing applications -
On-remisesdata is synced up in cloud storage, while applications continue to access the cached data locally -
AWS Storage Blog - Cloud storage in minutes with AWS Storage Gateway (updated) (opens in a new tab)
Storage Gateway - Amazon S3 File Gateway
-
AWS Docs - Amazon S3 File Gateway User Guide (opens in a new tab)
-
Using file protocols such as
NFSandSMBto access files stored as objects inS3.
Storage Gateway - Amazon FSx File Gateway
-
AWS Docs - Amazon FSx File Gateway User Guide (opens in a new tab)
-
Low latency and efficient access to in-cloud
FSx for Windows File Serverfile shares from youron-premisesfacility.
Storage Gateway - Tape Gateway
-
Cloud-backed virtual tape storage
-
Tape Gatewaypresents aniSCSI-based virtual tape library (VTL)of virtual tape drives and a virtual media changer to your on-premises backup application. -
Tape Gatewaystores your virtual tapes inS3and creates new ones automatically.
Storage Gateway - Volume Gateway
-
Stored volumesEntire datasetis storedlocallyand is asynchronously backed up toS3as point-in-time snapshots.
-
Cached volumesEntire datasetis stored inS3andthe most frequently accesseddata is cached locally.
Data Pipeline (opens in a new tab)
-
Define data-driven workflows for moving and transforming data from various sources to destinations, similar to
Airflow -
Managed
ETLservice usingEMRunder the hood -
Batch processing
-
Maintenance mode, workloads can be migrated to
AWS GlueAWS Step FunctionsAmazon MWAA (Amazon Managed Workflows for Apache Airflow)
Application Integration
Step Functions (opens in a new tab)
-
Able to Run
EMRworkloads on a schedule -
Gneral-purpose serverless workflow orchestration service
-
AWS Docs - AWS Step Functions - Manage an Amazon EMR Job (opens in a new tab)
-
AWS Big Data Blog - Automating EMR workloads using AWS Step Functions (opens in a new tab)
SQS
Kinesis Data Streams vs SQS
| MSK | Kinesis Data Streams | SQS | |
|---|---|---|---|
| Consumption | Can be consumed many times | Records are deleted after being consumed | |
| Retention | Retention period configurable from 24 hours to 1 year | Retention period configurable from 1 min to 14 days | |
| Ordering | Ordering of records is preserved at the same shard | FIFO queues preserve ordering of records | |
| Scaling | Manual resharding | Auto scaling | |
| Delivery | At least once delivery | Exactly once delivery | |
| Replay | Can replay | Cannot replay | |
| Payload size | 1 MB | 256 KB |
Database
Redshift
-
PBscale, columnar storage -
MPParchitecture -
Loading data
-
AWS Docs - Amazon Redshift best practices for loading data (opens in a new tab)
-
COPYcommand- Can load data from multiple files in parallel.
- Use the
COPYcommand withCOMPUPDATEset toONto analyze and apply compression automatically - Split your data into files so that
the number of filesis amultiple of the number of slicesin your cluster.
-
-
AWS Big Data Blog - JOIN Amazon Redshift AND Amazon RDS PostgreSQL WITH dblink (opens in a new tab)
- The dblink function allows the entire query to be pushed to Amazon Redshift. This lets Amazon Redshift do what it does best—query large quantities of data efficiently and return the results to PostgreSQL for further processing.
Redshift - Cluster
-
Every
clustercan have up to32 nodes. -
Leader nodeandcompute nodeshave the same specs, andleader nodeis chosen automatically. -
Leader node(2+ nodes)- Parse the query and building an optimal execution plan, and compile code from execution plan
- Receive and aggregate the results from the
compute nodes
-
Compute node-
Node
slicesA
compute nodeis partitioned intoslices. Eachsliceis allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node.
-
-
Enhanced
VPCrouting- Routes network traffic between your cluster and data repositories through a
VPC, instead of through the internet.
- Routes network traffic between your cluster and data repositories through a
Redshift - Cluster - Elastic resize
-
Preferred method
-
In a incremental manner
-
Add or remove
nodes(samenode type) to aclusterin minutes, akain-placeresize -
Change the
node typeof an existingclusterA snapshot is created, and a new
clusteris created from the snapshot with the newnode type. -
Resources
Redshift - Cluster - Classic resize
- When you need to change the
cluster sizeor thenode typeof an existingclusterand elastic resize is not supported - Takes much longer than
elastic resize
Redshift - Cluster - Node type
-
RA3Node Size vCPU RAM (GiB) Managed Storage quota per node Number of nodes per cluster ra3.xlplus4 32 32 TB 1to32ra3.4xlarge12 96 128 TB 2to32ra3.16xlarge48 384 128 TB 2to128- Pay for the
managed storageandcomputeseparately Managed storageusesSSDfor local cache andS3for persistence and automatically scales based on the workloadComputeis separated frommanaged storageand can be scaled independentlyRA3nodes are optimized for performance and cost-effectiveness
- Pay for the
-
DC2(Dense Compute)Node Size vCPU RAM (GiB) Managed Storage quota per node Number of nodes per cluster dc2.large2 15 160 GB 1to32dc2.8xlarge32 244 2.6 TB 2to128Compute-intensivelocalSSDstorage forhigh performanceandlow latency- Recommended for datasets under
10 TB
-
DS2(Dense Storage)Deprecated legacy option, forlow-costworkloads withHDDstorage,RA3is recommended fornew workloads
-
Resources
Redshift - Cluster - HA
-
AZ
Single AZby defaultMulti AZis supported
Redshift - Distribution style
Distribution style (opens in a new tab)
-
Redshift Distribution Keys (DIST Keys)determine where data is stored inRedshift. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node. -
Uneven distribution of data across computing nodes leads to the skewness of the work a node has to do and you don’t want an under-utilised compute node. So the distribution of the data should be uniform.
-
Data should be distributed in such a way that the rows that participate in joins are already on the
same nodewith their joining rows in other tables. This is calledcollocation, which reduces data movement and improves query performance. -
Distribution is configurable per table. So you can select a different
distribution stylefor each of the table. -
AUTO distributionRedshiftautomatically distributes the data across the nodes in the clusterRedshiftinitially assignsALLdistribution to asmall table, then changes toEVENdistribution when the table grows larger.- Suitable for small tables or tables with a small number of distinct values
-
EVEN distribution- The
leader nodedistributes the rowsacross the slicesin around-robinfashion, regardless of the values in any particular column. - Appropriate when
a table doesn't participate in joins. It's also appropriate when there isn't a clear choice betweenKEYdistribution andALLdistribution.
- The
-
KEY distribution- The rows are distributed according to
the values in one column. - All the entries with the
same value in the columnend up on thesame slice. - Suitable for
large tablesortables with a large number of distinct values
- The rows are distributed according to
-
ALL distribution- A copy of the
entire tableis distributed toevery node. - Suitable for
small tablesortables that are not joined with other tables
- A copy of the
Redshift - Concurrency scaling
Concurrency scalingaddstransient clustersto your cluster to handle concurrent requests with consistency and fast performance in a matter of seconds.
Redshift - Sort keys
Sort keys (opens in a new tab)
- When you
create a table, you can defineone or more of its columnsassort keys. - Either a
compoundorinterleavedsort key.
Redshift - Compression
AWS Docs - Working with column compression (opens in a new tab)
- You can apply a compression type, or encoding, to the columns in a table manually
when you create the table. - The
COPYcommand analyzes your data and applies compression encodings to an empty tableautomaticallyas part of the load operation. - Split your load data files so that the files are about
equal size, between1 MBand1 GBafter compression. For optimum parallelism, the ideal size is between1 MBand125 MBafter compression. - The number of files should be
a multiple of the number of slicesin your cluster. - When loading data, it is strongly recommended that you
individually compress your load filesusinggzip,lzop,bzip2, orZstandardwhen you have large datasets.
Redshift - Workload Management (WLM)
-
AWS Docs - Amazon Redshift - Implementing workload management (opens in a new tab)
-
Automatic WLM
-
Manual WLM
-
Query GroupsMemory percentageandconcurrencysettings
-
Query Monitoring RulesQuery Monitoring Rulesare used to define the conditions under which a query is monitored and the action to take when the query is monitored.
-
Query QueuesMemory percentageandconcurrencysettings
-
-
Concurrency scaling
Concurrency scalingautomatically adds and removescompute nodesto handle spikes in demand.Concurrency scalingis enabled by default for allRA3andDC2clusters.Concurrency scalingis not available forDS2clusters.
-
Short query acceleration (opens in a new tab)
Short query acceleration (SQA)prioritizesselected short-running queriesahead oflonger-running queries.SQAruns short-running queries in adedicated space, so thatSQAqueries aren't forced to wait in queues behind longer queries.CREATE TABLE AS(CTAS) statements and read-only queries, such asSELECTstatements, are eligible forSQA.
Redshift - User-Defined Functions (UDF)
-
User-Defined Functions(UDFs) are functions that you define to extend the capabilities of theSQLlanguage inRedshift. -
AWS Lambda(opens in a new tab)- Accessing other
AWS services
- Accessing other
Redshift - Snapshots
-
Automated (opens in a new tab)
Enabled by defaultwhen you create a cluster.- By default, about
every 8 hoursorfollowing every 5 GB per node of data changes, or whichever comes first. - Default
retention periodis1 day - Cannot be deleted manually
-
- By default, manual snapshots are retained indefinitely, even after you delete your cluster.
- AWS CLI:
create-cluster-snapshot - AWS API:
CreateClusterSnapshot
-
Backup
- You can configure
Redshiftto automatically copy snapshots (automated or manual) for a cluster to anotherRegion. - Only one destination
Regionat a time can be configured for automatic snapshot copy. - To change the destination
Regionthat you copy snapshots to, first disable the automatic copy feature. Then re-enable it, specifying the new destinationRegion.
- You can configure
-
Snapshot copy grant
KMSkeys are specific to aRegion. If you enable copying ofRedshiftsnapshots to anotherRegion, and the source cluster and its snapshots are encrypted using a master key fromKMS, you need to configure a grant forRedshiftto use a master key in the destinationRegion.- The grant is created in the destination
Regionand allowsRedshiftto use the master key in the destinationRegion.
Redshift - Query
-
Data API
- Available in
AWS SDK, based onHTTPandJSON, no persistent connection needed API callsareasynchronous- Uses either
credentials stored in Secrets Managerortemporary database credentials, no need to pass passwords in the API calls
- Available in
Redshift - Security
-
IAM
- Does not support
resource-based policies.
- Does not support
-
Access control
-
Cluster level
-
IAM policiesFor Redshift API actions
-
Security groupsFor network connectivity
-
-
Database level
Database user accounts
-
-
Data protection
-
DB encryption using
KMS(opens in a new tab)-
KMSkey hierarchy- Root key
- Cluster encryption key (CEK)
- Database encryption key (DEK)
- Data encryption keys
-
-
-
Auditing
-
Logs
-
Connection log
Logs authentication attempts, connections, and disconnections.
-
User log
Logs information about changes to database user definitions.
-
User activity log
Logs each query before it's run on the database.
-
-
Redshift Spectrum
In-place queryingof data inS3, without having to load the data intoRedshifttablesEBscale- Pay for
the number of bytes scanned Federated queryacrossoperational databases,data warehouses, anddata lakes
Redshift Spectrum - Comparison with Athena
| Redshift Spectrum | Athena | |
|---|---|---|
| Performance | Greater control over performance; use dedicated resources of your Redshift cluster | Not configurable; use shared resources managed by AWS |
| Query Results | Stored in Redshift | Stored in S3 |
Redshift Serverless
Redshiftmeasures data warehouse capacity inRedshift Processing Units(RPUs). You pay for the workloads you run inRPU-hours on a per-second basis (with a60-second minimum charge), including queries that access data in open file formats in S3
Keyspaces
- AWS Docs - Keyspaces (opens in a new tab)
- Based on
Apache Cassandra
Neptune
- AWS Docs - Neptune (opens in a new tab)
- Serverless graph database
Amazon Security Lake (opens in a new tab)
Fully managedsecurity data lake serviceCentralize security datafrom AWS environments, SaaS providers, on premises, cloud sources, and third-party sources intoa purpose-built data lakethat's stored in your AWS account.
Elemental MediaStore (opens in a new tab)
- High performance for streaming media delivery