Services
Analytics
Athena
-
In-place queryingof data inS3 -
Serverless -
PBscale -
Presto(opens in a new tab) under the hood -
Supports
SQLandApache Spark -
Analyze unstructured, semi-structured, and structured data stored in
S3, formats includingCSV,JSON,Apache ParquetandApache ORC -
Only successful or canceled queries are billed, while failed queries are not billed.
-
No charge for
DDLstatements -
Use cases
- Ad-hoc queries of web logs,
CloudTrail/CloudFront/VPC/ELBlogs - Querying staging data before loading to
Redshift - Integration with
Jupyter,Zeppelin,RStudionotebooks - Integration with
QuickSightfor visualization - Integration via
JDBC/ODBCforBItools
- Ad-hoc queries of web logs,
-
Resources
Athena - Athena SQL
-
MSCK REPAIR TABLErefreshes partition metadata with the current partitions of an external table.Repairing partitions manually using MSCK repair (opens in a new tab)
-
-
When you specify
ROW FORMAT DELIMITED,Athenauses theLazySimpleSerDeby default.-- e.g. ROW FORMAT DELIMITED ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' COLLECTION ITEMS TERMINATED BY '|' MAP KEYS TERMINATED BY ':' -
Use
ROW FORMAT SERDEto explicitly specify the type ofSerDethatAthenashould use when it reads and writes data to the table.-- e.g. ROW FORMAT SERDE ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = ',', 'field.delim' = ',', 'collection.delim' = '|', 'mapkey.delim' = ':', 'escape.delim' = '\\' )
-
-
- Writes query results from a
SELECTstatement to the specified data format. Supported formats forUNLOADincludeParquet,ORC,Avro, andJSON. - The
UNLOADstatement is useful when you want to output the results of aSELECTquery in anon-CSVformat but do not require the associated table.
- Writes query results from a
Athena - Workgroup
- A
workgroupis a collection of users, queries, and results that are associated with a specific data catalog and workgroup settings. Workgroupsettings include the location inS3where query results are stored, the encryption configuration, and the data usage control settings.
Athena - Federated Query
AWS Docs - Using Amazon Athena Federated Query (opens in a new tab)
Query the data in placeor build pipelines thatextract data from multiple data sourcesand store them inS3.- Uses
data source connectorsthat run onLambdato run federated queries.
OpenSearch
AWS Docs - OpenSearch (opens in a new tab)
- Operational analytics and log analytics
- Forked version of
Elasticsearch Number of Shards = Index Size / 30GB
OpenSearch - Index
-
Storage
-
UltraWarmTo store large amounts of
read-onlydata -
Cold
If you need to do periodic research or forensic analysis on your older data.
-
Hot
-
OR1(OpenSearch Optimized Instance)A domain with
OR1instances usesEBSgp3orio1volumes for primary storage, with data copied synchronously toS3as it arrives.
-
-
Cross-cluster replication (opens in a new tab)
Cross-cluster replicationfollows apullmodel, you initate connectionsfrom the follower domain.
EMR
AWS Docs - EMR (opens in a new tab)
-
Managed cluster platform that simplifies running big data frameworks, such as
HadoopandSpark, onAWS. -
PBscale -
RunJobFlowAPIcreates and starts running a newcluster(job flow). Theclusterruns thestepsspecified. After thestepscomplete, theclusterstops and theHDFSpartition is lost. -
An
EMRcluster with multipleprimary nodescan reside only in oneAZ. -
HDFS
- A typical block size used by
HDFSis128 MB. Thus, anHDFSfile is chopped up into128 MBchunks, and if possible, each chunk will reside on a differentDataNode.
- A typical block size used by
EMR cluster - Node types
AWS Docs - EMR cluster - Node types (opens in a new tab)
-
Primary node/Master node- Manages the cluster and typically runs primary components of distributed applications.
Tracks the status of jobssubmitted to the clusterMonitors the healthof the instance groupsHAsupport from5.23.0+
-
Core nodes- Store
HDFSdata and run tasks - Multi-node clusters have at least one
core node. - One
coreinstance grouporinstance fleetper cluster, but there can be multiple nodes running on multipleEC2instances in theinstance grouporinstance fleet.
- Store
-
Task nodes- No
HDFSdata, therefore no data loss if terminated - Can use
Spot instances Optionalin a cluster
- No
EMR - EC2
AWS Docs - EMR - EC2 (opens in a new tab)
EMR - EKS
AWS Docs - EMR - EKS (opens in a new tab)
EMR - Serverless
AWS Docs - EMR - Serverless (opens in a new tab)
EMR - EMRFS
-
Direct access to
S3fromEMRcluster -
Persistent storage for
EMRclusters -
Benefit from
S3features- For clusters with multiple users who need different levels of access to data in
S3throughEMRFS EMRFScan assume a different service role for clusterEC2instances based on the user or group making the request, or based on the location of data inS3.- Each
IAM roleforEMRFScan have different permissions for data access inS3. - When
EMRFSmakes a request toS3that matches users, groups, or the locations that you specify, the cluster usesthe corresponding role that you specifyinstead of theEMR role for EC2.
- For clusters with multiple users who need different levels of access to data in
-
AWS Docs - Authorizing access to EMRFS data in Amazon S3 (opens in a new tab)
-
-
An extension to
DistCp(Apache Hadoop Distributed Copy) -
Uses
MapReduceto efficiently copy large amounts of data in adistributedmanner -
Copy data from
S3toHDFS -
Copy data from
HDFStoS3 -
Copy data between
S3buckets
-
EMR - Security - Apache Ranger
- A
RBACframework to enable, monitor and manage comprehensivedata securityacross theHadoopecosystem. - Centralized security administration and auditing
- Fine-grained authorization across many
Hadoopcomponents (Hadoop,Hive,HBase,Storm,Knox,Solr,Kafka, andYARN) Syncs policies and usersby usingagentsandpluginsthat runwithin the same process as the Hadoop component.- Supports
row-level authentication and auditing capabilitieswith embedded search.
EMR - Security Configuration
AWS Docs - Security Configuration (opens in a new tab)
-
At-restdata encryption-
For
WALwithEMR(opens in a new tab)Open-source
HDFSencryptionLUKSencryption -
For cluster nodes local disks (
EC2instance volumes) (opens in a new tab)-
Open-source
HDFSencryption -
Instance storeencryptionNVMeencryptionLUKSencryption
-
EBS volumeencryption-
EBSencryption- Recommended
- EMR 5.24.0+
-
LUKSencryption- Only applies to attached storage volumes, not to the
root device volume
- Only applies to attached storage volumes, not to the
-
-
-
For
EMRFSonS3-
SSE-S3 -
SSE-KMS -
SSE-C -
CSE-KMSObjects are
encrypted before being uploadedtoS3and the client useskeys provided by KMS. -
CSE-CustomObjects are
encrypted before being uploadedtoS3and the client usesa custom Java class that provides the client-side master key.
-
-
-
In-transitdata encryption- For
EMRtraffic betweencluster nodesandWALwithEMR - For distributed applications
- For
EMRFStraffic betweenS3andcluster nodes
- For
-
Resources
EMR Studio
- Web IDE for fully managed Jupyter notebooks running on EMR clusters
EMR Notebooks (opens in a new tab)
EMR - Open Source Ecosystem
Delta Lake
EMR - Delta Lake (opens in a new tab)
- Storage layer framework for lakehouse architectures
Ganglia
EMR - Ganglia (opens in a new tab)
- Performance monitoring tool for
HadoopandHBaseclusters - Not included in
EMRafter6.15.0
Apache HBase
EMR - Apache HBase (opens in a new tab)
Wide-column store NoSQLrunning onHDFSEMR WALsupport, able to restore an existingWALretained for30 daysstarting from the time when cluster was created, with anew clusterfrom the sameS3root directory.
Apache HCatalog
EMR - Apache HCatalog (opens in a new tab)
HCatalogis a tool that allows you to accessHivemetastoretables withinPig,Spark SQL, and/or customMapReduceapplications.
Apache Hudi
EMR - Apache Hudi (opens in a new tab)
Open Table Format- Brings
databaseanddata warehousecapabilities to thedata lake
Hue
EMR - Hue (opens in a new tab)
Web GUIforHadoop
Apache Iceberg
EMR - Apache Iceberg (opens in a new tab)
Open Table Format
Apache Livy
EMR - Apache Livy (opens in a new tab)
- A
RESTinterface forApache Spark
Apache MXNet
EMR - Apache MXNet (opens in a new tab)
- Retired
- A
deep learningframework for building neural networks and other deep learning applications.
Apache Oozie
EMR - Apache Oozie (opens in a new tab)
Workflow scheduler systemto manageHadoopjobsApache Airflowis a modern alternative.
Apache Phoenix
EMR - Apache Phoenix (opens in a new tab)
OLTPandoperational analyticsinHadoopforlow latencyapplications
Apache Pig
EMR - Apache Pig (opens in a new tab)
SQL-likecommands written inPig Latinand converts those commands intoTezjobs based onDAGorMapReduceprograms.
Apache Sqoop
EMR - Apache Sqoop (opens in a new tab)
- Retired
- A tool for transferring data between
S3,Hadoop,HDFS, andRDBMSdatabases.
Apache Tez
EMR - Apache Tez (opens in a new tab)
- An application execution framework for complex
DAGof tasks, similar toApache Spark
Glue
-
Serverless Spark ETL with a Hive metastore -
Ingests data in batch while performing
transformations -
Either provide a script to perform the
ETLjob, orGluecan generate a script automatically. -
Data sourcecan be anAWS service, such asRDS,S3,DynamoDB, orKinesis Data Streams, as well as a third-partyJDBC-accessible database. -
Data targetcan be anAWS service, such asS3,RDS, andDocumentDB, as well as a third-partyJDBC-accessible database. -
Jobs are
billed per second -
Resources
- AWS Glue Web API (opens in a new tab)
- AWS Whitepapers - AWS Glue Best Practices: Building a Performant and Cost Optimized Data Pipeline (opens in a new tab)
- AWS Prescriptive Guidance - Best practices for performance tuning AWS Glue for Apache Spark jobs (opens in a new tab)
- GitHub - aws-samples/aws-glue-samples (opens in a new tab)
Glue - Orchestration - Triggers
-
When fired, a trigger can start specified
jobsandcrawlers. -
A trigger fires
on demand,based on a schedule, orbased on a combination of events. -
trigger types
-
Scheduled
-
Conditional
The trigger fires if the watched jobs or crawlers end with the specified statuses.
-
On-demand
-
Glue - Orchestration - Workflows
-
Create and visualize complex
ETLactivities involving multiplecrawlers,jobs, andtriggers. -
Can be created from an
Glue Blueprint, or manually.
Glue - Orchestration - Blueprints
-
Glue blueprintsprovide a way to create and shareGlue workflows. When there is a complexETLprocess that could be used for similar use cases, rather than creating anGlue workflowfor each use case, you can create a singleblueprint.
Glue - Job
Glue - Job type
-
Apache Spark ETL- Minimum
2 DPU 10 DPU(default)- Job command:
glueetl
- Minimum
-
Apache Spark streaming ETL- Minimum
2 DPU(default) - Job command:
gluestreaming
- Minimum
-
Python shell-
1 DPUor0.0625 DPU(default) -
Job command:
pythonshell -
Mainly intended for ad-hoc tasks such as
data retrieval -
Much faster startup times than
Sparkjobs -
Compared to
AWS Lambda- Can run much longer
- Can be equipped with
more CPUandmemory - Lower per-unit-time cost
- Higher minimum billing time (
at least 1 minute) - Seamlessly integrates with
Glue Workflow
-
-
Ray- Job command:
glueray
- Job command:
Glue - Job - Job Bookmark
-
Gluetracks data that has already been processed during a previous run of anETLjob by persisting state information from the job run. This persisted state information is called ajob bookmark. -
Options
-
Enabled
The job updates the state after a job run. The job keeps track of processed data, and when a job runs, it processes new data since the last checkpoint.
-
Disabled
Job bookmarks are not used, and the job always processes the entire dataset. This is the default setting.
-
Pause
The job bookmark state is not updated when this option set is specified. The job processes incremental data since the last successful job run. You are responsible for managing the output from previous job runs.
-
Glue - Job run insights (Monitoring)
- Job debugging and optimization
- Insights are available using 2 new
log streamsin theCloudWatch logs
Glue - Job - Worker
-
A single
DPUis also called aworker. -
A
DPUis a relative measure of processing power that consists of4 vCPUof compute capacity and16 GBof memory. -
A
M-DPUis aDPUwith4 vCPUand32 GBof memory. -
Auto Scaling- Available for
Glue jobswithG.1X,G.2X,G.4X,G.8X, orG.025X(only forStreamingjobs) worker types.Standard DPUsare not supported. - Requires
Glue 3.0+.
- Available for
-
Worker type
-
For Glue
2.0+Specify
Worker typeandNumber of workers-
Standard1 DPU4 vCPUand16 GBof memory50 GBof attached storage
-
G.1X(default)1 DPU-4 vCPUand16 GBof memory64 GBof attached storage- Recommended for most workloads with cost effective performance
- Recommended for jobs authored in
Glue 2.0+
-
G.2X2 DPU-8 vCPUand32 GBof memory128 GBof attached storage- Recommended for most workloads with cost effective performance
- Recommended for jobs authored in
Glue 2.0+
-
G.4X4 DPU-16 vCPUand64 GBof memoryGlue 3.0+Spark jobs
-
G.8X8 DPU-32 vCPUand128 GBof memory- Recommended for workloads with most demanding transformations, aggregations, and joins
Glue 3.0+Spark jobs
-
G.025X0.25 DPU- Recommended for
low volume streamingjobs Glue 3.0+Spark Streaming jobsonly
-
-
Execution class
-
Standard
- Ideal for
time-sensitiveworkloads that requirefast job startupanddedicated resources
- Ideal for
-
Flexible
- Appropriate for
time-insensitivejobs whose start and completion times may vary - Only jobs with
Glue 3.0+andcommand typeglueetlwill be allowed to setExecutionClasstoFLEX. - The flexible
execution classis available forSparkjobs.
- Appropriate for
-
-
Glue - Programming
-
Comparing available AWS Glue ETL programing languages
-
Developing and testing AWS Glue job scripts locally - AWS Glue (opens in a new tab)
-
PySpark REPL
docker run -it --rm \ -v ~/.aws:/home/hadoop/.aws \ -v /tmp/glue/workspace:/home/hadoop/workspace/ \ -e AWS_PROFILE=$AWS_PROFILE \ -n glue5_pyspark \ public.ecr.aws/glue/aws-glue-libs:5 \ pyspark -
Spark
docker run -it --rm \ -v ~/.aws:/home/hadoop/.aws \ -v /tmp/glue/workspace:/home/hadoop/workspace/ \ -e AWS_PROFILE=$AWS_PROFILE \ -n glue5_spark_submit \ public.ecr.aws/glue/aws-glue-libs:5 \ spark-submit $SCRIPT_FILE
-
Glue - Programming - Python
Glue - Programming - Scala
Glue - Studio
-
GUIforGluejobs -
Visual ETL
VisuallycomposeETL workflows -
Notebook
-
Script editor
- Create and edit
PythonorScalascripts Automatically generateETL scripts
- Create and edit
Glue - Data Catalog
AWS Docs - Glue Data Catalog (opens in a new tab)
-
Functionally similar to a
schema registry -
APIcompatible withHive Metastore -
lakeFS - Metadata Management: Hive Metastore vs AWS Glue (opens in a new tab)
-
Databases & Tables
Databasesandtablesare objects in theData Catalogthat contain metadata definitions.- The schema of your data is represented in your
Gluetable definition. The actual data remains in its original data store
Glue - Data Catalog - crawler
AWS Docs - Glue Data Catalog - crawler (opens in a new tab)
-
A
crawlercan crawl multiple data stores in a single run. Upon completion, thecrawlercreates or updates one or more tables in yourData Catalog.ETLjobs that you define inAWS Glueuse theseData Catalogtables as sources and targets. -
A
classifierrecognizes the format of your data and generates a schema. It returns acertainty numberbetween0.0and1.0, where1.0means100%certainty.Glueuses the output of theclassifierthat has thehighest certainty. -
If no
classifierreturns acertaintygreater than 0.0,Gluereturns thedefault classification stringofUNKNOWN. -
Once attached to a
crawler, acustom classifieris executed before thebuilt-in classifiers. -
Steps
- A
crawlerruns anycustom classifiersthat you choose to infer the format and schema of your data. - The
crawlerconnects to thedata store. - The
inferred schemais created for your data. - The
crawlerwritesmetadatato theGlue Data Catalog.
- A
-
Partition threshold
If the following conditions are met, the
schemasare denoted aspartitionsof atable.- The
partition thresholdis higher than0.7 (70%). - The maximum number of different
schemasdoesn't exceed5.
Learn how AWS Glue crawler detects the schema | AWS re:Post (opens in a new tab)
- The
Glue - Data Catalog - security
- You can only use an
Glueresource policyto manage permissions forData Catalog resources. - You can't attach it to any other
Glue resourcessuch asjobs,triggers,development endpoints,crawlers, orclassifiers. - Only one
resource policyis allowed percatalog, and its size is limited to10 KB. Each AWS accounthasone single catalog per Regionwhosecatalog IDis the same as theAWS account ID.- You cannot delete or modify a
catalog.
Glue - Data Quality
AWS Docs - Glue Data Quality (opens in a new tab)
-
Based on
DeeQu(opens in a new tab) framework, using DSLData Quality Definition Language (DQDL)to definedata quality rules -
Entry Point
- Data quality for the Data Catalog
- Data quality for ETL jobs
Glue - Streaming
AWS Docs - Glue Streaming (opens in a new tab)
- Using the
Apache Spark Streamingframework Near-real-timedata processing
Glue DataBrew
AWS Docs - Glue DataBrew (opens in a new tab)
Visual data preparation tool that enables users to clean and normalize data without writing any code.
-
Job type
-
Recipe job
Data transformation
-
Profile job
Analyzing a dataset to create a comprehensive profile of the data
-
Glue DataBrew - Recipes
-
AWS Docs - Recipe step and function reference (opens in a new tab)
-
Categories
-
Basic column recipe steps
- Filter
- Column
-
Data cleaning recipe steps
- Format
- Clean
- Extract
-
Data quality recipe steps
- Missing
- Invalid
- Duplicates
- Outliers
-
Personally indentifiable information (PII) recipe steps
- Mask personal information
- Replace personal information
- Encrypt personal information
- Shuffle rows
-
Column structure recipe steps
- Split
- Merge
- Create
-
Column formatting recipe steps
- Decimal precision
- Thousands separator
- Abbreviate numbers
-
Data structure recipe steps
- Nest-Unnest
- Pivot
- Group
- Join
- Union
-
Data science recipe steps
- Text
- Scale
- Mapping
- Encode
-
Functions
- Mathematical functions
- Aggregate functions
- Text functions
- Date and time functions
- Window functions
- Web functions
- Other functions
-
Kinesis Data Streams
AWS Docs - Kinesis Data Streams (opens in a new tab)
-
Key points
- Equivalent to
Kafka, for event streaming, noETL On-demandorprovisionedmodePaaSwithAPIaccess for developers- The maximum size of the data payload of a record before
base64-encoding is up to1 MB. Retention Periodby default is1 day, and can beup to 365 days.
- Equivalent to
-
-
A
streamis composed of one or moreshards. -
A
shardis a uniquely identified sequence ofdata recordsin astream. -
All the data in the
shardis sent to the same worker that is processing theshard. -
A
shard iteratoris a pointer to a position in theshardfrom which to start reading data records sequentially. Ashard iteratorspecifies this position using thesequence number of a data record in a shard. -
number_of_shards = max(incoming_write_bandwidth_in_KiB/1024, outgoing_read_bandwidth_in_KiB/2048) -
Read throughput
Up to
2 MB/second/shardor
5 transactions (API calls) /second/shardor
2000 records/second/shardshared across all the consumers that are reading from a given
shardEach call to
GetRecordsis counted as1 read transaction.GetRecordscan retrieve up to10 MB / transaction / shard, and up to10000 records per call. If a call toGetRecordsreturns10 MB, subsequent calls made within the next 5 seconds throw an exception.up to 20 registered consumers(Enhanced Fan-outLimit) foreach data stream. -
Write throughput
Up to
1 MB/second/shardor
1000 records/second/shard -
Record aggregationallows customers tocombine multiple records into a single record. This allows customers to improve their pershardthroughput.
-
-
Data stream capacity mode
-
On-demand mode
Automatically manages the shards
-
Provisioned mode
Must specify the number of shards for the data stream upfront, but you can also change the number of shards later by
reshardingthe stream.
-
-
Scaling
-
resharding(opens in a new tab)Adjust the number of shards in your stream
- Shard split
- Shard merge
UpdateShardCountAPI
-
-
Partition key(opens in a new tab)- All
data recordswith the samepartition keymap to the sameshard. - The number of
partition keysshould typically be much greater than the number ofshards, namelymany-to-onerelationship.
- All
-
Idempotency
-
2primary causes for duplicate recordsProducerretriesConsumerretries
-
Consumer retries are more common than producer retries
Kinesis Data Streamsdoes not guarantee the order of records acrossshardsin astream.Kinesis Data Streamsdoes guarantee the order of records within ashard.
-
Kinesis Data Streams - Producers
Kinesis Data Streams - Producers - KPL
AWS Docs - KPL (opens in a new tab)
-
Batching
-
Aggregation
Storing multiple payloads within a single
Kinesis Data Streams record. -
Collection
Consolidate multiple
Kinesis Data Streamsrecords into a singleKinesis Data Streamsrecord to reduce HTTP requests.
-
-
Rate Limiting
- Limits
per-shard throughputsent from a single producer - Implemented using a
token bucket algorithmwith separate buckets for bothKinesis Data Streamsrecords and bytes.
- Limits
-
Pros
- Higher throughput due to
aggregationandcompressionof records - Abstraction over
APIto simplify coding using low levelAPIdirectly
- Higher throughput due to
-
Cons
- Higher latency due to additional processing delay of up to configurable
RecordMaxBufferedTime - Only supports
AWS SDK v1
- Higher latency due to additional processing delay of up to configurable
Kinesis Data Streams - Producers - Kinesis Agent
AWS Docs - Kinesis Agent (opens in a new tab)
- Stand-alone
Javaapplication running as adaemon - Continuously monitors a set of files and sends new data to your stream.
Kinesis Data Streams - Consumers
-
Differences between shared throughout consumer and enhanced fan-out consumer (opens in a new tab)
-
Types
- Shared throughput consumers without enhanced fan-out
- Enhanced fan-out consumers
-
-
Tasks
- Connects to the data stream
- Enumerates the shards within the data stream
- Uses leases to coordinates shard associations with its workers
- Instantiates a record processor for every shard it manages
- Pulls data records from the data stream
- Pushes the records to the corresponding record processor
- Checkpoints processed records
- Balances shard-worker associations (leases) when the worker instance count changes or when the data stream is resharded (shards are split or merged)
-
Versions
-
KCL 1.xis based onAWS SDK v1, andKCL 2.xis based onAWS SDK v2 -
KCL 1.x- Java
- Node.js
- .NET
- Python
- Ruby
-
KCL 2.x- Java
- Python
-
-
Multistream processingis only supported inKCL 2.x for Java(KCL 2.3+ for Java) -
You should ensure that the number of KCL instances does not exceed the number of
shards(except for failure standby purposes) -
One
KCL workeris deployed on oneEC2instance, able to process multipleshards. -
A
KCL workerruns as aprocessand it has multiplerecord processorsrunning asthreadswithin it. -
One
Shardhas exactly one correspondingrecord processor. -
KCL 2.xenables you to createKCLconsumer applications that can processmore than one data stream at the same time. -
Checkpointssequence number for theshardin thelease tableto track processed records -
Enhanced fan-out (opens in a new tab)
Consumersusingenhanced fan-outhave dedicated throughput ofup to 2 MB/second/shard, and withoutenhanced fan-out, allconsumersreading from the same shard share thethroughputof2 MB/second/shard.- Requires
KCL 2.xandKinesis Data Streamswithenhanced fan-outenabled - You can register
up to 20 consumers per streamto useenhanced fan-out. Kinesis Data Streamspushes data records from the stream toconsumers using enhanced fan-outoverHTTP/2, thereforeno pollingandlower latency.
-
Lease Table(opens in a new tab)- At any given time,
each shardof data records is bound toa particular KCL workerby aleaseidentified by theleaseKeyvariable. - By default, a
workercan hold one or moreleases(subject to the value of themaxLeasesForWorkervariable) at the same time. - A
DynamoDBtable to keep track of theshardsin aKinesis Data Streamthat are beingleased and processedby theworkersof theKCL consumer application. Each row in the lease tablerepresentsa shard that is being processed by the workersof your consumer application.KCLusesthe name of the consumer applicationto createthe name of the lease tablethat this consumer application uses, therefore each consumer application name must be unique.One lease tablefor oneKCL consumer applicationKCLcreates thelease tablewith aprovisioned throughputof10 reads / secondand10 writes / second
- At any given time,
-
-
Developing Consumers Using Amazon Data Firehose (opens in a new tab)
-
Developing Consumers Using Amazon Managed Service for Apache Flink (opens in a new tab)
-
Troubleshooting (opens in a new tab)
-
Consumer read API calls throttled (
ReadProvisionedThroughputExceeded)-
Occurs when
GetRecordsquota is exceeded, subsequentGetRecordscalls are throttled. -
Quota
- 5
GetRecordsAPI calls / second / each shard 2 MB / second / each shard- Up to
10 MiB of data/ API call / each shard - Up to
10000 records/ API call
- 5
-
Slow record processing
-
Kinesis Data Firehose
AWS Docs - Kinesis Data Firehose (opens in a new tab)
-
Serverless,fully managedwithautomatic scaling -
Pre-built
Kinesis Data Streamsconnectors for various sources and destinations, without coding, similar toKafka Connect -
For data
latencyof60 seconds or higher. -
Troubleshooting
Kinesis Data Firehose - Sources & Destinations
-
Sources
-
Kinesis Data Streams -
MSK- Can only use
S3as destination
- Can only use
-
Direct PUT- Use
PutRecordandPutRecordBatchAPIto send data - If the source is
Direct PUT,Firehosewill retain data for24 hours.
- Use
-
-
Destinations - AWS Docs - Data Firehose - Source, Destination, and Name (opens in a new tab)
S3RedshiftOpenSearch- Custom
HTTPEndpoint 3rd-partyservice
Kinesis Data Firehose - Buffering hints
AWS Docs - Buffering hints (opens in a new tab)
-
Buffering sizeinMBs(destination specific)1 MBto128 MB- Default is
5 MB
-
Buffering intervalinseconds(destination specific)60to900seconds- Default is
60/300seconds
Kinesis Data Firehose - Data Transformation
Data Transformation (opens in a new tab) not natively supported, need to use Lambda
-
Lambda-
Custom transformation logic
-
Lambdainvocation time of up to5 min. -
Error handling
- Retry the invocation
3 times by default. - After retries, if the invocation is unsuccessful,
Firehosethen skips that batch of records. The skipped records are treated as unsuccessfully processed records. - The unsuccessfully processed records are delivered to your
S3 bucket.
- Retry the invocation
-
-
Data format conversion
Firehosecan convertJSONrecords using a schema from a table defined inGlue- Non-
JSONdata must be converted by invokingLambdafunction
Kinesis Data Firehose - Dynamic Partitioning
-
Dynamic partitioningenables you to continuously partition streaming data inFirehoseby usingkeyswithin data (for example,customer_idortransaction_id) and then deliver the data grouped by thesekeysinto correspondingS3prefixes. -
Partitioning your dataminimizes the amount of data scanned, optimizes performance, and reduces costs of your analytics queries onS3. -
Methods of creating
partitioning keys-
Inline parsingMap each parameter to a valid
jqexpression, suitable forJSONformat -
Lambda functionFor
compressedorencrypteddata records, or data that is in any file format other thanJSON
-
-
Resources
Kinesis Data Firehose - Server-side Encryption
Server-side encryption (opens in a new tab)
-
Can be enabled depending on the source of data
-
Kinesis Data Streamsas the SourceKinesis Data Streamsfirstdecrypts the dataand then sends it toFirehose.Firehosebuffers the data in memory and then deliverswithout storing the unencrypted data at rest.
-
With
Direct PUTor other Data Sources- Can be turned on by using the
StartDeliveryStreamEncryptionoperation. - To stop
server-side-encryption, use theStopDeliveryStreamEncryptionoperation.
- Can be turned on by using the
Kinesis Data Firehose - Custom S3 Prefixes
Custom Prefixes for Amazon S3 Objects (opens in a new tab)
-
Ability to specify a
custom S3 prefixto be compatible withHive naming conventions, for data to be easily cataloged byGlue. -
Resources
Kinesis Data Analytics
-
Input
Kinesis Data StreamsKinesis Data Firehose
-
Output
Kinesis Data StreamsKinesis Data FirehoseLambda
Kinesis Data Analytics - Managed Apache Flink
Managed Service for Apache Flink (opens in a new tab)
-
Equivalent to
Kafka Streams, usingApache Flinkbehind the scene, online stream processing, namelystreaming ETL -
Managed Service for Apache Flinkis for buiding streaming application withAPIinJava,Scala,PythonandSQL -
Managed Service for Apache Flink Studiois for ad-hoc interactive data exploration. -
Flink API
- Flink offers
4 levels of API abstraction:Flink SQL,Table API,DataStream API, andProcess Function, which is used in conjunction with theDataStream API. SQLcan be embedded in any Flink application, regardless the programming language chosen.- If you are if planning to use the
DataStream API, not all connectors are supported inPython. - If you need
low-latency/high-throughputyou should considerJava/Scalaregardless theAPI. - If you plan to use
Async IOin theProcess Functions APIyou will need to useJava.
- Flink offers
Kinesis Data Analytics Studio
Kinesis Data Analytics Studio (opens in a new tab)
-
Based on
-
Apache Zeppelin
-
Apache Flink
-
Lake Formation
-
- Simplify security management and governance at scale
- Centrally manage permissions across your users
- Comprehensive data access and audit logging
- Provides a
RDBMSpermissions model to grant or revoke access to databases, tables, and columns in theData Catalogwith underlying data inS3.
-
Resources
Lake Formation - Ingestion
-
Blueprintsare predefinedETLworkflows that you can use to ingest data into yourData Lake. -
Workflowsare instances of ingestionblueprintsinLake Formation. -
Blueprinttypes-
Database snapshot-
Loads or reloads data from all tables into the data lake from a JDBC source.
-
You can exclude some data from the source based on an exclude pattern.
-
When to use:
- Schema evolution is flexible. (Columns are re-named, previous columns are deleted, and new columns are added in their place.)
- Complete consistency is needed between the source and the destination.
-
-
Incremental database-
Loads only new data into the data lake from a JDBC source, based on previously set bookmarks.
-
You specify the individual tables in the JDBC source database to include.
-
When to use:
- Schema evolution is incremental. (There is only successive addition of columns.)
- Only new rows are added; previous rows are not updated.
-
-
Log fileBulk loads data from log file sources
-
Lake Formation - Permission management
The security policies in Lake Formation use two layers of permissions, each resource is protected by
-
Lake Formation permissions (which control access to Data Catalog resources and S3 locations)
-
IAM permissions (which control access to Lake Formation and AWS Glue API resources)
-
Permission types
-
Metadata access
Data Catalog permissions -
Underlying data access
-
Data access permissionsData access permissions (
SELECT,INSERT, andDELETE) onData Catalogtables that point to that location. -
Data location permissionsThe ability to create
Data Catalogresources that point to particularS3locations.When you grant the
CREATE_TABLEorALTERpermission to a principal, you also grant data location permissions to limit the locations for which the principal can create or alter metadata tables.
-
-
-
Permission model
-
IAMpermissions model consists ofIAM policies. -
Lake Formationpermissions model is implemented asDBMS-styleGRANT/REVOKEcommands. -
Principals
IAMusers and rolesSAMLusers and groupsIAM Identity Center- External accounts
-
Resources
-
LF-TagsUsing
LF-Tagscan greatly simplify the number of grants over usingNamed Resource policies. -
Named data catalog resources
-
-
Permissions
- Database permissions
- Table permissions
-
Data filtering
-
Column-levelsecurityAllows users to
view only specific columns and nested columnsthat they have access to in the table. -
Row-levelsecurityAllows users to
view only specific rows of datathat they have access to in the table. -
Cell-levelsecurityBy using both
row filteringandcolumn filteringat the same timeCan restrict access to different columns depending on the row.
-
-
QuickSight
AWS Docs - QuickSight (opens in a new tab)
-
Key points
- Ad-hoc, interactive query, analysis and visualization
- For formatted canned reports, use Paginated Reports (opens in a new tab)
-
Editions
-
Standardedition -
Enterpriseedition-
VPCconnectionAccess to data source in your
VPCwithout the need for a publicIPaddress -
Microsoft
Active DirectorySSO -
Data stored in
SPICEisencrypted at rest. -
For
SQL-based data sources, such asRedshift,Athena,PostgreSQL, orSnowflake, you can refresh your dataincrementallywithin a look-back window of time.
-
-
-
Datasets
-
Dataset refresh can be scheduled or on-demand.
-
SPICE- In-memory
-
Direct SQL query
-
-
Visual types
-
Combo chart (opens in a new tab)
Aka line and column (bar) charts
-
Scatter plot (opens in a new tab)
Used to observe relationships between variables
-
- Use heat maps to show a measure for the intersection of two dimensions, with color-coding to easily differentiate where values fall in the range.
- Use a heat map if you want to identify trends and outliers, because the use of color makes these easier to spot.
-
Pivot table (opens in a new tab)
- Use a pivot table if you want to analyze data on the visual.
-
-
AutoGraph (opens in a new tab)
You create a visual by choosing
AutoGraphand then selecting fields.
Migration and Transfer
Application Discovery Service
AWS Application Discovery Service (opens in a new tab)
-
Collect usage and configuration data about your
on-premises servers and databases. -
All discovered data is stored in your
AWS Migration Hubhome Region. -
2 ways of performing discovery and collecting data
-
Agentless discoveryIdentifies
VMsand hosts associated withVMware vCenter -
Agent-based discoveryThe agent runs in your local environment and requires
rootprivileges.
-
Application Migration Service
- Automated
lift-and-shift(rehost) migration ofVMstoAWS - Migrate your applications to
AWSfromphysical servers,VMware vSphere,Microsoft Hyper-V, andother cloud providers. - Migrate
EC2instances betweenAWS Regionsor betweenAWS accounts, and to migrate fromEC2-Classicto aVPC.
DMS
AWS Docs - DMS (Database Migration Service) (opens in a new tab)
-
Migrate
RDBMS,data warehouses,NoSQL databases, and other types of data stores -
One-time migrationorcontinuous replication -
Used for
CDC(Change Data Capture) forreal-time ongoing replication -
DMSuses areplication instanceto connect to your source data store, read the source data, and format the data for consumption by the target data store. -
DMSServerlesseliminates replication instance management tasks.
DMS - Fleet Advisor
- Automatically inventories and assesses on-premises database and analytics server fleet and identifies potential migration paths
DMS - Data Transformations
DataSync
AWS Docs - DataSync (opens in a new tab)
-
One-off online data transfer, transferring hot or cold data should not impede your business.
-
Move data between
on-premisesandAWS -
Move data between
AWSservices -
Requires
DataSync agentinstallationon-premises -
Move
filesorobjects, not databases -
On-premises
- Network File System (
NFS) - Server Message Block (
SMB) - Hadoop Distributed File Systems (
HDFS) Object storage
- Network File System (
Snow Family
-
AWS Snow Family service models - Feature comparison matrix (opens in a new tab)
-
AWS Docs - AWS Snowball Edge Device Specifications (opens in a new tab)
Snowball Edge Storage Optimized
- For
large-scale data migrationsandrecurring transfer workflows, as well aslocal computing with higher capacity needs
Snowball Edge Compute Optimized
- For use cases such as
machine learning,full motion video analysis,analytics, andlocal computing stacks
Snowmobile
- Intended for
more than 10 PBdata migrations from a single location - Up to
100 PBperSnowmobile - Shipping container
Transfer Family
-
Transfer flat files using the following protocols:
- Secure File Transfer Protocol (
SFTP): version 3 - File Transfer Protocol Secure (
FTPS) - File Transfer Protocol (
FTP) - Applicability Statement 2 (
AS2) - Browser-based transfers
- Secure File Transfer Protocol (
Application Integration
AppFlow
EventBridge
-
RulesRouting rules for events
-
API DestinationsAPI destinationsareHTTP endpointsthat you can invoke as thetarget of a rule, similar to how you invoke anAWS serviceorresourceas a target.
MWAA
- Auto scaling
- Automatic
Airflowsetup based onAirflowversion - Streamlined upgrades and patches
- Workflow monitoring in
CloudWatch
SNS
Step Functions
Step Functions (opens in a new tab)
-
Able to Run
EMRworkloads on a schedule -
Gneral-purpose serverless workflow orchestration service
-
Resources
- AWS Docs - AWS Step Functions - Manage an Amazon EMR Job (opens in a new tab)
- AWS Big Data Blog - Automating EMR workloads using AWS Step Functions (opens in a new tab)
- AWS Big Data Blog - Orchestrate Amazon EMR Serverless jobs with AWS Step functions (opens in a new tab)
- InfoQ - Building Workflows with AWS Step Functions (opens in a new tab)
- LocalStack - Step Functions (opens in a new tab)
Step Functions - State Machine
-
Types
-
Standard workflows-
Standard workflowsare ideal for long-running, durable, and auditable workflows. -
Standard workflowscan support an execution start rate of over 2000 executions / second. -
They can run for up to a year and you can retrieve the full execution history using the
Step Functions API. -
Standard Workflowsemploy anexactly-onceexecution model, where your tasks and states are never started more than once unless you have specified theRetrybehavior in yourstate machine. -
Suited to
orchestrating non-idempotent actions, such as starting anEMRcluster or processing payments. -
Standard Workflowexecutions are billed according tothe number of state transitions processed. -
Using
Callbacksor the.sync Serviceintegration will most likely reduce the number of state transitions and cost.
-
-
ExpressworkflowsThe
Expresstype is used for high-volume, event-processing workloads and can run forup to 5 minutes.-
SynchronousExpress WorkflowsStart a workflow, wait until it completes, and then return the result
-
AsynchronousExpress WorkflowsReturn confirmation that the workflow was started, but don't wait for the workflow to complete
-
-
Step Functions - States
AWS Docs - Step Functions - States (opens in a new tab)
-
Defined with
Amazon States LanguageinState Machinedefinition -
Types
-
Task -
Choice -
Map-
Use the
Map stateto run a set of workflow steps for each item in a dataset. TheMap state's iterations run in parallel, which makes it possible to process a dataset quickly. -
Map state processing modes (opens in a new tab)
-
Inline modeLimited concurrency, each iteration of theMap stateruns in the context of the workflow that contains theMap state.Map stateaccepts only aJSON array as input. Also, this mode supportsup to 40 concurrent iterations.
-
Distributed mode-
High concurrency, each iteration of theMap stateruns in its own execution context. -
When you run a
Map stateinDistributed mode,Step Functionscreates aMap Runresource. -
Tolerated failure thresholdcan be set for aMap Run, andMap Runfails automatically if it exceeds the threshold. -
Use cases
- The size of your dataset
exceeds 256 KB. - The workflow's execution event history
exceeds 25000 entries. - You need a concurrency of
more than 40 parallel iterations.
- The size of your dataset
-
-
-
-
SQS
Kinesis Data Streams vs SQS
| MSK | Kinesis Data Streams | SQS | |
|---|---|---|---|
| Consumption | Can be consumed many times | Records are deleted after being consumed | |
| Retention | Retention period configurable from 24 hours to 1 year | Retention period configurable from 1 min to 14 days | |
| Ordering | Ordering of records is preserved at the same shard | FIFO queues preserve ordering of records | |
| Scaling | Manual resharding | Auto scaling | |
| Delivery | At least once | Exactly once | |
| Replay | Can replay | Can replay | Cannot replay |
| Payload size | 1 MB | 1 MB | 256 KB |
-
Resources
Cloud Financial Management
AWS Budgets
AWS Cost Explorer
Compute
AWS Batch
AWS Batch (opens in a new tab)
AWS Batchhandlesjob executionandcompute resource management, allowing you to focus on developing your applications rather than managing infrastructure.- A batch job is deployed as a
Docker container.
EC2
Lambda
AWS SAM
Containers
ECR
ECS
EKS
Database
DocumentDB
DynamoDB
RDS
Redshift
-
PBscale, columnar storage -
MPParchitecture -
Loading data
-
AWS Docs - Amazon Redshift best practices for loading data (opens in a new tab)
-
COPYcommand- Can load data from multiple files in parallel.
- Use the
COPYcommand withCOMPUPDATEset toONto analyze and apply compression automatically - Split your data into files so that
the number of filesis amultiple of the number of slicesin your cluster.
-
-
AWS Big Data Blog - JOIN Amazon Redshift AND Amazon RDS PostgreSQL WITH dblink (opens in a new tab)
- The
dblinkfunction allows the entire query to be pushed toRedshift. This letsRedshiftdo what it does best—query large quantities of data efficiently and return the results toPostgreSQLfor further processing.
- The
Redshift - Cluster
-
Every
clustercan have up to32 nodes. -
Leader nodeandcompute nodeshave the same specs, andleader nodeis chosen automatically. -
Leader node(2+ nodes)- Parse the query and building an optimal execution plan, and compile code from execution plan
- Receive and aggregate the results from the
compute nodes
-
Compute node-
Node
slicesA
compute nodeis partitioned intoslices. Eachsliceis allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node.
-
-
Enhanced
VPCrouting- Routes network traffic between your cluster and data repositories through a
VPC, instead of through the internet.
- Routes network traffic between your cluster and data repositories through a
Redshift - Cluster - Elastic resize
-
Preferred method
-
In a incremental manner
-
Add or remove
nodes(samenode type) to aclusterin minutes, akain-placeresize -
Change the
node typeof an existingclusterA snapshot is created, and a new
clusteris created from the snapshot with the newnode type. -
Resources
Redshift - Cluster - Classic resize
- When you need to change the
cluster sizeor thenode typeof an existingclusterand elastic resize is not supported - Takes much longer than
elastic resize
Redshift - Cluster - Node type
-
RA3Node Size vCPU RAM (GiB) Managed Storage quota per node Number of nodes per cluster ra3.xlplus4 32 32 TB 1to32ra3.4xlarge12 96 128 TB 2to32ra3.16xlarge48 384 128 TB 2to128- Pay for the
managed storageandcomputeseparately Managed storageusesSSDfor local cache andS3for persistence and automatically scales based on the workloadComputeis separated frommanaged storageand can be scaled independentlyRA3nodes are optimized for performance and cost-effectiveness
- Pay for the
-
DC2(Dense Compute)Node Size vCPU RAM (GiB) Managed Storage quota per node Number of nodes per cluster dc2.large2 15 160 GB 1to32dc2.8xlarge32 244 2.6 TB 2to128Compute-intensivelocalSSDstorage forhigh performanceandlow latency- Recommended for datasets under
10 TB
-
DS2(Dense Storage)Deprecated legacy option, forlow-costworkloads withHDDstorage,RA3is recommended fornew workloads
-
Resources
Redshift - Cluster - HA
-
AZ
Single AZby defaultMulti AZis supported
Redshift - Distribution style
Distribution style (opens in a new tab)
-
Redshift Distribution Keys (DIST Keys)determine where data is stored inRedshift. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node. -
Uneven distribution of data across computing nodes leads to the skewness of the work a node has to do and you don’t want an under-utilised compute node. So the distribution of the data should be uniform.
-
Data should be distributed in such a way that the rows that participate in joins are already on the
same nodewith their joining rows in other tables. This is calledcollocation, which reduces data movement and improves query performance. -
Distribution is configurable per table. So you can select a different
distribution stylefor each of the table. -
AUTO distributionRedshiftautomatically distributes the data across the nodes in the clusterRedshiftinitially assignsALLdistribution to asmall table, then changes toEVENdistribution when the table grows larger.- Suitable for small tables or tables with a small number of distinct values
-
EVEN distribution- The
leader nodedistributes the rowsacross the slicesin around-robinfashion, regardless of the values in any particular column. - Appropriate when
a table doesn't participate in joins. It's also appropriate when there isn't a clear choice betweenKEYdistribution andALLdistribution.
- The
-
KEY distribution- The rows are distributed according to
the values in one column. - All the entries with the
same value in the columnend up on thesame slice. - Suitable for
large tablesortables with a large number of distinct values
- The rows are distributed according to
-
ALL distribution- A copy of the
entire tableis distributed toevery node. - Suitable for
small tablesortables that are not joined with other tables
- A copy of the
Redshift - Concurrency scaling
Concurrency scalingaddstransient clustersto your cluster to handle concurrent requests with consistency and fast performance in a matter of seconds, mainly for bursty workloads.
Redshift - Sort keys
Sort keys (opens in a new tab)
- When you
create a table, you can defineone or more of its columnsassort keys. - Either a
compoundorinterleavedsort key.
Redshift - Compression
AWS Docs - Working with column compression (opens in a new tab)
- You can apply a compression type, or encoding, to the columns in a table manually
when you create the table. - The
COPYcommand analyzes your data and applies compression encodings to an empty tableautomaticallyas part of the load operation. - Split your load data files so that the files are about
equal size, between1 MBand1 GBafter compression. For optimum parallelism, the ideal size is between1 MBand125 MBafter compression. - The number of files should be
a multiple of the number of slicesin your cluster. - When loading data, it is strongly recommended that you
individually compress your load filesusinggzip,lzop,bzip2, orZstandardwhen you have large datasets.
Redshift - Workload Management (WLM)
-
AWS Docs - Amazon Redshift - Implementing workload management (opens in a new tab)
-
Automatic WLM
-
Manual WLM
-
Query GroupsMemory percentageandconcurrencysettings
-
Query Monitoring RulesQuery Monitoring Rulesare used to define the conditions under which a query is monitored and the action to take when the query is monitored.
-
Query QueuesMemory percentageandconcurrencysettings
-
-
Concurrency scaling
Concurrency scalingautomatically adds and removescompute nodesto handle spikes in demand.Concurrency scalingis enabled by default for allRA3andDC2clusters.Concurrency scalingis not available forDS2clusters.
-
Short query acceleration (opens in a new tab)
Short query acceleration (SQA)prioritizesselected short-running queriesahead oflonger-running queries.SQAruns short-running queries in adedicated space, so thatSQAqueries aren't forced to wait in queues behind longer queries.CREATE TABLE AS(CTAS) statements and read-only queries, such asSELECTstatements, are eligible forSQA.
Redshift - User-Defined Functions (UDF)
-
User-Defined Functions(UDFs) are functions that you define to extend the capabilities of theSQLlanguage inRedshift. -
AWS Lambda(opens in a new tab)- Accessing other
AWS services
- Accessing other
Redshift - Snapshots
-
Automated (opens in a new tab)
Enabled by defaultwhen you create a cluster.- By default, about
every 8 hoursorfollowing every 5 GB per node of data changes, or whichever comes first. - Default
retention periodis1 day - Cannot be deleted manually
-
- By default, manual snapshots are retained indefinitely, even after you delete your cluster.
- AWS CLI:
create-cluster-snapshot - AWS API:
CreateClusterSnapshot
-
Backup
- You can configure
Redshiftto automatically copy snapshots (automated or manual) for a cluster to anotherRegion. - Only one destination
Regionat a time can be configured for automatic snapshot copy. - To change the destination
Regionthat you copy snapshots to, first disable the automatic copy feature. Then re-enable it, specifying the new destinationRegion.
- You can configure
-
Snapshot copy grant
KMSkeys are specific to aRegion. If you enable copying ofRedshiftsnapshots to anotherRegion, and the source cluster and its snapshots are encrypted using a master key fromKMS, you need to configure a grant forRedshiftto use a master key in the destinationRegion.- The grant is created in the destination
Regionand allowsRedshiftto use the master key in the destinationRegion.
Redshift - SQL
-
VACUUM-
FULLVACUUM FULLis the default.- By default,
VACUUM FULLskips the sort phase for any table that is alreadyat least 95% sorted.
-
SORT ONLY- Sorts the specified table (or all tables in the current database) without reclaiming space freed by deleted rows.
-
DELETE ONLYRedshiftautomatically performs aDELETE ONLYvacuum in the background. -
REINDEX
-
Redshift - Query
-
- Available in
AWS SDK, based onHTTPandJSON, no persistent connection needed API callsareasynchronous- Uses either
credentials stored in Secrets Managerortemporary database credentials, no need to pass passwords in the API calls
- Available in
-
System tables and views
-
SVVviewscontain information about database objects with references to transientSTVtables. -
SYSviews- These are system monitoring views used to monitor query and workload usage for provisioned clusters and serverless workgroups.
- These views are located in the
pg_catalogschema.
-
STLviews- Information retrieved from all Redshift log history across nodes
- Retain
7days of log history
-
STVtablesVirtual system tables(in-memory) that containsnapshots of the current system state
-
SVCSviewsprovide details about queries on both the main and concurrency scaling clusters. -
SVLviewsprovide details about queries on main clusters.
-
-
Federated Query
- Query and analyze data across
operational databases,data warehouses, anddata lakes. - AWS Docs - Considerations when accessing federated data with Amazon Redshift (opens in a new tab)
- Query and analyze data across
Redshift - Data Sharing
-
With
data sharing, you can securely and easily share live data acrossRedshiftclusters without the need to copy or move the data. -
Lends itself to a
multi-warehousearchitecture, where you can scale each data warehouse for various types of workloads. -
Only
RA3andserverlessclusters are supported. -
Live and transactionally consistent views of data across all consumers
-
Secure and governed collaboration within and across organizations
-
Sharing data with external parties to monetize your data
Redshift - Data Sharing - Datashare
Working with datashares (opens in a new tab)
-
You can share data
at different levelsDatabasesSchemasTablesViews(including regular, late-binding, and materialized views)SQL user-defined functions (UDFs)
Redshift - Security
-
IAM- Does not support
resource-based policies.
- Does not support
-
Access control
-
Cluster level
-
IAM policiesFor
RedshiftAPIactions -
Security groupsFor network connectivity
-
-
Database level
Database user accounts
-
-
Data protection
-
DB encryption using
KMS(opens in a new tab)-
KMSkey hierarchy- Root key
- Cluster encryption key (CEK)
- Database encryption key (DEK)
- Data encryption keys
-
-
-
Auditing
-
Logs
-
Connection log
Logs authentication attempts, connections, and disconnections.
-
User log
Logs information about changes to database user definitions.
-
User activity log
Logs each query before it's run on the database.
-
-
Redshift Spectrum
In-place queryingof data inS3, without having to load the data intoRedshifttablesEBscale- Pay for
the number of bytes scanned Federated queryacrossoperational databases,data warehouses, anddata lakes
Redshift Spectrum - Comparison with Athena
| Redshift Spectrum | Athena | |
|---|---|---|
| Performance | Greater control over performance; use dedicated resources of your Redshift cluster | Not configurable; use shared resources managed by AWS |
| Query Results | Stored in Redshift | Stored in S3 |
Redshift Serverless
Redshiftmeasures data warehouse capacity inRedshift Processing Units(RPUs). You pay for the workloads you run inRPU-hours on a per-second basis (with a60-second minimum charge), including queries that access data in open file formats in S3
Redshift Streaming Ingestion
-
Input
- Kinesis Data Streams
- MSK
Keyspaces
- AWS Docs - Keyspaces (opens in a new tab)
- Based on
Apache Cassandra
Neptune
- AWS Docs - Neptune (opens in a new tab)
- Serverless graph database
MemoryDB
- More expensive than
ElastiCache - Intended as a
Redis-compatible, fully managed, in-memory database service instead of a caching service
MemoryDB - Valkey
- Open source
Redis-compatible alternative, as it is forked fromRedis
MemoryDB - Redis
Management and Governance
CloudFormation
CloudTrail
CloudWatch
CloudWatch Logs
-
Subscription
Real-time processing of log data
AWS Config
AWS Config - Config rules
-
AWS Config Managed RulesPredefined, customizable rules created by AWS Config
-
AWS Config Custom RulesThere are
2 ways to create AWS Config custom rulesLambda functionsGuard(policy as code DSL)
Managed Grafana
Systems Manager
Well-Architected Tool
Networking and Content Delivery
CloudFront
PrivateLink
Route 53
VPC
Security, Identity, and Compliance
IAM
KMS
Macie
Secrets Manager
AWS Shield (opens in a new tab)
- Protection against
DDoSattacks
AWS Shield Standard (opens in a new tab)
- Operates at
L3andL4.
AWS Shield Advanced (opens in a new tab)
-
Operates at
L7 -
Include
Shield Standard- Certain
AWS WAFusage forShieldprotected resources
-
DDoScost protection- If any of these protected resources scale up in response to a
DDoSattack, you can requestShield Advancedservice credits through your regularAWS Supportchannel.
- If any of these protected resources scale up in response to a
AWS WAF (opens in a new tab)
-
L7(HTTP) application level firewall -
Define a
Web ACLand then associating it with one or more web application resources that you want to protect.