Services
Analytics
Athena
-
In-place querying
of data inS3
-
Serverless
-
PB
scale -
Presto
(opens in a new tab) under the hood -
Supports
SQL
andApache Spark
-
Analyze unstructured, semi-structured, and structured data stored in
S3
, formats includingCSV
,JSON
,Apache Parquet
andApache ORC
-
Only successful or canceled queries are billed, while failed queries are not billed.
-
No charge for
DDL
statements -
Use cases
- Ad-hoc queries of web logs,
CloudTrail
/CloudFront
/VPC
/ELB
logs - Querying staging data before loading to
Redshift
- Integration with
Jupyter
,Zeppelin
,RStudio
notebooks - Integration with
QuickSight
for visualization - Integration via
JDBC
/ODBC
forBI
tools
- Ad-hoc queries of web logs,
-
Resources
Athena - Athena SQL
-
MSCK REPAIR TABLE
refreshes partition metadata with the current partitions of an external table.Repairing partitions manually using MSCK repair (opens in a new tab)
-
-
When you specify
ROW FORMAT DELIMITED
,Athena
uses theLazySimpleSerDe
by default.-- e.g. ROW FORMAT DELIMITED ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' COLLECTION ITEMS TERMINATED BY '|' MAP KEYS TERMINATED BY ':'
-
Use
ROW FORMAT SERDE
to explicitly specify the type ofSerDe
thatAthena
should use when it reads and writes data to the table.-- e.g. ROW FORMAT SERDE ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = ',', 'field.delim' = ',', 'collection.delim' = '|', 'mapkey.delim' = ':', 'escape.delim' = '\\' )
-
-
- Writes query results from a
SELECT
statement to the specified data format. Supported formats forUNLOAD
includeParquet
,ORC
,Avro
, andJSON
. - The
UNLOAD
statement is useful when you want to output the results of aSELECT
query in anon-CSV
format but do not require the associated table.
- Writes query results from a
Athena - Workgroup
- A
workgroup
is a collection of users, queries, and results that are associated with a specific data catalog and workgroup settings. Workgroup
settings include the location inS3
where query results are stored, the encryption configuration, and the data usage control settings.
Athena - Federated Query
AWS Docs - Using Amazon Athena Federated Query (opens in a new tab)
Query the data in place
or build pipelines thatextract data from multiple data sources
and store them inS3
.- Uses
data source connectors
that run onLambda
to run federated queries.
OpenSearch
AWS Docs - OpenSearch (opens in a new tab)
- Operational analytics and log analytics
- Forked version of
Elasticsearch
Number of Shards = Index Size / 30GB
OpenSearch - Index
-
Storage
-
UltraWarm
To store large amounts of
read-only
data -
Cold
If you need to do periodic research or forensic analysis on your older data.
-
Hot
-
OR1
(OpenSearch Optimized Instance)A domain with
OR1
instances usesEBS
gp3
orio1
volumes for primary storage, with data copied synchronously toS3
as it arrives.
-
-
Cross-cluster replication (opens in a new tab)
Cross-cluster replication
follows apull
model, you initate connectionsfrom the follower domain
.
EMR
AWS Docs - EMR (opens in a new tab)
-
Managed cluster platform that simplifies running big data frameworks, such as
Hadoop
andSpark
, onAWS
. -
PB
scale -
RunJobFlow
API
creates and starts running a newcluster
(job flow
). Thecluster
runs thesteps
specified. After thesteps
complete, thecluster
stops and theHDFS
partition is lost. -
An
EMR
cluster with multipleprimary nodes
can reside only in oneAZ
. -
HDFS
- A typical block size used by
HDFS
is128 MB
. Thus, anHDFS
file is chopped up into128 MB
chunks, and if possible, each chunk will reside on a differentDataNode
.
- A typical block size used by
EMR cluster - Node types
AWS Docs - EMR cluster - Node types (opens in a new tab)
-
Primary node
/Master node
- Manages the cluster and typically runs primary components of distributed applications.
Tracks the status of jobs
submitted to the clusterMonitors the health
of the instance groupsHA
support from5.23.0+
-
Core nodes
- Store
HDFS
data and run tasks - Multi-node clusters have at least one
core node
. - One
core
instance group
orinstance fleet
per cluster, but there can be multiple nodes running on multipleEC2
instances in theinstance group
orinstance fleet
.
- Store
-
Task nodes
- No
HDFS
data, therefore no data loss if terminated - Can use
Spot instances
Optional
in a cluster
- No
EMR - EC2
AWS Docs - EMR - EC2 (opens in a new tab)
EMR - EKS
AWS Docs - EMR - EKS (opens in a new tab)
EMR - Serverless
AWS Docs - EMR - Serverless (opens in a new tab)
EMR - EMRFS
-
Direct access to
S3
fromEMR
cluster -
Persistent storage for
EMR
clusters -
Benefit from
S3
features- For clusters with multiple users who need different levels of access to data in
S3
throughEMRFS
EMRFS
can assume a different service role for clusterEC2
instances based on the user or group making the request, or based on the location of data inS3
.- Each
IAM role
forEMRFS
can have different permissions for data access inS3
. - When
EMRFS
makes a request toS3
that matches users, groups, or the locations that you specify, the cluster usesthe corresponding role that you specify
instead of theEMR role for EC2
.
- For clusters with multiple users who need different levels of access to data in
-
AWS Docs - Authorizing access to EMRFS data in Amazon S3 (opens in a new tab)
-
-
An extension to
DistCp
(Apache Hadoop Distributed Copy
) -
Uses
MapReduce
to efficiently copy large amounts of data in adistributed
manner -
Copy data from
S3
toHDFS
-
Copy data from
HDFS
toS3
-
Copy data between
S3
buckets
-
EMR - Security - Apache Ranger
- A
RBAC
framework to enable, monitor and manage comprehensivedata security
across theHadoop
ecosystem. - Centralized security administration and auditing
- Fine-grained authorization across many
Hadoop
components (Hadoop
,Hive
,HBase
,Storm
,Knox
,Solr
,Kafka
, andYARN
) Syncs policies and users
by usingagents
andplugins
that runwithin the same process as the Hadoop component
.- Supports
row-level authentication and auditing capabilities
with embedded search.
EMR - Security Configuration
AWS Docs - Security Configuration (opens in a new tab)
-
At-rest
data encryption-
For
WAL
withEMR
(opens in a new tab)Open-source
HDFS
encryptionLUKS
encryption -
For cluster nodes local disks (
EC2
instance volumes) (opens in a new tab)-
Open-source
HDFS
encryption -
Instance store
encryptionNVMe
encryptionLUKS
encryption
-
EBS volume
encryption-
EBS
encryption- Recommended
- EMR 5.24.0+
-
LUKS
encryption- Only applies to attached storage volumes, not to the
root device volume
- Only applies to attached storage volumes, not to the
-
-
-
For
EMRFS
onS3
-
SSE-S3
-
SSE-KMS
-
SSE-C
-
CSE-KMS
Objects are
encrypted before being uploaded
toS3
and the client useskeys provided by KMS
. -
CSE-Custom
Objects are
encrypted before being uploaded
toS3
and the client usesa custom Java class that provides the client-side master key
.
-
-
-
In-transit
data encryption- For
EMR
traffic betweencluster nodes
andWAL
withEMR
- For distributed applications
- For
EMRFS
traffic betweenS3
andcluster nodes
- For
-
Resources
EMR Studio
- Web IDE for fully managed Jupyter notebooks running on EMR clusters
EMR Notebooks (opens in a new tab)
EMR - Open Source Ecosystem
Delta Lake
EMR - Delta Lake (opens in a new tab)
- Storage layer framework for lakehouse architectures
Ganglia
EMR - Ganglia (opens in a new tab)
- Performance monitoring tool for
Hadoop
andHBase
clusters - Not included in
EMR
after6.15.0
Apache HBase
EMR - Apache HBase (opens in a new tab)
Wide-column store NoSQL
running onHDFS
EMR WAL
support, able to restore an existingWAL
retained for30 days
starting from the time when cluster was created, with anew cluster
from the sameS3
root directory.
Apache HCatalog
EMR - Apache HCatalog (opens in a new tab)
HCatalog
is a tool that allows you to accessHive
metastore
tables withinPig
,Spark SQL
, and/or customMapReduce
applications.
Apache Hudi
EMR - Apache Hudi (opens in a new tab)
Open Table Format
- Brings
database
anddata warehouse
capabilities to thedata lake
Hue
EMR - Hue (opens in a new tab)
Web GUI
forHadoop
Apache Iceberg
EMR - Apache Iceberg (opens in a new tab)
Open Table Format
Apache Livy
EMR - Apache Livy (opens in a new tab)
- A
REST
interface forApache Spark
Apache MXNet
EMR - Apache MXNet (opens in a new tab)
- Retired
- A
deep learning
framework for building neural networks and other deep learning applications.
Apache Oozie
EMR - Apache Oozie (opens in a new tab)
Workflow scheduler system
to manageHadoop
jobsApache Airflow
is a modern alternative.
Apache Phoenix
EMR - Apache Phoenix (opens in a new tab)
OLTP
andoperational analytics
inHadoop
forlow latency
applications
Apache Pig
EMR - Apache Pig (opens in a new tab)
SQL-like
commands written inPig Latin
and converts those commands intoTez
jobs based onDAG
orMapReduce
programs.
Apache Sqoop
EMR - Apache Sqoop (opens in a new tab)
- Retired
- A tool for transferring data between
S3
,Hadoop
,HDFS
, andRDBMS
databases.
Apache Tez
EMR - Apache Tez (opens in a new tab)
- An application execution framework for complex
DAG
of tasks, similar toApache Spark
Glue
-
Serverless Spark ETL with a Hive metastore
-
Ingests data in batch while performing
transformations
-
Either provide a script to perform the
ETL
job, orGlue
can generate a script automatically. -
Data source
can be anAWS service
, such asRDS
,S3
,DynamoDB
, orKinesis Data Streams
, as well as a third-partyJDBC
-accessible database. -
Data target
can be anAWS service
, such asS3
,RDS
, andDocumentDB
, as well as a third-partyJDBC
-accessible database. -
Jobs are
billed per second
-
Resources
- AWS Glue Web API (opens in a new tab)
- AWS Whitepapers - AWS Glue Best Practices: Building a Performant and Cost Optimized Data Pipeline (opens in a new tab)
- AWS Prescriptive Guidance - Best practices for performance tuning AWS Glue for Apache Spark jobs (opens in a new tab)
- GitHub - aws-samples/aws-glue-samples (opens in a new tab)
Glue - Orchestration - Triggers
-
When fired, a trigger can start specified
jobs
andcrawlers
. -
A trigger fires
on demand
,based on a schedule
, orbased on a combination of events
. -
trigger types
-
Scheduled
-
Conditional
The trigger fires if the watched jobs or crawlers end with the specified statuses.
-
On-demand
-
Glue - Orchestration - Workflows
-
Create and visualize complex
ETL
activities involving multiplecrawlers
,jobs
, andtriggers
. -
Can be created from an
Glue Blueprint
, or manually.
Glue - Orchestration - Blueprints
-
Glue blueprints
provide a way to create and shareGlue workflows
. When there is a complexETL
process that could be used for similar use cases, rather than creating anGlue workflow
for each use case, you can create a singleblueprint
.
Glue - Job
Glue - Job type
-
Apache Spark ETL
- Minimum
2 DPU
10 DPU
(default)- Job command:
glueetl
- Minimum
-
Apache Spark streaming ETL
- Minimum
2 DPU
(default) - Job command:
gluestreaming
- Minimum
-
Python shell
-
1 DPU
or0.0625 DPU
(default) -
Job command:
pythonshell
-
Mainly intended for ad-hoc tasks such as
data retrieval
-
Much faster startup times than
Spark
jobs -
Compared to
AWS Lambda
- Can run much longer
- Can be equipped with
more CPU
andmemory
- Lower per-unit-time cost
- Higher minimum billing time (
at least 1 minute
) - Seamlessly integrates with
Glue Workflow
-
-
Ray
- Job command:
glueray
- Job command:
Glue - Job - Job Bookmark
-
Glue
tracks data that has already been processed during a previous run of anETL
job by persisting state information from the job run. This persisted state information is called ajob bookmark
. -
Options
-
Enabled
The job updates the state after a job run. The job keeps track of processed data, and when a job runs, it processes new data since the last checkpoint.
-
Disabled
Job bookmarks are not used, and the job always processes the entire dataset. This is the default setting.
-
Pause
The job bookmark state is not updated when this option set is specified. The job processes incremental data since the last successful job run. You are responsible for managing the output from previous job runs.
-
Glue - Job run insights (Monitoring)
- Job debugging and optimization
- Insights are available using 2 new
log streams
in theCloudWatch logs
Glue - Job - Worker
-
A single
DPU
is also called aworker
. -
A
DPU
is a relative measure of processing power that consists of4 vCPU
of compute capacity and16 GB
of memory. -
A
M-DPU
is aDPU
with4 vCPU
and32 GB
of memory. -
Auto Scaling
- Available for
Glue jobs
withG.1X
,G.2X
,G.4X
,G.8X
, orG.025X
(only forStreaming
jobs) worker types.Standard DPUs
are not supported. - Requires
Glue 3.0+
.
- Available for
-
Worker type
-
For Glue
2.0+
Specify
Worker type
andNumber of workers
-
Standard
1 DPU
4 vCPU
and16 GB
of memory50 GB
of attached storage
-
G.1X
(default)1 DPU
-4 vCPU
and16 GB
of memory64 GB
of attached storage- Recommended for most workloads with cost effective performance
- Recommended for jobs authored in
Glue 2.0+
-
G.2X
2 DPU
-8 vCPU
and32 GB
of memory128 GB
of attached storage- Recommended for most workloads with cost effective performance
- Recommended for jobs authored in
Glue 2.0+
-
G.4X
4 DPU
-16 vCPU
and64 GB
of memoryGlue 3.0+
Spark jobs
-
G.8X
8 DPU
-32 vCPU
and128 GB
of memory- Recommended for workloads with most demanding transformations, aggregations, and joins
Glue 3.0+
Spark jobs
-
G.025X
0.25 DPU
- Recommended for
low volume streaming
jobs Glue 3.0+
Spark Streaming jobs
only
-
-
Execution class
-
Standard
- Ideal for
time-sensitive
workloads that requirefast job startup
anddedicated resources
- Ideal for
-
Flexible
- Appropriate for
time-insensitive
jobs whose start and completion times may vary - Only jobs with
Glue 3.0+
andcommand type
glueetl
will be allowed to setExecutionClass
toFLEX
. - The flexible
execution class
is available forSpark
jobs.
- Appropriate for
-
-
Glue - Programming
-
Comparing available AWS Glue ETL programing languages
-
Developing and testing AWS Glue job scripts locally - AWS Glue (opens in a new tab)
-
PySpark REPL
docker run -it --rm \ -v ~/.aws:/home/hadoop/.aws \ -v /tmp/glue/workspace:/home/hadoop/workspace/ \ -e AWS_PROFILE=$AWS_PROFILE \ -n glue5_pyspark \ public.ecr.aws/glue/aws-glue-libs:5 \ pyspark
-
Spark
docker run -it --rm \ -v ~/.aws:/home/hadoop/.aws \ -v /tmp/glue/workspace:/home/hadoop/workspace/ \ -e AWS_PROFILE=$AWS_PROFILE \ -n glue5_spark_submit \ public.ecr.aws/glue/aws-glue-libs:5 \ spark-submit $SCRIPT_FILE
-
Glue - Programming - Python
Glue - Programming - Scala
Glue - Studio
-
GUI
forGlue
jobs -
Visual ETL
Visually
composeETL workflows
-
Notebook
-
Script editor
- Create and edit
Python
orScala
scripts Automatically generate
ETL scripts
- Create and edit
Glue - Data Catalog
AWS Docs - Glue Data Catalog (opens in a new tab)
-
Functionally similar to a
schema registry
-
API
compatible withHive Metastore
-
lakeFS - Metadata Management: Hive Metastore vs AWS Glue (opens in a new tab)
-
Databases & Tables
Databases
andtables
are objects in theData Catalog
that contain metadata definitions.- The schema of your data is represented in your
Glue
table definition. The actual data remains in its original data store
Glue - Data Catalog - crawler
AWS Docs - Glue Data Catalog - crawler (opens in a new tab)
-
A
crawler
can crawl multiple data stores in a single run. Upon completion, thecrawler
creates or updates one or more tables in yourData Catalog
.ETL
jobs that you define inAWS Glue
use theseData Catalog
tables as sources and targets. -
A
classifier
recognizes the format of your data and generates a schema. It returns acertainty number
between0.0
and1.0
, where1.0
means100%
certainty.Glue
uses the output of theclassifier
that has thehighest certainty
. -
If no
classifier
returns acertainty
greater than 0.0
,Glue
returns thedefault classification string
ofUNKNOWN
. -
Once attached to a
crawler
, acustom classifier
is executed before thebuilt-in classifiers
. -
Steps
- A
crawler
runs anycustom classifiers
that you choose to infer the format and schema of your data. - The
crawler
connects to thedata store
. - The
inferred schema
is created for your data. - The
crawler
writesmetadata
to theGlue Data Catalog
.
- A
-
Partition threshold
If the following conditions are met, the
schemas
are denoted aspartitions
of atable
.- The
partition threshold
is higher than0.7 (70%)
. - The maximum number of different
schemas
doesn't exceed5
.
Learn how AWS Glue crawler detects the schema | AWS re:Post (opens in a new tab)
- The
Glue - Data Catalog - security
- You can only use an
Glue
resource policy
to manage permissions forData Catalog resources
. - You can't attach it to any other
Glue resources
such asjobs
,triggers
,development endpoints
,crawlers
, orclassifiers
. - Only one
resource policy
is allowed percatalog
, and its size is limited to10 KB
. Each AWS account
hasone single catalog per Region
whosecatalog ID
is the same as theAWS account ID
.- You cannot delete or modify a
catalog
.
Glue - Data Quality
AWS Docs - Glue Data Quality (opens in a new tab)
-
Based on
DeeQu
(opens in a new tab) framework, using DSLData Quality Definition Language (DQDL)
to definedata quality rules
-
Entry Point
- Data quality for the Data Catalog
- Data quality for ETL jobs
Glue - Streaming
AWS Docs - Glue Streaming (opens in a new tab)
- Using the
Apache Spark Streaming
framework Near-real-time
data processing
Glue DataBrew
AWS Docs - Glue DataBrew (opens in a new tab)
Visual data preparation tool that enables users to clean and normalize data without writing any code.
-
Job type
-
Recipe job
Data transformation
-
Profile job
Analyzing a dataset to create a comprehensive profile of the data
-
Glue DataBrew - Recipes
-
AWS Docs - Recipe step and function reference (opens in a new tab)
-
Categories
-
Basic column recipe steps
- Filter
- Column
-
Data cleaning recipe steps
- Format
- Clean
- Extract
-
Data quality recipe steps
- Missing
- Invalid
- Duplicates
- Outliers
-
Personally indentifiable information (PII) recipe steps
- Mask personal information
- Replace personal information
- Encrypt personal information
- Shuffle rows
-
Column structure recipe steps
- Split
- Merge
- Create
-
Column formatting recipe steps
- Decimal precision
- Thousands separator
- Abbreviate numbers
-
Data structure recipe steps
- Nest-Unnest
- Pivot
- Group
- Join
- Union
-
Data science recipe steps
- Text
- Scale
- Mapping
- Encode
-
Functions
- Mathematical functions
- Aggregate functions
- Text functions
- Date and time functions
- Window functions
- Web functions
- Other functions
-
Kinesis Data Streams
AWS Docs - Kinesis Data Streams (opens in a new tab)
-
Key points
- Equivalent to
Kafka
, for event streaming, noETL
On-demand
orprovisioned
modePaaS
withAPI
access for developers- The maximum size of the data payload of a record before
base64
-encoding is up to1 MB
. Retention Period
by default is1 day
, and can beup to 365 days
.
- Equivalent to
-
-
A
stream
is composed of one or moreshards
. -
A
shard
is a uniquely identified sequence ofdata records
in astream
. -
All the data in the
shard
is sent to the same worker that is processing theshard
. -
A
shard iterator
is a pointer to a position in theshard
from which to start reading data records sequentially. Ashard iterator
specifies this position using thesequence number of a data record in a shard
. -
number_of_shards = max(incoming_write_bandwidth_in_KiB/1024, outgoing_read_bandwidth_in_KiB/2048)
-
Read throughput
Up to
2 MB/second/shard
or
5 transactions (API calls) /second/shard
or
2000 records/second/shard
shared across all the consumers that are reading from a given
shard
Each call to
GetRecords
is counted as1 read transaction
.GetRecords
can retrieve up to10 MB / transaction / shard
, and up to10000 records per call
. If a call toGetRecords
returns10 MB
, subsequent calls made within the next 5 seconds throw an exception.up to 20 registered consumers
(Enhanced Fan-out
Limit) foreach data stream
. -
Write throughput
Up to
1 MB/second/shard
or
1000 records/second/shard
-
Record aggregation
allows customers tocombine multiple records into a single record
. This allows customers to improve their pershard
throughput.
-
-
Data stream capacity mode
-
On-demand mode
Automatically manages the shards
-
Provisioned mode
Must specify the number of shards for the data stream upfront, but you can also change the number of shards later by
resharding
the stream.
-
-
Scaling
-
resharding
(opens in a new tab)Adjust the number of shards in your stream
- Shard split
- Shard merge
UpdateShardCount
API
-
-
Partition key
(opens in a new tab)- All
data records
with the samepartition key
map to the sameshard
. - The number of
partition keys
should typically be much greater than the number ofshards
, namelymany-to-one
relationship.
- All
-
Idempotency
-
2
primary causes for duplicate recordsProducer
retriesConsumer
retries
-
Consumer retries are more common than producer retries
Kinesis Data Streams
does not guarantee the order of records acrossshards
in astream
.Kinesis Data Streams
does guarantee the order of records within ashard
.
-
Kinesis Data Streams - Producers
Kinesis Data Streams - Producers - KPL
AWS Docs - KPL (opens in a new tab)
-
Batching
-
Aggregation
Storing multiple payloads within a single
Kinesis Data Streams record
. -
Collection
Consolidate multiple
Kinesis Data Streams
records into a singleKinesis Data Streams
record to reduce HTTP requests.
-
-
Rate Limiting
- Limits
per-shard throughput
sent from a single producer - Implemented using a
token bucket algorithm
with separate buckets for bothKinesis Data Streams
records and bytes.
- Limits
-
Pros
- Higher throughput due to
aggregation
andcompression
of records - Abstraction over
API
to simplify coding using low levelAPI
directly
- Higher throughput due to
-
Cons
- Higher latency due to additional processing delay of up to configurable
RecordMaxBufferedTime
- Only supports
AWS SDK v1
- Higher latency due to additional processing delay of up to configurable
Kinesis Data Streams - Producers - Kinesis Agent
AWS Docs - Kinesis Agent (opens in a new tab)
- Stand-alone
Java
application running as adaemon
- Continuously monitors a set of files and sends new data to your stream.
Kinesis Data Streams - Consumers
-
Differences between shared throughout consumer and enhanced fan-out consumer (opens in a new tab)
-
Types
- Shared throughput consumers without enhanced fan-out
- Enhanced fan-out consumers
-
-
Tasks
- Connects to the data stream
- Enumerates the shards within the data stream
- Uses leases to coordinates shard associations with its workers
- Instantiates a record processor for every shard it manages
- Pulls data records from the data stream
- Pushes the records to the corresponding record processor
- Checkpoints processed records
- Balances shard-worker associations (leases) when the worker instance count changes or when the data stream is resharded (shards are split or merged)
-
Versions
-
KCL 1.x
is based onAWS SDK v1
, andKCL 2.x
is based onAWS SDK v2
-
KCL 1.x
- Java
- Node.js
- .NET
- Python
- Ruby
-
KCL 2.x
- Java
- Python
-
-
Multistream processing
is only supported inKCL 2.x for Java
(KCL 2.3+ for Java) -
You should ensure that the number of KCL instances does not exceed the number of
shards
(except for failure standby purposes) -
One
KCL worker
is deployed on oneEC2
instance, able to process multipleshards
. -
A
KCL worker
runs as aprocess
and it has multiplerecord processors
running asthreads
within it. -
One
Shard
has exactly one correspondingrecord processor
. -
KCL 2.x
enables you to createKCL
consumer applications that can processmore than one data stream at the same time
. -
Checkpoints
sequence number for theshard
in thelease table
to track processed records -
Enhanced fan-out (opens in a new tab)
Consumers
usingenhanced fan-out
have dedicated throughput ofup to 2 MB/second/shard
, and withoutenhanced fan-out
, allconsumers
reading from the same shard share thethroughput
of2 MB/second/shard
.- Requires
KCL 2.x
andKinesis Data Streams
withenhanced fan-out
enabled - You can register
up to 20 consumers per stream
to useenhanced fan-out
. Kinesis Data Streams
pushes data records from the stream toconsumers using enhanced fan-out
overHTTP/2
, thereforeno polling
andlower latency
.
-
Lease Table
(opens in a new tab)- At any given time,
each shard
of data records is bound toa particular KCL worker
by alease
identified by theleaseKey
variable. - By default, a
worker
can hold one or moreleases
(subject to the value of themaxLeasesForWorker
variable) at the same time. - A
DynamoDB
table to keep track of theshards
in aKinesis Data Stream
that are beingleased and processed
by theworkers
of theKCL consumer application
. Each row in the lease table
representsa shard that is being processed by the workers
of your consumer application.KCL
usesthe name of the consumer application
to createthe name of the lease table
that this consumer application uses, therefore each consumer application name must be unique.One lease table
for oneKCL consumer application
KCL
creates thelease table
with aprovisioned throughput
of10 reads / second
and10 writes / second
- At any given time,
-
-
Developing Consumers Using Amazon Data Firehose (opens in a new tab)
-
Developing Consumers Using Amazon Managed Service for Apache Flink (opens in a new tab)
-
Troubleshooting (opens in a new tab)
-
Consumer read API calls throttled (
ReadProvisionedThroughputExceeded
)-
Occurs when
GetRecords
quota is exceeded, subsequentGetRecords
calls are throttled. -
Quota
- 5
GetRecords
API calls / second / each shard 2 MB / second / each shard
- Up to
10 MiB of data
/ API call / each shard - Up to
10000 records
/ API call
- 5
-
Slow record processing
-
Kinesis Data Firehose
AWS Docs - Kinesis Data Firehose (opens in a new tab)
-
Serverless
,fully managed
withautomatic scaling
-
Pre-built
Kinesis Data Streams
connectors for various sources and destinations, without coding, similar toKafka Connect
-
For data
latency
of60 seconds or higher
. -
Troubleshooting
Kinesis Data Firehose - Sources & Destinations
-
Sources
-
Kinesis Data Streams
-
MSK
- Can only use
S3
as destination
- Can only use
-
Direct PUT
- Use
PutRecord
andPutRecordBatch
API
to send data - If the source is
Direct PUT
,Firehose
will retain data for24 hours
.
- Use
-
-
Destinations - AWS Docs - Data Firehose - Source, Destination, and Name (opens in a new tab)
S3
Redshift
OpenSearch
- Custom
HTTP
Endpoint 3rd-party
service
Kinesis Data Firehose - Buffering hints
AWS Docs - Buffering hints (opens in a new tab)
-
Buffering size
inMBs
(destination specific)1 MB
to128 MB
- Default is
5 MB
-
Buffering interval
inseconds
(destination specific)60
to900
seconds- Default is
60
/300
seconds
Kinesis Data Firehose - Data Transformation
Data Transformation (opens in a new tab) not natively supported, need to use Lambda
-
Lambda
-
Custom transformation logic
-
Lambda
invocation time of up to5 min
. -
Error handling
- Retry the invocation
3 times by default
. - After retries, if the invocation is unsuccessful,
Firehose
then skips that batch of records. The skipped records are treated as unsuccessfully processed records. - The unsuccessfully processed records are delivered to your
S3 bucket
.
- Retry the invocation
-
-
Data format conversion
Firehose
can convertJSON
records using a schema from a table defined inGlue
- Non-
JSON
data must be converted by invokingLambda
function
Kinesis Data Firehose - Dynamic Partitioning
-
Dynamic partitioning
enables you to continuously partition streaming data inFirehose
by usingkeys
within data (for example,customer_id
ortransaction_id
) and then deliver the data grouped by thesekeys
into correspondingS3
prefixes. -
Partitioning your data
minimizes the amount of data scanned, optimizes performance, and reduces costs of your analytics queries onS3
. -
Methods of creating
partitioning keys
-
Inline parsing
Map each parameter to a valid
jq
expression, suitable forJSON
format -
Lambda function
For
compressed
orencrypted
data records, or data that is in any file format other thanJSON
-
-
Resources
Kinesis Data Firehose - Server-side Encryption
Server-side encryption (opens in a new tab)
-
Can be enabled depending on the source of data
-
Kinesis Data Streams
as the SourceKinesis Data Streams
firstdecrypts the data
and then sends it toFirehose
.Firehose
buffers the data in memory and then deliverswithout storing the unencrypted data at rest
.
-
With
Direct PUT
or other Data Sources- Can be turned on by using the
StartDeliveryStreamEncryption
operation. - To stop
server-side-encryption
, use theStopDeliveryStreamEncryption
operation.
- Can be turned on by using the
Kinesis Data Firehose - Custom S3 Prefixes
Custom Prefixes for Amazon S3 Objects (opens in a new tab)
-
Ability to specify a
custom S3 prefix
to be compatible withHive naming conventions
, for data to be easily cataloged byGlue
. -
Resources
Kinesis Data Analytics
-
Input
Kinesis Data Streams
Kinesis Data Firehose
-
Output
Kinesis Data Streams
Kinesis Data Firehose
Lambda
Kinesis Data Analytics - Managed Apache Flink
Managed Service for Apache Flink (opens in a new tab)
-
Equivalent to
Kafka Streams
, usingApache Flink
behind the scene, online stream processing, namelystreaming ETL
-
Managed Service for Apache Flink
is for buiding streaming application withAPI
inJava
,Scala
,Python
andSQL
-
Managed Service for Apache Flink Studio
is for ad-hoc interactive data exploration. -
Flink API
- Flink offers
4 levels of API abstraction
:Flink SQL
,Table API
,DataStream API
, andProcess Function
, which is used in conjunction with theDataStream API
. SQL
can be embedded in any Flink application, regardless the programming language chosen.- If you are if planning to use the
DataStream API
, not all connectors are supported inPython
. - If you need
low-latency/high-throughput
you should considerJava/Scala
regardless theAPI
. - If you plan to use
Async IO
in theProcess Functions API
you will need to useJava
.
- Flink offers
Kinesis Data Analytics Studio
Kinesis Data Analytics Studio (opens in a new tab)
-
Based on
-
Apache Zeppelin
-
Apache Flink
-
Lake Formation
-
- Simplify security management and governance at scale
- Centrally manage permissions across your users
- Comprehensive data access and audit logging
- Provides a
RDBMS
permissions model to grant or revoke access to databases, tables, and columns in theData Catalog
with underlying data inS3
.
-
Resources
Lake Formation - Ingestion
-
Blueprints
are predefinedETL
workflows that you can use to ingest data into yourData Lake
. -
Workflows
are instances of ingestionblueprints
inLake Formation
. -
Blueprint
types-
Database snapshot
-
Loads or reloads data from all tables into the data lake from a JDBC source.
-
You can exclude some data from the source based on an exclude pattern.
-
When to use:
- Schema evolution is flexible. (Columns are re-named, previous columns are deleted, and new columns are added in their place.)
- Complete consistency is needed between the source and the destination.
-
-
Incremental database
-
Loads only new data into the data lake from a JDBC source, based on previously set bookmarks.
-
You specify the individual tables in the JDBC source database to include.
-
When to use:
- Schema evolution is incremental. (There is only successive addition of columns.)
- Only new rows are added; previous rows are not updated.
-
-
Log file
Bulk loads data from log file sources
-
Lake Formation - Permission management
The security policies in Lake Formation use two layers of permissions, each resource is protected by
-
Lake Formation permissions (which control access to Data Catalog resources and S3 locations)
-
IAM permissions (which control access to Lake Formation and AWS Glue API resources)
-
Permission types
-
Metadata access
Data Catalog permissions
-
Underlying data access
-
Data access permissions
Data access permissions (
SELECT
,INSERT
, andDELETE
) onData Catalog
tables that point to that location. -
Data location permissions
The ability to create
Data Catalog
resources that point to particularS3
locations.When you grant the
CREATE_TABLE
orALTER
permission to a principal, you also grant data location permissions to limit the locations for which the principal can create or alter metadata tables.
-
-
-
Permission model
-
IAM
permissions model consists ofIAM policies
. -
Lake Formation
permissions model is implemented asDBMS
-styleGRANT/REVOKE
commands. -
Principals
IAM
users and rolesSAML
users and groupsIAM Identity Center
- External accounts
-
Resources
-
LF-Tags
Using
LF-Tags
can greatly simplify the number of grants over usingNamed Resource policies
. -
Named data catalog resources
-
-
Permissions
- Database permissions
- Table permissions
-
Data filtering
-
Column-level
securityAllows users to
view only specific columns and nested columns
that they have access to in the table. -
Row-level
securityAllows users to
view only specific rows of data
that they have access to in the table. -
Cell-level
securityBy using both
row filtering
andcolumn filtering
at the same timeCan restrict access to different columns depending on the row.
-
-
QuickSight
AWS Docs - QuickSight (opens in a new tab)
-
Key points
- Ad-hoc, interactive query, analysis and visualization
- For formatted canned reports, use Paginated Reports (opens in a new tab)
-
Editions
-
Standard
edition -
Enterprise
edition-
VPC
connectionAccess to data source in your
VPC
without the need for a publicIP
address -
Microsoft
Active Directory
SSO -
Data stored in
SPICE
isencrypted at rest
. -
For
SQL-based data sources
, such asRedshift
,Athena
,PostgreSQL
, orSnowflake
, you can refresh your dataincrementally
within a look-back window of time.
-
-
-
Datasets
-
Dataset refresh can be scheduled or on-demand.
-
SPICE
- In-memory
-
Direct SQL query
-
-
Visual types
-
Combo chart (opens in a new tab)
Aka line and column (bar) charts
-
Scatter plot (opens in a new tab)
Used to observe relationships between variables
-
- Use heat maps to show a measure for the intersection of two dimensions, with color-coding to easily differentiate where values fall in the range.
- Use a heat map if you want to identify trends and outliers, because the use of color makes these easier to spot.
-
Pivot table (opens in a new tab)
- Use a pivot table if you want to analyze data on the visual.
-
-
AutoGraph (opens in a new tab)
You create a visual by choosing
AutoGraph
and then selecting fields.
Migration and Transfer
Application Discovery Service
AWS Application Discovery Service (opens in a new tab)
-
Collect usage and configuration data about your
on-premises servers and databases
. -
All discovered data is stored in your
AWS Migration Hub
home Region
. -
2 ways of performing discovery and collecting data
-
Agentless discovery
Identifies
VMs
and hosts associated withVMware vCenter
-
Agent-based discovery
The agent runs in your local environment and requires
root
privileges.
-
Application Migration Service
- Automated
lift-and-shift
(rehost) migration ofVMs
toAWS
- Migrate your applications to
AWS
fromphysical servers
,VMware vSphere
,Microsoft Hyper-V
, andother cloud providers
. - Migrate
EC2
instances betweenAWS Regions
or betweenAWS accounts
, and to migrate fromEC2-Classic
to aVPC
.
DMS
AWS Docs - DMS (Database Migration Service) (opens in a new tab)
-
Migrate
RDBMS
,data warehouses
,NoSQL databases
, and other types of data stores -
One-time migration
orcontinuous replication
-
Used for
CDC
(Change Data Capture) forreal-time ongoing replication
-
DMS
uses areplication instance
to connect to your source data store, read the source data, and format the data for consumption by the target data store. -
DMS
Serverless
eliminates replication instance management tasks.
DMS - Fleet Advisor
- Automatically inventories and assesses on-premises database and analytics server fleet and identifies potential migration paths
DMS - Data Transformations
DataSync
AWS Docs - DataSync (opens in a new tab)
-
One-off online data transfer, transferring hot or cold data should not impede your business.
-
Move data between
on-premises
andAWS
-
Move data between
AWS
services -
Requires
DataSync agent
installationon-premises
-
Move
files
orobjects
, not databases -
On-premises
- Network File System (
NFS
) - Server Message Block (
SMB
) - Hadoop Distributed File Systems (
HDFS
) Object storage
- Network File System (
Snow Family
-
AWS Snow Family service models - Feature comparison matrix (opens in a new tab)
-
AWS Docs - AWS Snowball Edge Device Specifications (opens in a new tab)
Snowball Edge Storage Optimized
- For
large-scale data migrations
andrecurring transfer workflows
, as well aslocal computing with higher capacity needs
Snowball Edge Compute Optimized
- For use cases such as
machine learning
,full motion video analysis
,analytics
, andlocal computing stacks
Snowmobile
- Intended for
more than 10 PB
data migrations from a single location - Up to
100 PB
perSnowmobile
- Shipping container
Transfer Family
-
Transfer flat files using the following protocols:
- Secure File Transfer Protocol (
SFTP
): version 3 - File Transfer Protocol Secure (
FTPS
) - File Transfer Protocol (
FTP
) - Applicability Statement 2 (
AS2
) - Browser-based transfers
- Secure File Transfer Protocol (
Application Integration
AppFlow
EventBridge
-
Rules
Routing rules for events
-
API Destinations
API destinations
areHTTP endpoints
that you can invoke as thetarget of a rule
, similar to how you invoke anAWS service
orresource
as a target.
MWAA
- Auto scaling
- Automatic
Airflow
setup based onAirflow
version - Streamlined upgrades and patches
- Workflow monitoring in
CloudWatch
SNS
Step Functions
Step Functions (opens in a new tab)
-
Able to Run
EMR
workloads on a schedule -
Gneral-purpose serverless workflow orchestration service
-
Resources
- AWS Docs - AWS Step Functions - Manage an Amazon EMR Job (opens in a new tab)
- AWS Big Data Blog - Automating EMR workloads using AWS Step Functions (opens in a new tab)
- AWS Big Data Blog - Orchestrate Amazon EMR Serverless jobs with AWS Step functions (opens in a new tab)
- InfoQ - Building Workflows with AWS Step Functions (opens in a new tab)
- LocalStack - Step Functions (opens in a new tab)
Step Functions - State Machine
-
Types
-
Standard workflows
-
Standard workflows
are ideal for long-running, durable, and auditable workflows. -
Standard workflows
can support an execution start rate of over 2000 executions / second. -
They can run for up to a year and you can retrieve the full execution history using the
Step Functions API
. -
Standard Workflows
employ anexactly-once
execution model, where your tasks and states are never started more than once unless you have specified theRetry
behavior in yourstate machine
. -
Suited to
orchestrating non-idempotent actions
, such as starting anEMR
cluster or processing payments. -
Standard Workflow
executions are billed according tothe number of state transitions processed
. -
Using
Callbacks
or the.sync Service
integration will most likely reduce the number of state transitions and cost.
-
-
Express
workflowsThe
Express
type is used for high-volume, event-processing workloads and can run forup to 5 minutes
.-
Synchronous
Express WorkflowsStart a workflow, wait until it completes, and then return the result
-
Asynchronous
Express WorkflowsReturn confirmation that the workflow was started, but don't wait for the workflow to complete
-
-
Step Functions - States
AWS Docs - Step Functions - States (opens in a new tab)
-
Defined with
Amazon States Language
inState Machine
definition -
Types
-
Task
-
Choice
-
Map
-
Use the
Map state
to run a set of workflow steps for each item in a dataset. TheMap state
's iterations run in parallel, which makes it possible to process a dataset quickly. -
Map state processing modes (opens in a new tab)
-
Inline mode
Limited concurrency
, each iteration of theMap state
runs in the context of the workflow that contains theMap state
.Map state
accepts only aJSON array as input
. Also, this mode supportsup to 40 concurrent iterations
.
-
Distributed mode
-
High concurrency
, each iteration of theMap state
runs in its own execution context. -
When you run a
Map state
inDistributed mode
,Step Functions
creates aMap Run
resource. -
Tolerated failure threshold
can be set for aMap Run
, andMap Run
fails automatically if it exceeds the threshold. -
Use cases
- The size of your dataset
exceeds 256 KB
. - The workflow's execution event history
exceeds 25000 entries
. - You need a concurrency of
more than 40 parallel iterations
.
- The size of your dataset
-
-
-
-
SQS
Kinesis Data Streams vs SQS
MSK | Kinesis Data Streams | SQS | |
---|---|---|---|
Consumption | Can be consumed many times | Records are deleted after being consumed | |
Retention | Retention period configurable from 24 hours to 1 year | Retention period configurable from 1 min to 14 days | |
Ordering | Ordering of records is preserved at the same shard | FIFO queues preserve ordering of records | |
Scaling | Manual resharding | Auto scaling | |
Delivery | At least once | Exactly once | |
Replay | Can replay | Can replay | Cannot replay |
Payload size | 1 MB | 1 MB | 256 KB |
-
Resources
Cloud Financial Management
AWS Budgets
AWS Cost Explorer
Compute
AWS Batch
AWS Batch (opens in a new tab)
AWS Batch
handlesjob execution
andcompute resource management
, allowing you to focus on developing your applications rather than managing infrastructure.- A batch job is deployed as a
Docker container
.
EC2
Lambda
AWS SAM
Containers
ECR
ECS
EKS
Database
DocumentDB
DynamoDB
RDS
Redshift
-
PB
scale, columnar storage -
MPP
architecture -
Loading data
-
AWS Docs - Amazon Redshift best practices for loading data (opens in a new tab)
-
COPY
command- Can load data from multiple files in parallel.
- Use the
COPY
command withCOMPUPDATE
set toON
to analyze and apply compression automatically - Split your data into files so that
the number of files
is amultiple of the number of slices
in your cluster.
-
-
AWS Big Data Blog - JOIN Amazon Redshift AND Amazon RDS PostgreSQL WITH dblink (opens in a new tab)
- The
dblink
function allows the entire query to be pushed toRedshift
. This letsRedshift
do what it does best—query large quantities of data efficiently and return the results toPostgreSQL
for further processing.
- The
Redshift - Cluster
-
Every
cluster
can have up to32 nodes
. -
Leader node
andcompute nodes
have the same specs, andleader node
is chosen automatically. -
Leader node
(2+ nodes)- Parse the query and building an optimal execution plan, and compile code from execution plan
- Receive and aggregate the results from the
compute nodes
-
Compute node
-
Node
slices
A
compute node
is partitioned intoslices
. Eachslice
is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node.
-
-
Enhanced
VPC
routing- Routes network traffic between your cluster and data repositories through a
VPC
, instead of through the internet.
- Routes network traffic between your cluster and data repositories through a
Redshift - Cluster - Elastic resize
-
Preferred method
-
In a incremental manner
-
Add or remove
nodes
(samenode type
) to acluster
in minutes, akain-place
resize -
Change the
node type
of an existingcluster
A snapshot is created, and a new
cluster
is created from the snapshot with the newnode type
. -
Resources
Redshift - Cluster - Classic resize
- When you need to change the
cluster size
or thenode type
of an existingcluster
and elastic resize is not supported - Takes much longer than
elastic resize
Redshift - Cluster - Node type
-
RA3
Node Size vCPU RAM (GiB) Managed Storage quota per node Number of nodes per cluster ra3.xlplus
4 32 32 TB 1
to32
ra3.4xlarge
12 96 128 TB 2
to32
ra3.16xlarge
48 384 128 TB 2
to128
- Pay for the
managed storage
andcompute
separately Managed storage
usesSSD
for local cache andS3
for persistence and automatically scales based on the workloadCompute
is separated frommanaged storage
and can be scaled independentlyRA3
nodes are optimized for performance and cost-effectiveness
- Pay for the
-
DC2
(Dense Compute)Node Size vCPU RAM (GiB) Managed Storage quota per node Number of nodes per cluster dc2.large
2 15 160 GB 1
to32
dc2.8xlarge
32 244 2.6 TB 2
to128
Compute-intensive
local
SSD
storage forhigh performance
andlow latency
- Recommended for datasets under
10 TB
-
DS2
(Dense Storage)Deprecated legacy option
, forlow-cost
workloads withHDD
storage,RA3
is recommended fornew workloads
-
Resources
Redshift - Cluster - HA
-
AZ
Single AZ
by defaultMulti AZ
is supported
Redshift - Distribution style
Distribution style (opens in a new tab)
-
Redshift Distribution Keys (DIST Keys)
determine where data is stored inRedshift
. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node. -
Uneven distribution of data across computing nodes leads to the skewness of the work a node has to do and you don’t want an under-utilised compute node. So the distribution of the data should be uniform.
-
Data should be distributed in such a way that the rows that participate in joins are already on the
same node
with their joining rows in other tables. This is calledcollocation
, which reduces data movement and improves query performance. -
Distribution is configurable per table. So you can select a different
distribution style
for each of the table. -
AUTO distribution
Redshift
automatically distributes the data across the nodes in the clusterRedshift
initially assignsALL
distribution to asmall table
, then changes toEVEN
distribution when the table grows larger.- Suitable for small tables or tables with a small number of distinct values
-
EVEN distribution
- The
leader node
distributes the rowsacross the slices
in around-robin
fashion, regardless of the values in any particular column. - Appropriate when
a table doesn't participate in joins
. It's also appropriate when there isn't a clear choice betweenKEY
distribution andALL
distribution.
- The
-
KEY distribution
- The rows are distributed according to
the values in one column
. - All the entries with the
same value in the column
end up on thesame slice
. - Suitable for
large tables
ortables with a large number of distinct values
- The rows are distributed according to
-
ALL distribution
- A copy of the
entire table
is distributed toevery node
. - Suitable for
small tables
ortables that are not joined with other tables
- A copy of the
Redshift - Concurrency scaling
Concurrency scaling
addstransient clusters
to your cluster to handle concurrent requests with consistency and fast performance in a matter of seconds, mainly for bursty workloads.
Redshift - Sort keys
Sort keys (opens in a new tab)
- When you
create a table
, you can defineone or more of its columns
assort keys
. - Either a
compound
orinterleaved
sort key
.
Redshift - Compression
AWS Docs - Working with column compression (opens in a new tab)
- You can apply a compression type, or encoding, to the columns in a table manually
when you create the table
. - The
COPY
command analyzes your data and applies compression encodings to an empty tableautomatically
as part of the load operation. - Split your load data files so that the files are about
equal size
, between1 MB
and1 GB
after compression. For optimum parallelism, the ideal size is between1 MB
and125 MB
after compression. - The number of files should be
a multiple of the number of slices
in your cluster. - When loading data, it is strongly recommended that you
individually compress your load files
usinggzip
,lzop
,bzip2
, orZstandard
when you have large datasets.
Redshift - Workload Management (WLM)
-
AWS Docs - Amazon Redshift - Implementing workload management (opens in a new tab)
-
Automatic WLM
-
Manual WLM
-
Query Groups
Memory percentage
andconcurrency
settings
-
Query Monitoring Rules
Query Monitoring Rules
are used to define the conditions under which a query is monitored and the action to take when the query is monitored.
-
Query Queues
Memory percentage
andconcurrency
settings
-
-
Concurrency scaling
Concurrency scaling
automatically adds and removescompute nodes
to handle spikes in demand.Concurrency scaling
is enabled by default for allRA3
andDC2
clusters.Concurrency scaling
is not available forDS2
clusters.
-
Short query acceleration (opens in a new tab)
Short query acceleration (SQA)
prioritizesselected short-running queries
ahead oflonger-running queries
.SQA
runs short-running queries in adedicated space
, so thatSQA
queries aren't forced to wait in queues behind longer queries.CREATE TABLE AS
(CTAS
) statements and read-only queries, such asSELECT
statements, are eligible forSQA
.
Redshift - User-Defined Functions (UDF)
-
User-Defined Functions
(UDFs
) are functions that you define to extend the capabilities of theSQL
language inRedshift
. -
AWS Lambda
(opens in a new tab)- Accessing other
AWS services
- Accessing other
Redshift - Snapshots
-
Automated (opens in a new tab)
Enabled by default
when you create a cluster.- By default, about
every 8 hours
orfollowing every 5 GB per node of data changes
, or whichever comes first. - Default
retention period
is1 day
- Cannot be deleted manually
-
- By default, manual snapshots are retained indefinitely, even after you delete your cluster.
- AWS CLI:
create-cluster-snapshot
- AWS API:
CreateClusterSnapshot
-
Backup
- You can configure
Redshift
to automatically copy snapshots (automated or manual) for a cluster to anotherRegion
. - Only one destination
Region
at a time can be configured for automatic snapshot copy. - To change the destination
Region
that you copy snapshots to, first disable the automatic copy feature. Then re-enable it, specifying the new destinationRegion
.
- You can configure
-
Snapshot copy grant
KMS
keys are specific to aRegion
. If you enable copying ofRedshift
snapshots to anotherRegion
, and the source cluster and its snapshots are encrypted using a master key fromKMS
, you need to configure a grant forRedshift
to use a master key in the destinationRegion
.- The grant is created in the destination
Region
and allowsRedshift
to use the master key in the destinationRegion
.
Redshift - SQL
-
VACUUM
-
FULL
VACUUM FULL
is the default.- By default,
VACUUM FULL
skips the sort phase for any table that is alreadyat least 95% sorted
.
-
SORT ONLY
- Sorts the specified table (or all tables in the current database) without reclaiming space freed by deleted rows.
-
DELETE ONLY
Redshift
automatically performs aDELETE ONLY
vacuum in the background. -
REINDEX
-
Redshift - Query
-
- Available in
AWS SDK
, based onHTTP
andJSON
, no persistent connection needed API calls
areasynchronous
- Uses either
credentials stored in Secrets Manager
ortemporary database credentials
, no need to pass passwords in the API calls
- Available in
-
System tables and views
-
SVV
views
contain information about database objects with references to transientSTV
tables
. -
SYS
views
- These are system monitoring views used to monitor query and workload usage for provisioned clusters and serverless workgroups.
- These views are located in the
pg_catalog
schema.
-
STL
views
- Information retrieved from all Redshift log history across nodes
- Retain
7
days of log history
-
STV
tables
Virtual system tables
(in-memory) that containsnapshots of the current system state
-
SVCS
views
provide details about queries on both the main and concurrency scaling clusters. -
SVL
views
provide details about queries on main clusters.
-
-
Federated Query
- Query and analyze data across
operational databases
,data warehouses
, anddata lakes
. - AWS Docs - Considerations when accessing federated data with Amazon Redshift (opens in a new tab)
- Query and analyze data across
Redshift - Data Sharing
-
With
data sharing
, you can securely and easily share live data acrossRedshift
clusters without the need to copy or move the data. -
Lends itself to a
multi-warehouse
architecture, where you can scale each data warehouse for various types of workloads. -
Only
RA3
andserverless
clusters are supported. -
Live and transactionally consistent views of data across all consumers
-
Secure and governed collaboration within and across organizations
-
Sharing data with external parties to monetize your data
Redshift - Data Sharing - Datashare
Working with datashares (opens in a new tab)
-
You can share data
at different levels
Databases
Schemas
Tables
Views
(including regular, late-binding, and materialized views)SQL user-defined functions (UDFs)
Redshift - Security
-
IAM
- Does not support
resource-based policies
.
- Does not support
-
Access control
-
Cluster level
-
IAM policies
For
Redshift
API
actions -
Security groups
For network connectivity
-
-
Database level
Database user accounts
-
-
Data protection
-
DB encryption using
KMS
(opens in a new tab)-
KMS
key hierarchy- Root key
- Cluster encryption key (CEK)
- Database encryption key (DEK)
- Data encryption keys
-
-
-
Auditing
-
Logs
-
Connection log
Logs authentication attempts, connections, and disconnections.
-
User log
Logs information about changes to database user definitions.
-
User activity log
Logs each query before it's run on the database.
-
-
Redshift Spectrum
In-place querying
of data inS3
, without having to load the data intoRedshift
tablesEB
scale- Pay for
the number of bytes scanned
Federated query
acrossoperational databases
,data warehouses
, anddata lakes
Redshift Spectrum - Comparison with Athena
Redshift Spectrum | Athena | |
---|---|---|
Performance | Greater control over performance; use dedicated resources of your Redshift cluster | Not configurable; use shared resources managed by AWS |
Query Results | Stored in Redshift | Stored in S3 |
Redshift Serverless
Redshift
measures data warehouse capacity inRedshift Processing Units
(RPUs
). You pay for the workloads you run inRPU
-hours on a per-second basis (with a60-second minimum charge
), including queries that access data in open file formats in S3
Redshift Streaming Ingestion
-
Input
- Kinesis Data Streams
- MSK
Keyspaces
- AWS Docs - Keyspaces (opens in a new tab)
- Based on
Apache Cassandra
Neptune
- AWS Docs - Neptune (opens in a new tab)
- Serverless graph database
MemoryDB
- More expensive than
ElastiCache
- Intended as a
Redis
-compatible, fully managed, in-memory database service instead of a caching service
MemoryDB - Valkey
- Open source
Redis
-compatible alternative, as it is forked fromRedis
MemoryDB - Redis
Management and Governance
CloudFormation
CloudTrail
CloudWatch
CloudWatch Logs
-
Subscription
Real-time processing of log data
AWS Config
AWS Config - Config rules
-
AWS Config Managed Rules
Predefined, customizable rules created by AWS Config
-
AWS Config Custom Rules
There are
2 ways to create AWS Config custom rules
Lambda functions
Guard
(policy as code DSL)
Managed Grafana
Systems Manager
Well-Architected Tool
Networking and Content Delivery
CloudFront
PrivateLink
Route 53
VPC
Security, Identity, and Compliance
IAM
KMS
Macie
Secrets Manager
AWS Shield (opens in a new tab)
- Protection against
DDoS
attacks
AWS Shield Standard (opens in a new tab)
- Operates at
L3
andL4
.
AWS Shield Advanced (opens in a new tab)
-
Operates at
L7
-
Include
Shield Standard
- Certain
AWS WAF
usage forShield
protected resources
-
DDoS
cost protection- If any of these protected resources scale up in response to a
DDoS
attack, you can requestShield Advanced
service credits through your regularAWS Support
channel.
- If any of these protected resources scale up in response to a
AWS WAF (opens in a new tab)
-
L7
(HTTP
) application level firewall -
Define a
Web ACL
and then associating it with one or more web application resources that you want to protect.