AWS-DEA-C01 – Nextra

Services

Analytics

Athena

Athena (opens in a new tab)

In-place querying of data in S3
Serverless
PB scale
Presto (opens in a new tab) under the hood
Supports SQL and Apache Spark
Analyze unstructured, semi-structured, and structured data stored in S3, formats including CSV, JSON, Apache Parquet and Apache ORC
Only successful or canceled queries are billed, while failed queries are not billed.
No charge for DDL statements
Use cases
- Ad-hoc queries of web logs, CloudTrail / CloudFront / VPC / ELB logs
- Querying staging data before loading to Redshift
- Integration with Jupyter, Zeppelin, RStudio notebooks
- Integration with QuickSight for visualization
- Integration via JDBC / ODBC for BI tools
Resources
- AWS Big Data Blog - Top 10 Performance Tuning Tips for Amazon Athena (opens in a new tab)

Athena - Athena SQL

MSCK REPAIR TABLE refreshes partition metadata with the current partitions of an external table.

Repairing partitions manually using MSCK repair (opens in a new tab)

SerDe (opens in a new tab)

When you specify ROW FORMAT DELIMITED, Athena uses the LazySimpleSerDe by default.

-- e.g. ROW FORMAT DELIMITED
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':'

Use ROW FORMAT SERDE to explicitly specify the type of SerDe that Athena should use when it reads and writes data to the table.

-- e.g. ROW FORMAT SERDE
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'collection.delim' = '|',
'mapkey.delim' = ':',
'escape.delim' = '\\'
)

UNLOAD (opens in a new tab)
- Writes query results from a SELECT statement to the specified data format. Supported formats for UNLOAD include Parquet, ORC, Avro, and JSON.
- The UNLOAD statement is useful when you want to output the results of a SELECT query in a non-CSV format but do not require the associated table.

Athena - Workgroup

A workgroup is a collection of users, queries, and results that are associated with a specific data catalog and workgroup settings.
Workgroup settings include the location in S3 where query results are stored, the encryption configuration, and the data usage control settings.

Athena - Federated Query

AWS Docs - Using Amazon Athena Federated Query (opens in a new tab)

Query the data in place or build pipelines that extract data from multiple data sources and store them in S3.
Uses data source connectors that run on Lambda to run federated queries.

OpenSearch

AWS Docs - OpenSearch (opens in a new tab)

Operational analytics and log analytics
Forked version of Elasticsearch
Number of Shards = Index Size / 30GB

OpenSearch - Index

Storage
- UltraWarm
  
  To store large amounts of read-only data
- Cold
  
  If you need to do periodic research or forensic analysis on your older data.
- Hot
- OR1 (OpenSearch Optimized Instance)
  
  A domain with OR1 instances uses EBS gp3 or io1 volumes for primary storage, with data copied synchronously to S3 as it arrives.
Cross-cluster replication (opens in a new tab)
- Cross-cluster replication follows a pull model, you initate connections from the follower domain.

EMR

AWS Docs - EMR (opens in a new tab)

Managed cluster platform that simplifies running big data frameworks, such as Hadoop and Spark, on AWS.
PB scale
RunJobFlow API creates and starts running a new cluster (job flow). The cluster runs the steps specified. After the steps complete, the cluster stops and the HDFS partition is lost.
An EMR cluster with multiple primary nodes can reside only in one AZ.
HDFS
- A typical block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB chunks, and if possible, each chunk will reside on a different DataNode.
EMR metrics (opens in a new tab)

EMR cluster - Node types

AWS Docs - EMR cluster - Node types (opens in a new tab)

Primary node / Master node
- Manages the cluster and typically runs primary components of distributed applications.
- Tracks the status of jobs submitted to the cluster
- Monitors the health of the instance groups
- HA support from 5.23.0+
Core nodes
- Store HDFS data and run tasks
- Multi-node clusters have at least one core node.
- One core instance group or instance fleet per cluster, but there can be multiple nodes running on multiple EC2 instances in the instance group or instance fleet.
Task nodes
- No HDFS data, therefore no data loss if terminated
- Can use Spot instances
- Optional in a cluster

EMR - EMRFS

Direct access to S3 from EMR cluster
Persistent storage for EMR clusters
Benefit from S3 features
- For clusters with multiple users who need different levels of access to data in S3 through EMRFS
- EMRFS can assume a different service role for cluster EC2 instances based on the user or group making the request, or based on the location of data in S3.
- Each IAM role for EMRFS can have different permissions for data access in S3.
- When EMRFS makes a request to S3 that matches users, groups, or the locations that you specify, the cluster uses the corresponding role that you specify instead of the EMR role for EC2.
AWS Docs - Authorizing access to EMRFS data in Amazon S3 (opens in a new tab)
S3DistCp (opens in a new tab)
- An extension to DistCp (Apache Hadoop Distributed Copy)
- Uses MapReduce to efficiently copy large amounts of data in a distributed manner
- Copy data from S3 to HDFS
- Copy data from HDFS to S3
- Copy data between S3 buckets
- AWS Big Data Blog - Seven Tips for Using S3DistCp on Amazon EMR to Move Data Efficiently Between HDFS and Amazon S3 (opens in a new tab)

EMR - Security - Apache Ranger

A RBAC framework to enable, monitor and manage comprehensive data security across the Hadoop ecosystem.
Centralized security administration and auditing
Fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN)
Syncs policies and users by using agents and plugins that run within the same process as the Hadoop component.
Supports row-level authentication and auditing capabilities with embedded search.

EMR - Security Configuration

AWS Docs - Security Configuration (opens in a new tab)

At-rest data encryption
- For WAL with EMR (opens in a new tab)
  
  Open-source HDFS encryption LUKS encryption
- For cluster nodes local disks (EC2 instance volumes) (opens in a new tab)
  - Open-source HDFS encryption
  - Instance store encryption
    - NVMe encryption
    - LUKS encryption
  - EBS volume encryption
    - EBS encryption
      - Recommended
      - EMR 5.24.0+
    - LUKS encryption
      - Only applies to attached storage volumes, not to the root device volume
- For EMRFS on S3
  - SSE-S3
  - SSE-KMS
  - SSE-C
  - CSE-KMS
    
    Objects are encrypted before being uploaded to S3 and the client uses keys provided by KMS.
  - CSE-Custom
    
    Objects are encrypted before being uploaded to S3 and the client uses a custom Java class that provides the client-side master key.
In-transit data encryption
- For EMR traffic between cluster nodes and WAL with EMR
- For distributed applications
- For EMRFS traffic between S3 and cluster nodes
Resources
- AWS Big Data Blog - Best Practices for Securing Amazon EMR (opens in a new tab)

EMR Studio

Web IDE for fully managed Jupyter notebooks running on EMR clusters

EMR Notebooks (opens in a new tab)

EMR - Open Source Ecosystem

Delta Lake

EMR - Delta Lake (opens in a new tab)

Storage layer framework for lakehouse architectures

Ganglia

EMR - Ganglia (opens in a new tab)

Performance monitoring tool for Hadoop and HBase clusters
Not included in EMR after 6.15.0

Apache HBase

EMR - Apache HBase (opens in a new tab)

Wide-column store NoSQL running on HDFS
EMR WAL support, able to restore an existing WAL retained for 30 days starting from the time when cluster was created, with a new cluster from the same S3 root directory.

Apache HCatalog

EMR - Apache HCatalog (opens in a new tab)

HCatalog is a tool that allows you to access Hive metastore tables within Pig, Spark SQL, and/or custom MapReduce applications.

Apache Hudi

EMR - Apache Hudi (opens in a new tab)

Open Table Format
Brings database and data warehouse capabilities to the data lake

Hue

EMR - Hue (opens in a new tab)

Web GUI for Hadoop

Apache Iceberg

EMR - Apache Iceberg (opens in a new tab)

Open Table Format

Apache Livy

EMR - Apache Livy (opens in a new tab)

A REST interface for Apache Spark

Apache MXNet

EMR - Apache MXNet (opens in a new tab)

Retired
A deep learning framework for building neural networks and other deep learning applications.

Apache Oozie

EMR - Apache Oozie (opens in a new tab)

Workflow scheduler system to manage Hadoop jobs
Apache Airflow is a modern alternative.

Apache Phoenix

EMR - Apache Phoenix (opens in a new tab)

OLTP and operational analytics in Hadoop for low latency applications

Apache Pig

EMR - Apache Pig (opens in a new tab)

SQL-like commands written in Pig Latin and converts those commands into Tez jobs based on DAG or MapReduce programs.

Apache Sqoop

EMR - Apache Sqoop (opens in a new tab)

Retired
A tool for transferring data between S3, Hadoop, HDFS, and RDBMS databases.

Apache Tez

EMR - Apache Tez (opens in a new tab)

An application execution framework for complex DAG of tasks, similar to Apache Spark

Glue

Glue (opens in a new tab)

Serverless Spark ETL with a Hive metastore
Ingests data in batch while performing transformations
Either provide a script to perform the ETL job, or Glue can generate a script automatically.
Data source can be an AWS service, such as RDS, S3, DynamoDB, or Kinesis Data Streams, as well as a third-party JDBC-accessible database.
Data target can be an AWS service, such as S3, RDS, and DocumentDB, as well as a third-party JDBC-accessible database.
Jobs are billed per second
Resources

Glue - Orchestration - Triggers

When fired, a trigger can start specified jobs and crawlers.
A trigger fires on demand, based on a schedule, or based on a combination of events.
trigger types
- Scheduled
- Conditional
  
  The trigger fires if the watched jobs or crawlers end with the specified statuses.
- On-demand

Glue - Orchestration - Workflows

AWS Docs - Glue - workflows (opens in a new tab)
Create and visualize complex ETL activities involving multiple crawlers, jobs, and triggers.
Can be created from an Glue Blueprint, or manually.

Glue - Orchestration - Blueprints

AWS Docs - Glue - blueprints (opens in a new tab)
Glue blueprints provide a way to create and share Glue workflows. When there is a complex ETL process that could be used for similar use cases, rather than creating an Glue workflow for each use case, you can create a single blueprint.

Glue - Job

Glue - Job type

Apache Spark ETL
- Minimum 2 DPU
- 10 DPU (default)
- Job command: glueetl
Apache Spark streaming ETL
- Minimum 2 DPU (default)
- Job command: gluestreaming
Python shell
- 1 DPU or 0.0625 DPU (default)
- Job command: pythonshell
- Mainly intended for ad-hoc tasks such as data retrieval
- Much faster startup times than Spark jobs
- Compared to AWS Lambda
  - Can run much longer
  - Can be equipped with more CPU and memory
  - Lower per-unit-time cost
  - Higher minimum billing time (at least 1 minute)
  - Seamlessly integrates with Glue Workflow
- AWS Glue Python Shell Jobs (opens in a new tab)
Ray
- Job command: glueray

Glue - Job - Job Bookmark

Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run. This persisted state information is called a job bookmark.
Options
- Enabled
  
  The job updates the state after a job run. The job keeps track of processed data, and when a job runs, it processes new data since the last checkpoint.
- Disabled
  
  Job bookmarks are not used, and the job always processes the entire dataset. This is the default setting.
- Pause
  
  The job bookmark state is not updated when this option set is specified. The job processes incremental data since the last successful job run. You are responsible for managing the output from previous job runs.

Glue - Job run insights (Monitoring)

Job debugging and optimization
Insights are available using 2 new log streams in the CloudWatch logs

Glue - Job - Worker

A single DPU is also called a worker.
A DPU is a relative measure of processing power that consists of 4 vCPU of compute capacity and 16 GB of memory.
A M-DPU is a DPU with 4 vCPU and 32 GB of memory.
Auto Scaling
- Available for Glue jobs with G.1X, G.2X, G.4X, G.8X, or G.025X (only for Streaming jobs) worker types. Standard DPUs are not supported.
- Requires Glue 3.0+.
Worker type
- For Glue 2.0+
  
  Specify Worker type and Number of workers
  - Standard
    - 1 DPU
    - 4 vCPU and 16 GB of memory
    - 50 GB of attached storage
  - G.1X (default)
    - 1 DPU - 4 vCPU and 16 GB of memory
    - 64 GB of attached storage
    - Recommended for most workloads with cost effective performance
    - Recommended for jobs authored in Glue 2.0+
  - G.2X
    - 2 DPU - 8 vCPU and 32 GB of memory
    - 128 GB of attached storage
    - Recommended for most workloads with cost effective performance
    - Recommended for jobs authored in Glue 2.0+
  - G.4X
    - 4 DPU - 16 vCPU and 64 GB of memory
    - Glue 3.0+ Spark jobs
  - G.8X
    - 8 DPU - 32 vCPU and 128 GB of memory
    - Recommended for workloads with most demanding transformations, aggregations, and joins
    - Glue 3.0+ Spark jobs
  - G.025X
    - 0.25 DPU
    - Recommended for low volume streaming jobs
    - Glue 3.0+ Spark Streaming jobs only
- Execution class
  - Standard
    - Ideal for time-sensitive workloads that require fast job startup and dedicated resources
  - Flexible
    - Appropriate for time-insensitive jobs whose start and completion times may vary
    - Only jobs with Glue 3.0+ and command type glueetl will be allowed to set ExecutionClass to FLEX.
    - The flexible execution class is available for Spark jobs.

Glue - Programming

AWS Glue Best Practices: Building a Performant and Cost Optimized Data Pipeline - Additional considerations (opens in a new tab)

Comparing available AWS Glue ETL programing languages

Developing and testing AWS Glue job scripts locally - AWS Glue (opens in a new tab)

PySpark REPL

docker run -it --rm \
  -v ~/.aws:/home/hadoop/.aws \
  -v /tmp/glue/workspace:/home/hadoop/workspace/ \
  -e AWS_PROFILE=$AWS_PROFILE \
  -n glue5_pyspark \
  public.ecr.aws/glue/aws-glue-libs:5 \
  pyspark

Spark

docker run -it --rm \
  -v ~/.aws:/home/hadoop/.aws \
  -v /tmp/glue/workspace:/home/hadoop/workspace/ \
  -e AWS_PROFILE=$AWS_PROFILE \
  -n glue5_spark_submit \
  public.ecr.aws/glue/aws-glue-libs:5 \
  spark-submit $SCRIPT_FILE

Glue - Programming - Python

Glue - Programming - Scala

Glue - Studio

GUI for Glue jobs
Visual ETL

Visually compose ETL workflows
Notebook
Script editor
- Create and edit Python or Scala scripts
- Automatically generate ETL scripts

Glue - Data Catalog

AWS Docs - Glue Data Catalog (opens in a new tab)

Functionally similar to a schema registry
API compatible with Hive Metastore
lakeFS - Metadata Management: Hive Metastore vs AWS Glue (opens in a new tab)
Databases & Tables
- Databases and tables are objects in the Data Catalog that contain metadata definitions.
- The schema of your data is represented in your Glue table definition. The actual data remains in its original data store

Glue - Data Catalog - crawler

AWS Docs - Glue Data Catalog - crawler (opens in a new tab)

A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.
A classifier recognizes the format of your data and generates a schema. It returns a certainty number between 0.0 and 1.0, where 1.0 means 100% certainty. Glue uses the output of the classifier that has the highest certainty.
If no classifier returns a certainty greater than 0.0, Glue returns the default classification string of UNKNOWN.
Once attached to a crawler, a custom classifier is executed before the built-in classifiers.
Steps
- A crawler runs any custom classifiers that you choose to infer the format and schema of your data.
- The crawler connects to the data store.
- The inferred schema is created for your data.
- The crawler writes metadata to the Glue Data Catalog.
Partition threshold

If the following conditions are met, the schemas are denoted as partitions of a table.
- The partition threshold is higher than 0.7 (70%).
- The maximum number of different schemas doesn't exceed 5.
Learn how AWS Glue crawler detects the schema | AWS re:Post (opens in a new tab)

Glue - Data Catalog - security

You can only use an Glue resource policy to manage permissions for Data Catalog resources.
You can't attach it to any other Glue resources such as jobs, triggers, development endpoints, crawlers, or classifiers.
Only one resource policy is allowed per catalog, and its size is limited to 10 KB.
Each AWS account has one single catalog per Region whose catalog ID is the same as the AWS account ID.
You cannot delete or modify a catalog.

Glue - Data Quality

AWS Docs - Glue Data Quality (opens in a new tab)

Based on DeeQu (opens in a new tab) framework, using DSL Data Quality Definition Language (DQDL) to define data quality rules
Entry Point
- Data quality for the Data Catalog
- Data quality for ETL jobs

Glue - Streaming

AWS Docs - Glue Streaming (opens in a new tab)

Using the Apache Spark Streaming framework
Near-real-time data processing

Glue DataBrew

AWS Docs - Glue DataBrew (opens in a new tab)

Visual data preparation tool that enables users to clean and normalize data without writing any code.

Job type
- Recipe job
  
  Data transformation
- Profile job
  
  Analyzing a dataset to create a comprehensive profile of the data

Glue DataBrew - Recipes

AWS Docs - Recipe step and function reference (opens in a new tab)
Categories
- Basic column recipe steps
  - Filter
  - Column
- Data cleaning recipe steps
  - Format
  - Clean
  - Extract
- Data quality recipe steps
  - Missing
  - Invalid
  - Duplicates
  - Outliers
- Personally indentifiable information (PII) recipe steps
  - Mask personal information
  - Replace personal information
  - Encrypt personal information
  - Shuffle rows
- Column structure recipe steps
  - Split
  - Merge
  - Create
- Column formatting recipe steps
  - Decimal precision
  - Thousands separator
  - Abbreviate numbers
- Data structure recipe steps
  - Nest-Unnest
  - Pivot
  - Group
  - Join
  - Union
- Data science recipe steps
  - Text
  - Scale
  - Mapping
  - Encode
- Functions
  - Mathematical functions
  - Aggregate functions
  - Text functions
  - Date and time functions
  - Window functions
  - Web functions
  - Other functions

Kinesis Data Streams

AWS Docs - Kinesis Data Streams (opens in a new tab)

Key points
- Equivalent to Kafka, for event streaming, no ETL
- On-demand or provisioned mode
- PaaS with API access for developers
- The maximum size of the data payload of a record before base64-encoding is up to 1 MB.
- Retention Period by default is 1 day, and can be up to 365 days.
Shard (opens in a new tab)
- A stream is composed of one or more shards.
- A shard is a uniquely identified sequence of data records in a stream.
- All the data in the shard is sent to the same worker that is processing the shard.
- A shard iterator is a pointer to a position in the shard from which to start reading data records sequentially. A shard iterator specifies this position using the sequence number of a data record in a shard.
- number_of_shards = max(incoming_write_bandwidth_in_KiB/1024, outgoing_read_bandwidth_in_KiB/2048)
- Read throughput
  
  Up to 2 MB/second/shard
  
  or
  
  5 transactions (API calls) /second/shard
  
  or
  
  2000 records/second/shard
  
  shared across all the consumers that are reading from a given shard
  
  Each call to GetRecords is counted as 1 read transaction.
  
  GetRecords can retrieve up to 10 MB / transaction / shard, and up to 10000 records per call. If a call to GetRecords returns 10 MB, subsequent calls made within the next 5 seconds throw an exception.
  
  up to 20 registered consumers (Enhanced Fan-out Limit) for each data stream.
- Write throughput
  
  Up to 1 MB/second/shard
  
  or
  
  1000 records/second/shard
- Record aggregation allows customers to combine multiple records into a single record. This allows customers to improve their per shard throughput.
Data stream capacity mode
- On-demand mode
  
  Automatically manages the shards
- Provisioned mode
  
  Must specify the number of shards for the data stream upfront, but you can also change the number of shards later by resharding the stream.
Scaling
- resharding (opens in a new tab)
  
  Adjust the number of shards in your stream
  - Shard split
  - Shard merge
  - UpdateShardCount API
Partition key (opens in a new tab)
- All data records with the same partition key map to the same shard.
- The number of partition keys should typically be much greater than the number of shards, namely many-to-one relationship.
Idempotency
- 2 primary causes for duplicate records
  - Producer retries
  - Consumer retries
- Consumer retries are more common than producer retries
  - Kinesis Data Streams does not guarantee the order of records across shards in a stream.
  - Kinesis Data Streams does guarantee the order of records within a shard.

Kinesis Data Streams - Producers

Kinesis Data Streams - Producers - KPL

AWS Docs - KPL (opens in a new tab)

Batching
- Aggregation
  
  Storing multiple payloads within a single Kinesis Data Streams record.
- Collection
  
  Consolidate multiple Kinesis Data Streams records into a single Kinesis Data Streams record to reduce HTTP requests.
Rate Limiting
- Limits per-shard throughput sent from a single producer
- Implemented using a token bucket algorithm with separate buckets for both Kinesis Data Streams records and bytes.
Pros
- Higher throughput due to aggregation and compression of records
- Abstraction over API to simplify coding using low level API directly
Cons
- Higher latency due to additional processing delay of up to configurable RecordMaxBufferedTime
- Only supports AWS SDK v1

Kinesis Data Streams - Producers - Kinesis Agent

AWS Docs - Kinesis Agent (opens in a new tab)

Stand-alone Java application running as a daemon
Continuously monitors a set of files and sends new data to your stream.

Kinesis Data Streams - Consumers

Differences between shared throughout consumer and enhanced fan-out consumer (opens in a new tab)
Types
- Shared throughput consumers without enhanced fan-out
- Enhanced fan-out consumers
KCL (opens in a new tab)
- Tasks
  - Connects to the data stream
  - Enumerates the shards within the data stream
  - Uses leases to coordinates shard associations with its workers
  - Instantiates a record processor for every shard it manages
  - Pulls data records from the data stream
  - Pushes the records to the corresponding record processor
  - Checkpoints processed records
  - Balances shard-worker associations (leases) when the worker instance count changes or when the data stream is resharded (shards are split or merged)
- Versions
  - KCL 1.x is based on AWS SDK v1, and KCL 2.x is based on AWS SDK v2
  - KCL 1.x
    - Java
    - Node.js
    - .NET
    - Python
    - Ruby
  - KCL 2.x
    - Java
    - Python
- Multistream processing is only supported in KCL 2.x for Java (KCL 2.3+ for Java)
- You should ensure that the number of KCL instances does not exceed the number of shards (except for failure standby purposes)
- One KCL worker is deployed on one EC2 instance, able to process multiple shards.
- A KCL worker runs as a process and it has multiple record processors running as threads within it.
- One Shard has exactly one corresponding record processor.
- KCL 2.x enables you to create KCL consumer applications that can process more than one data stream at the same time.
- Checkpoints sequence number for the shard in the lease table to track processed records
- Enhanced fan-out (opens in a new tab)
  - Consumers using enhanced fan-out have dedicated throughput of up to 2 MB/second/shard, and without enhanced fan-out, all consumers reading from the same shard share the throughput of 2 MB/second/shard.
  - Requires KCL 2.x and Kinesis Data Streams with enhanced fan-out enabled
  - You can register up to 20 consumers per stream to use enhanced fan-out.
  - Kinesis Data Streams pushes data records from the stream to consumers using enhanced fan-out over HTTP/2, therefore no polling and lower latency.
- Lease Table (opens in a new tab)
  - At any given time, each shard of data records is bound to a particular KCL worker by a lease identified by the leaseKey variable.
  - By default, a worker can hold one or more leases (subject to the value of the maxLeasesForWorker variable) at the same time.
  - A DynamoDB table to keep track of the shards in a Kinesis Data Stream that are being leased and processed by the workers of the KCL consumer application.
  - Each row in the lease table represents a shard that is being processed by the workers of your consumer application.
  - KCL uses the name of the consumer application to create the name of the lease table that this consumer application uses, therefore each consumer application name must be unique.
  - One lease table for one KCL consumer application
  - KCL creates the lease table with a provisioned throughput of 10 reads / second and 10 writes / second
Developing Consumers Using Amazon Data Firehose (opens in a new tab)
Developing Consumers Using Amazon Managed Service for Apache Flink (opens in a new tab)
Troubleshooting (opens in a new tab)
- Consumer read API calls throttled (ReadProvisionedThroughputExceeded)
  - AWS re:Post - How do I detect and troubleshoot ReadProvisionedThroughputExceeded exceptions in Kinesis Data Streams? (opens in a new tab)
  - Occurs when GetRecords quota is exceeded, subsequent GetRecords calls are throttled.
  - Quota
    - 5 GetRecords API calls / second / each shard
    - 2 MB / second / each shard
    - Up to 10 MiB of data / API call / each shard
    - Up to 10000 records / API call
- Slow record processing
  - AWS re:Post - Why does the IteratorAgeMilliseconds value in Kinesis Data Streams continue to increase? (opens in a new tab)

Kinesis Data Firehose

AWS Docs - Kinesis Data Firehose (opens in a new tab)

Serverless, fully managed with automatic scaling
Pre-built Kinesis Data Streams connectors for various sources and destinations, without coding, similar to Kafka Connect
For data latency of 60 seconds or higher.
Troubleshooting
- Troubleshoot Amazon Kinesis yielding too many S3 files | AWS re:Post (opens in a new tab)

Kinesis Data Firehose - Sources & Destinations

Sources
- Kinesis Data Streams
- MSK
  - Can only use S3 as destination
- Direct PUT
  - Use PutRecord and PutRecordBatch API to send data
  - If the source is Direct PUT, Firehose will retain data for 24 hours.
Destinations - AWS Docs - Data Firehose - Source, Destination, and Name (opens in a new tab)
- S3
- Redshift
- OpenSearch
- Custom HTTP Endpoint
- 3rd-party service

Kinesis Data Firehose - Buffering hints

AWS Docs - Buffering hints (opens in a new tab)

Buffering size in MBs (destination specific)
- 1 MB to 128 MB
- Default is 5 MB
Buffering interval in seconds (destination specific)
- 60 to 900 seconds
- Default is 60 / 300 seconds

Kinesis Data Firehose - Data Transformation

Data Transformation (opens in a new tab) not natively supported, need to use Lambda

Lambda
- Custom transformation logic
- Lambda invocation time of up to 5 min.
- Error handling
  - Retry the invocation 3 times by default.
  - After retries, if the invocation is unsuccessful, Firehose then skips that batch of records. The skipped records are treated as unsuccessfully processed records.
  - The unsuccessfully processed records are delivered to your S3 bucket.
Data format conversion
- Firehose can convert JSON records using a schema from a table defined in Glue
- Non-JSON data must be converted by invoking Lambda function

Kinesis Data Firehose - Dynamic Partitioning

Dynamic partitioning enables you to continuously partition streaming data in Firehose by using keys within data (for example, customer_id or transaction_id) and then deliver the data grouped by these keys into corresponding S3 prefixes.
Partitioning your data minimizes the amount of data scanned, optimizes performance, and reduces costs of your analytics queries on S3.
Methods of creating partitioning keys
- Inline parsing
  
  Map each parameter to a valid jq expression, suitable for JSON format
- Lambda function
  
  For compressed or encrypted data records, or data that is in any file format other than JSON
Resources
- Dynamic Partitioning (opens in a new tab)
- AWS Big Data Blog - Amazon Data Firehose now supports dynamic partitioning to Amazon S3 (opens in a new tab)

Kinesis Data Firehose - Server-side Encryption

Server-side encryption (opens in a new tab)

Can be enabled depending on the source of data
Kinesis Data Streams as the Source
- Kinesis Data Streams first decrypts the data and then sends it to Firehose.
- Firehose buffers the data in memory and then delivers without storing the unencrypted data at rest.
With Direct PUT or other Data Sources
- Can be turned on by using the StartDeliveryStreamEncryption operation.
- To stop server-side-encryption, use the StopDeliveryStreamEncryption operation.

Kinesis Data Firehose - Custom S3 Prefixes

Custom Prefixes for Amazon S3 Objects (opens in a new tab)

Ability to specify a custom S3 prefix to be compatible with Hive naming conventions, for data to be easily cataloged by Glue.
Resources
- AWS Big Data Blog - Amazon Data Firehose custom prefixes for Amazon S3 objects (opens in a new tab)

Kinesis Data Analytics

Input
- Kinesis Data Streams
- Kinesis Data Firehose
Output
- Kinesis Data Streams
- Kinesis Data Firehose
- Lambda

Kinesis Data Analytics - Managed Apache Flink

Managed Service for Apache Flink (opens in a new tab)

Equivalent to Kafka Streams, using Apache Flink behind the scene, online stream processing, namely streaming ETL
Managed Service for Apache Flink is for buiding streaming application with API in Java, Scala, Python and SQL
Managed Service for Apache Flink Studio is for ad-hoc interactive data exploration.
Flink API
- Flink offers 4 levels of API abstraction: Flink SQL, Table API, DataStream API, and Process Function, which is used in conjunction with the DataStream API.
- SQL can be embedded in any Flink application, regardless the programming language chosen.
- If you are if planning to use the DataStream API, not all connectors are supported in Python.
- If you need low-latency/high-throughput you should consider Java/Scala regardless the API.
- If you plan to use Async IO in the Process Functions API you will need to use Java.

Kinesis Data Analytics Studio

Kinesis Data Analytics Studio (opens in a new tab)

Based on
- Apache Zeppelin
- Apache Flink

Lake Formation

Features (opens in a new tab)
- Simplify security management and governance at scale
- Centrally manage permissions across your users
- Comprehensive data access and audit logging
- Provides a RDBMS permissions model to grant or revoke access to databases, tables, and columns in the Data Catalog with underlying data in S3.
Resources

Lake Formation - Ingestion

Blueprints are predefined ETL workflows that you can use to ingest data into your Data Lake.
Workflows are instances of ingestion blueprints in Lake Formation.
Blueprint types
- Database snapshot
  - Loads or reloads data from all tables into the data lake from a JDBC source.
  - You can exclude some data from the source based on an exclude pattern.
  - When to use:
    - Schema evolution is flexible. (Columns are re-named, previous columns are deleted, and new columns are added in their place.)
    - Complete consistency is needed between the source and the destination.
- Incremental database
  - Loads only new data into the data lake from a JDBC source, based on previously set bookmarks.
  - You specify the individual tables in the JDBC source database to include.
  - When to use:
    - Schema evolution is incremental. (There is only successive addition of columns.)
    - Only new rows are added; previous rows are not updated.
- Log file
  
  Bulk loads data from log file sources

Lake Formation - Permission management

The security policies in Lake Formation use two layers of permissions, each resource is protected by

Lake Formation permissions (which control access to Data Catalog resources and S3 locations)
IAM permissions (which control access to Lake Formation and AWS Glue API resources)
Permission types
- Metadata access
  
  Data Catalog permissions
- Underlying data access
  - Data access permissions
    
    Data access permissions (SELECT, INSERT, and DELETE) on Data Catalog tables that point to that location.
  - Data location permissions
    
    The ability to create Data Catalog resources that point to particular S3 locations.
    
    When you grant the CREATE_TABLE or ALTER permission to a principal, you also grant data location permissions to limit the locations for which the principal can create or alter metadata tables.
Permission model
- IAM permissions model consists of IAM policies.
- Lake Formation permissions model is implemented as DBMS-style GRANT/REVOKE commands.
- Principals
  - IAM users and roles
  - SAML users and groups
  - IAM Identity Center
  - External accounts
- Resources
  - LF-Tags
    
    Using LF-Tags can greatly simplify the number of grants over using Named Resource policies.
  - Named data catalog resources
- Permissions
  - Database permissions
  - Table permissions
- Data filtering
  - Column-level security
    
    Allows users to view only specific columns and nested columns that they have access to in the table.
  - Row-level security
    
    Allows users to view only specific rows of data that they have access to in the table.
  - Cell-level security
    
    By using both row filtering and column filtering at the same time
    
    Can restrict access to different columns depending on the row.

QuickSight

AWS Docs - QuickSight (opens in a new tab)

Key points
- Ad-hoc, interactive query, analysis and visualization
- For formatted canned reports, use Paginated Reports (opens in a new tab)
Editions
- Standard edition
- Enterprise edition
  - VPC connection
    
    Access to data source in your VPC without the need for a public IP address
  - Microsoft Active Directory SSO
  - Data stored in SPICE is encrypted at rest.
  - For SQL-based data sources, such as Redshift, Athena, PostgreSQL, or Snowflake, you can refresh your data incrementally within a look-back window of time.
Datasets
- Dataset refresh can be scheduled or on-demand.
- SPICE
  - In-memory
- Direct SQL query
Visual types
- Combo chart (opens in a new tab)
  
  Aka line and column (bar) charts
- Scatter plot (opens in a new tab)
  
  Used to observe relationships between variables
- Heat map (opens in a new tab)
  - Use heat maps to show a measure for the intersection of two dimensions, with color-coding to easily differentiate where values fall in the range.
  - Use a heat map if you want to identify trends and outliers, because the use of color makes these easier to spot.
- Pivot table (opens in a new tab)
  - Use a pivot table if you want to analyze data on the visual.
AutoGraph (opens in a new tab)

You create a visual by choosing AutoGraph and then selecting fields.

Migration and Transfer

Application Discovery Service

AWS Application Discovery Service (opens in a new tab)

Collect usage and configuration data about your on-premises servers and databases.
All discovered data is stored in your AWS Migration Hub home Region.
2 ways of performing discovery and collecting data
- Agentless discovery
  
  Identifies VMs and hosts associated with VMware vCenter
- Agent-based discovery
  
  The agent runs in your local environment and requires root privileges.

Application Migration Service

Automated lift-and-shift (rehost) migration of VMs to AWS
Migrate your applications to AWS from physical servers, VMware vSphere, Microsoft Hyper-V, and other cloud providers.
Migrate EC2 instances between AWS Regions or between AWS accounts, and to migrate from EC2-Classic to a VPC.

DMS

AWS Docs - DMS (Database Migration Service) (opens in a new tab)

Migrate RDBMS, data warehouses, NoSQL databases, and other types of data stores
One-time migration or continuous replication
Used for CDC (Change Data Capture) for real-time ongoing replication

AWS Big Data Blog - Stream change data to Amazon Kinesis Data Streams with AWS DMS (opens in a new tab)
DMS uses a replication instance to connect to your source data store, read the source data, and format the data for consumption by the target data store.
DMS Serverless eliminates replication instance management tasks.

DMS - Fleet Advisor

Automatically inventories and assesses on-premises database and analytics server fleet and identifies potential migration paths

DMS - Data Transformations

SQLite expression-based (opens in a new tab)

DataSync

AWS Docs - DataSync (opens in a new tab)

One-off online data transfer, transferring hot or cold data should not impede your business.
Move data between on-premises and AWS
Move data between AWS services
Requires DataSync agent installation on-premises
Move files or objects, not databases
On-premises
- Network File System (NFS)
- Server Message Block (SMB)
- Hadoop Distributed File Systems (HDFS)
- Object storage

Snow Family

Snowball Edge Storage Optimized

For large-scale data migrations and recurring transfer workflows, as well as local computing with higher capacity needs

Snowball Edge Compute Optimized

For use cases such as machine learning, full motion video analysis, analytics, and local computing stacks

Snowmobile

Intended for more than 10 PB data migrations from a single location
Up to 100 PB per Snowmobile
Shipping container

Transfer Family

Transfer flat files using the following protocols:
- Secure File Transfer Protocol (SFTP): version 3
- File Transfer Protocol Secure (FTPS)
- File Transfer Protocol (FTP)
- Applicability Statement 2 (AS2)
- Browser-based transfers

Application Integration

AppFlow

EventBridge

Rules

Routing rules for events
API Destinations

API destinations are HTTP endpoints that you can invoke as the target of a rule, similar to how you invoke an AWS service or resource as a target.

MWAA

Auto scaling
Automatic Airflow setup based on Airflow version
Streamlined upgrades and patches
Workflow monitoring in CloudWatch

SNS

Step Functions

Step Functions (opens in a new tab)

Able to Run EMR workloads on a schedule
Gneral-purpose serverless workflow orchestration service
Resources

Step Functions - State Machine

Types
- Standard workflows
  - Standard workflows are ideal for long-running, durable, and auditable workflows.
  - Standard workflows can support an execution start rate of over 2000 executions / second.
  - They can run for up to a year and you can retrieve the full execution history using the Step Functions API.
  - Standard Workflows employ an exactly-once execution model, where your tasks and states are never started more than once unless you have specified the Retry behavior in your state machine.
  - Suited to orchestrating non-idempotent actions, such as starting an EMR cluster or processing payments.
  - Standard Workflow executions are billed according to the number of state transitions processed.
  - Using Callbacks or the .sync Service integration will most likely reduce the number of state transitions and cost.
- Express workflows
  
  The Express type is used for high-volume, event-processing workloads and can run for up to 5 minutes.
  - Synchronous Express Workflows
    
    Start a workflow, wait until it completes, and then return the result
  - Asynchronous Express Workflows
    
    Return confirmation that the workflow was started, but don't wait for the workflow to complete

Step Functions - States

AWS Docs - Step Functions - States (opens in a new tab)

Defined with Amazon States Language in State Machine definition
Types
- Task
- Choice
- Map
  - Use the Map state to run a set of workflow steps for each item in a dataset. The Map state's iterations run in parallel, which makes it possible to process a dataset quickly.
  - Map state processing modes (opens in a new tab)
    - Inline mode
      - Limited concurrency, each iteration of the Map state runs in the context of the workflow that contains the Map state.
      - Map state accepts only a JSON array as input. Also, this mode supports up to 40 concurrent iterations.
    - Distributed mode
      - High concurrency, each iteration of the Map state runs in its own execution context.
      - When you run a Map state in Distributed mode, Step Functions creates a Map Run resource.
      - Tolerated failure threshold can be set for a Map Run, and Map Run fails automatically if it exceeds the threshold.
      - Use cases
        
        The size of your dataset exceeds 256 KB.
        
        The workflow's execution event history exceeds 25000 entries.
        
        You need a concurrency of more than 40 parallel iterations.

SQS

Kinesis Data Streams vs SQS

	MSK	Kinesis Data Streams	SQS
Consumption		Can be `consumed many times`	Records are `deleted after being consumed`
Retention		Retention period configurable `from 24 hours to 1 year`	Retention period configurable `from 1 min to 14 days`
Ordering		Ordering of records is preserved `at the same shard`	`FIFO queues` preserve ordering of records
Scaling		`Manual resharding`	`Auto scaling`
Delivery		`At least once`	`Exactly once`
Replay	`Can replay`	`Can replay`	`Cannot replay`
Payload size	`1 MB`	`1 MB`	`256 KB`

Resources

Cloud Financial Management

AWS Budgets

AWS Cost Explorer

Compute

AWS Batch

AWS Batch (opens in a new tab)

AWS Batch handles job execution and compute resource management, allowing you to focus on developing your applications rather than managing infrastructure.
A batch job is deployed as a Docker container.

EC2

Lambda

AWS SAM

Containers

ECR

ECS

EKS

Database

DocumentDB

DynamoDB

RDS

Redshift

PB scale, columnar storage
MPP architecture
Loading data
- AWS Docs - Amazon Redshift best practices for loading data (opens in a new tab)
- COPY command
  - Can load data from multiple files in parallel.
  - Use the COPY command with COMPUPDATE set to ON to analyze and apply compression automatically
  - Split your data into files so that the number of files is a multiple of the number of slices in your cluster.
AWS Big Data Blog - JOIN Amazon Redshift AND Amazon RDS PostgreSQL WITH dblink (opens in a new tab)
- The dblink function allows the entire query to be pushed to Redshift. This lets Redshift do what it does best—query large quantities of data efficiently and return the results to PostgreSQL for further processing.

Redshift - Cluster

Every cluster can have up to 32 nodes.
Leader node and compute nodes have the same specs, and leader node is chosen automatically.
Leader node (2+ nodes)
- Parse the query and building an optimal execution plan, and compile code from execution plan
- Receive and aggregate the results from the compute nodes
Compute node
- Node slices
  
  A compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node.
Enhanced VPC routing
- Routes network traffic between your cluster and data repositories through a VPC, instead of through the internet.

Redshift - Cluster - Elastic resize

Preferred method
In a incremental manner
Add or remove nodes (same node type) to a cluster in minutes, aka in-place resize
Change the node type of an existing cluster

A snapshot is created, and a new cluster is created from the snapshot with the new node type.
Resources
- Troubleshoot classic resize for Amazon Redshift cluster | AWS re:Post (opens in a new tab)

Redshift - Cluster - Classic resize

When you need to change the cluster size or the node type of an existing cluster and elastic resize is not supported
Takes much longer than elastic resize

Redshift - Cluster - Node type

RA3

Node Size vCPU RAM (GiB) Managed Storage quota per node Number of nodes per cluster
ra3.xlplus 4 32 32 TB 1 to 32
ra3.4xlarge 12 96 128 TB 2 to 32
ra3.16xlarge 48 384 128 TB 2 to 128
- Pay for the managed storage and compute separately
- Managed storage uses SSD for local cache and S3 for persistence and automatically scales based on the workload
- Compute is separated from managed storage and can be scaled independently
- RA3 nodes are optimized for performance and cost-effectiveness
DC2 (Dense Compute)

Node Size vCPU RAM (GiB) Managed Storage quota per node Number of nodes per cluster
dc2.large 2 15 160 GB 1 to 32
dc2.8xlarge 32 244 2.6 TB 2 to 128
- Compute-intensive local SSD storage for high performance and low latency
- Recommended for datasets under 10 TB
DS2 (Dense Storage)
- Deprecated legacy option, for low-cost workloads with HDD storage, RA3 is recommended for new workloads
Resources
- AWS Big Data Blog - Compare different node types for your workload using Amazon Redshift (opens in a new tab)

Node Size	vCPU	RAM (GiB)	Managed Storage quota per node	Number of nodes per cluster
`ra3.xlplus`	4	32	32 TB	`1` to `32`
`ra3.4xlarge`	12	96	128 TB	`2` to `32`
`ra3.16xlarge`	48	384	128 TB	`2` to `128`

Node Size	vCPU	RAM (GiB)	Managed Storage quota per node	Number of nodes per cluster
`dc2.large`	2	15	160 GB	`1` to `32`
`dc2.8xlarge`	32	244	2.6 TB	`2` to `128`

Redshift - Cluster - HA

AZ
- Single AZ by default
- Multi AZ is supported

Redshift - Distribution style

Distribution style (opens in a new tab)

Redshift Distribution Keys (DIST Keys) determine where data is stored in Redshift. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node.
Uneven distribution of data across computing nodes leads to the skewness of the work a node has to do and you don’t want an under-utilised compute node. So the distribution of the data should be uniform.
Data should be distributed in such a way that the rows that participate in joins are already on the same node with their joining rows in other tables. This is called collocation, which reduces data movement and improves query performance.
Distribution is configurable per table. So you can select a different distribution style for each of the table.
AUTO distribution
- Redshift automatically distributes the data across the nodes in the cluster
- Redshift initially assigns ALL distribution to a small table, then changes to EVEN distribution when the table grows larger.
- Suitable for small tables or tables with a small number of distinct values
EVEN distribution
- The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column.
- Appropriate when a table doesn't participate in joins. It's also appropriate when there isn't a clear choice between KEY distribution and ALL distribution.
KEY distribution
- The rows are distributed according to the values in one column.
- All the entries with the same value in the column end up on the same slice.
- Suitable for large tables or tables with a large number of distinct values
ALL distribution
- A copy of the entire table is distributed to every node.
- Suitable for small tables or tables that are not joined with other tables

Redshift - Concurrency scaling

Concurrency scaling adds transient clusters to your cluster to handle concurrent requests with consistency and fast performance in a matter of seconds, mainly for bursty workloads.

Redshift - Sort keys

Sort keys (opens in a new tab)

When you create a table, you can define one or more of its columns as sort keys.
Either a compound or interleaved sort key.

Redshift - Compression

AWS Docs - Working with column compression (opens in a new tab)

You can apply a compression type, or encoding, to the columns in a table manually when you create the table.
The COPY command analyzes your data and applies compression encodings to an empty table automatically as part of the load operation.
Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB and 125 MB after compression.
The number of files should be a multiple of the number of slices in your cluster.
When loading data, it is strongly recommended that you individually compress your load files using gzip, lzop, bzip2, or Zstandard when you have large datasets.

Redshift - Workload Management (WLM)

AWS Docs - Amazon Redshift - Implementing workload management (opens in a new tab)
Automatic WLM
Manual WLM
- Query Groups
  - Memory percentage and concurrency settings
- Query Monitoring Rules
  - Query Monitoring Rules are used to define the conditions under which a query is monitored and the action to take when the query is monitored.
- Query Queues
  - Memory percentage and concurrency settings
Concurrency scaling
- Concurrency scaling automatically adds and removes compute nodes to handle spikes in demand.
- Concurrency scaling is enabled by default for all RA3 and DC2 clusters.
- Concurrency scaling is not available for DS2 clusters.
Short query acceleration (opens in a new tab)
- Short query acceleration (SQA) prioritizes selected short-running queries ahead of longer-running queries.
- SQA runs short-running queries in a dedicated space, so that SQA queries aren't forced to wait in queues behind longer queries.
- CREATE TABLE AS (CTAS) statements and read-only queries, such as SELECT statements, are eligible for SQA.

Redshift - User-Defined Functions (UDF)

User-Defined Functions (UDFs) are functions that you define to extend the capabilities of the SQL language in Redshift.
SQL (opens in a new tab)
Python (opens in a new tab)
AWS Lambda (opens in a new tab)
- Accessing other AWS services

Redshift - Snapshots

Automated (opens in a new tab)
- Enabled by default when you create a cluster.
- By default, about every 8 hours or following every 5 GB per node of data changes, or whichever comes first.
- Default retention period is 1 day
- Cannot be deleted manually
Manual (opens in a new tab)
- By default, manual snapshots are retained indefinitely, even after you delete your cluster.
- AWS CLI: create-cluster-snapshot
- AWS API: CreateClusterSnapshot
Backup
- You can configure Redshift to automatically copy snapshots (automated or manual) for a cluster to another Region.
- Only one destination Region at a time can be configured for automatic snapshot copy.
- To change the destination Region that you copy snapshots to, first disable the automatic copy feature. Then re-enable it, specifying the new destination Region.
Snapshot copy grant
- KMS keys are specific to a Region. If you enable copying of Redshift snapshots to another Region, and the source cluster and its snapshots are encrypted using a master key from KMS, you need to configure a grant for Redshift to use a master key in the destination Region.
- The grant is created in the destination Region and allows Redshift to use the master key in the destination Region.

Redshift - SQL

VACUUM
- FULL
  - VACUUM FULL is the default.
  - By default, VACUUM FULL skips the sort phase for any table that is already at least 95% sorted.
- SORT ONLY
  - Sorts the specified table (or all tables in the current database) without reclaiming space freed by deleted rows.
- DELETE ONLY
  
  Redshift automatically performs a DELETE ONLY vacuum in the background.
- REINDEX

Redshift - Query

Data API (opens in a new tab)
- Available in AWS SDK, based on HTTP and JSON, no persistent connection needed
- API calls are asynchronous
- Uses either credentials stored in Secrets Manager or temporary database credentials, no need to pass passwords in the API calls
System tables and views
- SVV views contain information about database objects with references to transient STV tables.
- SYS views
  - These are system monitoring views used to monitor query and workload usage for provisioned clusters and serverless workgroups.
  - These views are located in the pg_catalog schema.
- STL views
  - Information retrieved from all Redshift log history across nodes
  - Retain 7 days of log history
- STV tables
  - Virtual system tables (in-memory) that contain snapshots of the current system state
- SVCS views provide details about queries on both the main and concurrency scaling clusters.
- SVL views provide details about queries on main clusters.
Federated Query
- Query and analyze data across operational databases, data warehouses, and data lakes.
- AWS Docs - Considerations when accessing federated data with Amazon Redshift (opens in a new tab)

Redshift - Data Sharing

With data sharing, you can securely and easily share live data across Redshift clusters without the need to copy or move the data.
Lends itself to a multi-warehouse architecture, where you can scale each data warehouse for various types of workloads.
Only RA3 and serverless clusters are supported.
Live and transactionally consistent views of data across all consumers
Secure and governed collaboration within and across organizations
Sharing data with external parties to monetize your data

Redshift - Data Sharing - Datashare

Working with datashares (opens in a new tab)

You can share data at different levels
- Databases
- Schemas
- Tables
- Views (including regular, late-binding, and materialized views)
- SQL user-defined functions (UDFs)

Redshift - Security

IAM
- Does not support resource-based policies.
Access control
- Cluster level
  - IAM policies
    
    For Redshift API actions
  - Security groups
    
    For network connectivity
- Database level
  - Database user accounts
Data protection
- DB encryption using KMS (opens in a new tab)
  - KMS key hierarchy
    - Root key
    - Cluster encryption key (CEK)
    - Database encryption key (DEK)
    - Data encryption keys
Auditing
- Logs
  - Connection log
    
    Logs authentication attempts, connections, and disconnections.
  - User log
    
    Logs information about changes to database user definitions.
  - User activity log
    
    Logs each query before it's run on the database.

Redshift Spectrum

In-place querying of data in S3, without having to load the data into Redshift tables
EB scale
Pay for the number of bytes scanned
Federated query across operational databases, data warehouses, and data lakes

Redshift Spectrum - Comparison with Athena

	Redshift Spectrum	Athena
Performance	Greater control over performance; use dedicated resources of your `Redshift` cluster	Not configurable; use shared resources managed by `AWS`
Query Results	Stored in `Redshift`	Stored in `S3`

Redshift Serverless

Redshift measures data warehouse capacity in Redshift Processing Units (RPUs). You pay for the workloads you run in RPU-hours on a per-second basis (with a 60-second minimum charge), including queries that access data in open file formats in S3

Redshift Streaming Ingestion

Input
- Kinesis Data Streams
- MSK

Keyspaces

AWS Docs - Keyspaces (opens in a new tab)
Based on Apache Cassandra

Neptune

AWS Docs - Neptune (opens in a new tab)
Serverless graph database

MemoryDB

More expensive than ElastiCache
Intended as a Redis-compatible, fully managed, in-memory database service instead of a caching service

MemoryDB - Valkey

Open source Redis-compatible alternative, as it is forked from Redis

MemoryDB - Redis

Management and Governance

CloudFormation

CloudTrail

CloudWatch

CloudWatch Logs

Subscription

Real-time processing of log data

AWS Config

AWS Config - Config rules

AWS Config Managed Rules

Predefined, customizable rules created by AWS Config

List of AWS Config Managed Rules (opens in a new tab)
AWS Config Custom Rules

There are 2 ways to create AWS Config custom rules
- Lambda functions
- Guard (policy as code DSL)

Managed Grafana

Systems Manager

Well-Architected Tool

Networking and Content Delivery

CloudFront

PrivateLink

Route 53

VPC

Security, Identity, and Compliance

IAM

KMS

Macie

Secrets Manager

AWS Shield (opens in a new tab)

Protection against DDoS attacks

AWS Shield Standard (opens in a new tab)

Operates at L3 and L4.

AWS Shield Advanced (opens in a new tab)

Operates at L7
Include
- Shield Standard
- Certain AWS WAF usage for Shield protected resources
DDoS cost protection
- If any of these protected resources scale up in response to a DDoS attack, you can request Shield Advanced service credits through your regular AWS Support channel.

AWS WAF (opens in a new tab)

L7 (HTTP) application level firewall
Define a Web ACL and then associating it with one or more web application resources that you want to protect.