AWS-DAS-C01

Exam Guide

Domain 1: Collection

Task Statement 1.1: Determine the operational characteristics of the collection system

Task Statement 1.2: Select a collection system that handles the frequency, volume, and source of data

Task Statement 1.3: Select a collection system that addresses the key properties of data, such as order, format, and compression

Domain 2: Storage and Data Management

Task Statement 2.1: Determine the operational characteristics of the storage solution for analytics

Services

Analytics

Athena

Athena (opens in a new tab)

  • In-place querying of data in S3

  • Serverless

  • PB scale

  • Presto (opens in a new tab) under the hood

  • Supports SQL and Apache Spark

  • Analyze unstructured, semi-structured, and structured data stored in S3, formats including CSV, JSON, Apache Parquet and Apache ORC

  • Only successful or canceled queries are billed, while failed queries are not billed.

  • No charge for DDL statements

  • Use cases

    • Ad-hoc queries of web logs, CloudTrail / CloudFront / VPC / ELB logs
    • Querying staging data before loading to Redshift
    • Integration with Jupyter, Zeppelin, RStudio notebooks
    • Integration with QuickSight for visualization
    • Integration via JDBC / ODBC for BI tools
  • SQL

Athena - Workgroup
  • A workgroup is a collection of users, queries, and results that are associated with a specific data catalog and workgroup settings.
  • Workgroup settings include the location in S3 where query results are stored, the encryption configuration, and the data usage control settings.
Athena - Federated Query

AWS Docs - Using Amazon Athena Federated Query (opens in a new tab)

  • Query the data in place or build pipelines that extract data from multiple data sources and store them in S3.
  • Uses data source connectors that run on Lambda to run federated queries.

CloudSearch

AWS Docs - CloudSearch (opens in a new tab)

  • Managed search service, based on Apache Solr

OpenSearch

AWS Docs - OpenSearch (opens in a new tab)

  • Operational analytics and log analytics
  • Forked version of Elasticsearch
  • Number of Shards = Index Size / 30GB

EMR

AWS Docs - EMR (opens in a new tab)

  • Managed cluster platform that simplifies running big data frameworks, such as Hadoop and Spark, on AWS.

  • PB scale

  • RunJobFlow API Action creates and starts running a new cluster (job flow). The cluster runs the steps specified. After the steps complete, the cluster stops and the HDFS partition is lost.

  • An EMR cluster with multiple primary nodes can reside only in one AZ.

  • HDFS

    • A typical block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB chunks, and if possible, each chunk will reside on a different DataNode.
  • EMR metrics (opens in a new tab)

EMR cluster - Node types

AWS Docs - EMR cluster - Node types (opens in a new tab)

  • Primary node / Master node

    • Manages the cluster and typically runs primary components of distributed applications.
    • Tracks the status of jobs submitted to the cluster
    • Monitors the health of the instance groups
    • HA support from 5.23.0+
  • Core nodes

    • Store HDFS data and run tasks
    • Multi-node clusters have at least one core node.
    • One core instance group or instance fleet per cluster, but there can be multiple nodes running on multiple EC2 instances in the instance group or instance fleet.
  • Task nodes

    • No HDFS data, therefore no data loss if terminated
    • Can use Spot instances
    • Optional in a cluster
EMR - EC2

AWS Docs - EMR - EC2 (opens in a new tab)

EMR - EKS

AWS Docs - EMR - EKS (opens in a new tab)

EMR - Serverless

AWS Docs - EMR - Serverless (opens in a new tab)

EMR - EMRFS
EMR - Security - Apache Ranger
  • A RBAC framework to enable, monitor and manage comprehensive data security across the Hadoop ecosystem.
  • Centralized security administration and auditing
  • Fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN)
  • Syncs policies and users by using agents and plugins that run within the same process as the Hadoop component.
  • Supports row-level authentication and auditing capabilities with embedded search.
EMR - Security Configuration

AWS Docs - Security Configuration (opens in a new tab)

EMR Studio
  • Web IDE for fully managed Jupyter notebooks running on EMR clusters
EMR Notebooks (opens in a new tab)
EMR - Open Source Ecosystem
Delta Lake

EMR - Delta Lake (opens in a new tab)

  • Storage layer framework for lakehouse architectures
Ganglia

EMR - Ganglia (opens in a new tab)

  • Performance monitoring tool for Hadoop and HBase clusters
  • Not included in EMR after 6.15.0
Apache HBase

EMR - Apache HBase (opens in a new tab)

  • Wide-column store NoSQL running on HDFS
  • EMR WAL support, able to restore an existing WAL retained for 30 days starting from the time when cluster was created, with a new cluster from the same S3 root directory.
Apache HCatalog

EMR - Apache HCatalog (opens in a new tab)

  • HCatalog is a tool that allows you to access Hive metastore tables within Pig, Spark SQL, and/or custom MapReduce applications.
Apache Hudi

EMR - Apache Hudi (opens in a new tab)

  • Open Table Format
  • Brings database and data warehouse capabilities to the data lake
Hue

EMR - Hue (opens in a new tab)

  • Web GUI for Hadoop
Apache Iceberg

EMR - Apache Iceberg (opens in a new tab)

  • Open Table Format
Apache Livy

EMR - Apache Livy (opens in a new tab)

  • A REST interface for Apache Spark
Apache MXNet

EMR - Apache MXNet (opens in a new tab)

  • Retired
  • A deep learning framework for building neural networks and other deep learning applications.
Apache Oozie

EMR - Apache Oozie (opens in a new tab)

  • Workflow scheduler system to manage Hadoop jobs
  • Apache Airflow is a modern alternative.
Apache Phoenix

EMR - Apache Phoenix (opens in a new tab)

  • OLTP and operational analytics in Hadoop for low latency applications
Apache Pig

EMR - Apache Pig (opens in a new tab)

  • SQL-like commands written in Pig Latin and converts those commands into Tez jobs based on DAG or MapReduce programs.
Apache Sqoop

EMR - Apache Sqoop (opens in a new tab)

  • Retired
  • A tool for transferring data between S3, Hadoop, HDFS, and RDBMS databases.
Apache Tez

EMR - Apache Tez (opens in a new tab)

  • An application execution framework for complex DAG of tasks, similar to Apache Spark

Glue

Glue (opens in a new tab)

  • Basically Serverless Spark ETL with a Hive metastore

  • Ingests data in batch while performing transformations

  • Either provide a script to perform the ETL job, or Glue can generate a script automatically.

  • Data source can be an AWS service, such as RDS, S3, DynamoDB, or Kinesis Data Streams, as well as a third-party JDBC-accessible database.

  • Data target can be an AWS service, such as S3, RDS, and DocumentDB, as well as a third-party JDBC-accessible database.

  • Hourly rate charge based on the number of Data Processing Units (or DPUs) used to run your ETL job.

  • Resources

  • Orchestration

    • triggers

      • When fired, a trigger can start specified jobs and crawlers.

      • A trigger fires on demand, based on a schedule, or based on a combination of events.

      • trigger types

        • Scheduled

        • Conditional

          The trigger fires if the watched jobs or crawlers end with the specified statuses.

        • On-demand

    • workflows

    • blueprints

      • AWS Docs - Glue - blueprints (opens in a new tab)

      • Glue blueprints provide a way to create and share Glue workflows. When there is a complex ETL process that could be used for similar use cases, rather than creating an Glue workflow for each use case, you can create a single blueprint.

Glue - Job
  • Job type

    • Apache Spark

      • Minimum 2 DPU
      • 10 DPU (default)
    • Apache Spark Streaming

      • Minimum 2 DPU (default)
    • Python Shell

      • 1 DPU or 0.0625 DPU (default)
  • Job Bookmarks

    • Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run. This persisted state information is called a job bookmark.
Glue - Worker
  • A single DPU is also called a worker.

  • A DPU is a relative measure of processing power that consists of 4 vCPU of compute capacity and 16 GB of memory.

  • A M-DPU is a DPU with 4 vCPU and 32 GB of memory.

  • Worker type

    • For Glue 2.0+

      Specify Worker type and Number of workers

      • G.1X (default)

        • 1 DPU
        • Recommended for most workloads with cost effective performance
      • G.2X

        • 2 DPU
        • Recommended for most workloads with cost effective performance
      • G.4X

        • 4 DPU
        • Glue 3.0+ Spark jobs
      • G.8X

        • 8 DPU
        • Recommended for workloads with most demanding transformations, aggregations, and joins
        • Glue 3.0+ Spark jobs
      • G.025X

        • 0.25 DPU
        • Recommended for low volume streaming jobs
        • Glue 3.0+ Spark Streaming jobs
Glue - Data Catalog

AWS Docs - Glue Data Catalog (opens in a new tab)

  • Functionally similar to a schema registry

  • API compatible with Hive Metastore

  • lakeFS - Metadata Management: Hive Metastore vs AWS Glue (opens in a new tab)

  • Glue Data Catalog - database & table

    • Databases and tables are objects in the Data Catalog that contain metadata definitions.
    • The schema of your data is represented in your Glue table definition. The actual data remains in its original data store
Glue - Data Catalog - crawler

AWS Docs - Glue Data Catalog - crawler (opens in a new tab)

  • A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.

  • A classifier recognizes the format of your data and generates a schema. It returns a certainty number between 0.0 and 1.0, where 1.0 means 100% certainty. Glue uses the output of the classifier that has the highest certainty.

  • If no classifier returns a certainty greater than 0.0, Glue returns the default classification string of UNKNOWN.

  • Steps

    • A crawler runs any custom classifiers that you choose to infer the format and schema of your data.
    • The crawler connects to the data store.
    • The inferred schema is created for your data.
    • The crawler writes metadata to the Glue Data Catalog.
  • Partition threshold

    If the following conditions are met, the schemas are denoted as partitions of a table.

    • The partition threshold is higher than 0.7 (70%).
    • The maximum number of different schemas doesn't exceed 5.

    Learn how AWS Glue crawler detects the schema | AWS re:Post (opens in a new tab)

Glue - Data Catalog - security
  • You can only use an Glue resource policy to manage permissions for Data Catalog resources.
  • You can't attach it to any other Glue resources such as jobs, triggers, development endpoints, crawlers, or classifiers.
  • Only one resource policy is allowed per catalog, and its size is limited to 10 KB.
  • Each AWS account has one single catalog per Region whose catalog ID is the same as the AWS account ID.
  • You cannot delete or modify a catalog.
Glue Data Quality

AWS Docs - Glue Data Quality (opens in a new tab)

Glue Streaming

AWS Docs - Glue Streaming (opens in a new tab)

  • Using the Apache Spark Streaming framework
  • Near-real-time data processing

Glue DataBrew

AWS Docs - Glue DataBrew (opens in a new tab)

Visual data preparation tool that enables users to clean and normalize data without writing any code.

Kinesis Data Streams

AWS Docs - Kinesis Data Streams (opens in a new tab)

  • Key points

    • Equivalent to Kafka, for event streaming, no ETL
    • On-demand or provisioned mode
    • PaaS with API access for developers
    • The maximum size of the data payload of a record before base64-encoding is up to 1 MB.
    • Retention Period by default is 1 day, and can be up to 365 days.
  • Shard (opens in a new tab)

    • A stream is composed of one or more shards.

    • A shard is a uniquely identified sequence of data records in a stream.

    • All the data in the shard is sent to the same worker that is processing the shard.

    • Read throughput

      Up to 2 MB/second/shard

      or

      5 transactions (API calls) /second/shard

      or

      2000 records/second/shard

      shared across all the consumers that are reading from a given shard

      Each call to GetRecords is counted as 1 read transaction.

      GetRecords can retrieve up to 10 MB / transaction / shard, and up to 10000 records per call. If a call to GetRecords returns 10 MB, subsequent calls made within the next 5 seconds throw an exception.

      up to 20 registered consumers (Enhanced Fan-out Limit) for each data stream.

    • Write throughput

      Up to 1 MB/second/shard

      or

      1000 records/second/shard

    • Record aggregation allows customers to combine multiple records into a single record. This allows customers to improve their per shard throughput.

    • Not scaled automatically, resharding (opens in a new tab) is the process used to scale your data stream using a series of shard splits or merges.

  • Partition key (opens in a new tab)

    • All data records with the same partition key map to the same shard.
    • The number of partition keys should typically be much greater than the number of shards, namely many-to-one relationship.
  • Idempotency

    • 2 primary causes for duplicate records

      • Producer retries
      • Consumer retries
    • Consumer retries are more common than producer retries

      • Kinesis Data Streams does not guarantee the order of records across shards in a stream.
      • Kinesis Data Streams does guarantee the order of records within a shard.
Kinesis Data Streams - Producers
Kinesis Data Streams - Producers - KPL

AWS Docs - KPL (opens in a new tab)

  • Batching

    • Aggregation

      Storing multiple payloads within a single Kinesis Data Streams record.

    • Collection

      Consolidate multiple Kinesis Data Streams records into a single Kinesis Data Streams record to reduce HTTP requests.

  • Rate Limiting

    • Limits per-shard throughput sent from a single producer
    • Implemented using a token bucket algorithm with separate buckets for both Kinesis Data Streams records and bytes.
  • Pros

    • Higher throughput due to aggregation and compression of records
    • Abstraction over API to simplify coding using low level API directly
  • Cons

    • Higher latency due to additional processing delay of up to configurable RecordMaxBufferedTime
    • Only supports AWS SDK v1
Kinesis Data Streams - Producers - Kinesis Agent

AWS Docs - Kinesis Agent (opens in a new tab)

  • Stand-alone Java application running as a daemon
  • Continuously monitors a set of files and sends new data to your stream.
Kinesis Data Streams - Consumers
  • KCL (opens in a new tab)

    • Tasks

      • Connects to the data stream
      • Enumerates the shards within the data stream
      • Uses leases to coordinates shard associations with its workers
      • Instantiates a record processor for every shard it manages
      • Pulls data records from the data stream
      • Pushes the records to the corresponding record processor
      • Checkpoints processed records
      • Balances shard-worker associations (leases) when the worker instance count changes or when the data stream is resharded (shards are split or merged)
    • Versions

      • KCL 1.x is based on AWS SDK v1, and KCL 2.x is based on AWS SDK v2

      • KCL 1.x

        • Java
        • Node.js
        • .NET
        • Python
        • Ruby
      • KCL 2.x

        • Java
        • Python
    • Multistream processing is only supported in KCL 2.x for Java (KCL 2.3+ for Java)

    • You should ensure that the number of KCL instances does not exceed the number of shards (except for failure standby purposes)

    • One KCL worker is deployed on one EC2 instance, able to process multiple shards.

    • A KCL worker runs as a process and it has multiple record processors running as threads within it.

    • One Shard has exactly one corresponding record processor.

    • KCL 2.x enables you to create KCL consumer applications that can process more than one data stream at the same time.

    • Checkpoints sequence number for the shard in the lease table to track processed records

    • Enhanced fan-out (opens in a new tab)

      • Consumers using enhanced fan-out have dedicated throughput of up to 2 MB/second/shard, and without enhanced fan-out, all consumers reading from the same shard share the throughput of 2 MB/second/shard.
      • Requires KCL 2.x and Kinesis Data Streams with enhanced fan-out enabled
      • You can register up to 20 consumers per stream to use enhanced fan-out.
      • Kinesis Data Streams pushes data records from the stream to consumers using enhanced fan-out over HTTP/2, therefore no polling and lower latency.
    • Lease Table (opens in a new tab)

      • At any given time, each shard of data records is bound to a particular KCL worker by a lease identified by the leaseKey variable.
      • By default, a worker can hold one or more leases (subject to the value of the maxLeasesForWorker variable) at the same time.
      • A DynamoDB table to keep track of the shards in a Kinesis Data Stream that are being leased and processed by the workers of the KCL consumer application.
      • Each row in the lease table represents a shard that is being processed by the workers of your consumer application.
      • KCL uses the name of the consumer application to create the name of the lease table that this consumer application uses, therefore each consumer application name must be unique.
      • One lease table for one KCL consumer application
      • KCL creates the lease table with a provisioned throughput of 10 reads / second and 10 writes / second
  • Kinesis Data Firehose (opens in a new tab)

  • Managed Apache Flink (opens in a new tab)

  • Troubleshooting (opens in a new tab)

Kinesis Data Firehose

AWS Docs - Kinesis Data Firehose (opens in a new tab)

  • Serverless, fully managed with automatic scaling

  • Pre-built Kinesis Data Streams connectors for various sources and destinations, without coding, similar to Kafka Connect

  • For data latency of 60 seconds or higher.

  • Sources

    • Kinesis Data Streams

    • MSK

      • Can only use S3 as destination
    • Direct PUT

      • Use PutRecord and PutRecordBatch API to send data
      • If the source is Direct PUT, Firehose will retain data for 24 hours.
  • Destinations - AWS Docs - Data Firehose - Source, Destination, and Name (opens in a new tab)

    • S3
    • Redshift
    • OpenSearch
    • Custom HTTP Endpoint
    • 3rd-party service
  • Buffering hints

    AWS Docs - Buffering hints (opens in a new tab)

    • Buffering size in MBs (destination specific)

      • 1 MB to 128 MB
      • Default is 5 MB
    • Buffering interval in seconds (destination specific)

      • 0 to 900
      • Default is 60 / 300 seconds
  • Data Transformation (opens in a new tab) not natively supported, need to use Lambda

    • Lambda

      • Custom transformation logic

      • Lambda invocation time of up to 5 min.

      • Error handling

        • Retry the invocation 3 times by default.
        • After retries, if the invocation is unsuccessful, Firehose then skips that batch of records. The skipped records are treated as unsuccessfully processed records.
        • The unsuccessfully processed records are delivered to your S3 bucket.
    • Data format conversion

      • Firehose can convert JSON records using a schema from a table defined in Glue
      • Non-JSON data must be converted by invoking Lambda function
  • Dynamic Partitioning (opens in a new tab)

    • Dynamic partitioning enables you to continuously partition streaming data in Firehose by using keys within data (for example, customer_id or transaction_id) and then deliver the data grouped by these keys into corresponding S3 prefixes.
    • Partitioning your data minimizes the amount of data scanned, optimizes performance, and reduces costs of your analytics queries on S3.
  • Server-side encryption (opens in a new tab)

    • Can be enabled depending on the source of data

    • Kinesis Data Streams as the Source

      • Kinesis Data Streams first decrypts the data and then sends it to Firehose.
      • Firehose buffers the data in memory and then delivers without storing the unencrypted data at rest.
    • With Direct PUT or other Data Sources

      • Can be turned on by using the StartDeliveryStreamEncryption operation.
      • To stop server-side-encryption, use the StopDeliveryStreamEncryption operation.
  • Troubleshooting

Kinesis Data Analytics

  • Input

    • Kinesis Data Streams
    • Kinesis Data Firehose
  • Output

    • Kinesis Data Streams
    • Kinesis Data Firehose
    • Lambda
Kinesis Data Analytics - Managed Apache Flink
  • Equivalent to Kafka Streams, using Apache Flink behind the scene, online stream processing, namely streaming ETL

Managed Service for Apache Flink (opens in a new tab)

  • Streaming application built with API in Java, Scala, Python and SQL

  • Interactive notebooks

Kinesis Data Analytics - SQL (legacy)

Kinesis Data Analytics for SQL Applications (opens in a new tab)

Kinesis Data Analytics Studio

Kinesis Data Analytics Studio (opens in a new tab)

  • Based on

    • Apache Zeppelin

    • Apache Flink

Kinesis Video Streams

Kinesis Video Streams (opens in a new tab)

  • Out of scope for the exam

Lake Formation

Lake Formation - Ingestion
  • Blueprints are predefined ETL workflows that you can use to ingest data into your Data Lake.

  • Workflows are instances of ingestion blueprints in Lake Formation.

  • Blueprint types

    • Database snapshot

      • Loads or reloads data from all tables into the data lake from a JDBC source.

      • You can exclude some data from the source based on an exclude pattern.

      • When to use:

        • Schema evolution is flexible. (Columns are re-named, previous columns are deleted, and new columns are added in their place.)
        • Complete consistency is needed between the source and the destination.
    • Incremental database

      • Loads only new data into the data lake from a JDBC source, based on previously set bookmarks.

      • You specify the individual tables in the JDBC source database to include.

      • When to use:

        • Schema evolution is incremental. (There is only successive addition of columns.)
        • Only new rows are added; previous rows are not updated.
    • Log file

      Bulk loads data from log file sources

Lake Formation - Permission management
  • Permission types

    • Metadata access

      Data Catalog permissions

    • Underlying data access

      • Data access permissions

        Data access permissions (SELECT, INSERT, and DELETE) on Data Catalog tables that point to that location.

      • Data location permissions

        The ability to create Data Catalog resources that point to particular S3 locations.

        When you grant the CREATE_TABLE or ALTER permission to a principal, you also grant data location permissions to limit the locations for which the principal can create or alter metadata tables.

  • Permission model

    • IAM permissions model consists of IAM policies.

    • Lake Formation permissions model is implemented as DBMS-style GRANT/REVOKE commands.

    • Principals

      • IAM users and roles
      • SAML users and groups
      • IAM Identity Center
      • External accounts
    • Resources

      • LF-Tags

        Using LF-Tags can greatly simplify the number of grants over using Named Resource policies.

      • Named data catalog resources

    • Permissions

      • Database permissions
      • Table permissions
    • Data filtering

      • Column-level security

        Allows users to view only specific columns and nested columns that they have access to in the table.

      • Row-level security

        Allows users to view only specific rows of data that they have access to in the table.

      • Cell-level security

        By using both row filtering and column filtering at the same time

        Can restrict access to different columns depending on the row.

QuickSight

AWS Docs - QuickSight (opens in a new tab)

  • Key points

  • Editions

    • Standard edition

    • Enterprise edition

      • VPC connection

        Access to data source in your VPC without the need for a public IP address

      • Microsoft Active Directory SSO

      • Data stored in SPICE is encrypted at rest.

      • For SQL-based data sources, such as Redshift, Athena, PostgreSQL, or Snowflake, you can refresh your data incrementally within a look-back window of time.

  • Datasets

    • Dataset refresh can be scheduled or on-demand.

    • SPICE

      • In-memory
    • Direct SQL query

  • Visual types

  • AutoGraph (opens in a new tab)

    You create a visual by choosing AutoGraph and then selecting fields.

Migration and Transfer

DMS (Database Migration Service)

DataSync

  • AWS Docs - DataSync (opens in a new tab)
  • One-off online data transfer, transferring hot or cold data should not impede your business.
  • Move data from on-premises to AWS
  • Move data between AWS services
  • Requires DataSync agent installation on-premises

Snow Family

Snowball Edge Storage Optimized
  • For large-scale data migrations and recurring transfer workflows, as well as local computing with higher capacity needs
Snowball Edge Compute Optimized
  • For use cases such as machine learning, full motion video analysis, analytics, and local computing stacks
Snowmobile
  • Intended for more than 10 PB data migrations from a single location
  • Up to 100 PB per Snowmobile
  • Shipping container

Transfer Family

  • Transfer flat files using SFTP, FTPS, SSH and FTP

Storage Gateway

Storage Gateway - Amazon S3 File Gateway
Storage Gateway - Amazon FSx File Gateway
Storage Gateway - Tape Gateway
  • AWS Docs - Tape Gateway User Guide (opens in a new tab)

  • Cloud-backed virtual tape storage

  • Tape Gateway presents an iSCSI-based virtual tape library (VTL) of virtual tape drives and a virtual media changer to your on-premises backup application.

  • Tape Gateway stores your virtual tapes in S3 and creates new ones automatically.

Storage Gateway - Volume Gateway

Data Pipeline (opens in a new tab)

  • Define data-driven workflows for moving and transforming data from various sources to destinations, similar to Airflow

  • Managed ETL service using EMR under the hood

  • Batch processing

  • Maintenance mode, workloads can be migrated to

    • AWS Glue
    • AWS Step Functions
    • Amazon MWAA (Amazon Managed Workflows for Apache Airflow)

Application Integration

Step Functions (opens in a new tab)

SQS

Kinesis Data Streams vs SQS
MSKKinesis Data StreamsSQS
ConsumptionCan be consumed many timesRecords are deleted after being consumed
RetentionRetention period configurable from 24 hours to 1 yearRetention period configurable from 1 min to 14 days
OrderingOrdering of records is preserved at the same shardFIFO queues preserve ordering of records
ScalingManual reshardingAuto scaling
DeliveryAt least once deliveryExactly once delivery
ReplayCan replayCannot replay
Payload size1 MB256 KB

Database

Redshift

Redshift - Cluster
  • Every cluster can have up to 32 nodes.

  • Leader node and compute nodes have the same specs, and leader node is chosen automatically.

  • Leader node (2+ nodes)

    • Parse the query and building an optimal execution plan, and compile code from execution plan
    • Receive and aggregate the results from the compute nodes
  • Compute node

    • Node slices

      A compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node.

  • Enhanced VPC routing

    • Routes network traffic between your cluster and data repositories through a VPC, instead of through the internet.
Redshift - Cluster - Elastic resize
Redshift - Cluster - Classic resize
  • When you need to change the cluster size or the node type of an existing cluster and elastic resize is not supported
  • Takes much longer than elastic resize
Redshift - Cluster - Node type
  • RA3

    Node SizevCPURAM (GiB)Managed Storage quota per nodeNumber of nodes per cluster
    ra3.xlplus43232 TB1 to 32
    ra3.4xlarge1296128 TB2 to 32
    ra3.16xlarge48384128 TB2 to 128
    • Pay for the managed storage and compute separately
    • Managed storage uses SSD for local cache and S3 for persistence and automatically scales based on the workload
    • Compute is separated from managed storage and can be scaled independently
    • RA3 nodes are optimized for performance and cost-effectiveness
  • DC2 (Dense Compute)

    Node SizevCPURAM (GiB)Managed Storage quota per nodeNumber of nodes per cluster
    dc2.large215160 GB1 to 32
    dc2.8xlarge322442.6 TB2 to 128
    • Compute-intensive local SSD storage for high performance and low latency
    • Recommended for datasets under 10 TB
  • DS2 (Dense Storage)

    • Deprecated legacy option, for low-cost workloads with HDD storage, RA3 is recommended for new workloads
  • Resources

Redshift - Cluster - HA
  • AZ

    • Single AZ by default
    • Multi AZ is supported
Redshift - Distribution style

Distribution style (opens in a new tab)

  • Redshift Distribution Keys (DIST Keys) determine where data is stored in Redshift. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node.

  • Uneven distribution of data across computing nodes leads to the skewness of the work a node has to do and you don’t want an under-utilised compute node. So the distribution of the data should be uniform.

  • Data should be distributed in such a way that the rows that participate in joins are already on the same node with their joining rows in other tables. This is called collocation, which reduces data movement and improves query performance.

  • Distribution is configurable per table. So you can select a different distribution style for each of the table.

  • AUTO distribution

    • Redshift automatically distributes the data across the nodes in the cluster
    • Redshift initially assigns ALL distribution to a small table, then changes to EVEN distribution when the table grows larger.
    • Suitable for small tables or tables with a small number of distinct values
  • EVEN distribution

    • The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column.
    • Appropriate when a table doesn't participate in joins. It's also appropriate when there isn't a clear choice between KEY distribution and ALL distribution.
  • KEY distribution

    • The rows are distributed according to the values in one column.
    • All the entries with the same value in the column end up on the same slice.
    • Suitable for large tables or tables with a large number of distinct values
  • ALL distribution

    • A copy of the entire table is distributed to every node.
    • Suitable for small tables or tables that are not joined with other tables
Redshift - Concurrency scaling
  • Concurrency scaling adds transient clusters to your cluster to handle concurrent requests with consistency and fast performance in a matter of seconds.
Redshift - Sort keys

Sort keys (opens in a new tab)

  • When you create a table, you can define one or more of its columns as sort keys.
  • Either a compound or interleaved sort key.
Redshift - Compression

AWS Docs - Working with column compression (opens in a new tab)

  • You can apply a compression type, or encoding, to the columns in a table manually when you create the table.
  • The COPY command analyzes your data and applies compression encodings to an empty table automatically as part of the load operation.
  • Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB and 125 MB after compression.
  • The number of files should be a multiple of the number of slices in your cluster.
  • When loading data, it is strongly recommended that you individually compress your load files using gzip, lzop, bzip2, or Zstandard when you have large datasets.
Redshift - Workload Management (WLM)
  • AWS Docs - Amazon Redshift - Implementing workload management (opens in a new tab)

  • Automatic WLM

  • Manual WLM

    • Query Groups

      • Memory percentage and concurrency settings
    • Query Monitoring Rules

      • Query Monitoring Rules are used to define the conditions under which a query is monitored and the action to take when the query is monitored.
    • Query Queues

      • Memory percentage and concurrency settings
  • Concurrency scaling

    • Concurrency scaling automatically adds and removes compute nodes to handle spikes in demand.
    • Concurrency scaling is enabled by default for all RA3 and DC2 clusters.
    • Concurrency scaling is not available for DS2 clusters.
  • Short query acceleration (opens in a new tab)

    • Short query acceleration (SQA) prioritizes selected short-running queries ahead of longer-running queries.
    • SQA runs short-running queries in a dedicated space, so that SQA queries aren't forced to wait in queues behind longer queries.
    • CREATE TABLE AS (CTAS) statements and read-only queries, such as SELECT statements, are eligible for SQA.
Redshift - User-Defined Functions (UDF)
Redshift - Snapshots
  • Automated (opens in a new tab)

    • Enabled by default when you create a cluster.
    • By default, about every 8 hours or following every 5 GB per node of data changes, or whichever comes first.
    • Default retention period is 1 day
    • Cannot be deleted manually
  • Manual (opens in a new tab)

    • By default, manual snapshots are retained indefinitely, even after you delete your cluster.
    • AWS CLI: create-cluster-snapshot
    • AWS API: CreateClusterSnapshot
  • Backup

    • You can configure Redshift to automatically copy snapshots (automated or manual) for a cluster to another Region.
    • Only one destination Region at a time can be configured for automatic snapshot copy.
    • To change the destination Region that you copy snapshots to, first disable the automatic copy feature. Then re-enable it, specifying the new destination Region.
  • Snapshot copy grant

    • KMS keys are specific to a Region. If you enable copying of Redshift snapshots to another Region, and the source cluster and its snapshots are encrypted using a master key from KMS, you need to configure a grant for Redshift to use a master key in the destination Region.
    • The grant is created in the destination Region and allows Redshift to use the master key in the destination Region.
Redshift - Query
  • Data API

    • Available in AWS SDK, based on HTTP and JSON, no persistent connection needed
    • API calls are asynchronous
    • Uses either credentials stored in Secrets Manager or temporary database credentials, no need to pass passwords in the API calls
Redshift - Security
  • IAM

    • Does not support resource-based policies.
  • Access control

    • Cluster level

      • IAM policies

        For Redshift API actions

      • Security groups

        For network connectivity

    • Database level

      • Database user accounts
  • Data protection

  • Auditing

    • Logs

      • Connection log

        Logs authentication attempts, connections, and disconnections.

      • User log

        Logs information about changes to database user definitions.

      • User activity log

        Logs each query before it's run on the database.

Redshift Spectrum
  • In-place querying of data in S3, without having to load the data into Redshift tables
  • EB scale
  • Pay for the number of bytes scanned
  • Federated query across operational databases, data warehouses, and data lakes
Redshift Spectrum - Comparison with Athena
Redshift SpectrumAthena
PerformanceGreater control over performance; use dedicated resources of your Redshift clusterNot configurable; use shared resources managed by AWS
Query ResultsStored in RedshiftStored in S3
Redshift Serverless
  • Redshift measures data warehouse capacity in Redshift Processing Units (RPUs). You pay for the workloads you run in RPU-hours on a per-second basis (with a 60-second minimum charge), including queries that access data in open file formats in S3

Keyspaces

Neptune

Amazon Security Lake (opens in a new tab)

  • Fully managed security data lake service
  • Centralize security data from AWS environments, SaaS providers, on premises, cloud sources, and third-party sources into a purpose-built data lake that's stored in your AWS account.

Elemental MediaStore (opens in a new tab)

  • High performance for streaming media delivery

References