AWS-DEA-C01

Services

Analytics

Athena

Athena (opens in a new tab)

  • In-place querying of data in S3

  • Serverless

  • PB scale

  • Presto (opens in a new tab) under the hood

  • Supports SQL and Apache Spark

  • Analyze unstructured, semi-structured, and structured data stored in S3, formats including CSV, JSON, Apache Parquet and Apache ORC

  • Only successful or canceled queries are billed, while failed queries are not billed.

  • No charge for DDL statements

  • Use cases

    • Ad-hoc queries of web logs, CloudTrail / CloudFront / VPC / ELB logs
    • Querying staging data before loading to Redshift
    • Integration with Jupyter, Zeppelin, RStudio notebooks
    • Integration with QuickSight for visualization
    • Integration via JDBC / ODBC for BI tools
  • Resources

Athena - Athena SQL
  • MSCK REPAIR TABLE refreshes partition metadata with the current partitions of an external table.

    Repairing partitions manually using MSCK repair (opens in a new tab)

  • SerDe (opens in a new tab)

    • When you specify ROW FORMAT DELIMITED, Athena uses the LazySimpleSerDe by default.

      -- e.g. ROW FORMAT DELIMITED
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      ESCAPED BY '\\'
      COLLECTION ITEMS TERMINATED BY '|'
      MAP KEYS TERMINATED BY ':'
    • Use ROW FORMAT SERDE to explicitly specify the type of SerDe that Athena should use when it reads and writes data to the table.

      -- e.g. ROW FORMAT SERDE
      ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
      WITH SERDEPROPERTIES (
      'serialization.format' = ',',
      'field.delim' = ',',
      'collection.delim' = '|',
      'mapkey.delim' = ':',
      'escape.delim' = '\\'
      )
  • UNLOAD (opens in a new tab)

    • Writes query results from a SELECT statement to the specified data format. Supported formats for UNLOAD include Parquet, ORC, Avro, and JSON.
    • The UNLOAD statement is useful when you want to output the results of a SELECT query in a non-CSV format but do not require the associated table.
Athena - Workgroup
  • A workgroup is a collection of users, queries, and results that are associated with a specific data catalog and workgroup settings.
  • Workgroup settings include the location in S3 where query results are stored, the encryption configuration, and the data usage control settings.
Athena - Federated Query

AWS Docs - Using Amazon Athena Federated Query (opens in a new tab)

  • Query the data in place or build pipelines that extract data from multiple data sources and store them in S3.
  • Uses data source connectors that run on Lambda to run federated queries.

OpenSearch

AWS Docs - OpenSearch (opens in a new tab)

  • Operational analytics and log analytics
  • Forked version of Elasticsearch
  • Number of Shards = Index Size / 30GB
OpenSearch - Index
  • Storage

    • UltraWarm

      To store large amounts of read-only data

    • Cold

      If you need to do periodic research or forensic analysis on your older data.

    • Hot

    • OR1 (OpenSearch Optimized Instance)

      A domain with OR1 instances uses EBS gp3 or io1 volumes for primary storage, with data copied synchronously to S3 as it arrives.

  • Cross-cluster replication (opens in a new tab)

    • Cross-cluster replication follows a pull model, you initate connections from the follower domain.

EMR

AWS Docs - EMR (opens in a new tab)

  • Managed cluster platform that simplifies running big data frameworks, such as Hadoop and Spark, on AWS.

  • PB scale

  • RunJobFlow API creates and starts running a new cluster (job flow). The cluster runs the steps specified. After the steps complete, the cluster stops and the HDFS partition is lost.

  • An EMR cluster with multiple primary nodes can reside only in one AZ.

  • HDFS

    • A typical block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB chunks, and if possible, each chunk will reside on a different DataNode.
  • EMR metrics (opens in a new tab)

EMR cluster - Node types

AWS Docs - EMR cluster - Node types (opens in a new tab)

  • Primary node / Master node

    • Manages the cluster and typically runs primary components of distributed applications.
    • Tracks the status of jobs submitted to the cluster
    • Monitors the health of the instance groups
    • HA support from 5.23.0+
  • Core nodes

    • Store HDFS data and run tasks
    • Multi-node clusters have at least one core node.
    • One core instance group or instance fleet per cluster, but there can be multiple nodes running on multiple EC2 instances in the instance group or instance fleet.
  • Task nodes

    • No HDFS data, therefore no data loss if terminated
    • Can use Spot instances
    • Optional in a cluster
EMR - EC2

AWS Docs - EMR - EC2 (opens in a new tab)

EMR - EKS

AWS Docs - EMR - EKS (opens in a new tab)

EMR - Serverless

AWS Docs - EMR - Serverless (opens in a new tab)

EMR - EMRFS
EMR - Security - Apache Ranger
  • A RBAC framework to enable, monitor and manage comprehensive data security across the Hadoop ecosystem.
  • Centralized security administration and auditing
  • Fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN)
  • Syncs policies and users by using agents and plugins that run within the same process as the Hadoop component.
  • Supports row-level authentication and auditing capabilities with embedded search.
EMR - Security Configuration

AWS Docs - Security Configuration (opens in a new tab)

EMR Studio
  • Web IDE for fully managed Jupyter notebooks running on EMR clusters
EMR Notebooks (opens in a new tab)
EMR - Open Source Ecosystem
Delta Lake

EMR - Delta Lake (opens in a new tab)

  • Storage layer framework for lakehouse architectures
Ganglia

EMR - Ganglia (opens in a new tab)

  • Performance monitoring tool for Hadoop and HBase clusters
  • Not included in EMR after 6.15.0
Apache HBase

EMR - Apache HBase (opens in a new tab)

  • Wide-column store NoSQL running on HDFS
  • EMR WAL support, able to restore an existing WAL retained for 30 days starting from the time when cluster was created, with a new cluster from the same S3 root directory.
Apache HCatalog

EMR - Apache HCatalog (opens in a new tab)

  • HCatalog is a tool that allows you to access Hive metastore tables within Pig, Spark SQL, and/or custom MapReduce applications.
Apache Hudi

EMR - Apache Hudi (opens in a new tab)

  • Open Table Format
  • Brings database and data warehouse capabilities to the data lake
Hue

EMR - Hue (opens in a new tab)

  • Web GUI for Hadoop
Apache Iceberg

EMR - Apache Iceberg (opens in a new tab)

  • Open Table Format
Apache Livy

EMR - Apache Livy (opens in a new tab)

  • A REST interface for Apache Spark
Apache MXNet

EMR - Apache MXNet (opens in a new tab)

  • Retired
  • A deep learning framework for building neural networks and other deep learning applications.
Apache Oozie

EMR - Apache Oozie (opens in a new tab)

  • Workflow scheduler system to manage Hadoop jobs
  • Apache Airflow is a modern alternative.
Apache Phoenix

EMR - Apache Phoenix (opens in a new tab)

  • OLTP and operational analytics in Hadoop for low latency applications
Apache Pig

EMR - Apache Pig (opens in a new tab)

  • SQL-like commands written in Pig Latin and converts those commands into Tez jobs based on DAG or MapReduce programs.
Apache Sqoop

EMR - Apache Sqoop (opens in a new tab)

  • Retired
  • A tool for transferring data between S3, Hadoop, HDFS, and RDBMS databases.
Apache Tez

EMR - Apache Tez (opens in a new tab)

  • An application execution framework for complex DAG of tasks, similar to Apache Spark

Glue

Glue (opens in a new tab)

Glue - Orchestration - Triggers
  • When fired, a trigger can start specified jobs and crawlers.

  • A trigger fires on demand, based on a schedule, or based on a combination of events.

  • trigger types

    • Scheduled

    • Conditional

      The trigger fires if the watched jobs or crawlers end with the specified statuses.

    • On-demand

Glue - Orchestration - Workflows
Glue - Orchestration - Blueprints
  • AWS Docs - Glue - blueprints (opens in a new tab)

  • Glue blueprints provide a way to create and share Glue workflows. When there is a complex ETL process that could be used for similar use cases, rather than creating an Glue workflow for each use case, you can create a single blueprint.

Glue - Job
Glue - Job type
  • Apache Spark ETL

    • Minimum 2 DPU
    • 10 DPU (default)
    • Job command: glueetl
  • Apache Spark streaming ETL

    • Minimum 2 DPU (default)
    • Job command: gluestreaming
  • Python shell

    • 1 DPU or 0.0625 DPU (default)

    • Job command: pythonshell

    • Mainly intended for ad-hoc tasks such as data retrieval

    • Much faster startup times than Spark jobs

    • Compared to AWS Lambda

      • Can run much longer
      • Can be equipped with more CPU and memory
      • Lower per-unit-time cost
      • Higher minimum billing time (at least 1 minute)
      • Seamlessly integrates with Glue Workflow
    • AWS Glue Python Shell Jobs (opens in a new tab)

  • Ray

    • Job command: glueray
Glue - Job - Job Bookmark
  • Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run. This persisted state information is called a job bookmark.

  • Options

    • Enabled

      The job updates the state after a job run. The job keeps track of processed data, and when a job runs, it processes new data since the last checkpoint.

    • Disabled

      Job bookmarks are not used, and the job always processes the entire dataset. This is the default setting.

    • Pause

      The job bookmark state is not updated when this option set is specified. The job processes incremental data since the last successful job run. You are responsible for managing the output from previous job runs.

Glue - Job run insights (Monitoring)
  • Job debugging and optimization
  • Insights are available using 2 new log streams in the CloudWatch logs
Glue - Job - Worker
  • A single DPU is also called a worker.

  • A DPU is a relative measure of processing power that consists of 4 vCPU of compute capacity and 16 GB of memory.

  • A M-DPU is a DPU with 4 vCPU and 32 GB of memory.

  • Auto Scaling

    • Available for Glue jobs with G.1X, G.2X, G.4X, G.8X, or G.025X (only for Streaming jobs) worker types. Standard DPUs are not supported.
    • Requires Glue 3.0+.
  • Worker type

    • For Glue 2.0+

      Specify Worker type and Number of workers

      • Standard

        • 1 DPU
        • 4 vCPU and 16 GB of memory
        • 50 GB of attached storage
      • G.1X (default)

        • 1 DPU - 4 vCPU and 16 GB of memory
        • 64 GB of attached storage
        • Recommended for most workloads with cost effective performance
        • Recommended for jobs authored in Glue 2.0+
      • G.2X

        • 2 DPU - 8 vCPU and 32 GB of memory
        • 128 GB of attached storage
        • Recommended for most workloads with cost effective performance
        • Recommended for jobs authored in Glue 2.0+
      • G.4X

        • 4 DPU - 16 vCPU and 64 GB of memory
        • Glue 3.0+ Spark jobs
      • G.8X

        • 8 DPU - 32 vCPU and 128 GB of memory
        • Recommended for workloads with most demanding transformations, aggregations, and joins
        • Glue 3.0+ Spark jobs
      • G.025X

        • 0.25 DPU
        • Recommended for low volume streaming jobs
        • Glue 3.0+ Spark Streaming jobs only
    • Execution class

      • Standard

        • Ideal for time-sensitive workloads that require fast job startup and dedicated resources
      • Flexible

        • Appropriate for time-insensitive jobs whose start and completion times may vary
        • Only jobs with Glue 3.0+ and command type glueetl will be allowed to set ExecutionClass to FLEX.
        • The flexible execution class is available for Spark jobs.
Glue - Programming
Glue - Programming - Python
Glue - Programming - Scala
Glue - Studio
  • GUI for Glue jobs

  • Visual ETL

    Visually compose ETL workflows

  • Notebook

  • Script editor

    • Create and edit Python or Scala scripts
    • Automatically generate ETL scripts
Glue - Data Catalog

AWS Docs - Glue Data Catalog (opens in a new tab)

Glue - Data Catalog - crawler

AWS Docs - Glue Data Catalog - crawler (opens in a new tab)

  • A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.

  • A classifier recognizes the format of your data and generates a schema. It returns a certainty number between 0.0 and 1.0, where 1.0 means 100% certainty. Glue uses the output of the classifier that has the highest certainty.

  • If no classifier returns a certainty greater than 0.0, Glue returns the default classification string of UNKNOWN.

  • Once attached to a crawler, a custom classifier is executed before the built-in classifiers.

  • Steps

    • A crawler runs any custom classifiers that you choose to infer the format and schema of your data.
    • The crawler connects to the data store.
    • The inferred schema is created for your data.
    • The crawler writes metadata to the Glue Data Catalog.
  • Partition threshold

    If the following conditions are met, the schemas are denoted as partitions of a table.

    • The partition threshold is higher than 0.7 (70%).
    • The maximum number of different schemas doesn't exceed 5.

    Learn how AWS Glue crawler detects the schema | AWS re:Post (opens in a new tab)

Glue - Data Catalog - security
  • You can only use an Glue resource policy to manage permissions for Data Catalog resources.
  • You can't attach it to any other Glue resources such as jobs, triggers, development endpoints, crawlers, or classifiers.
  • Only one resource policy is allowed per catalog, and its size is limited to 10 KB.
  • Each AWS account has one single catalog per Region whose catalog ID is the same as the AWS account ID.
  • You cannot delete or modify a catalog.
Glue - Data Quality

AWS Docs - Glue Data Quality (opens in a new tab)

  • Based on DeeQu (opens in a new tab) framework, using DSL Data Quality Definition Language (DQDL) to define data quality rules

  • Entry Point

    • Data quality for the Data Catalog
    • Data quality for ETL jobs
Glue - Streaming

AWS Docs - Glue Streaming (opens in a new tab)

  • Using the Apache Spark Streaming framework
  • Near-real-time data processing

Glue DataBrew

AWS Docs - Glue DataBrew (opens in a new tab)

Visual data preparation tool that enables users to clean and normalize data without writing any code.

  • Job type

    • Recipe job

      Data transformation

    • Profile job

      Analyzing a dataset to create a comprehensive profile of the data

Glue DataBrew - Recipes
  • AWS Docs - Recipe step and function reference (opens in a new tab)

  • Categories

    • Basic column recipe steps

      • Filter
      • Column
    • Data cleaning recipe steps

      • Format
      • Clean
      • Extract
    • Data quality recipe steps

      • Missing
      • Invalid
      • Duplicates
      • Outliers
    • Personally indentifiable information (PII) recipe steps

      • Mask personal information
      • Replace personal information
      • Encrypt personal information
      • Shuffle rows
    • Column structure recipe steps

      • Split
      • Merge
      • Create
    • Column formatting recipe steps

      • Decimal precision
      • Thousands separator
      • Abbreviate numbers
    • Data structure recipe steps

      • Nest-Unnest
      • Pivot
      • Group
      • Join
      • Union
    • Data science recipe steps

      • Text
      • Scale
      • Mapping
      • Encode
    • Functions

      • Mathematical functions
      • Aggregate functions
      • Text functions
      • Date and time functions
      • Window functions
      • Web functions
      • Other functions

Kinesis Data Streams

AWS Docs - Kinesis Data Streams (opens in a new tab)

  • Key points

    • Equivalent to Kafka, for event streaming, no ETL
    • On-demand or provisioned mode
    • PaaS with API access for developers
    • The maximum size of the data payload of a record before base64-encoding is up to 1 MB.
    • Retention Period by default is 1 day, and can be up to 365 days.
  • Shard (opens in a new tab)

    • A stream is composed of one or more shards.

    • A shard is a uniquely identified sequence of data records in a stream.

    • All the data in the shard is sent to the same worker that is processing the shard.

    • A shard iterator is a pointer to a position in the shard from which to start reading data records sequentially. A shard iterator specifies this position using the sequence number of a data record in a shard.

    • number_of_shards = max(incoming_write_bandwidth_in_KiB/1024, outgoing_read_bandwidth_in_KiB/2048)

    • Read throughput

      Up to 2 MB/second/shard

      or

      5 transactions (API calls) /second/shard

      or

      2000 records/second/shard

      shared across all the consumers that are reading from a given shard

      Each call to GetRecords is counted as 1 read transaction.

      GetRecords can retrieve up to 10 MB / transaction / shard, and up to 10000 records per call. If a call to GetRecords returns 10 MB, subsequent calls made within the next 5 seconds throw an exception.

      up to 20 registered consumers (Enhanced Fan-out Limit) for each data stream.

    • Write throughput

      Up to 1 MB/second/shard

      or

      1000 records/second/shard

    • Record aggregation allows customers to combine multiple records into a single record. This allows customers to improve their per shard throughput.

  • Data stream capacity mode

    • On-demand mode

      Automatically manages the shards

    • Provisioned mode

      Must specify the number of shards for the data stream upfront, but you can also change the number of shards later by resharding the stream.

  • Scaling

  • Partition key (opens in a new tab)

    • All data records with the same partition key map to the same shard.
    • The number of partition keys should typically be much greater than the number of shards, namely many-to-one relationship.
  • Idempotency

    • 2 primary causes for duplicate records

      • Producer retries
      • Consumer retries
    • Consumer retries are more common than producer retries

      • Kinesis Data Streams does not guarantee the order of records across shards in a stream.
      • Kinesis Data Streams does guarantee the order of records within a shard.
Kinesis Data Streams - Producers
Kinesis Data Streams - Producers - KPL

AWS Docs - KPL (opens in a new tab)

  • Batching

    • Aggregation

      Storing multiple payloads within a single Kinesis Data Streams record.

    • Collection

      Consolidate multiple Kinesis Data Streams records into a single Kinesis Data Streams record to reduce HTTP requests.

  • Rate Limiting

    • Limits per-shard throughput sent from a single producer
    • Implemented using a token bucket algorithm with separate buckets for both Kinesis Data Streams records and bytes.
  • Pros

    • Higher throughput due to aggregation and compression of records
    • Abstraction over API to simplify coding using low level API directly
  • Cons

    • Higher latency due to additional processing delay of up to configurable RecordMaxBufferedTime
    • Only supports AWS SDK v1
Kinesis Data Streams - Producers - Kinesis Agent

AWS Docs - Kinesis Agent (opens in a new tab)

  • Stand-alone Java application running as a daemon
  • Continuously monitors a set of files and sends new data to your stream.
Kinesis Data Streams - Consumers
  • Differences between shared throughout consumer and enhanced fan-out consumer (opens in a new tab)

  • Types

    • Shared throughput consumers without enhanced fan-out
    • Enhanced fan-out consumers
  • KCL (opens in a new tab)

    • Tasks

      • Connects to the data stream
      • Enumerates the shards within the data stream
      • Uses leases to coordinates shard associations with its workers
      • Instantiates a record processor for every shard it manages
      • Pulls data records from the data stream
      • Pushes the records to the corresponding record processor
      • Checkpoints processed records
      • Balances shard-worker associations (leases) when the worker instance count changes or when the data stream is resharded (shards are split or merged)
    • Versions

      • KCL 1.x is based on AWS SDK v1, and KCL 2.x is based on AWS SDK v2

      • KCL 1.x

        • Java
        • Node.js
        • .NET
        • Python
        • Ruby
      • KCL 2.x

        • Java
        • Python
    • Multistream processing is only supported in KCL 2.x for Java (KCL 2.3+ for Java)

    • You should ensure that the number of KCL instances does not exceed the number of shards (except for failure standby purposes)

    • One KCL worker is deployed on one EC2 instance, able to process multiple shards.

    • A KCL worker runs as a process and it has multiple record processors running as threads within it.

    • One Shard has exactly one corresponding record processor.

    • KCL 2.x enables you to create KCL consumer applications that can process more than one data stream at the same time.

    • Checkpoints sequence number for the shard in the lease table to track processed records

    • Enhanced fan-out (opens in a new tab)

      • Consumers using enhanced fan-out have dedicated throughput of up to 2 MB/second/shard, and without enhanced fan-out, all consumers reading from the same shard share the throughput of 2 MB/second/shard.
      • Requires KCL 2.x and Kinesis Data Streams with enhanced fan-out enabled
      • You can register up to 20 consumers per stream to use enhanced fan-out.
      • Kinesis Data Streams pushes data records from the stream to consumers using enhanced fan-out over HTTP/2, therefore no polling and lower latency.
    • Lease Table (opens in a new tab)

      • At any given time, each shard of data records is bound to a particular KCL worker by a lease identified by the leaseKey variable.
      • By default, a worker can hold one or more leases (subject to the value of the maxLeasesForWorker variable) at the same time.
      • A DynamoDB table to keep track of the shards in a Kinesis Data Stream that are being leased and processed by the workers of the KCL consumer application.
      • Each row in the lease table represents a shard that is being processed by the workers of your consumer application.
      • KCL uses the name of the consumer application to create the name of the lease table that this consumer application uses, therefore each consumer application name must be unique.
      • One lease table for one KCL consumer application
      • KCL creates the lease table with a provisioned throughput of 10 reads / second and 10 writes / second
  • Developing Consumers Using Amazon Data Firehose (opens in a new tab)

  • Developing Consumers Using Amazon Managed Service for Apache Flink (opens in a new tab)

  • Troubleshooting (opens in a new tab)

Kinesis Data Firehose

AWS Docs - Kinesis Data Firehose (opens in a new tab)

Kinesis Data Firehose - Sources & Destinations
Kinesis Data Firehose - Buffering hints

AWS Docs - Buffering hints (opens in a new tab)

  • Buffering size in MBs (destination specific)

    • 1 MB to 128 MB
    • Default is 5 MB
  • Buffering interval in seconds (destination specific)

    • 60 to 900 seconds
    • Default is 60 / 300 seconds
Kinesis Data Firehose - Data Transformation

Data Transformation (opens in a new tab) not natively supported, need to use Lambda

  • Lambda

    • Custom transformation logic

    • Lambda invocation time of up to 5 min.

    • Error handling

      • Retry the invocation 3 times by default.
      • After retries, if the invocation is unsuccessful, Firehose then skips that batch of records. The skipped records are treated as unsuccessfully processed records.
      • The unsuccessfully processed records are delivered to your S3 bucket.
  • Data format conversion

    • Firehose can convert JSON records using a schema from a table defined in Glue
    • Non-JSON data must be converted by invoking Lambda function
Kinesis Data Firehose - Dynamic Partitioning
  • Dynamic partitioning enables you to continuously partition streaming data in Firehose by using keys within data (for example, customer_id or transaction_id) and then deliver the data grouped by these keys into corresponding S3 prefixes.

  • Partitioning your data minimizes the amount of data scanned, optimizes performance, and reduces costs of your analytics queries on S3.

  • Methods of creating partitioning keys

    • Inline parsing

      Map each parameter to a valid jq expression, suitable for JSON format

    • Lambda function

      For compressed or encrypted data records, or data that is in any file format other than JSON

  • Resources

Kinesis Data Firehose - Server-side Encryption

Server-side encryption (opens in a new tab)

  • Can be enabled depending on the source of data

  • Kinesis Data Streams as the Source

    • Kinesis Data Streams first decrypts the data and then sends it to Firehose.
    • Firehose buffers the data in memory and then delivers without storing the unencrypted data at rest.
  • With Direct PUT or other Data Sources

    • Can be turned on by using the StartDeliveryStreamEncryption operation.
    • To stop server-side-encryption, use the StopDeliveryStreamEncryption operation.
Kinesis Data Firehose - Custom S3 Prefixes

Custom Prefixes for Amazon S3 Objects (opens in a new tab)

Kinesis Data Analytics

  • Input

    • Kinesis Data Streams
    • Kinesis Data Firehose
  • Output

    • Kinesis Data Streams
    • Kinesis Data Firehose
    • Lambda
Kinesis Data Analytics - Managed Apache Flink

Managed Service for Apache Flink (opens in a new tab)

  • Equivalent to Kafka Streams, using Apache Flink behind the scene, online stream processing, namely streaming ETL

  • Managed Service for Apache Flink is for buiding streaming application with API in Java, Scala, Python and SQL

  • Managed Service for Apache Flink Studio is for ad-hoc interactive data exploration.

  • Flink API

    • Flink offers 4 levels of API abstraction: Flink SQL, Table API, DataStream API, and Process Function, which is used in conjunction with the DataStream API.
    • SQL can be embedded in any Flink application, regardless the programming language chosen.
    • If you are if planning to use the DataStream API, not all connectors are supported in Python.
    • If you need low-latency/high-throughput you should consider Java/Scala regardless the API.
    • If you plan to use Async IO in the Process Functions API you will need to use Java.
Kinesis Data Analytics Studio

Kinesis Data Analytics Studio (opens in a new tab)

  • Based on

    • Apache Zeppelin

    • Apache Flink

Lake Formation

Lake Formation - Ingestion
  • Blueprints are predefined ETL workflows that you can use to ingest data into your Data Lake.

  • Workflows are instances of ingestion blueprints in Lake Formation.

  • Blueprint types

    • Database snapshot

      • Loads or reloads data from all tables into the data lake from a JDBC source.

      • You can exclude some data from the source based on an exclude pattern.

      • When to use:

        • Schema evolution is flexible. (Columns are re-named, previous columns are deleted, and new columns are added in their place.)
        • Complete consistency is needed between the source and the destination.
    • Incremental database

      • Loads only new data into the data lake from a JDBC source, based on previously set bookmarks.

      • You specify the individual tables in the JDBC source database to include.

      • When to use:

        • Schema evolution is incremental. (There is only successive addition of columns.)
        • Only new rows are added; previous rows are not updated.
    • Log file

      Bulk loads data from log file sources

Lake Formation - Permission management

The security policies in Lake Formation use two layers of permissions, each resource is protected by

  • Lake Formation permissions (which control access to Data Catalog resources and S3 locations)

  • IAM permissions (which control access to Lake Formation and AWS Glue API resources)

  • Permission types

    • Metadata access

      Data Catalog permissions

    • Underlying data access

      • Data access permissions

        Data access permissions (SELECT, INSERT, and DELETE) on Data Catalog tables that point to that location.

      • Data location permissions

        The ability to create Data Catalog resources that point to particular S3 locations.

        When you grant the CREATE_TABLE or ALTER permission to a principal, you also grant data location permissions to limit the locations for which the principal can create or alter metadata tables.

  • Permission model

    • IAM permissions model consists of IAM policies.

    • Lake Formation permissions model is implemented as DBMS-style GRANT/REVOKE commands.

    • Principals

      • IAM users and roles
      • SAML users and groups
      • IAM Identity Center
      • External accounts
    • Resources

      • LF-Tags

        Using LF-Tags can greatly simplify the number of grants over using Named Resource policies.

      • Named data catalog resources

    • Permissions

      • Database permissions
      • Table permissions
    • Data filtering

      • Column-level security

        Allows users to view only specific columns and nested columns that they have access to in the table.

      • Row-level security

        Allows users to view only specific rows of data that they have access to in the table.

      • Cell-level security

        By using both row filtering and column filtering at the same time

        Can restrict access to different columns depending on the row.

QuickSight

AWS Docs - QuickSight (opens in a new tab)

  • Key points

  • Editions

    • Standard edition

    • Enterprise edition

      • VPC connection

        Access to data source in your VPC without the need for a public IP address

      • Microsoft Active Directory SSO

      • Data stored in SPICE is encrypted at rest.

      • For SQL-based data sources, such as Redshift, Athena, PostgreSQL, or Snowflake, you can refresh your data incrementally within a look-back window of time.

  • Datasets

    • Dataset refresh can be scheduled or on-demand.

    • SPICE

      • In-memory
    • Direct SQL query

  • Visual types

  • AutoGraph (opens in a new tab)

    You create a visual by choosing AutoGraph and then selecting fields.

Migration and Transfer

Application Discovery Service

AWS Application Discovery Service (opens in a new tab)

  • Collect usage and configuration data about your on-premises servers and databases.

  • All discovered data is stored in your AWS Migration Hub home Region.

  • 2 ways of performing discovery and collecting data

    • Agentless discovery

      Identifies VMs and hosts associated with VMware vCenter

    • Agent-based discovery

      The agent runs in your local environment and requires root privileges.

Application Migration Service

  • Automated lift-and-shift (rehost) migration of VMs to AWS
  • Migrate your applications to AWS from physical servers, VMware vSphere, Microsoft Hyper-V, and other cloud providers.
  • Migrate EC2 instances between AWS Regions or between AWS accounts, and to migrate from EC2-Classic to a VPC.

DMS

AWS Docs - DMS (Database Migration Service) (opens in a new tab)

  • Migrate RDBMS, data warehouses, NoSQL databases, and other types of data stores

  • One-time migration or continuous replication

  • Used for CDC (Change Data Capture) for real-time ongoing replication

    AWS Big Data Blog - Stream change data to Amazon Kinesis Data Streams with AWS DMS (opens in a new tab)

  • DMS uses a replication instance to connect to your source data store, read the source data, and format the data for consumption by the target data store.

  • DMS Serverless eliminates replication instance management tasks.

DMS - Fleet Advisor
  • Automatically inventories and assesses on-premises database and analytics server fleet and identifies potential migration paths
DMS - Data Transformations

DataSync

AWS Docs - DataSync (opens in a new tab)

  • One-off online data transfer, transferring hot or cold data should not impede your business.

  • Move data between on-premises and AWS

  • Move data between AWS services

  • Requires DataSync agent installation on-premises

  • Move files or objects, not databases

  • On-premises

    • Network File System (NFS)
    • Server Message Block (SMB)
    • Hadoop Distributed File Systems (HDFS)
    • Object storage

Snow Family

Snowball Edge Storage Optimized
  • For large-scale data migrations and recurring transfer workflows, as well as local computing with higher capacity needs
Snowball Edge Compute Optimized
  • For use cases such as machine learning, full motion video analysis, analytics, and local computing stacks
Snowmobile
  • Intended for more than 10 PB data migrations from a single location
  • Up to 100 PB per Snowmobile
  • Shipping container

Transfer Family

  • Transfer flat files using the following protocols:

    • Secure File Transfer Protocol (SFTP): version 3
    • File Transfer Protocol Secure (FTPS)
    • File Transfer Protocol (FTP)
    • Applicability Statement 2 (AS2)
    • Browser-based transfers

Application Integration

AppFlow

EventBridge

  • Rules

    Routing rules for events

  • API Destinations

    API destinations are HTTP endpoints that you can invoke as the target of a rule, similar to how you invoke an AWS service or resource as a target.

MWAA

  • Auto scaling
  • Automatic Airflow setup based on Airflow version
  • Streamlined upgrades and patches
  • Workflow monitoring in CloudWatch

SNS

Step Functions

Step Functions (opens in a new tab)

Step Functions - State Machine
  • Types

    • Standard workflows

      • Standard workflows are ideal for long-running, durable, and auditable workflows.

      • Standard workflows can support an execution start rate of over 2000 executions / second.

      • They can run for up to a year and you can retrieve the full execution history using the Step Functions API.

      • Standard Workflows employ an exactly-once execution model, where your tasks and states are never started more than once unless you have specified the Retry behavior in your state machine.

      • Suited to orchestrating non-idempotent actions, such as starting an EMR cluster or processing payments.

      • Standard Workflow executions are billed according to the number of state transitions processed.

      • Using Callbacks or the .sync Service integration will most likely reduce the number of state transitions and cost.

    • Express workflows

      The Express type is used for high-volume, event-processing workloads and can run for up to 5 minutes.

      • Synchronous Express Workflows

        Start a workflow, wait until it completes, and then return the result

      • Asynchronous Express Workflows

        Return confirmation that the workflow was started, but don't wait for the workflow to complete

Step Functions - States

AWS Docs - Step Functions - States (opens in a new tab)

  • Defined with Amazon States Language in State Machine definition

  • Types

    • Task

    • Choice

    • Map

      • Use the Map state to run a set of workflow steps for each item in a dataset. The Map state's iterations run in parallel, which makes it possible to process a dataset quickly.

      • Map state processing modes (opens in a new tab)

        • Inline mode

          • Limited concurrency, each iteration of the Map state runs in the context of the workflow that contains the Map state.
          • Map state accepts only a JSON array as input. Also, this mode supports up to 40 concurrent iterations.
        • Distributed mode

          • High concurrency, each iteration of the Map state runs in its own execution context.

          • When you run a Map state in Distributed mode, Step Functions creates a Map Run resource.

          • Tolerated failure threshold can be set for a Map Run, and Map Run fails automatically if it exceeds the threshold.

          • Use cases

            • The size of your dataset exceeds 256 KB.
            • The workflow's execution event history exceeds 25000 entries.
            • You need a concurrency of more than 40 parallel iterations.

SQS

Kinesis Data Streams vs SQS
MSKKinesis Data StreamsSQS
ConsumptionCan be consumed many timesRecords are deleted after being consumed
RetentionRetention period configurable from 24 hours to 1 yearRetention period configurable from 1 min to 14 days
OrderingOrdering of records is preserved at the same shardFIFO queues preserve ordering of records
ScalingManual reshardingAuto scaling
DeliveryAt least onceExactly once
ReplayCan replayCan replayCannot replay
Payload size1 MB1 MB256 KB

Cloud Financial Management

AWS Budgets

AWS Cost Explorer

Compute

AWS Batch

AWS Batch (opens in a new tab)

  • AWS Batch handles job execution and compute resource management, allowing you to focus on developing your applications rather than managing infrastructure.
  • A batch job is deployed as a Docker container.

EC2

Lambda

AWS SAM

Containers

ECR

ECS

EKS

Database

DocumentDB

DynamoDB

RDS

Redshift

Redshift - Cluster
  • Every cluster can have up to 32 nodes.

  • Leader node and compute nodes have the same specs, and leader node is chosen automatically.

  • Leader node (2+ nodes)

    • Parse the query and building an optimal execution plan, and compile code from execution plan
    • Receive and aggregate the results from the compute nodes
  • Compute node

    • Node slices

      A compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node.

  • Enhanced VPC routing

    • Routes network traffic between your cluster and data repositories through a VPC, instead of through the internet.
Redshift - Cluster - Elastic resize
Redshift - Cluster - Classic resize
  • When you need to change the cluster size or the node type of an existing cluster and elastic resize is not supported
  • Takes much longer than elastic resize
Redshift - Cluster - Node type
  • RA3

    Node SizevCPURAM (GiB)Managed Storage quota per nodeNumber of nodes per cluster
    ra3.xlplus43232 TB1 to 32
    ra3.4xlarge1296128 TB2 to 32
    ra3.16xlarge48384128 TB2 to 128
    • Pay for the managed storage and compute separately
    • Managed storage uses SSD for local cache and S3 for persistence and automatically scales based on the workload
    • Compute is separated from managed storage and can be scaled independently
    • RA3 nodes are optimized for performance and cost-effectiveness
  • DC2 (Dense Compute)

    Node SizevCPURAM (GiB)Managed Storage quota per nodeNumber of nodes per cluster
    dc2.large215160 GB1 to 32
    dc2.8xlarge322442.6 TB2 to 128
    • Compute-intensive local SSD storage for high performance and low latency
    • Recommended for datasets under 10 TB
  • DS2 (Dense Storage)

    • Deprecated legacy option, for low-cost workloads with HDD storage, RA3 is recommended for new workloads
  • Resources

Redshift - Cluster - HA
  • AZ

    • Single AZ by default
    • Multi AZ is supported
Redshift - Distribution style

Distribution style (opens in a new tab)

  • Redshift Distribution Keys (DIST Keys) determine where data is stored in Redshift. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node.

  • Uneven distribution of data across computing nodes leads to the skewness of the work a node has to do and you don’t want an under-utilised compute node. So the distribution of the data should be uniform.

  • Data should be distributed in such a way that the rows that participate in joins are already on the same node with their joining rows in other tables. This is called collocation, which reduces data movement and improves query performance.

  • Distribution is configurable per table. So you can select a different distribution style for each of the table.

  • AUTO distribution

    • Redshift automatically distributes the data across the nodes in the cluster
    • Redshift initially assigns ALL distribution to a small table, then changes to EVEN distribution when the table grows larger.
    • Suitable for small tables or tables with a small number of distinct values
  • EVEN distribution

    • The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column.
    • Appropriate when a table doesn't participate in joins. It's also appropriate when there isn't a clear choice between KEY distribution and ALL distribution.
  • KEY distribution

    • The rows are distributed according to the values in one column.
    • All the entries with the same value in the column end up on the same slice.
    • Suitable for large tables or tables with a large number of distinct values
  • ALL distribution

    • A copy of the entire table is distributed to every node.
    • Suitable for small tables or tables that are not joined with other tables
Redshift - Concurrency scaling
  • Concurrency scaling adds transient clusters to your cluster to handle concurrent requests with consistency and fast performance in a matter of seconds, mainly for bursty workloads.
Redshift - Sort keys

Sort keys (opens in a new tab)

  • When you create a table, you can define one or more of its columns as sort keys.
  • Either a compound or interleaved sort key.
Redshift - Compression

AWS Docs - Working with column compression (opens in a new tab)

  • You can apply a compression type, or encoding, to the columns in a table manually when you create the table.
  • The COPY command analyzes your data and applies compression encodings to an empty table automatically as part of the load operation.
  • Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB and 125 MB after compression.
  • The number of files should be a multiple of the number of slices in your cluster.
  • When loading data, it is strongly recommended that you individually compress your load files using gzip, lzop, bzip2, or Zstandard when you have large datasets.
Redshift - Workload Management (WLM)
  • AWS Docs - Amazon Redshift - Implementing workload management (opens in a new tab)

  • Automatic WLM

  • Manual WLM

    • Query Groups

      • Memory percentage and concurrency settings
    • Query Monitoring Rules

      • Query Monitoring Rules are used to define the conditions under which a query is monitored and the action to take when the query is monitored.
    • Query Queues

      • Memory percentage and concurrency settings
  • Concurrency scaling

    • Concurrency scaling automatically adds and removes compute nodes to handle spikes in demand.
    • Concurrency scaling is enabled by default for all RA3 and DC2 clusters.
    • Concurrency scaling is not available for DS2 clusters.
  • Short query acceleration (opens in a new tab)

    • Short query acceleration (SQA) prioritizes selected short-running queries ahead of longer-running queries.
    • SQA runs short-running queries in a dedicated space, so that SQA queries aren't forced to wait in queues behind longer queries.
    • CREATE TABLE AS (CTAS) statements and read-only queries, such as SELECT statements, are eligible for SQA.
Redshift - User-Defined Functions (UDF)
Redshift - Snapshots
  • Automated (opens in a new tab)

    • Enabled by default when you create a cluster.
    • By default, about every 8 hours or following every 5 GB per node of data changes, or whichever comes first.
    • Default retention period is 1 day
    • Cannot be deleted manually
  • Manual (opens in a new tab)

    • By default, manual snapshots are retained indefinitely, even after you delete your cluster.
    • AWS CLI: create-cluster-snapshot
    • AWS API: CreateClusterSnapshot
  • Backup

    • You can configure Redshift to automatically copy snapshots (automated or manual) for a cluster to another Region.
    • Only one destination Region at a time can be configured for automatic snapshot copy.
    • To change the destination Region that you copy snapshots to, first disable the automatic copy feature. Then re-enable it, specifying the new destination Region.
  • Snapshot copy grant

    • KMS keys are specific to a Region. If you enable copying of Redshift snapshots to another Region, and the source cluster and its snapshots are encrypted using a master key from KMS, you need to configure a grant for Redshift to use a master key in the destination Region.
    • The grant is created in the destination Region and allows Redshift to use the master key in the destination Region.
Redshift - SQL
  • VACUUM

    • FULL

      • VACUUM FULL is the default.
      • By default, VACUUM FULL skips the sort phase for any table that is already at least 95% sorted.
    • SORT ONLY

      • Sorts the specified table (or all tables in the current database) without reclaiming space freed by deleted rows.
    • DELETE ONLY

      Redshift automatically performs a DELETE ONLY vacuum in the background.

    • REINDEX

Redshift - Query
  • Data API (opens in a new tab)

    • Available in AWS SDK, based on HTTP and JSON, no persistent connection needed
    • API calls are asynchronous
    • Uses either credentials stored in Secrets Manager or temporary database credentials, no need to pass passwords in the API calls
  • System tables and views

    • SVV views contain information about database objects with references to transient STV tables.

    • SYS views

      • These are system monitoring views used to monitor query and workload usage for provisioned clusters and serverless workgroups.
      • These views are located in the pg_catalog schema.
    • STL views

      • Information retrieved from all Redshift log history across nodes
      • Retain 7 days of log history
    • STV tables

      • Virtual system tables (in-memory) that contain snapshots of the current system state
    • SVCS views provide details about queries on both the main and concurrency scaling clusters.

    • SVL views provide details about queries on main clusters.

  • Federated Query

Redshift - Data Sharing
  • With data sharing, you can securely and easily share live data across Redshift clusters without the need to copy or move the data.

  • Lends itself to a multi-warehouse architecture, where you can scale each data warehouse for various types of workloads.

  • Only RA3 and serverless clusters are supported.

  • Live and transactionally consistent views of data across all consumers

  • Secure and governed collaboration within and across organizations

  • Sharing data with external parties to monetize your data

Redshift - Data Sharing - Datashare

Working with datashares (opens in a new tab)

  • You can share data at different levels

    • Databases
    • Schemas
    • Tables
    • Views (including regular, late-binding, and materialized views)
    • SQL user-defined functions (UDFs)
Redshift - Security
  • IAM

    • Does not support resource-based policies.
  • Access control

    • Cluster level

      • IAM policies

        For Redshift API actions

      • Security groups

        For network connectivity

    • Database level

      • Database user accounts
  • Data protection

  • Auditing

    • Logs

      • Connection log

        Logs authentication attempts, connections, and disconnections.

      • User log

        Logs information about changes to database user definitions.

      • User activity log

        Logs each query before it's run on the database.

Redshift Spectrum
  • In-place querying of data in S3, without having to load the data into Redshift tables
  • EB scale
  • Pay for the number of bytes scanned
  • Federated query across operational databases, data warehouses, and data lakes
Redshift Spectrum - Comparison with Athena
Redshift SpectrumAthena
PerformanceGreater control over performance; use dedicated resources of your Redshift clusterNot configurable; use shared resources managed by AWS
Query ResultsStored in RedshiftStored in S3
Redshift Serverless
  • Redshift measures data warehouse capacity in Redshift Processing Units (RPUs). You pay for the workloads you run in RPU-hours on a per-second basis (with a 60-second minimum charge), including queries that access data in open file formats in S3
Redshift Streaming Ingestion
  • Input

    • Kinesis Data Streams
    • MSK

Keyspaces

Neptune

MemoryDB

  • More expensive than ElastiCache
  • Intended as a Redis-compatible, fully managed, in-memory database service instead of a caching service
MemoryDB - Valkey
  • Open source Redis-compatible alternative, as it is forked from Redis
MemoryDB - Redis

Management and Governance

CloudFormation

CloudTrail

CloudWatch

CloudWatch Logs

  • Subscription

    Real-time processing of log data

AWS Config

AWS Config - Config rules

Managed Grafana

Systems Manager

Well-Architected Tool

Networking and Content Delivery

CloudFront

PrivateLink

Route 53

VPC

Security, Identity, and Compliance

IAM

KMS

Macie

Secrets Manager

AWS Shield (opens in a new tab)

  • Protection against DDoS attacks
AWS Shield Standard (opens in a new tab)
  • Operates at L3 and L4.
AWS Shield Advanced (opens in a new tab)
  • Operates at L7

  • Include

    • Shield Standard
    • Certain AWS WAF usage for Shield protected resources
  • DDoS cost protection

    • If any of these protected resources scale up in response to a DDoS attack, you can request Shield Advanced service credits through your regular AWS Support channel.

AWS WAF (opens in a new tab)

  • L7 (HTTP) application level firewall

  • Define a Web ACL and then associating it with one or more web application resources that you want to protect.

Storage

AWS Backup

EBS

EFS

S3

S3 Glacier

References