Amazon RedShift - Big Data in AWS - AWS Certification Cheat Sheet


Let’s get a quick overview of Amazon RedShift from an AWS certification perspective. We will look at important certification questions regarding Amazon RedShift.

You will learn

  • What is Amazon RedShift?
  • Why do we need Amazon RedShift?
  • When do we use Amazon RedShift?
  • How do you do Redshift Workload Management?
  • How do you design Redshift Tables?
  • How do you load data into Amazon Redshift?

Table of Contents

Amazon Redshift

Let’s look at a few key highlights about Amazon Redshift:

  • Redshift is a relational database ( tables and relationships)
  • What is the need for another relational database?
    • RDS is optimized for online transaction processing
    • It is optimized to provide a balance between both reads and write operations
  • OLAP workloads have exponentially larger reads on the databases compared to writes:
    • Can we use a different approach to design the database?
    • How about creating a cluster and splitting the execution of the same query across several nodes?
  • Redshift is a petabyte-scale distributed data ware house based on PostgreSQL
  • Three important characteristics of Redshift:
    • Massively parallel processing (MPP) - storage and processing can be split across multiple nodes
    • Columnar data storage
    • High data compression
  • As a result
    • A single row of data might be stored across multiple nodes
    • A query to Redshift leader node is distributed to multiple compute nodes for execution
  • Start with a single node configuration and scale to multi node configuration
  • You can dynamically add and remove nodes
  • Used for traditional ETL(Extract, Transform, Load), OLAP and Business Intelligence (BI) use cases
    • Optimized for high-performance analysis and reporting of very large datasets
  • Supports standard SQL
  • Integration with data loading, reporting, mining and analytics tools
  • Provides high availability and durability:
    • Automatic replication (maintains 3 copies of your data)
    • Automated backups (to S3. Default retention - 1 day. Max - 35 days)
    • Automatic recovery from any node failures

Redshift Cluster

You can create Redshift Clusters with following characteristics:

  • One leader node and multiple compute nodes
    • Add compute nodes for more performance
    • Create a cluster subnet group to use a VPC
  • One or more databases in a cluster
  • Clients communicate with leader node
    • Leader node divides the query execution between compute nodes
    • No direct access to compute nodes

Redshift - Designing Tables

Here are some of the important things that you need to take into consideration while designing Redshift tables:

Compression Encoding (optional)

- Let Redshift choose or configure for each column
	- Examples : Raw, Bytedict, LZO, Runlength, Text255, Text32K
- Find the right compression encoding by running tests

Sort Keys (optional)

- Data is stored in sorted order (using sort key)
- Increase efficiency of your queries
- Example 1 : Columns used frequently in range (year > 1995 and year < 2005) or equal (year = 2015) conditions
- Example 2 : Join columns with other tables 
- Example 3 : Timestamp columns if you use the most recent data frequently

Distribution Strategy

  • How are the rows of the table distributed across compute nodes?
    • Aim to distribute data equally across nodes and minimize data movement during query execution
  • EVEN (default) - data is uniformly distributed
  • KEY - based on values of one column
    • Matching values are stored close together
    • Use join columns as KEY if you want matching columns to be co-located
  • ALL - entire table on all nodes
    • Used for lookup tables

Loading Data into Amazon Redshift

Here are the important options for Loading Data into Amazon Redshift:

Scenario Solution
Simple Use SQL insert queries using ODBC or JDBC
Efficient Use Amazon Redshift COPY command to load data from Amazon S3, Amazon DynamoDB, Amazon EMR etc
Data Pipelines Load using AWS Data Pipeline
On-premises data Use Storage Gateway or Import/Export to import data into S3. COPY data from S3
Other databases AWS Database Migration Service : RDS, DynamoDB or another Amazon Redshift Database
Recommendation Prefer COPY over INSERT for bulk operations as COPY is done in parallel
Recommendation Prefer COPY from multiple files. Split large files into multiple small input files

Redshift Workload Management

You can use Redshift Workload Management to prioritize your query workloads.

Here are some of the important characteristics:

  • WLM can be configured to prioritize queues
  • Create multiple queues with different concurrency level for different purposes
  • One queue for long running queries with low concurrency
  • One queue for short running queries with high concurrency (upto 50 concurrent queries)

Redshift Security

Here are some of the important things you need to remember about Redshift Security:

  • Uses 4-tier, key-based architecture for encryption
    • master key (chosen from keys in KMS)
    • a cluster encryption key (CEK)
    • a database encryption key (DEK)
    • and data encryption keys
  • Manage keys using AWS KMS or AWS Cloud HSM
  • IAM to manage user permissions for cluster operations
    • Grant permissions on a per cluster basis instead of per table basis

Redshift Operations

Here are some of the important things you need to remember about Operations with Redshift:

  • Add new columns by using ALTER TABLE
    • Existing columns cannot be modified
  • SQL operations are logged
    • Use SQL queries to query against system tables or download to S3
  • Monitor performance & queries with Cloud Watch and Redshift web console
  • When deleting a Redshift cluster, take a final snapshot to Amazon S3

Certification - Recommended Reading

Cloud Certifications - AWS, Azure and Google Cloud - Top 8 FAQ For Me

Google Cloud For Beginners - How to choose a Database Service?

Teaching Cloud Certifications - Top 6 Learnings

Google Cloud For Beginners - How to choose a Compute Service?

Important Kubernetes Concepts Made Easy

Introduction to Google Cloud - For AWS Professionals

Getting Started with Docker - 5 Easy Steps

GCP PubSub - GCP Certification Cheat Sheet

GCP IAM - GCP Certification Cheat Sheet

GCP App Engine - GCP Certification Cheat Sheet

WHAT NEXT?

Congratulations on reading this article!

Wondering what to learn next?

MY RECOMMENDATIONS

Keep Learning Every Day

Check Out Our Amazing ROADMAPS