Databases

AWS Databases

Relational Database Service RDS
- Oracle, SQL Server, MySQL, PostgreSQL, Aurora, MariaDB
- Multi-AZ - for disaster recovery - exact copy in another AZ, AWS will change the endpoint during failover - no intervention (not available for Aurora b/c it’s fault tolerant by design). no charge for data transfer. Can use Reserved DB Instances
- Read Replicas - for performance - async replication from primary. RR can be promoted to standalone DB (breaks replication)
- Not for DR
- Must have automatic backups turned on
- Up to 5 read replicas (but an of RR of RR)
- Each RR has its own DNS endpoint
- RR can be Multi-AZ
- Can be in a second region
- Not serverless (except Aurora)
- Runs on VMs - can’t log in - Amazon responsible for patching
- Automated Backups are enabled by default
- Point in time recovery w/in retention period (1-35 days)
- Full daily snapshots plus transaction logs throughout day
- May experience latency during window
- Deleted with the RDS db
- DB Snapshots are done manually - stored even after you delete original RDS db
- Restore an instance to a new RDS instance, which means new endpoint (change your apps)
- Encryption at rest - KMS - backups, etc also
- RDS for SQL Server max is 16TB when using Provisioned IPS and General Purpose (SSD) storage types
Non relational NoSQL DynamoDB
- Collection = table; Document = row; Key Value Pairs = fields
Data Warehouse - Redshift
- OLTP - online transaction processing
- OLAP - online analytics processing
ElastiCache - in memory cache
- Redis
- memcached

DynamoDB

SSD storage - single digit millisecond latency - document and k-v data models
Across 3 Geo-distinct data
Eventual consistent reads (default = ~1sec) vs. strongly consistent reads
Accelerator - DAX : in-memory cache. Can write-through cache
- Fully managed, HA. 10x performance improvement
- Request time down to microseconds
Transactions -
- Multiple “all or nothing” ops
- 2 underlying reads/write - prepare/commit
- Up to 25 items or 4MB of data
On-Demand Capacity - pay per request pricing (balance cost and performance) - pay more per request than provisioned
On-Demand Backup and Restore - full backup at any time, zero impact on perf or availability. Consistent w/in seconds and retained until deleted
Point in Time Recovery (PITR) - protect against accidental writes or deletes, up to 35 days, incremental backups, not enabled by default. Latest restorable = 5 mins in past
Stream - time-ordered sequence of item-level changes - stored 24 hrs. Inserts, updates, deletes. Combine w/ lambda’s to simulate stored procedures
Global Tables - managed multi-master, multi-region replication - based on streams, multi-region redundancy for DR or HA. No application requires. Replication latency under one second.
Database Migration Service (target, not source)
Encrypted at rest using KMS. Site to Site VPN, Direct Connect (DX), IAM, Fine-grained access, VPC endpoints, Cloudwatch and CloudTrail
Strongly Consistent Reads can be used but may have higher latency
DynamoDB allows for the storage of large text and binary objects, but there is a limit of 400 KB.

Redshift

Fully managed petabyte-scale data warehouse
no commitments/upfront costs - 0.25/hr -> 1K per TB per year
Single Node (160Gb)
Multi-Node
- Leader Node - manages client conx and receives queries
- Compute Node - store data and perform queries and computations (up to 128 compute nodes)
Advanced Compression - columnar data stores can be compressed much more than row-based. doesn’t require materialized views, etc
Massively parallel processing
Backups - enabled by default w/ a 1 day retention (max 35)
- attempts to maintain at least 3 copies of data (original, replica on compute nodes, and backup in s3)
- async replicate your snapshots to s3 in another region for DR
Charges - compute node only - 1 unit per node per hour + backup + data transfer w/in VPC
Security - transit + rest. KMS but can use HSM
Currently only available in 1 AZ
Can restore snapshots in another AZ

Aurora

MySQL & PostgreSQL compatible
Up to 5x perf over MySQL; 3x over PGSQL
10GB, scale in 10GB increments to 64TB
Compute can scale to 32vcpu and 244 gb memory
2 copies of data in each AZ, minimum 3 az = 6 copies of your data
Transparently handle the loss of up to 2 copies of data w/o affecting write avail. up to 3 copies w/o affecting read availability
Self healing - data blocks and disks are scanned for errors and repaired automatically
Aurora Replicas - max 15 - in region replications
Mysql Read Replicas - max 5
PGSQL read Replicas - max 1
Backups are always enabled. Snapshots can be shared w/ other AWS accounts.
Aurora Serverless - automatically starts/stops/scales - for infrequent, intermittent or unpredictable workloads

ElastiCache

In memory cache in the cloud - increase DB and web application performance
memcached - simple
Redis - pub sub, complex data types, backup and restores, multi-az

Database Migration Service (DMS)

migrate to cloud, between on-premises, combos
Server that runs replication software - source/target
Will create tables and primary keys, but can pre-create manually or use AWS Schema Conversion Tool (SCT) needed for heterogeneous
Supports homogenous / heterogeneous migrations
Sources - usual DBs, Azure SQL, RDS, S3
Targets - usual DBs, RDS, Redshift, DynamoDB, S3, Elasticsearch, Kinesis data streams, DocumentDB

Caching Services

CloudFront
API Gateway
ElastiCache - Redis / memcached
DynamoDB Application Accelerator - DAX

Amazon EMR

Elastic Map Reduce
Spark, Hive, HBase, Flink, Hudi, Presto
Petabyte-scale analysis at 1/2 cost of traditional on-prem and 3x faster than Apache Spark
Cluster - nodes w/ roles
- Master node - status of tasks & monitors health - every cluster has a master node
- Core node - runs tasks and stores data in Hadoop Distributed File System (HDFS) - multi-node clusters have at least one core node
- Task node - only runs task and does NOT store data in HDFS - optional
Log files stored on master node at /mnt/var/log
- configure to archive logs to S3 at 5-minute intervals in case of normal shutdown or error
- must be done when first set up cluster