Multi-Region Architecture Deep Dive

Multi-Region Foundation (30min)

오준석 (Junseok Oh), Sr. Solutions Architect, AWS

Agenda

총 150min 세션

1
Multi-Region Foundation 30분
2
Data Sync & Replication 30분
Break 5분
3
Traffic Routing & Edge 30분
4
DR & Failover Automation 30분
Break 5분
5
Observability & Operations 30분
질문은 각 Block 종료 시에 받겠습니다.

Why Multi-Region?

Business Drivers

:::click

Scalability (10x Traffic Spikes)

  • Black Friday, Prime Day 수준의 트래픽 급증
  • 단일 리전 AZ 용량 한계 극복
  • 지역별 피크 타임 분산 처리

Active-Active vs Active-Passive

Active-Active

  • 트래픽: 양쪽 리전이 동시에 트래픽 처리
  • Latency: 사용자에게 가장 가까운 리전으로 라우팅 (최적)
  • 비용: 항상 2x 리소스 운영 (높음)
  • 데이터: 양방향 복제, 충돌 해결 필요
  • Failover: 즉시 (이미 트래픽 처리 중)
  • 복잡성: 높음 — 데이터 동기화, 충돌 해결 로직 필요
  • Use Case: 글로벌 서비스, 실시간 협업, 게임

Active-Passive

  • 트래픽: Primary만 처리, Secondary는 대기
  • Latency: Primary 리전 기준 (지역별 차이)
  • 비용: Standby 리소스만 유지 (낮음)
  • 데이터: 단방향 복제, Primary → Secondary
  • Failover: RTO 기반 (수 분 ~ 수십 분)
  • 복잡성: 낮음 — 단방향 복제만 관리
  • Use Case: 지역 한정 서비스, 비용 민감, DR 중심

CAP Theorem in Practice

C
Consistency
A
Availability
P
Partition
Tolerance
CP
Aurora DSQL
Strong consistency, may reject writes during partition
AP
DynamoDB Global Tables
Eventually consistent, always available, last-writer-wins
AP
ElastiCache Global Datastore
Async replication, < 1s lag, read replicas
핵심: 네트워크 파티션은 불가피 → P는 필수 선택 → C vs A 트레이드오프

Consistency Models

정의: 모든 읽기가 최신 쓰기 결과를 반환

AWS 서비스: Aurora DSQL, DynamoDB (strongly consistent read)

Latency: 높음 (cross-region quorum 필요)

Use Case: 금융 트랜잭션, 재고 관리, 결제 처리

Write(x=5) → [Sync to all regions] → Read(x) = 5 ✓

정의: 읽기가 최대 N초 또는 N버전 이내의 데이터 반환

AWS 서비스: Cosmos DB (Azure), 커스텀 구현 필요

Latency: 중간 (설정된 bound 내에서 유연)

Use Case: 리더보드, 분석 대시보드, 실시간에 근접한 데이터

Write(x=5) → [Within 5 seconds] → Read(x) = 5 or recent value

정의: 같은 세션 내에서는 자신의 쓰기를 항상 읽음

AWS 서비스: DynamoDB (with session affinity)

Latency: 낮음 (local read 가능)

Use Case: 사용자 프로필, 쇼핑 카트, 개인 설정

Session A: Write(x=5) → Read(x) = 5 ✓ Session B: Read(x) = 3 (이전 값, 곧 업데이트됨)

정의: 충분한 시간이 지나면 모든 복제본이 동일해짐

AWS 서비스: S3, DynamoDB Global Tables, ElastiCache

Latency: 가장 낮음 (local read, async replication)

Use Case: 로그, 메트릭, 콘텐츠 캐시, 비핵심 데이터

Write(x=5) → Read(x) = 3 (stale) → [Eventually] → Read(x) = 5

Architecture at a Glance

Edge Layer
CloudFront
WAF
Route53
PRIMARY us-east-1
NLB
EKS
20 MSA | v1.35 | Karpenter v1.9
SECONDARY us-west-2
NLB
EKS
20 MSA | v1.35 | Karpenter v1.9
Data Layer (6 Stores)
Aurora DSQL
DocumentDB
ElastiCache
MSK
OpenSearch
S3

Write-Primary / Read-Local Pattern

us-east-1 (Primary)
Write Path All writes processed here
Read Path (Local) US East users read locally
Primary Writer
Async Replication
< 1s lag
Write Forwarding
Transparent to app
us-west-2 (Secondary)
Write Forwarding Writes forwarded to Primary
Read Path (Local) US West users read locally
Read Replica
No Conflict
단일 쓰기 지점 → 충돌 없음
Low Read Latency
로컬 읽기 → 최적 지연
Simple Code
Aurora가 포워딩 처리

Regional Deployment

us-east-1 (Primary)

us-west-2 (Secondary)

VPC Design

us-east-1 VPC

10.0.0.0/16
• 65,536 IP addresses
• 3 Availability Zones
• Non-overlapping with us-west-2

us-west-2 VPC

10.1.0.0/16
• 65,536 IP addresses
• 3 Availability Zones
• Non-overlapping with us-east-1
Critical: CIDR 겹침 방지 — Transit Gateway 피어링 필수 조건

3-Tier Subnet Architecture

Public Tier

10.x.0.0/20 (AZ-a)
10.x.16.0/20 (AZ-b)
10.x.32.0/20 (AZ-c)
Resources:
• ALB / NLB
• NAT Gateway
• Bastion (if any)

Private Tier

10.x.48.0/20 (AZ-a)
10.x.64.0/20 (AZ-b)
10.x.80.0/20 (AZ-c)
Resources:
• EKS Worker Nodes
• Application Pods
• Internal services

Data Tier

10.x.96.0/20 (AZ-a)
10.x.112.0/20 (AZ-b)
10.x.128.0/20 (AZ-c)
Resources:
• Aurora, DocumentDB
• ElastiCache, MSK
• OpenSearch
3 AZ Distribution (per region)
AZ-a
Public + Private + Data
AZ-b
Public + Private + Data
AZ-c
Public + Private + Data
Total: 9 subnets per region × 2 regions = 18 subnets

Transit Gateway Peering

TGW (us-east-1)
VPC Attachment
+ Peering Attachment
AWS Backbone
Encrypted, ECMP enabled
TGW (us-west-2)
VPC Attachment
+ Peering Attachment

Benefits

ECMP: Equal-Cost Multi-Path routing for bandwidth
Centralized: Single routing table, easier management
Scalable: Add more VPCs without mesh complexity
Encrypted: All traffic encrypted on AWS backbone

Route Table Example

# us-east-1 TGW Route Table 10.0.0.0/16 → VPC Attachment (local) 10.1.0.0/16 → Peering Attachment (us-west-2) # us-west-2 TGW Route Table 10.1.0.0/16 → VPC Attachment (local) 10.0.0.0/16 → Peering Attachment (us-east-1)

VPC Endpoints Strategy

Gateway Endpoints (Free)

Interface Endpoints (PrivateLink)

Architecture Decisions Summary

Decision Choice Rationale
Data Pattern Write-Primary / Read-Local No conflict, Aurora handles forwarding
Cross-region Transit Gateway ECMP, centralized routing, scalability
Ingress CloudFront-only No direct ALB, WAF integration, global edge
IAM IRSA everywhere Least privilege, no shared node role
Node Provisioning Karpenter 6-pool Workload-specific, cost/availability balance
Service Discovery DNS-based Simpler ops, sufficient for current scale
Autoscaling KEDA + HPA dual Event-driven + metric-driven scaling
GitOps ArgoCD App-of-apps Declarative, audit trail, multi-cluster
Encryption Per-service KMS Isolation, granular rotation, blast radius
Network 3-tier subnets Security segmentation, compliance

Key Takeaways

Multi-Region Fundamentals

  • Business Drivers: Latency reduction, 99.99%+ availability, 10x scalability, data residency
  • Pattern Choice: Active-Active vs Active-Passive → Write-Primary/Read-Local as hybrid
  • CAP Trade-off: Aurora DSQL (CP) for transactions, DynamoDB/ElastiCache (AP) for caching

Network Foundation

  • Non-overlapping CIDR: us-east-1 = 10.0.0.0/16, us-west-2 = 10.1.0.0/16
  • 3-Tier Subnets: Public (ALB/NAT) / Private (EKS) / Data (Aurora/Cache) across 3 AZs
  • Transit Gateway Peering: ECMP enabled, encrypted AWS backbone, centralized routing

Cost & Security Optimization

  • VPC Endpoints: S3 Gateway (free) + ECR/STS/CW Logs Interface → 82% NAT cost savings
  • Security Layers: CloudFront-only ingress, prefix-list SGs, IRSA, per-service KMS

Next: Data Sync & Replication

  • Aurora DSQL distributed transactions
  • DocumentDB Global Clusters
  • ElastiCache Global Datastore
  • MSK Cross-region Replication

Knowledge Check

Q1: Write-Primary / Read-Local 패턴에서 Secondary 리전의 Write 요청은 어떻게 처리되나요?
Q2: Transit Gateway Peering을 VPC Peering 대신 선택한 주요 이유는?
Q3: 두 리전의 VPC CIDR이 겹치면 안 되는 이유는?
Q4: VPC Endpoints로 NAT Gateway 비용을 ~82% 절감할 수 있었던 주요 서비스는?

Thank You

Thank You

수고하셨습니다!

← 목차로 돌아가기

Multi-Region Architecture Deep Dive

Data Sync & Replication (30min)

오준석 (Junseok Oh), Sr. Solutions Architect, AWS

Polyglot Persistence Strategy

Workload TypeData StoreReplicationServices
ACID 트랜잭션 (강한 일관성)Aurora PostgreSQL Global DB≤1s asyncorder, payment, inventory, user-account, shipping
유연한 스키마 (문서 모델)DocumentDB Global Cluster≤2s asyncproduct-catalog, user-profile, wishlist, review
실시간 캐시 (밀리초 응답)ElastiCache Valkey Global<1s asynccart, session, rate-limiting, leaderboard
전문 검색 (한국어 nori)OpenSearch 2.17Cross-clustersearch, analytics, notification-logs
이벤트 스트리밍 (비동기)MSK Kafka 3.6MSK Replicatorevent-bus, saga orchestration
정적 자산 (객체)S3 + CRRAsyncCDN assets, Tempo traces

원칙: 각 데이터 스토어는 워크로드 특성에 최적화된 용도로 사용. 모든 것을 하나의 DB에 넣지 않는다.

Cross-Region Replication Patterns

Aurora PostgreSQL Global Database

Cluster Topology

Global Cluster: aurora-global ┌─────────────────────────┐ │ us-east-1 │ │ (PRIMARY CLUSTER) │ │ │ │ Writer (r6g.2xlarge) │ │ Reader 1 (r6g.xlarge) │ │ Reader 2 (r6g.xlarge) │ └───────────┬─────────────┘ │ Storage-level │ Replication ≤1s ▼ ┌─────────────────────────┐ │ us-west-2 │ │ (SECONDARY CLUSTER) │ │ │ │ Reader 1 (r6g.xlarge) │ │ Reader 2 (r6g.xlarge) │ └─────────────────────────┘

Key Specifications

ParameterValue
EnginePostgreSQL 15.4
Instance Classr6g.2xlarge (Writer)
Replication Lag≤1 second (typical)
RPO~1 second
Failover RTO<1 minute (planned)
Max Secondary Regions5
StorageAuto-scaling, encrypted

Services Using Aurora

  • order-service — 주문 CRUD
  • payment-service — 결제 트랜잭션
  • inventory-service — 재고 관리
  • user-account-service — 사용자 계정
  • shipping-service — 배송 추적
  • returns-service — 반품 처리

Aurora Global Write Forwarding

Write Forwarding Latency Analysis

OperationLocal RegionWith Write ForwardingOverhead
Simple INSERT5-10ms65-90ms+60-80ms (RTT)
Batch INSERT (100 rows)50-100ms110-180ms+60-80ms
UPDATE with index3-8ms63-88ms+60-80ms
Transaction (3 statements)15-30ms195-270ms+180-240ms (3 RTT)

중요 제약사항

ConstraintDetail
Read-after-write복제 지연(≤1s)까지 기다려야 최신 데이터 확인 가능
Transaction isolation각 statement마다 RTT 추가, 긴 트랜잭션은 비효율적
DDL 불가CREATE TABLE, ALTER TABLE 등은 Primary에서 직접 실행
Temp table 불가임시 테이블 사용 불가

Best Practice: Write-heavy 서비스는 Primary 리전에, Read-heavy 서비스는 양 리전에 배치

DocumentDB Global Cluster

::: tab Primary Cluster (us-east-1)

Writer + 2 Readers

{ "cluster": "docdb-global-us-east-1", "role": "PRIMARY", "engine": "docdb 5.0 (MongoDB 5.0 compatible)", "instances": [ { "id": "writer", "class": "db.r6g.2xlarge", "role": "writer" }, { "id": "reader-1", "class": "db.r6g.xlarge", "role": "reader" }, { "id": "reader-2", "class": "db.r6g.xlarge", "role": "reader" } ], "encryption": "at-rest (KMS) + in-transit (TLS)", "backup": "continuous, 35-day retention" }

Collections

  • products (150 items, 10 categories) — 상품 카탈로그
  • user_profiles — 프로필 + 선호도 + 배송지
  • wishlists — 위시리스트
  • reviews — 상품 리뷰 + 평점
  • notifications — 알림 이력

:::

::: tab Secondary Cluster (us-west-2)

2 Readers (Read-Only)

{ "cluster": "docdb-global-us-west-2", "role": "SECONDARY", "engine": "docdb 5.0", "instances": [ { "id": "reader-1", "class": "db.r6g.xlarge", "role": "reader" }, { "id": "reader-2", "class": "db.r6g.xlarge", "role": "reader" } ], "replication_lag": "≤ 2 seconds (oplog-based)", "read_preference": "secondaryPreferred" }

Replication Mechanism

  • oplog 기반 비동기 복제
  • Secondary는 read-only (Write Forwarding 미지원)
  • Failover 시 Secondary → Primary 승격 (manual)
  • RPO: ~2 seconds

:::

DocumentDB Schema Design

// products collection — 상품 카탈로그 { "productId": "PROD-001", "name": "삼성 갤럭시 S25 울트라", "brand": "삼성전자", "category": { "id": "CAT-01", "name": "전자제품", "slug": "electronics" }, "price": 1799000, "salePrice": 1439200, "discount": 20, "currency": "KRW", "rating": 4.5, "reviewCount": 342, "tags": ["electronics", "삼성전자", "인기상품"], "attributes": { "weight": "0.5kg", "origin": "한국" }, "stock": { "available": 250, "warehouse": "WH-EAST-1" }, "status": "active" }

Index Strategy

CollectionIndexPurpose
products{ productId: 1 } uniquePK lookup
products{ category.slug: 1 }카테고리 필터
products{ brand: 1 }, { rating: -1 }브랜드 필터, 평점 정렬
user_profiles{ userId: 1 } unique사용자 조회
reviews{ productId: 1 }, { rating: -1 }상품별 리뷰, 평점순
notifications{ userId: 1, sentAt: -1 }최근 알림 조회

ElastiCache Valkey Global Datastore

Cluster Configuration

ParameterValue
EngineValkey 7.2
Node Typecache.r7g.xlarge
Shards3 (num_node_groups)
Replicas/Shard2
Cross-region lag< 1 second
EncryptionAt-rest (KMS) + In-transit (TLS)

Cache Patterns & TTL Strategy

Key PatternTTLData TypeAccess Pattern
product:{id}1hHashCache-Aside: DB 조회 후 캐시
cache:categories24hString (JSON)Refresh-Ahead
cart:{userId}7dHashWrite-Through: 즉시 반영
session:{sessionId}2hHashWrite-Through
ratelimit:api:{userId}60sString (counter)Increment + EXPIRE
stock:{productId}- (no TTL)String (counter)DECR on purchase
leaderboard:popular- (no TTL)Sorted SetZINCRBY on view
search-history:{userId}30dListLPUSH + LTRIM(50)
promo:flash-sale24hHashWrite-Through

Cache Invalidation Strategy

PatternWhen to UseImplementation
Cache-Aside읽기 빈번, 쓰기 적은 데이터App이 캐시 miss 시 DB 조회 → SET
Write-Through즉시 일관성 필요App이 DB + 캐시 동시 갱신
Event-Driven비동기 일관성 OKKafka 이벤트로 캐시 무효화
TTL-Based결과적 일관성 OK자연 만료 후 DB에서 재로드

MSK (Kafka) Event Architecture

Topic Summary

CategoryTopicsPartitionsRetention
Orderorder.created/confirmed/shipped/delivered6 each7d
Paymentpayment.completed/refunded/failed6 each7d
Inventoryinventory.reserved/released/restocked4 each7d
Notificationnotification.email/push/sms3 each3d
Infrastructuredlq.all, saga.orchestrator1, 630d, 7d

MSK Replicator: Cross-Region Topic Replication

::: option-a MSK Replicator Enabled

Active Configuration

[us-east-1 MSK] ─── MSK Replicator ──→ [us-west-2 MSK] (IAM Auth, async)

복제 대상: 모든 토픽 (regex: .*)

Consumer Offset: 동기화 가능

Compression: GZIP

Latency: 수백 ms ~ 수 초

장점

  • DR 시 Secondary에서 즉시 consume 가능
  • Consumer offset 동기화로 이벤트 유실 최소화
  • 토픽 구성(partitions, configs) 자동 동기화

비용

  • 데이터 전송: $0.02/GB (리전 간)
  • Replicator 시간당 요금 추가

:::

::: option-b MSK Replicator Disabled

Passive Configuration

[us-east-1 MSK] ─── (no replication) ──→ [us-west-2 MSK]

Secondary MSK: 독립 클러스터 (빈 토픽)

DR 시나리오: Producer가 Secondary로 전환

장점

  • 크로스리전 데이터 전송 비용 없음
  • 구성 단순

단점

  • DR 시 이벤트 유실 불가피 (in-flight events)
  • Consumer offset 재설정 필요
  • Failover 시간 증가 (토픽 재생성 불필요하나 데이터 없음)

:::

Consistency vs Latency Trade-offs

RPO & Replication Lag Matrix

Data StoreReplication MethodTypical LagRPORisk Level
Aurora Global DBStorage-level async≤ 1s~1s🟢 Low
DocumentDB GlobalOplog-based async≤ 2s~2s🟡 Medium
ElastiCache GlobalAsync replication< 1s~1s🟢 Low
MSK ReplicatorTopic-level async100ms ~ 5s~5s🟡 Medium
OpenSearchCross-cluster (manual)MinutesMinutes🔴 High
S3 CRRObject-level async≤ 15min~15min🔴 High

Data Loss Scenarios During Regional Failover

ScenarioUncommitted DataImpactMitigation
Aurora Failover≤1s of writes최근 주문/결제 유실Idempotent retry + DLQ
DocumentDB Failover≤2s of writes프로필/리뷰 업데이트 유실Event sourcing + replay
ElastiCache Failover≤1s of writes세션/장바구니 유실Session reconstruction
MSK FailoverIn-flight events이벤트 순서 보장 불가Consumer idempotency

핵심: 모든 서비스는 idempotent하게 설계하여 재시도 시 중복 처리를 방지해야 합니다.

Key Takeaways

:::card-grid

:::card highlight

1. Polyglot Persistence

각 데이터 스토어는 워크로드 특성에 맞게 선택. Aurora(ACID), DocumentDB(문서), ElastiCache(캐시), MSK(이벤트)

:::

:::card

2. Write-Primary / Read-Local

모든 쓰기는 Primary 리전으로 라우팅. 읽기는 양 리전에서 서비스. Aurora Write Forwarding으로 Secondary에서도 쓰기 가능.

:::

:::card

3. Async Replication Trade-offs

모든 크로스리전 복제는 비동기. RPO는 1초(Aurora)부터 15분(S3)까지 다양. 강한 일관성이 필요하면 Primary에서 처리.

:::

:::card highlight

4. Idempotency is King

Failover, 복제 지연, 네트워크 파티션 — 모든 상황에서 안전하려면 서비스의 idempotent 설계가 필수.

:::

:::

Block 2 Quiz

Traffic Routing & Edge

Global Traffic Management with Route53, CloudFront & WAF

End-to-End Traffic Flow

👤 mall.example.com
Route53 CNAME → CloudFront
Edge Layer
CloudFront
WAF
/static/*
S3 + OAC
/api/*
api-internal.example.com
Route53 Latency-based
us-east-1 / us-west-2
NLB Prefix-list SG
api-gateway
Backend MSA

Route53 Latency-Based Routing

How It Works

  • Latency Measurement
    • AWS measures RTT from 20+ edge locations
    • Updates every ~24 hours
    • NOT real-time user latency
  • DNS Resolution
    • User DNS query → nearest resolver
    • Route53 returns lowest-latency region IP
    • TTL: 60 seconds (balance: freshness vs load)
  • Automatic Failover
    • Health check fails → remove from rotation
    • Traffic shifts to healthy region
    • Recovery → automatic re-addition

Health Check Configuration

Type: HTTP Endpoint: /health Port: 443 Protocol: HTTPS Interval: 30 seconds Failure Threshold: 3

DNS Records

RecordTypeRouting
mall.example.comCNAME→ CloudFront
api-internal.example.comALatency (2 regions)

Latency Targets

  • us-east-1 → NLB IP (Primary)
  • us-west-2 → NLB IP (Secondary)

Route53 Health Checks

Public DNS

mall.example.com
Type: CNAME
Target: d1234567890.cloudfront.net
Health Check: None (CloudFront handles)

Internal API DNS

api-internal.example.com
Type: A Record (Latency-based)
Targets:
• us-east-1 NLB IP
• us-west-2 NLB IP
Health Check: HTTP 200 on /health

Health Check Parameters

Endpoint: /health
Protocol: HTTPS:443
Interval: 30s
Failure Threshold: 3
Regions: us-east-1, us-west-2, eu-west-1

CloudFront Distribution

Path Pattern: /api/* Origin: api-internal.example.com (NLB) Cache Policy: CachingDisabled # No caching for dynamic API responses Origin Request Policy: AllViewerExceptHostHeader # Forward all headers except Host Allowed Methods: GET, HEAD, OPTIONS, PUT, POST, PATCH, DELETE # All HTTP methods for REST API Viewer Protocol: HTTPS only Compress: true
Path Pattern: /static/* Origin: mall-static-assets.s3.amazonaws.com Cache Policy: CachingOptimized TTL: 86400 (24 hours) # Long cache for immutable assets Origin Access: OAC (Origin Access Control) # S3 bucket not publicly accessible Allowed Methods: GET, HEAD Viewer Protocol: HTTPS only Compress: true
Path Pattern: /* (Default) Origin: mall-static-assets.s3.amazonaws.com Cache Policy: CachingOptimized Custom Error Response: Error Code: 403, 404 Response Page: /index.html Response Code: 200 # SPA client-side routing support

CloudFront Origin Shield

What is Origin Shield?

  • Additional caching layer at regional edge
  • Positioned between edge locations and origin
  • Single point of origin contact per region

Configuration

Origin Shield Region: us-east-1 # Closest to primary origin Enabled: true

How It Works

User → Edge Location → Origin Shield → Origin (400+ PoPs) (1 region) (NLB)

Benefits

CloudFront Origin Failover

Origin Group: api-origin-group

Primary Origin
api-internal.example.com
us-east-1 NLB (via Route53)
failover →
Secondary Origin
api-secondary.example.com
us-west-2 NLB (direct)

Failover Triggers

  • HTTP 500, 502, 503, 504
  • Connection timeout
  • Origin not reachable

Failover Speed

  • Sub-second switching
  • No DNS propagation delay
  • Automatic recovery on primary health
Note: Origin Failover operates at CloudFront level, independent of Route53 health checks. Both layers provide redundancy.

WAF Rule Stack

Priority 1

GeoRestriction

Allow: KR, US, JP
Block: All others
Priority 2

RateLimit

2,000 requests / 5 min / IP
Action: Block
Priority 3

AWSManagedRulesKnownBadInputsRuleSet

Log4j, SSRF, etc.
Priority 4

AWSManagedRulesSQLiRuleSet

SQL Injection patterns
Priority 5

AWSManagedRulesCommonRuleSet

OWASP Top 10

WAF Custom Rules — Terraform

# Rate Limiting Rule resource "aws_wafv2_web_acl" "mall_waf" { name = "mall-waf-acl" scope = "CLOUDFRONT" rule { name = "RateLimitRule" priority = 2 action { block {} } statement { rate_based_statement { limit = 2000 aggregate_key_type = "IP" } } visibility_config { sampled_requests_enabled = true cloudwatch_metrics_enabled = true metric_name = "RateLimitRule" } } # Geo Restriction Rule rule { name = "GeoRestriction" priority = 1 action { block {} } statement { not_statement { statement { geo_match_statement { country_codes = ["KR", "US", "JP"] } } } } } }

Block everything else

Current Problem

Solution: Scope-Down Rules

Prefix-List Security

CloudFront
400+ Edge Locations
Security Group
Prefix-list ONLY
NLB

Allowed (Prefix-List)

Inbound Rule: Type: HTTPS (443) Source: com.amazonaws.global.cloudfront.origin-facing # AWS-managed prefix list # Auto-updates with CloudFront IPs

Blocked

0.0.0.0/0 (Any IP)
Direct ALB/NLB access
WAF bypass attempts
Zero 0.0.0.0/0 Policy
All traffic MUST pass through CloudFront + WAF. Direct access to NLB bypasses all security controls and is strictly prohibited.

Global Accelerator vs CloudFront

Global Accelerator

  • Anycast IP (2 static IPs globally)
  • Protocols: TCP, UDP (Layer 4)
  • Routing: Consistent endpoint routing
  • Use Case: Non-HTTP, gaming, IoT
  • Pricing: Per-flow ($0.025/hr + data)
  • Caching: None (passthrough)
  • WAF Integration: No

CloudFront

  • DNS-based routing
  • Protocols: HTTP/HTTPS only (Layer 7)
  • Routing: Latency-based to edge
  • Use Case: Web apps, APIs, static content
  • Pricing: Per-request ($0.0085/10K) + data
  • Caching: Yes (edge caching)
  • WAF Integration: Yes (native)

When to Use Each

Need Global Traffic Distribution?
↓ Yes
Need Edge Caching?
↓ Yes
CloudFront
Need TCP/UDP?
↓ Yes
Global Accelerator
Need Both?
↓ Yes
CF + GA Combined
CloudFront
Web apps
REST APIs
Static assets
Video streaming
Global Accelerator
Gaming
VoIP/Real-time
IoT devices
Non-HTTP APIs
Combined
Web + Gaming
HTTP + WebSocket
Multi-protocol apps

Edge Network Cost Analysis

Service Monthly Cost Pricing Model Notes
CloudFront
$50 - $500 Requests + Data transfer Traffic-dependent
WAF
$0 ACL + Rules + Requests Currently disabled in prod
NLB (×2)
$36 + LCU $18/NLB + LCU charges 2 regions × $18 base
Route53
~$5 Hosted zones + Queries $0.50/zone + $0.40/M queries

Cost Optimization

  • Origin Shield: 30-60% origin request reduction
  • Cache optimization: Higher TTL for static
  • Compression: Brotli/Gzip enabled

Cost Drivers

  • Data transfer out (largest component)
  • HTTPS request count
  • Origin requests (cache miss)

Key Takeaways

Traffic Routing

  • Route53 Latency-based for regional failover
  • CloudFront for global edge caching
  • Origin Shield: 50-90% origin reduction
  • Origin Failover: sub-second switching

Edge Security

  • WAF rule stack: Geo → Rate → Managed
  • Bot Control: Count mode first, then Block
  • Prefix-list SG: Zero 0.0.0.0/0 policy
  • All traffic MUST pass CloudFront + WAF

Architecture Decisions

  • CloudFront over GA: HTTP + caching + WAF
  • OAC for S3: No public bucket access
  • NLB over ALB: Lower latency, prefix-list support
  • DNS TTL 60s: Balance freshness vs load

Cost Optimization

  • CloudFront: $50-500/mo (traffic-based)
  • Origin Shield ROI: 30-60% savings
  • Cache TTL tuning for static assets
  • Compression: Brotli + Gzip enabled

Knowledge Check

Q1: Route53 Latency-based Routing의 latency 측정 주기는?
Q2: CloudFront Origin Failover의 전환 속도는?
Q3: Prefix-list Security Group의 핵심 목적은?
Q4: WAF Bot Control을 처음 활성화할 때 권장 설정은?

Thank You

Thank You

수고하셨습니다!

← 목차로 돌아가기

DR & Failover Automation

Disaster Recovery 현황 분석과 자동화 전략 (30min)

DR Current State — Risk Assessment

Component DR Capability Failover Type Risk Level
Traffic Routing Auto failover Auto LOW
Aurora DSQL NO DR (single region) None CRITICAL
DocumentDB Global Cluster Manual CLI HIGH
ElastiCache Global Datastore Manual CLI MEDIUM
MSK NO DR (per-region) None CRITICAL
OpenSearch Per-region (rebuild) Hours HIGH
S3 Cross-Region Replication Auto CRR LOW
EKS Karpenter auto-provision Auto LOW

CRITICAL: DSQL + MSK — 리전 장애 시 완전한 데이터 손실

RTO/RPO Matrix

Component RTO RPO Recovery Notes
Traffic Routing 1-2 min 0 Health check interval + DNS TTL
Aurora DSQL N/A Total Loss No cross-region replica exists
DocumentDB 5-15 min < 1 min Manual switchover-global-cluster
ElastiCache 1-5 min < 1 sec Manual promote secondary
MSK N/A Total Loss Replicator disabled, no offset sync
S3 Instant < 15 min CRR replication lag
EKS 2-5 min 0 Karpenter node replacement
RTO < 5min, RPO < 1min
RTO 5-15min
Unrecoverable

Scenario 1: Full Region Failure (us-east-1)

Auto Recovery (30s - 2min)

Route53 Health Check
30s detect → traffic switch to us-west-2
CloudFront Origin
Follows Route53, no config change
S3 CRR
Already replicated, RPO <15min

Manual Intervention (5-15min)

DocumentDB Failover
switchover-global-cluster CLI required
ElastiCache Promotion
failover-global-replication-group CLI

DATA LOSS (Unrecoverable)

DSQL: 6 services lose ALL transaction data
order, payment, inventory, user-account, shipping, warehouse
MSK: Unbounded event loss + offset desync
notification, analytics, recommendation consumers affected

Scenario 2: Individual DB Failure

Impact: 6 services affected

  • inventory, shipping, order, payment, user-account, warehouse

Current Behavior:

  • Services fail with connection timeout
  • No automatic fallback

Graceful Degradation (recommended):

// Mock fallback mode if err != nil && config.MockFallbackEnabled { return mockResponse(), nil }

Impact: 7 services affected

  • product-catalog, recommendation, review, wishlist, notification, analytics, user-profile

Manual Failover Steps:

aws docdb failover-global-cluster \ --global-cluster-identifier production-docdb-global \ --target-db-cluster-identifier production-docdb-global-us-west-2

RTO: 5-15 minutes

RPO: < 1 minute

Impact: cart service (session + cache)

Manual Promotion:

aws elasticache failover-global-replication-group \ --global-replication-group-id production-elasticache \ --primary-region us-west-2

RTO: 1-5 minutes

Cart data: Temporary loss (rebuild from DB)

Missing PDB

Auto Recovery (Karpenter)

  • Node replacement: 2-5 minutes
  • ArgoCD: Auto-redeploy workloads
  • Karpenter: Provisions new nodes on demand

Workload Recovery

  • Deployment rollout triggered automatically
  • Pod scheduling to new nodes
  • Service endpoints updated

Current Gap

DSQL Single-Region Risk — CRITICAL

Aurora DSQL

us-east-1 ONLY
SINGLE REGION

Region failure = Total data loss for 6 core services

Affected Services:
order payment inventory user-account shipping warehouse

Business Impact

Order Processing
All active orders lost, no recovery
Payment Records
Transaction history unavailable
Inventory State
Stock levels unknown, oversell risk
User Accounts
Authentication data lost

DSQL Linked Clusters — Solution

Future State: Multi-Region Active-Active

Interim Mitigation

Automated Failover Pipeline

Trigger
CloudWatch Alarm
HealthCheckStatus = UNHEALTHY
Orchestration
EventBridge Rule
Pattern: cloudwatch.alarm.state_change
Execution (Parallel)
DocumentDB Failover
ElastiCache Promotion
Route53 Update
(if manual override needed)
SNS Notification
Ops team alert
RTO Reduction: Manual 15min → Automated 2min

Lambda Failover Code

DocumentDB Failover Lambda

import boto3 import os def handler(event, context): # Safety flag check if not os.environ.get('ENABLE_AUTO_FAILOVER'): return {'status': 'skipped'} docdb = boto3.client('docdb') response = docdb.switchover_global_cluster( GlobalClusterIdentifier='production-docdb-global', TargetDbClusterIdentifier= 'production-docdb-global-us-west-2' ) return {'status': 'success'}

ElastiCache Failover Lambda

import boto3 import os def handler(event, context): # Safety flag check if not os.environ.get('ENABLE_AUTO_FAILOVER'): return {'status': 'skipped'} elasticache = boto3.client('elasticache') response = elasticache.failover_global_replication_group( GlobalReplicationGroupId= 'production-elasticache', PrimaryRegion='us-west-2' ) return {'status': 'success'}
⚠️
ENABLE_AUTO_FAILOVER Safety Flag
환경 변수로 자동 failover 활성화/비활성화 제어. 테스트 환경에서는 비활성화, 프로덕션에서만 활성화.

DocumentDB Failover — Step by Step

CLI Command

aws docdb switchover-global-cluster \ --global-cluster-identifier production-docdb-global \ --target-db-cluster-identifier production-docdb-global-us-west-2
30s
Health Check Detect
Route53 detects failure
10s
Lambda Trigger
EventBridge → Lambda
5-15m
Failover Execution
switchover-global-cluster
30s
DNS Propagation
New endpoint active

Endpoint Update

Before (us-east-1):
production-docdb-global.cluster-xxx.us-east-1.docdb.amazonaws.com
After (us-west-2):
production-docdb-global.cluster-yyy.us-west-2.docdb.amazonaws.com

ElastiCache Promotion — Step by Step

CLI Command

aws elasticache failover-global-replication-group \ --global-replication-group-id production-elasticache \ --primary-region us-west-2
30s
Detect
Health check fail
1-5m
Promote
Secondary → Primary
Ready
Writes Enabled
us-west-2 is primary

Cart Service Impact

During Failover (1-5min)
  • Cart reads: degraded (stale data)
  • Cart writes: fail (read-only)
  • Session data: may be lost
After Promotion
  • Full read/write restored
  • RPO: < 1 second (async lag)
  • Cart rebuild from session cookie

DR Improvement Roadmap

1
P0 — Critical (Week 1-2)
DSQL Linked Clusters 설정 MSK Replicator 활성화 6개 서비스 multi-region 쓰기 가능
2
P1 — High (Week 3-4)
Auto-failover Lambda + EventBridge 구축 PodDisruptionBudget 전체 서비스 적용 RTO 15min → 2min 달성
3
P2 — Medium (Month 2)
CloudFront Origin Failover Group 설정 Route53 health check → /health/ready 변경 OpenSearch cross-cluster search 검토

FIS Experiment Template

Drill Frequency

Tools & Automation

Key Takeaways

Critical Risks

  • DSQL: Single region, total data loss risk
  • MSK: No replication, event loss
  • DocumentDB/ElastiCache: Manual CLI failover
  • Current RTO: 15min manual

Solutions

  • P0: DSQL Linked Clusters + MSK Replicator
  • P1: Lambda + EventBridge automation
  • P1: PDB for all 20 services
  • Target RTO: 2min automated

Automation Pipeline

  • CloudWatch Alarm → EventBridge → Lambda
  • Parallel DB failover execution
  • SNS notification to Ops team
  • ENABLE_AUTO_FAILOVER safety flag

DR Drill Strategy

  • Quarterly: Full region failure simulation
  • Monthly: Individual DB failover test
  • Weekly: Replication lag + health checks
  • Tool: AWS Fault Injection Simulator

Knowledge Check

Q1: 현재 Multi-Region Mall에서 가장 심각한 DR 리스크는?
Q2: DocumentDB Global Cluster failover CLI 명령은?
Q3: 자동화 파이프라인에서 ENABLE_AUTO_FAILOVER 플래그의 목적은?
Q4: DR Drill에서 Full Region Failure 시뮬레이션 권장 주기는?

Thank You

Thank You

수고하셨습니다!

← 목차로 돌아가기

Observability & Operations

Cross-Region Monitoring, Cost Optimization, Production Readiness

Observability Stack Overview

App Pods (20 svc)
OTLP export
OTel Collector
DaemonSet
ADOT 0.40.0
tail_sampling
Traces
X-Ray
Tempo→S3
Metrics
Prometheus
kube-prom 68.4.0
Logs
CloudWatch
Visualization
Grafana

Dual Export: Tempo (S3 장기 저장) + X-Ray (managed, 서비스 맵)

OTel Collector Configuration

receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: memory_limiter: check_interval: 1s limit_mib: 512 spike_limit_mib: 128 tail_sampling: decision_wait: 10s policies: - name: errors-policy type: status_code status_code: { status_codes: [ERROR] } - name: slow-policy type: latency latency: { threshold_ms: 500 } - name: default-policy type: probabilistic probabilistic: { sampling_percentage: 10 }
exporters: otlp/tempo: endpoint: tempo.observability:4317 tls: insecure: true awsxray: region: ${AWS_REGION} index_all_attributes: true service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, tail_sampling, batch] exporters: [otlp/tempo, awsxray] metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [prometheusremotewrite]

Tail-Based Sampling Strategy

Why Tail > Head Sampling?

Head Sampling: 요청 시작 시 샘플링 결정

문제: 에러/지연 발생 전에 결정 → 중요 트레이스 누락

Tail Sampling: 전체 트레이스 완료 후 결정

장점: 에러/지연 여부 확인 후 선택적 저장

Sampling Policies

Policy Rate Condition
errors 100% status_code = ERROR
slow 100% latency > 500ms
default 10% probabilistic
Storage Savings: ~90% while keeping 100% of errors & slow traces

Grafana Tempo Integration

Architecture

OTel Collector
Tempo (Monolithic)
S3 Backend
Parquet blocks

Retention: 30 days hot, 90 days warm

Compression: zstd (60% savings)

S3 Lifecycle Policy

Tier Days Cost/GB
S3 Standard 0-30 $0.023
S3 IA 30-90 $0.0125
Glacier IR 90-365 $0.004
Delete 365+ -
Est. Cost: ~$185/mo per region

Cross-Region Trace Correlation

TraceQL Examples

# 특정 서비스의 에러 트레이스 { resource.service.name = "order-service" } | status = error # 500ms 이상 걸린 결제 요청 { span.http.route = "/api/payments" } | duration > 500ms # Cross-region 트레이스 (us-east → us-west) { resource.cloud.region = "us-east-1" } >> { resource.cloud.region = "us-west-2" }

Service Map Generation

Tempo metrics_generator가 span 데이터에서 자동으로 서비스 의존성 그래프 생성

Logs-to-Traces Correlation

1. TraceID Injection

# Application log format {"level":"error", "traceId":"abc123...", "spanId":"def456...", "msg":"payment failed"}

2. Grafana Derived Fields

# Loki datasource config derivedFields: - name: TraceID matcherRegex: "traceId\":\"([^\"]+)" url: "$${__value.raw}" datasourceUid: tempo

Cost Analysis — $9,600 ~ $12,400/month

$3,186-4,586
Compute (33-37%)
EKS + EC2 (Karpenter)
$2,560
Database (21-27%)
Aurora + DocDB + ElastiCache
$1,620
Messaging (13-17%)
MSK kafka.m5.large x6
$1,560
Search (13-16%)
OpenSearch r6g.large x4
$452+
Networking
NAT+TGW+NLB
$50-500
Edge
CloudFront+WAF
$150-400
Observability
CW+X-Ray+Prom
$29
Security
KMS+Secrets Mgr

Cost Breakdown by Service

Category Service Spec Cost/mo
Compute EKS Control Plane x2 regions $146
Karpenter EC2 m6i/c6i.xlarge variable $2,800-4,200
Bootstrap Node Group t3.medium x2 regions $240
Database Aurora DSQL us-east-1 only (serverless) $200-500
DocumentDB r6g.large x4 (Global) $1,480
ElastiCache r6g.large x4 (Valkey) $880
OpenSearch r6g.large x4 + master x3 $1,560
Messaging MSK kafka.m5.large x3 x2 regions $1,620
Network NAT Gateway x4 (2 per region) $270
Transit Gateway x2 regions + attachments $146
NLB x2 regions $36+

Cost Optimization Opportunities

Optimization Savings Effort Notes
Karpenter Spot Expansion 30-40% EC2 L worker-tier, batch-tier NodePool에 Spot 우선 설정
DocumentDB Downsize $370/mo L r6g.large → r6g.medium (현재 CPU <20%)
MSK Serverless 50-70% M 저트래픽 시 유리 (현재 provisioned 유휴)
OpenSearch Scale-down $520/mo L dedicated master 제거 (현재 search 미사용)
Reserved Instances 1yr 30-40% DB L DocumentDB, ElastiCache, OpenSearch
NAT → NAT Instance $200/mo M t4g.micro + ASG (HA 구성)
Total Savings Potential: $2,500 - $3,500 / month (25-35% reduction)

Performance Bottlenecks

P0 — Critical Bugs

1. Python Valkey MOVED error
Redis() → RedisCluster() 필요
영향: 7개 Python 서비스
2. DSQL connection pool default
기본값 4/CPU → exhaustion 발생
영향: 6개 Go 서비스
3. DocumentDB Motor maxPoolSize
default 100 → Pod scale 시 초과
영향: 7개 Python 서비스

P1 — Improvements

4. API GW MaxIdleConnsPerHost = 2
Backend 연결 병목 → 100으로 상향
5. Java SimpleClientHttpRequestFactory
Connection pooling 없음 → RestTemplate + pooling

Quick Wins

Product cache TTL 5min → DocDB 부하 감소
CloudFront API cache 1min → Origin 요청 50%↓

Load Test Strategy (k6)

// Profile 1: Baseline (100 VUs, 10min) // 현재 상태의 기준선 측정 export const options = { vus: 100, duration: '10m', thresholds: { http_req_duration: ['p95<500'], http_req_failed: ['rate<0.01'], }, }; export default function() { // 60% Browse, 15% Purchase, 20% Search, 5% Seller const scenario = weightedRandom([ { weight: 60, fn: browseProducts }, { weight: 15, fn: purchaseFlow }, { weight: 20, fn: searchProducts }, { weight: 5, fn: sellerDashboard }, ]); scenario(); }
// Profile 2: Ramp-up (100→1000 VUs, 30min) // 점진적 부하 증가로 breaking point 탐색 export const options = { stages: [ { duration: '5m', target: 100 }, { duration: '10m', target: 500 }, { duration: '10m', target: 1000 }, { duration: '5m', target: 0 }, ], thresholds: { http_req_duration: ['p95<500', 'p99<1000'], }, };
// Profile 4: Spike (500→2000→500 VUs, 15min) // 급격한 부하 증가 시 auto-scaling 반응 테스트 export const options = { stages: [ { duration: '2m', target: 500 }, { duration: '1m', target: 2000 }, // spike { duration: '5m', target: 2000 }, // sustain { duration: '2m', target: 500 }, // recovery { duration: '5m', target: 500 }, ], }; // Karpenter scaling 반응 시간 측정 // 목표: 2분 내 노드 프로비저닝
// Profile 5: Soak Test (300 VUs, 4hr) // 메모리 누수, 연결 누수 탐지 export const options = { vus: 300, duration: '4h', thresholds: { http_req_duration: ['p95<300'], http_req_failed: ['rate<0.001'], }, }; // 모니터링 항목: // - Pod memory usage trend // - DB connection count // - Goroutine/Thread count // - File descriptor usage

Target Metrics (SLOs)

Latency SLOs

Percentile Target Status
p50 < 100ms TBD
p95 < 300ms TBD
p99 < 500ms TBD

Throughput & Availability

Sustained Throughput 5,000 RPS
Error Rate < 0.1%
Availability 99.9%

Per-Endpoint Targets

Endpoint p95
GET /products < 200ms
GET /products/:id < 150ms
POST /orders < 500ms
POST /payments < 500ms
GET /search < 300ms
GET /seller/dashboard < 400ms

Gap Analysis Summary

P0 — Production Blockers
Authentication
인증 미들웨어 없음
Event Bus → MSK
Mock publish only
P1 — High Risk
WAF Reactivation
Bot Control 차단 중
Search → OpenSearch
strings.Contains mock
DSQL Multi-Region
us-east-1 only
P2 — Improvements
MSK Replicator
Cross-region sync
CI/CD Automation
Manual deploys 상태

7개 Gap | P0 해결 없이는 Production 불가

Production Readiness Roadmap

1
Phase 1 (2주)
P0 Gaps 해결 Authentication + Event Bus MSK 연동
2
Phase 2 (4주)
P1 Gaps + Performance WAF 재설정, Search→OpenSearch, DSQL Multi-Region 성능 버그 수정 (Valkey, Pool)
3
Phase 3 (2주)
P2 Gaps + DR MSK Replicator, CI/CD Pipeline DR 자동화 (EventBridge + Lambda)
4
Phase 4 (ongoing)
Load Test + Optimization k6 전체 프로필 실행 비용 최적화 적용

Key Takeaways

Observability Stack

  • OTel Collector + Tail Sampling
  • Storage 90% 절감, 에러 100% 보존
  • Tempo + X-Ray dual export
  • Logs-to-Traces correlation

Cost Optimization

  • 월 $9,600-12,400 총 비용
  • $2,500-3,500 절감 가능 (25-35%)
  • Quick wins: Spot, DB downsize
  • Medium-term: MSK Serverless, RI

Production Blockers

  • P0: Authentication, Event Bus
  • P1: WAF, Search, DSQL Multi-Region
  • P0 해결 없이 Production 불가
  • 8주 로드맵으로 해결

Performance & Testing

  • P0 bugs: Valkey, DSQL pool, Motor
  • SLO: p95 <300ms, 5K RPS
  • k6 4가지 프로필로 검증
  • Soak test로 누수 탐지
1. P0 Gaps 2. Performance Fix 3. Load Test 4. Cost Optimize

Knowledge Check

Q1: Tail-Based Sampling의 핵심 장점은?
Q2: OTel Collector에서 tail_sampling의 decision_wait 역할은?
Q3: 비용 최적화에서 가장 빠르게 적용할 수 있는 항목은?
Q4: Production Blocker (P0) Gap에 해당하는 것은?

Thank You

Thank You

수고하셨습니다!

← 목차로 돌아가기