Microservices API Gateway - Project Details

Project Overview

Production-grade API gateway service providing a unified entry point for 25+ microservices. Handles authentication, rate limiting, request routing, and transformation for a high-traffic enterprise application. Built with Node.js and deployed on Kubernetes for high availability and scalability.

Timeline

September 2023 - January 2024

Role

Lead Developer

Type

Infrastructure / Gateway

Status

Production

Tech Stack

Node.js 20 Express.js Redis 7 Docker Kubernetes OAuth 2.0 JWT Azure

Results & Impact

50M+

Requests Served

99.95%

Uptime

45%

Response Time ↓

70%

Auth Overhead ↓

Key Features

Intelligent Routing: Service discovery with health checks and automatic failover
Authentication Hub: OAuth 2.0 and JWT token validation with token caching
Rate Limiting: Distributed rate limiting using Redis with per-user and per-endpoint limits
Circuit Breaker: Automatic fallback when downstream services fail
Request Transformation: Header injection, body transformation, and response filtering
Comprehensive Logging: Structured logging with correlation IDs for request tracing
Metrics & Monitoring: Prometheus metrics for request rates, latencies, and errors
API Versioning: Support for multiple API versions with gradual migration

Architecture

Request Flow

TLS Termination: HTTPS connections terminated at load balancer
Authentication: JWT validation with Redis-cached public keys
Rate Limiting: Check user/IP rate limits in Redis
Service Discovery: Consul lookup for healthy service instances
Circuit Breaker Check: Verify downstream service health
Request Forwarding: Proxy to target service with added headers
Response Processing: Transform and return to client
Logging: Async logging to Elasticsearch

High Availability Setup

3+ gateway instances behind Kubernetes service
Auto-scaling based on CPU and request rate metrics
Redis cluster for distributed state
Health checks with automatic pod replacement
Zero-downtime rolling updates

Technical Highlights

Smart Token Caching

JWT validation is CPU-intensive. Solution: Cache validation results in Redis with TTL matching token expiry. Result: 70% reduction in authentication overhead.

// Pseudocode
async function validateToken(token) {
  const cached = await redis.get(`token:${hash(token)}`);
  if (cached) return JSON.parse(cached);
  
  const claims = await jwt.verify(token, publicKey);
  await redis.setex(`token:${hash(token)}`, claims.exp, JSON.stringify(claims));
  return claims;
}

Circuit Breaker Pattern

Prevents cascading failures when downstream services are unhealthy. After 5 consecutive failures, gateway returns cached response or friendly error for 30 seconds before retrying.

Distributed Rate Limiting

Uses Redis atomic operations (INCR with EXPIRE) to enforce limits across multiple gateway instances. Sliding window algorithm provides smooth rate limiting without burst allowances.

Production Incident Example

In November 2023, a downstream payment service had a database deadlock issue causing 30-second response times. The circuit breaker detected this and started returning cached responses, preventing user-facing timeouts. The issue was isolated to the payment service while other features remained functional. Total user impact: <2 minutes vs. potential 45+ minutes of site-wide slowness.

Challenges & Solutions

Challenge: Token Validation Performance

Problem: JWT signature verification was bottleneck at 10k+ req/min.

Solution: Implemented Redis caching layer. Also moved to RS256 (async crypto) and cached public keys. Reduced CPU usage by 65%.

Challenge: Rate Limit Synchronization

Problem: Multiple gateway instances had inconsistent rate limit counts.

Solution: Centralized rate limiting state in Redis cluster. Used Lua scripts for atomic increment+check operations.

Challenge: Service Discovery Latency

Problem: Looking up service endpoints on every request added 5-10ms latency.

Solution: Local caching with background refresh every 30s. Active health checks update cache immediately on service changes.

Lessons Learned

Cache Everything Safely: Caching dramatically improved performance, but cache invalidation is hard. Use TTLs aggressively.
Fail Fast: Circuit breakers prevent cascade failures. Set thresholds conservatively.
Monitor Extensively: Can't fix what you can't measure. Added metrics for every decision point.
Test Failure Scenarios: Chaos engineering revealed issues that unit tests missed. Kill random pods regularly.
Keep It Simple: Resisted adding complex features. Gateway should route fast, not do business logic.

Performance Metrics

Throughput: 10,000+ requests/minute sustained
Latency: P50: 8ms, P95: 25ms, P99: 45ms (overhead only, excluding downstream)
Availability: 99.95% uptime (4 hours downtime in 12 months)
Resource Usage: 200MB RAM, 15% CPU per instance at peak load
Error Rate: <0.01% (mostly client errors, not gateway issues)

← Back to Past Projects