Project Overview
Production-grade API gateway service providing a unified entry point for 25+ microservices. Handles authentication, rate limiting, request routing, and transformation for a high-traffic enterprise application. Built with Node.js and deployed on Kubernetes for high availability and scalability.
Timeline
September 2023 - January 2024
Type
Infrastructure / Gateway
Tech Stack
Node.js 20
Express.js
Redis 7
Docker
Kubernetes
OAuth 2.0
JWT
Azure
Key Features
- Intelligent Routing: Service discovery with health checks and automatic failover
- Authentication Hub: OAuth 2.0 and JWT token validation with token caching
- Rate Limiting: Distributed rate limiting using Redis with per-user and per-endpoint limits
- Circuit Breaker: Automatic fallback when downstream services fail
- Request Transformation: Header injection, body transformation, and response filtering
- Comprehensive Logging: Structured logging with correlation IDs for request tracing
- Metrics & Monitoring: Prometheus metrics for request rates, latencies, and errors
- API Versioning: Support for multiple API versions with gradual migration
Architecture
Request Flow
- TLS Termination: HTTPS connections terminated at load balancer
- Authentication: JWT validation with Redis-cached public keys
- Rate Limiting: Check user/IP rate limits in Redis
- Service Discovery: Consul lookup for healthy service instances
- Circuit Breaker Check: Verify downstream service health
- Request Forwarding: Proxy to target service with added headers
- Response Processing: Transform and return to client
- Logging: Async logging to Elasticsearch
High Availability Setup
- 3+ gateway instances behind Kubernetes service
- Auto-scaling based on CPU and request rate metrics
- Redis cluster for distributed state
- Health checks with automatic pod replacement
- Zero-downtime rolling updates
Technical Highlights
Smart Token Caching
JWT validation is CPU-intensive. Solution: Cache validation results in Redis with TTL matching token expiry. Result: 70% reduction in authentication overhead.
// Pseudocode
async function validateToken(token) {
const cached = await redis.get(`token:${hash(token)}`);
if (cached) return JSON.parse(cached);
const claims = await jwt.verify(token, publicKey);
await redis.setex(`token:${hash(token)}`, claims.exp, JSON.stringify(claims));
return claims;
}
Circuit Breaker Pattern
Prevents cascading failures when downstream services are unhealthy. After 5 consecutive failures, gateway returns cached response or friendly error for 30 seconds before retrying.
Distributed Rate Limiting
Uses Redis atomic operations (INCR with EXPIRE) to enforce limits across multiple gateway instances. Sliding window algorithm provides smooth rate limiting without burst allowances.
Production Incident Example
In November 2023, a downstream payment service had a database deadlock issue causing 30-second response times. The circuit breaker detected this and started returning cached responses, preventing user-facing timeouts. The issue was isolated to the payment service while other features remained functional. Total user impact: <2 minutes vs. potential 45+ minutes of site-wide slowness.
Challenges & Solutions
Challenge: Token Validation Performance
Problem: JWT signature verification was bottleneck at 10k+ req/min.
Solution: Implemented Redis caching layer. Also moved to RS256 (async crypto) and cached public keys. Reduced CPU usage by 65%.
Challenge: Rate Limit Synchronization
Problem: Multiple gateway instances had inconsistent rate limit counts.
Solution: Centralized rate limiting state in Redis cluster. Used Lua scripts for atomic increment+check operations.
Challenge: Service Discovery Latency
Problem: Looking up service endpoints on every request added 5-10ms latency.
Solution: Local caching with background refresh every 30s. Active health checks update cache immediately on service changes.
Lessons Learned
- Cache Everything Safely: Caching dramatically improved performance, but cache invalidation is hard. Use TTLs aggressively.
- Fail Fast: Circuit breakers prevent cascade failures. Set thresholds conservatively.
- Monitor Extensively: Can't fix what you can't measure. Added metrics for every decision point.
- Test Failure Scenarios: Chaos engineering revealed issues that unit tests missed. Kill random pods regularly.
- Keep It Simple: Resisted adding complex features. Gateway should route fast, not do business logic.
Performance Metrics
- Throughput: 10,000+ requests/minute sustained
- Latency: P50: 8ms, P95: 25ms, P99: 45ms (overhead only, excluding downstream)
- Availability: 99.95% uptime (4 hours downtime in 12 months)
- Resource Usage: 200MB RAM, 15% CPU per instance at peak load
- Error Rate: <0.01% (mostly client errors, not gateway issues)
← Back to Past Projects