Admission Control
Admission control provides automatic back-pressure management for the Kafka linearizer to prevent system overload during high-traffic periods or when batch processing falls behind.
Overview
Section titled “Overview”The admission control system uses per-database semaphores with dynamically adjustable capacity. When Kafka consumer lag increases, the system automatically reduces capacity to slow down incoming requests and allow the batch processor to catch up.
Key Features
Section titled “Key Features”- Per-Database Isolation: Each database has its own semaphore, preventing one busy database from affecting others
- Lag-Based Adaptation: Capacity adjusts automatically based on Kafka consumer lag
- Graceful Degradation: System remains responsive even under heavy load
- Observable: Metrics for monitoring capacity and wait times
How It Works
Section titled “How It Works”Architecture
Section titled “Architecture”graph TB subgraph "API Server" A[Client Request] --> B[Admission Control] B --> C[Acquire Permit] C --> D[Publish to Kafka] D --> E[Release Permit] E --> F[Return Response] end
subgraph "Background Monitor" G[Lag Monitor] --> H[Check Consumer Lag] H --> I{Lag Level?} I -->|Low| J[Max Capacity] I -->|Medium| K[Interpolate] I -->|High| L[Min Capacity] J --> M[Adjust Semaphores] K --> M L --> M end
M -.Updates.-> BCapacity Calculation
Section titled “Capacity Calculation”The system uses three lag thresholds to determine semaphore capacity:
-
Below target_lag (default: 10,000 messages)
- Use maximum capacity (default: 1,000 permits)
- System operating normally
-
Between target_lag and critical_lag
- Linear interpolation between max and min capacity
- Gradually reduce pressure as lag increases
-
Above critical_lag (default: 100,000 messages)
- Use minimum capacity (default: 10 permits)
- System under heavy load, maximum back-pressure
The interpolation formula:
capacity = min_capacity + (max_capacity - min_capacity) × (1 - lag_fraction)
where: lag_fraction = (lag - target_lag) / (critical_lag - target_lag)Lag Monitoring
Section titled “Lag Monitoring”A background task periodically queries Kafka for consumer group lag:
- Fetch topic metadata to discover partitions
- Query committed offsets for the consumer group
- Query high water marks for each partition
- Calculate lag = high_water_mark - committed_offset
- Aggregate total lag across all partitions
- Adjust capacity for all database semaphores
By default, lag is checked every 5 seconds.
Configuration
Section titled “Configuration”Admission control is configured in the Kafka linearizer:
use std::time::Duration;
let config = AdmissionControlConfig::new( true, // enabled 1000, // max_capacity 10, // min_capacity 10_000, // target_lag (messages) 100_000, // critical_lag (messages) Duration::from_secs(5), // adjustment_interval);Configuration Parameters
Section titled “Configuration Parameters”| Parameter | Default | Description |
|---|---|---|
enabled | true | Enable/disable admission control |
max_capacity | 1000 | Maximum permits per database (low lag) |
min_capacity | 10 | Minimum permits per database (high lag) |
target_lag | 10000 | Target lag in messages (below = max capacity) |
critical_lag | 100000 | Critical lag in messages (above = min capacity) |
adjustment_interval | 5s | How often to check lag and adjust capacity |
Configuration Validation
Section titled “Configuration Validation”The system validates configuration at startup:
max_capacity >= min_capacitymin_capacity > 0critical_lag > target_lagtarget_lag >= 0
Invalid configurations will cause startup failure with a descriptive error message.
Monitoring
Section titled “Monitoring”OpenTelemetry Metrics
Section titled “OpenTelemetry Metrics”The system exposes these metrics:
evidentdb.admission_control.wait
Section titled “evidentdb.admission_control.wait”- Type: Histogram
- Unit: seconds
- Labels:
database - Description: Time spent waiting for admission control permits
evidentdb.admission_control.capacity
Section titled “evidentdb.admission_control.capacity”- Type: Histogram
- Unit: permits
- Labels:
database - Buckets: 10, 25, 50, 100, 250, 500, 750, 1000
- Description: Current admission control capacity per database
evidentdb.kafka.consumer_lag
Section titled “evidentdb.kafka.consumer_lag”- Type: Gauge
- Unit: messages
- Description: Total Kafka consumer lag across all partitions
Observing System Behavior
Section titled “Observing System Behavior”Use these queries to monitor admission control:
# Average admission control wait time per databaserate(evidentdb_admission_control_wait_sum[5m]) / rate(evidentdb_admission_control_wait_count[5m])
# Current capacity distributionhistogram_quantile(0.95, evidentdb_admission_control_capacity_bucket)
# Kafka consumer lagevidentdb_kafka_consumer_lag
# Requests being throttled (wait time > 100ms)rate(evidentdb_admission_control_wait_bucket{le="0.1"}[5m]) / rate(evidentdb_admission_control_wait_count[5m]) < 0.95Log Messages
Section titled “Log Messages”The system logs capacity adjustments:
INFO Adjusted capacity for database my_database from 1000 to 505 (lag=55000)WARN Decreased semaphore capacity from 505 to 250 (-255 permits)DEBUG Admission control wait for database my_database: 125msTuning Guide
Section titled “Tuning Guide”Symptoms and Solutions
Section titled “Symptoms and Solutions”Symptom: Frequent capacity reductions
Section titled “Symptom: Frequent capacity reductions”Cause: System can’t keep up with write load Solutions:
- Increase batch processor resources (CPU/memory)
- Scale horizontally (add more batch processors)
- Reduce batch coalescer timeout for faster commits
- Increase DynamoDB write capacity
Symptom: Long admission control wait times
Section titled “Symptom: Long admission control wait times”Cause: Capacity too restricted Solutions:
- Increase
min_capacityto allow more concurrency under load - Adjust
critical_lagthreshold higher - Investigate why lag is high
Symptom: Consumer lag grows unbounded
Section titled “Symptom: Consumer lag grows unbounded”Cause: Admission control not restrictive enough Solutions:
- Decrease
max_capacityto reduce peak load - Decrease
target_lagto trigger back-pressure earlier - Increase
min_capacityif it’s preventing any progress
Recommended Configurations
Section titled “Recommended Configurations”High-throughput, low-latency
Section titled “High-throughput, low-latency”AdmissionControlConfig::new( true, 2000, // Higher max for peak throughput 50, // Higher min to maintain progress 5_000, // Aggressive target 50_000, // Lower critical threshold Duration::from_secs(3), // Faster adjustment)Burst handling, cost-sensitive
Section titled “Burst handling, cost-sensitive”AdmissionControlConfig::new( true, 500, // Lower max to control costs 5, // Lower min for tighter control 20_000, // Allow more lag before throttling 200_000, // High threshold for rare events Duration::from_secs(10), // Slower adjustment, more stable)Development/Testing
Section titled “Development/Testing”AdmissionControlConfig::disabled() // No back-pressureArchitectural Decisions
Section titled “Architectural Decisions”Why Per-Database Semaphores?
Section titled “Why Per-Database Semaphores?”Per-database isolation ensures that one busy database doesn’t block requests for other databases. This provides better fairness and prevents cascading failures.
Why Lag-Based Adaptation?
Section titled “Why Lag-Based Adaptation?”Consumer lag is a direct indicator of system health. High lag means the batch processor is falling behind, and reducing incoming load helps it catch up. This creates a natural feedback loop that stabilizes the system.
Why Linear Interpolation?
Section titled “Why Linear Interpolation?”Linear interpolation provides smooth, predictable capacity changes. More complex algorithms (exponential, PID controllers) were considered but linear proved sufficient for most workloads while being easier to reason about.
Why Background Task for Adjustment?
Section titled “Why Background Task for Adjustment?”Decoupling lag monitoring from request handling keeps the critical path fast. Lag only needs to be checked periodically (every few seconds), not on every request.
gRPC Result Delivery
Section titled “gRPC Result Delivery”Admission control works in conjunction with direct gRPC result delivery. When the linearizer receives a batch:
- Acquire permit: Block until admission control grants a permit
- Publish to Kafka: Send batch to the prospective batches topic
- Wait for result: Result delivered directly via gRPC from processor to originator
- Release permit: Permit automatically released when response received
This point-to-point result delivery eliminates the need for a separate Kafka result topic, reducing latency and operational complexity.
Future Enhancements
Section titled “Future Enhancements”Potential improvements under consideration:
- Per-database configuration: Different thresholds per database
- CLI configuration flags: Make settings configurable at startup
- Advanced algorithms: Exponential smoothing, PID control
- Admission rejection: Return errors instead of blocking when capacity exhausted
- Priority queues: Different permit pools for different request priorities
- External metrics: Adapt based on DynamoDB throttling, CPU usage, etc.
Related Documentation
Section titled “Related Documentation”- Storage Adapters - Underlying storage architecture
- Kafka Linearizer - Kafka-based linearizer deployment
- Monitoring Guide - Comprehensive monitoring setup