Skip to content

Admission Control

Admission control provides automatic back-pressure management for the Kafka linearizer to prevent system overload during high-traffic periods or when batch processing falls behind.

The admission control system uses per-database semaphores with dynamically adjustable capacity. When Kafka consumer lag increases, the system automatically reduces capacity to slow down incoming requests and allow the batch processor to catch up.

  • Per-Database Isolation: Each database has its own semaphore, preventing one busy database from affecting others
  • Lag-Based Adaptation: Capacity adjusts automatically based on Kafka consumer lag
  • Graceful Degradation: System remains responsive even under heavy load
  • Observable: Metrics for monitoring capacity and wait times
graph TB
subgraph "API Server"
A[Client Request] --> B[Admission Control]
B --> C[Acquire Permit]
C --> D[Publish to Kafka]
D --> E[Release Permit]
E --> F[Return Response]
end
subgraph "Background Monitor"
G[Lag Monitor] --> H[Check Consumer Lag]
H --> I{Lag Level?}
I -->|Low| J[Max Capacity]
I -->|Medium| K[Interpolate]
I -->|High| L[Min Capacity]
J --> M[Adjust Semaphores]
K --> M
L --> M
end
M -.Updates.-> B

The system uses three lag thresholds to determine semaphore capacity:

  1. Below target_lag (default: 10,000 messages)

    • Use maximum capacity (default: 1,000 permits)
    • System operating normally
  2. Between target_lag and critical_lag

    • Linear interpolation between max and min capacity
    • Gradually reduce pressure as lag increases
  3. Above critical_lag (default: 100,000 messages)

    • Use minimum capacity (default: 10 permits)
    • System under heavy load, maximum back-pressure

The interpolation formula:

capacity = min_capacity + (max_capacity - min_capacity) × (1 - lag_fraction)
where:
lag_fraction = (lag - target_lag) / (critical_lag - target_lag)

A background task periodically queries Kafka for consumer group lag:

  1. Fetch topic metadata to discover partitions
  2. Query committed offsets for the consumer group
  3. Query high water marks for each partition
  4. Calculate lag = high_water_mark - committed_offset
  5. Aggregate total lag across all partitions
  6. Adjust capacity for all database semaphores

By default, lag is checked every 5 seconds.

Admission control is configured in the Kafka linearizer:

use std::time::Duration;
let config = AdmissionControlConfig::new(
true, // enabled
1000, // max_capacity
10, // min_capacity
10_000, // target_lag (messages)
100_000, // critical_lag (messages)
Duration::from_secs(5), // adjustment_interval
);
ParameterDefaultDescription
enabledtrueEnable/disable admission control
max_capacity1000Maximum permits per database (low lag)
min_capacity10Minimum permits per database (high lag)
target_lag10000Target lag in messages (below = max capacity)
critical_lag100000Critical lag in messages (above = min capacity)
adjustment_interval5sHow often to check lag and adjust capacity

The system validates configuration at startup:

  • max_capacity >= min_capacity
  • min_capacity > 0
  • critical_lag > target_lag
  • target_lag >= 0

Invalid configurations will cause startup failure with a descriptive error message.

The system exposes these metrics:

  • Type: Histogram
  • Unit: seconds
  • Labels: database
  • Description: Time spent waiting for admission control permits
  • Type: Histogram
  • Unit: permits
  • Labels: database
  • Buckets: 10, 25, 50, 100, 250, 500, 750, 1000
  • Description: Current admission control capacity per database
  • Type: Gauge
  • Unit: messages
  • Description: Total Kafka consumer lag across all partitions

Use these queries to monitor admission control:

# Average admission control wait time per database
rate(evidentdb_admission_control_wait_sum[5m]) / rate(evidentdb_admission_control_wait_count[5m])
# Current capacity distribution
histogram_quantile(0.95, evidentdb_admission_control_capacity_bucket)
# Kafka consumer lag
evidentdb_kafka_consumer_lag
# Requests being throttled (wait time > 100ms)
rate(evidentdb_admission_control_wait_bucket{le="0.1"}[5m]) / rate(evidentdb_admission_control_wait_count[5m]) < 0.95

The system logs capacity adjustments:

INFO Adjusted capacity for database my_database from 1000 to 505 (lag=55000)
WARN Decreased semaphore capacity from 505 to 250 (-255 permits)
DEBUG Admission control wait for database my_database: 125ms

Cause: System can’t keep up with write load Solutions:

  • Increase batch processor resources (CPU/memory)
  • Scale horizontally (add more batch processors)
  • Reduce batch coalescer timeout for faster commits
  • Increase DynamoDB write capacity

Symptom: Long admission control wait times

Section titled “Symptom: Long admission control wait times”

Cause: Capacity too restricted Solutions:

  • Increase min_capacity to allow more concurrency under load
  • Adjust critical_lag threshold higher
  • Investigate why lag is high

Cause: Admission control not restrictive enough Solutions:

  • Decrease max_capacity to reduce peak load
  • Decrease target_lag to trigger back-pressure earlier
  • Increase min_capacity if it’s preventing any progress
AdmissionControlConfig::new(
true,
2000, // Higher max for peak throughput
50, // Higher min to maintain progress
5_000, // Aggressive target
50_000, // Lower critical threshold
Duration::from_secs(3), // Faster adjustment
)
AdmissionControlConfig::new(
true,
500, // Lower max to control costs
5, // Lower min for tighter control
20_000, // Allow more lag before throttling
200_000, // High threshold for rare events
Duration::from_secs(10), // Slower adjustment, more stable
)
AdmissionControlConfig::disabled() // No back-pressure

Per-database isolation ensures that one busy database doesn’t block requests for other databases. This provides better fairness and prevents cascading failures.

Consumer lag is a direct indicator of system health. High lag means the batch processor is falling behind, and reducing incoming load helps it catch up. This creates a natural feedback loop that stabilizes the system.

Linear interpolation provides smooth, predictable capacity changes. More complex algorithms (exponential, PID controllers) were considered but linear proved sufficient for most workloads while being easier to reason about.

Decoupling lag monitoring from request handling keeps the critical path fast. Lag only needs to be checked periodically (every few seconds), not on every request.

Admission control works in conjunction with direct gRPC result delivery. When the linearizer receives a batch:

  1. Acquire permit: Block until admission control grants a permit
  2. Publish to Kafka: Send batch to the prospective batches topic
  3. Wait for result: Result delivered directly via gRPC from processor to originator
  4. Release permit: Permit automatically released when response received

This point-to-point result delivery eliminates the need for a separate Kafka result topic, reducing latency and operational complexity.

Potential improvements under consideration:

  • Per-database configuration: Different thresholds per database
  • CLI configuration flags: Make settings configurable at startup
  • Advanced algorithms: Exponential smoothing, PID control
  • Admission rejection: Return errors instead of blocking when capacity exhausted
  • Priority queues: Different permit pools for different request priorities
  • External metrics: Adapt based on DynamoDB throttling, CPU usage, etc.