Admission Control

Admission control provides automatic back-pressure management for the Kafka linearizer to prevent system overload during high-traffic periods or when transaction processing falls behind.

Overview

The admission control system uses per-database semaphores with dynamically adjustable capacity. When Kafka consumer lag increases, the system automatically reduces capacity to slow down incoming requests and allow the transaction processor to catch up.

Key Features

Per-Database Isolation: Each database has its own semaphore, preventing one busy database from affecting others
Lag-Based Adaptation: Capacity adjusts automatically based on Kafka consumer lag
Graceful Degradation: System remains responsive even under heavy load
Observable: Metrics for monitoring capacity and wait times

How It Works

Architecture

graph TB
    subgraph "API Server"
        A[Client Request] --> B[Admission Control]
        B --> C[Acquire Permit]
        C --> D[Publish to Kafka]
        D --> E[Release Permit]
        E --> F[Return Response]
    end

    subgraph "Background Monitor"
        G[Lag Monitor] --> H[Check Consumer Lag]
        H --> I{Lag Level?}
        I -->|Low| J[Max Capacity]
        I -->|Medium| K[Interpolate]
        I -->|High| L[Min Capacity]
        J --> M[Adjust Semaphores]
        K --> M
        L --> M
    end

    M -.Updates.-> B

Capacity Calculation

The system uses three lag thresholds to determine semaphore capacity:

Below target_lag (default: 10,000 messages)
- Use maximum capacity (default: 1,000 permits)
- System operating normally
Between target_lag and critical_lag
- Linear interpolation between max and min capacity
- Gradually reduce pressure as lag increases
Above critical_lag (default: 100,000 messages)
- Use minimum capacity (default: 10 permits)
- System under heavy load, maximum back-pressure

The interpolation formula:

capacity = min_capacity + (max_capacity - min_capacity) × (1 - lag_fraction)

where:
  lag_fraction = (lag - target_lag) / (critical_lag - target_lag)

Lag Monitoring

A background task periodically queries Kafka for consumer group lag:

Fetch topic metadata to discover partitions
Query committed offsets for the consumer group
Query high water marks for each partition
Calculate lag = high_water_mark - committed_offset
Aggregate total lag across all partitions
Adjust capacity for all database semaphores

By default, lag is checked every 5 seconds.

Configuration

Admission control is configured in the Kafka linearizer:

use std::time::Duration;

let config = AdmissionControlConfig::new(
    true,                           // enabled
    1000,                           // max_capacity
    10,                             // min_capacity
    10_000,                         // target_lag (messages)
    100_000,                        // critical_lag (messages)
    Duration::from_secs(5),         // adjustment_interval
);

Configuration Parameters

Parameter	Default	Description
`enabled`	`true`	Enable/disable admission control
`max_capacity`	`1000`	Maximum permits per database (low lag)
`min_capacity`	`10`	Minimum permits per database (high lag)
`target_lag`	`10000`	Target lag in messages (below = max capacity)
`critical_lag`	`100000`	Critical lag in messages (above = min capacity)
`adjustment_interval`	`5s`	How often to check lag and adjust capacity

Configuration Validation

The system validates configuration at startup:

max_capacity >= min_capacity
min_capacity > 0
critical_lag > target_lag
target_lag >= 0

Invalid configurations will cause startup failure with a descriptive error message.

Monitoring

OpenTelemetry Metrics

The system exposes these metrics:

`evidentdb.admission_control.wait`

Type: Histogram
Unit: seconds
Labels: database
Description: Time spent waiting for admission control permits

`evidentdb.admission_control.capacity`

Type: Histogram
Unit: permits
Labels: database
Buckets: 10, 25, 50, 100, 250, 500, 750, 1000
Description: Current admission control capacity per database

`evidentdb.kafka.consumer_lag`

Type: Gauge
Unit: messages
Description: Total Kafka consumer lag across all partitions

Observing System Behavior

Use these queries to monitor admission control:

# Average admission control wait time per database
rate(evidentdb_admission_control_wait_sum[5m]) / rate(evidentdb_admission_control_wait_count[5m])

# Current capacity distribution
histogram_quantile(0.95, evidentdb_admission_control_capacity_bucket)

# Kafka consumer lag
evidentdb_kafka_consumer_lag

# Requests being throttled (wait time > 100ms)
rate(evidentdb_admission_control_wait_bucket{le="0.1"}[5m]) / rate(evidentdb_admission_control_wait_count[5m]) < 0.95

Log Messages

The system logs capacity adjustments:

INFO  Adjusted capacity for database my_database from 1000 to 505 (lag=55000)
WARN  Decreased semaphore capacity from 505 to 250 (-255 permits)
DEBUG Admission control wait for database my_database: 125ms

Tuning Guide

Symptoms and Solutions

Symptom: Frequent capacity reductions

Cause: System can’t keep up with write load Solutions:

Increase transaction processor resources (CPU/memory)
Scale horizontally (add more transaction processors)
Reduce transaction coalescer timeout for faster commits
Increase DynamoDB write capacity

Symptom: Long admission control wait times

Cause: Capacity too restricted Solutions:

Increase min_capacity to allow more concurrency under load
Adjust critical_lag threshold higher
Investigate why lag is high

Symptom: Consumer lag grows unbounded

Cause: Admission control not restrictive enough Solutions:

Decrease max_capacity to reduce peak load
Decrease target_lag to trigger back-pressure earlier
Increase min_capacity if it’s preventing any progress

Recommended Configurations

High-throughput, low-latency

AdmissionControlConfig::new(
    true,
    2000,                           // Higher max for peak throughput
    50,                             // Higher min to maintain progress
    5_000,                          // Aggressive target
    50_000,                         // Lower critical threshold
    Duration::from_secs(3),         // Faster adjustment
)

Burst handling, cost-sensitive

AdmissionControlConfig::new(
    true,
    500,                            // Lower max to control costs
    5,                              // Lower min for tighter control
    20_000,                         // Allow more lag before throttling
    200_000,                        // High threshold for rare events
    Duration::from_secs(10),        // Slower adjustment, more stable
)

Development/Testing

AdmissionControlConfig::disabled()  // No back-pressure

Architectural Decisions

Why Per-Database Semaphores?

Per-database isolation ensures that one busy database doesn’t block requests for other databases. This provides better fairness and prevents cascading failures.

Why Lag-Based Adaptation?

Consumer lag is a direct indicator of system health. High lag means the transaction processor is falling behind, and reducing incoming load helps it catch up. This creates a natural feedback loop that stabilizes the system.

Why Linear Interpolation?

Linear interpolation provides smooth, predictable capacity changes. More complex algorithms (exponential, PID controllers) were considered but linear proved sufficient for most workloads while being easier to reason about.

Why Background Task for Adjustment?

Decoupling lag monitoring from request handling keeps the critical path fast. Lag only needs to be checked periodically (every few seconds), not on every request.

gRPC Result Delivery

Admission control works in conjunction with direct gRPC result delivery. When the linearizer receives a transaction:

Acquire permit: Block until admission control grants a permit
Publish to Kafka: Send transaction to the transaction proposals topic
Wait for result: Result delivered directly via gRPC from processor to originator
Release permit: Permit automatically released when response received

This point-to-point result delivery eliminates the need for a separate Kafka result topic, reducing latency and operational complexity.

Future Enhancements

Potential improvements under consideration:

Per-database configuration: Different thresholds per database
CLI configuration flags: Make settings configurable at startup
Advanced algorithms: Exponential smoothing, PID control
Admission rejection: Return errors instead of blocking when capacity exhausted
Priority queues: Different permit pools for different request priorities
External metrics: Adapt based on DynamoDB throttling, CPU usage, etc.

Storage Adapters - Underlying storage architecture
Kafka Linearizer - Kafka-based linearizer deployment
Monitoring Guide - Comprehensive monitoring setup