-
Notifications
You must be signed in to change notification settings - Fork 328
Description
Connection Pool Limits Cause Sequential Processing Instead of Concurrent Execution
Summary
BAML appears to have connection pool limits that cause high-concurrency requests to be processed sequentially rather than concurrently, despite correct usage of asyncio.gather(). This manifests as a distinctive timing pattern where requests complete in sequential batches rather than truly in parallel.
Environment
- BAML Version: 0.208.5 (latest as of issue creation: 0.211.0)
- Python Version: 3.12.5
- OS: macOS
- Usage Pattern: 20+ concurrent requests via
asyncio.gather()
Issue Details
Expected Behavior
When making multiple concurrent BAML calls with asyncio.gather(), requests should execute in parallel with completion times distributed based on actual API response times.
Actual Behavior
Requests are processed in sequential batches (~6 at a time), creating this pattern:
- First ~6 requests: Complete sequentially with 1.5-2s gaps between each
- Sudden burst: 6+ requests complete within milliseconds of each other
- Pattern repeats: Indicating connection pool cycling rather than true concurrency
Evidence from Production Logs
Sequential Processing Phase:
2025-10-08 17:22:26,888 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-08 17:22:28,519 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 1.631s]
2025-10-08 17:22:30,228 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 1.709s]
2025-10-08 17:22:31,689 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 1.461s]
2025-10-08 17:22:33,466 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 1.777s]
2025-10-08 17:22:35,298 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 1.832s]
Then Sudden Concurrent Burst:
2025-10-08 17:22:47,930 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-08 17:22:47,930 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 0ms]
2025-10-08 17:22:47,931 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 1ms]
2025-10-08 17:22:47,931 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 0ms]
2025-10-08 17:22:47,932 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 1ms]
2025-10-08 17:22:47,933 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 1ms]
User Code (Correctly Implemented)
async def concurrent_simplified_generation(queries, context_chunks_list, baml_options):
"""From backend/backend/core/agents/helpers.py - correctly uses asyncio.gather"""
tasks = []
for query, context_chunks in zip(queries, context_chunks_list, strict=True):
task = simplified_baml_qa_response(query, ..., baml_options=baml_options)
tasks.append(task)
# This should enable true concurrency, but BAML appears to serialize internally
return await asyncio.gather(*tasks)Relationship to Previous Work
Acknowledgment: The BAML team has already addressed several connection pool issues:
- PR Add limit on connection pool to prevent stalling issues in pyo3 and other ffi boundaries #1027/set pool timeout for all clients #1028: Fixed idle connection stalling in FFI boundaries
- PR Add a pool timeout to try to fix open File descriptor issue like deno #2205: Fixed file descriptor leaks with pool timeouts
This issue is different:
- Previous fixes addressed idle connections and resource leaks
- This issue is about active connection limits preventing true concurrency
- The distinctive timing pattern suggests connection pool exhaustion rather than idle timeouts
Root Cause Analysis
BAML uses requests/httpx internally but appears to have connection pool limits that aren't suitable for high-concurrency scenarios. The current configuration likely allows ~6 concurrent connections, causing additional requests to queue rather than execute in parallel.
Impact
- Performance degradation: 20 concurrent requests that should complete in ~3-5s take 30-50s
- Poor resource utilization: CPU and network remain idle while requests queue
- Unpredictable latency: Request completion depends on queue position, not actual processing
Proposed Solutions
- Expose connection pool configuration in BAML client options
- Increase default connection limits for modern high-concurrency use cases
- Add configuration similar to the existing timeout proposal in Feature Proposal: Configurable LLM Client Timeouts #1630
Additional Context
- Issue becomes pronounced with 10+ concurrent requests
- BAML version 0.208.5, but reviewing through 0.211.0 shows no related fixes
- This significantly impacts batch processing and parallel generation workflows
- Related to Feature Proposal: Configurable LLM Client Timeouts #1630 (configurable timeouts) but specifically about connection limits
Reproducible: Yes, consistently observed across multiple test runs and production usage