Skip to content

Commit 68da2b6

Browse files
Fix download speed add download speed monitor (#2)
* improve download * fix code * optimize speed * fix failing test --------- Co-authored-by: minguyen9988 <[email protected]>
1 parent f0bc1ea commit 68da2b6

14 files changed

+1241
-25
lines changed

.github/workflows/build.yaml

Lines changed: 38 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -49,8 +49,16 @@ jobs:
4949
- name: Build clickhouse-backup binary
5050
id: make-race
5151
env:
52-
GOROOT: ${{ env.GOROOT_1_24_X64 }}
52+
GOROOT: ${{ env.GOROOT_1_25_X64 }}
5353
run: |
54+
echo "=== Go Environment Diagnostics ==="
55+
echo "Expected Go version: ${{ matrix.golang-version }}"
56+
echo "GOROOT environment: ${{ env.GOROOT_1_25_X64 }}"
57+
echo "Actual go version:"
58+
go version
59+
echo "Actual GOROOT:"
60+
go env GOROOT
61+
echo "================================"
5462
make build/linux/amd64/clickhouse-backup build/linux/arm64/clickhouse-backup
5563
make build/linux/amd64/clickhouse-backup-fips build/linux/arm64/clickhouse-backup-fips
5664
make build-race build-race-fips config test
@@ -170,8 +178,16 @@ jobs:
170178
sudo chmod -Rv +rx test/testflows/clickhouse_backup/_instances
171179
- name: Format testflows coverage
172180
env:
173-
GOROOT: ${{ env.GOROOT_1_24_X64 }}
181+
GOROOT: ${{ env.GOROOT_1_25_X64 }}
174182
run: |
183+
echo "=== TestFlows Coverage Go Environment ==="
184+
echo "Expected Go version: ${{ matrix.golang-version }}"
185+
echo "GOROOT environment: ${{ env.GOROOT_1_25_X64 }}"
186+
echo "Actual go version:"
187+
go version
188+
echo "Actual GOROOT:"
189+
go env GOROOT
190+
echo "========================================"
175191
sudo chmod -Rv a+rw test/testflows/_coverage_/
176192
ls -la test/testflows/_coverage_
177193
go env
@@ -252,7 +268,7 @@ jobs:
252268
- name: Running integration tests
253269
env:
254270
RUN_PARALLEL: 4
255-
GOROOT: ${{ env.GOROOT_1_24_X64 }}
271+
GOROOT: ${{ env.GOROOT_1_25_X64 }}
256272
CLICKHOUSE_VERSION: ${{ matrix.clickhouse }}
257273
# options for advanced debug CI/CD
258274
# RUN_TESTS: "^TestSkipDisk$"
@@ -278,8 +294,18 @@ jobs:
278294
QA_GCS_OVER_S3_BUCKET: ${{ secrets.QA_GCS_OVER_S3_BUCKET }}
279295
run: |
280296
set -xe
297+
echo "=== Integration Test Environment Diagnostics ==="
298+
echo "Expected Go version: ${{ matrix.golang-version }}"
299+
echo "GOROOT environment: ${{ env.GOROOT_1_25_X64 }}"
300+
echo "Actual go version:"
301+
go version
302+
echo "Actual GOROOT:"
303+
go env GOROOT
304+
echo "Go environment variables:"
305+
go env | grep -E "GOROOT|GOCACHE|GOPATH|GOMODCACHE"
281306
echo "CLICKHOUSE_VERSION=${CLICKHOUSE_VERSION}"
282307
echo "GCS_TESTS=${GCS_TESTS}"
308+
echo "================================================="
283309
284310
chmod +x $(pwd)/clickhouse-backup/clickhouse-backup*
285311
@@ -338,8 +364,16 @@ jobs:
338364
339365
- name: Format integration coverage
340366
env:
341-
GOROOT: ${{ env.GOROOT_1_24_X64 }}
367+
GOROOT: ${{ env.GOROOT_1_25_X64 }}
342368
run: |
369+
echo "=== Coverage Format Go Environment ==="
370+
echo "Expected Go version: ${{ matrix.golang-version }}"
371+
echo "GOROOT environment: ${{ env.GOROOT_1_25_X64 }}"
372+
echo "Actual go version:"
373+
go version
374+
echo "Actual GOROOT:"
375+
go env GOROOT
376+
echo "====================================="
343377
sudo chmod -Rv a+rw test/integration/_coverage_/
344378
ls -la test/integration/_coverage_
345379
go tool covdata textfmt -i test/integration/_coverage_/ -o test/integration/_coverage_/coverage.out

PERFORMANCE_IMPROVEMENTS.md

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
# ClickHouse Backup Performance Improvements
2+
3+
## Overview
4+
5+
This document outlines the comprehensive performance improvements made to address inconsistent download performance in clickhouse-backup. The improvements target the issue where downloads "start strong but become bumpy and progressively slower towards completion."
6+
7+
## Key Problems Identified
8+
9+
1. **Multi-level concurrency conflicts**: Fixed and object disk operations used separate, conflicting concurrency limits
10+
2. **Fixed buffer sizes**: Static buffer sizes didn't adapt to file size or network conditions
11+
3. **Client pool exhaustion**: GCS client pools could become exhausted under high concurrency
12+
4. **Storage-specific bottlenecks**: Each storage backend had different optimal concurrency patterns
13+
5. **Lack of performance monitoring**: No visibility into download performance degradation
14+
15+
## Implemented Solutions
16+
17+
### 1. Adaptive Concurrency Management (`pkg/config/config.go`)
18+
19+
**New Methods:**
20+
- `GetOptimalDownloadConcurrency()`: Calculates optimal concurrency based on storage type
21+
- `GetOptimalUploadConcurrency()`: Storage-aware upload concurrency
22+
- `GetOptimalObjectDiskConcurrency()`: Unified object disk concurrency
23+
- `CalculateOptimalBufferSize()`: Dynamic buffer sizing based on file size and concurrency
24+
- `GetOptimalClientPoolSize()`: Adaptive GCS client pool sizing
25+
26+
**Storage-Specific Concurrency:**
27+
- **GCS**: 2x CPU cores, max 50 (conservative due to client pool limits)
28+
- **S3**: 3x CPU cores, max 100 (handles higher concurrency well)
29+
- **Azure Blob**: 2x CPU cores, max 25 (more conservative)
30+
- **Other storage**: 2x CPU cores, max 20
31+
32+
### 2. Enhanced Storage Backends
33+
34+
#### GCS Storage (`pkg/storage/gcs.go`)
35+
- Added config reference for adaptive operations
36+
- Dynamic client pool sizing: `gcs.cfg.GetOptimalClientPoolSize()`
37+
- Adaptive buffer sizing for uploads
38+
- Dynamic chunk sizing (256KB to 100MB based on file size)
39+
40+
#### S3 Storage (`pkg/storage/s3.go`)
41+
- Adaptive buffer sizing in `GetFileReaderWithLocalPath()` and `PutFileAbsolute()`
42+
- Buffer size calculation: `config.CalculateOptimalBufferSize(remoteSize, s.Concurrency)`
43+
- Applied to both download and upload operations
44+
45+
#### Azure Blob Storage (`pkg/storage/azblob.go`)
46+
- Adaptive buffer sizing in `PutFileAbsolute()`
47+
- Dynamic buffer calculation based on file size and max buffers
48+
49+
### 3. Unified Concurrency Limits (`pkg/backup/restore.go` & `pkg/backup/create.go`)
50+
51+
**Before**: Separate concurrency limits caused bottlenecks
52+
```go
53+
// Old: Multiple conflicting limits
54+
restoreBackupWorkingGroup.SetLimit(b.cfg.General.DownloadConcurrency)
55+
downloadObjectDiskPartsWorkingGroup.SetLimit(b.cfg.General.ObjectDiskServerSideCopyConcurrency)
56+
```
57+
58+
**After**: Unified, adaptive concurrency
59+
```go
60+
// New: Unified optimal concurrency
61+
optimalConcurrency := b.cfg.GetOptimalDownloadConcurrency()
62+
restoreBackupWorkingGroup.SetLimit(max(optimalConcurrency, 1))
63+
objectDiskConcurrency := b.cfg.GetOptimalObjectDiskConcurrency()
64+
downloadObjectDiskPartsWorkingGroup.SetLimit(objectDiskConcurrency)
65+
```
66+
67+
### 4. Real-time Performance Monitoring (`pkg/backup/performance_monitor.go`)
68+
69+
**New Performance Monitor Features:**
70+
- **Real-time speed tracking**: Monitors download speed every 5 seconds
71+
- **Performance degradation detection**: Identifies 30%+ performance drops from peak
72+
- **Adaptive concurrency adjustment**: Automatically reduces/increases concurrency based on performance
73+
- **Comprehensive metrics**: Tracks current, average, and peak speeds
74+
- **Event callbacks**: Notifies on performance changes and concurrency adjustments
75+
76+
**Integration Points:**
77+
- Integrated into object disk download operations
78+
- Tracks bytes downloaded and automatically adjusts concurrency
79+
- Provides detailed performance logging with metrics
80+
81+
### 5. Smart Buffer Sizing Algorithm
82+
83+
```go
84+
func CalculateOptimalBufferSize(fileSize int64, concurrency int) int64 {
85+
// Base buffer size
86+
baseBuffer := int64(64 * 1024) // 64KB
87+
88+
// Scale with file size (logarithmic scaling)
89+
if fileSize > 0 {
90+
// More aggressive scaling for larger files
91+
fileSizeFactor := math.Log10(float64(fileSize)/1024/1024 + 1) // Log of MB + 1
92+
baseBuffer = int64(float64(baseBuffer) * (1 + fileSizeFactor))
93+
}
94+
95+
// Reduce buffer size for higher concurrency to manage memory
96+
if concurrency > 1 {
97+
concurrencyFactor := math.Sqrt(float64(concurrency))
98+
baseBuffer = int64(float64(baseBuffer) / concurrencyFactor)
99+
}
100+
101+
// Clamp to reasonable bounds
102+
minBuffer := int64(32 * 1024) // 32KB minimum
103+
maxBuffer := int64(32 * 1024 * 1024) // 32MB maximum
104+
105+
if baseBuffer < minBuffer {
106+
return minBuffer
107+
}
108+
if baseBuffer > maxBuffer {
109+
return maxBuffer
110+
}
111+
112+
return baseBuffer
113+
}
114+
```
115+
116+
## Performance Benefits
117+
118+
### Expected Improvements
119+
120+
1. **Consistent Throughput**: Eliminates performance degradation towards completion
121+
2. **Better Resource Utilization**: Adaptive concurrency matches system and network capabilities
122+
3. **Reduced Memory Pressure**: Smart buffer sizing prevents memory exhaustion
123+
4. **Storage-Optimized Operations**: Each backend uses optimal concurrency patterns
124+
5. **Self-Healing Performance**: Automatic adjustment when performance degrades
125+
126+
### Monitoring and Observability
127+
128+
**New Log Messages:**
129+
```
130+
# Performance degradation detection
131+
WARN performance degradation detected current_speed_mbps=45.2 average_speed_mbps=67.8 peak_speed_mbps=89.1 concurrency=32
132+
133+
# Concurrency adjustments
134+
INFO concurrency adjusted for better performance concurrency=24 speed_mbps=52.3
135+
136+
# Final performance metrics
137+
INFO object_disk data downloaded with performance metrics disk=s3_disk duration=2m34s size=15.2GB avg_speed_mbps=67.8 peak_speed_mbps=89.1 final_concurrency=28 degradation_detected=false
138+
```
139+
140+
## Configuration
141+
142+
The improvements work automatically with existing configurations. For fine-tuning:
143+
144+
```yaml
145+
general:
146+
# These settings now adapt automatically, but can still be overridden
147+
download_concurrency: 0 # 0 = auto-detect optimal
148+
upload_concurrency: 0 # 0 = auto-detect optimal
149+
150+
# Object disk operations now unified with main concurrency
151+
object_disk_server_side_copy_concurrency: 0 # 0 = auto-detect optimal
152+
```
153+
154+
## Backwards Compatibility
155+
156+
- All existing configurations continue to work
157+
- New adaptive features activate when concurrency is set to 0 (auto-detect)
158+
- Manual concurrency settings still override adaptive behavior
159+
- No breaking changes to existing APIs or configurations
160+
161+
## Testing Recommendations
162+
163+
1. **Monitor logs** for performance degradation warnings
164+
2. **Compare download times** before and after the improvements
165+
3. **Watch memory usage** during large downloads
166+
4. **Test with different storage backends** to verify optimizations
167+
5. **Verify consistent performance** throughout entire download process
168+
169+
## Future Enhancements
170+
171+
1. **Upload performance monitoring**: Extend monitoring to upload operations
172+
2. **Network condition detection**: Adapt to changing network conditions
173+
3. **Historical performance learning**: Remember optimal settings per backup
174+
4. **Cross-table optimization**: Coordinate concurrency across multiple table downloads
175+
5. **Bandwidth throttling**: Respect network bandwidth limits

config-performance-monitoring.yaml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Enhanced logging configuration for performance monitoring
2+
3+
general:
4+
# Enable debug logging for performance details
5+
log_level: "debug"
6+
7+
# Optional: Set concurrency to 0 for auto-detection
8+
download_concurrency: 0
9+
upload_concurrency: 0
10+
object_disk_server_side_copy_concurrency: 0
11+
12+
# Log format configuration
13+
log:
14+
# Use JSON format for easier parsing
15+
format: "json"
16+
# Enable timestamps
17+
timestamp: true
18+
# Include caller information
19+
caller: true

0 commit comments

Comments
 (0)