Skip to content

Commit b8f71ee

Browse files
authored
Merge branch 'master' into minguyen/restore_remote
2 parents 4c26024 + 68da2b6 commit b8f71ee

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+14298
-124
lines changed

.github/workflows/build.yaml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,14 @@ jobs:
5151
env:
5252
GOROOT: ${{ env.GOROOT_1_25_X64 }}
5353
run: |
54+
echo "=== Go Environment Diagnostics ==="
55+
echo "Expected Go version: ${{ matrix.golang-version }}"
56+
echo "GOROOT environment: ${{ env.GOROOT_1_25_X64 }}"
57+
echo "Actual go version:"
58+
go version
59+
echo "Actual GOROOT:"
60+
go env GOROOT
61+
echo "================================"
5462
make build/linux/amd64/clickhouse-backup build/linux/arm64/clickhouse-backup
5563
make build/linux/amd64/clickhouse-backup-fips build/linux/arm64/clickhouse-backup-fips
5664
make build-race build-race-fips config test
@@ -170,6 +178,14 @@ jobs:
170178
env:
171179
GOROOT: ${{ env.GOROOT_1_25_X64 }}
172180
run: |
181+
echo "=== TestFlows Coverage Go Environment ==="
182+
echo "Expected Go version: ${{ matrix.golang-version }}"
183+
echo "GOROOT environment: ${{ env.GOROOT_1_25_X64 }}"
184+
echo "Actual go version:"
185+
go version
186+
echo "Actual GOROOT:"
187+
go env GOROOT
188+
echo "========================================"
173189
sudo chmod -Rv a+rw test/testflows/_coverage_/
174190
ls -la test/testflows/_coverage_
175191
go env
@@ -340,6 +356,14 @@ jobs:
340356
env:
341357
GOROOT: ${{ env.GOROOT_1_25_X64 }}
342358
run: |
359+
echo "=== Coverage Format Go Environment ==="
360+
echo "Expected Go version: ${{ matrix.golang-version }}"
361+
echo "GOROOT environment: ${{ env.GOROOT_1_25_X64 }}"
362+
echo "Actual go version:"
363+
go version
364+
echo "Actual GOROOT:"
365+
go env GOROOT
366+
echo "====================================="
343367
sudo chmod -Rv a+rw test/integration/_coverage_/
344368
ls -la test/integration/_coverage_
345369
go tool covdata textfmt -i test/integration/_coverage_/ -o test/integration/_coverage_/coverage.out

PERFORMANCE_IMPROVEMENTS.md

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
# ClickHouse Backup Performance Improvements
2+
3+
## Overview
4+
5+
This document outlines the comprehensive performance improvements made to address inconsistent download performance in clickhouse-backup. The improvements target the issue where downloads "start strong but become bumpy and progressively slower towards completion."
6+
7+
## Key Problems Identified
8+
9+
1. **Multi-level concurrency conflicts**: Fixed and object disk operations used separate, conflicting concurrency limits
10+
2. **Fixed buffer sizes**: Static buffer sizes didn't adapt to file size or network conditions
11+
3. **Client pool exhaustion**: GCS client pools could become exhausted under high concurrency
12+
4. **Storage-specific bottlenecks**: Each storage backend had different optimal concurrency patterns
13+
5. **Lack of performance monitoring**: No visibility into download performance degradation
14+
15+
## Implemented Solutions
16+
17+
### 1. Adaptive Concurrency Management (`pkg/config/config.go`)
18+
19+
**New Methods:**
20+
- `GetOptimalDownloadConcurrency()`: Calculates optimal concurrency based on storage type
21+
- `GetOptimalUploadConcurrency()`: Storage-aware upload concurrency
22+
- `GetOptimalObjectDiskConcurrency()`: Unified object disk concurrency
23+
- `CalculateOptimalBufferSize()`: Dynamic buffer sizing based on file size and concurrency
24+
- `GetOptimalClientPoolSize()`: Adaptive GCS client pool sizing
25+
26+
**Storage-Specific Concurrency:**
27+
- **GCS**: 2x CPU cores, max 50 (conservative due to client pool limits)
28+
- **S3**: 3x CPU cores, max 100 (handles higher concurrency well)
29+
- **Azure Blob**: 2x CPU cores, max 25 (more conservative)
30+
- **Other storage**: 2x CPU cores, max 20
31+
32+
### 2. Enhanced Storage Backends
33+
34+
#### GCS Storage (`pkg/storage/gcs.go`)
35+
- Added config reference for adaptive operations
36+
- Dynamic client pool sizing: `gcs.cfg.GetOptimalClientPoolSize()`
37+
- Adaptive buffer sizing for uploads
38+
- Dynamic chunk sizing (256KB to 100MB based on file size)
39+
40+
#### S3 Storage (`pkg/storage/s3.go`)
41+
- Adaptive buffer sizing in `GetFileReaderWithLocalPath()` and `PutFileAbsolute()`
42+
- Buffer size calculation: `config.CalculateOptimalBufferSize(remoteSize, s.Concurrency)`
43+
- Applied to both download and upload operations
44+
45+
#### Azure Blob Storage (`pkg/storage/azblob.go`)
46+
- Adaptive buffer sizing in `PutFileAbsolute()`
47+
- Dynamic buffer calculation based on file size and max buffers
48+
49+
### 3. Unified Concurrency Limits (`pkg/backup/restore.go` & `pkg/backup/create.go`)
50+
51+
**Before**: Separate concurrency limits caused bottlenecks
52+
```go
53+
// Old: Multiple conflicting limits
54+
restoreBackupWorkingGroup.SetLimit(b.cfg.General.DownloadConcurrency)
55+
downloadObjectDiskPartsWorkingGroup.SetLimit(b.cfg.General.ObjectDiskServerSideCopyConcurrency)
56+
```
57+
58+
**After**: Unified, adaptive concurrency
59+
```go
60+
// New: Unified optimal concurrency
61+
optimalConcurrency := b.cfg.GetOptimalDownloadConcurrency()
62+
restoreBackupWorkingGroup.SetLimit(max(optimalConcurrency, 1))
63+
objectDiskConcurrency := b.cfg.GetOptimalObjectDiskConcurrency()
64+
downloadObjectDiskPartsWorkingGroup.SetLimit(objectDiskConcurrency)
65+
```
66+
67+
### 4. Real-time Performance Monitoring (`pkg/backup/performance_monitor.go`)
68+
69+
**New Performance Monitor Features:**
70+
- **Real-time speed tracking**: Monitors download speed every 5 seconds
71+
- **Performance degradation detection**: Identifies 30%+ performance drops from peak
72+
- **Adaptive concurrency adjustment**: Automatically reduces/increases concurrency based on performance
73+
- **Comprehensive metrics**: Tracks current, average, and peak speeds
74+
- **Event callbacks**: Notifies on performance changes and concurrency adjustments
75+
76+
**Integration Points:**
77+
- Integrated into object disk download operations
78+
- Tracks bytes downloaded and automatically adjusts concurrency
79+
- Provides detailed performance logging with metrics
80+
81+
### 5. Smart Buffer Sizing Algorithm
82+
83+
```go
84+
func CalculateOptimalBufferSize(fileSize int64, concurrency int) int64 {
85+
// Base buffer size
86+
baseBuffer := int64(64 * 1024) // 64KB
87+
88+
// Scale with file size (logarithmic scaling)
89+
if fileSize > 0 {
90+
// More aggressive scaling for larger files
91+
fileSizeFactor := math.Log10(float64(fileSize)/1024/1024 + 1) // Log of MB + 1
92+
baseBuffer = int64(float64(baseBuffer) * (1 + fileSizeFactor))
93+
}
94+
95+
// Reduce buffer size for higher concurrency to manage memory
96+
if concurrency > 1 {
97+
concurrencyFactor := math.Sqrt(float64(concurrency))
98+
baseBuffer = int64(float64(baseBuffer) / concurrencyFactor)
99+
}
100+
101+
// Clamp to reasonable bounds
102+
minBuffer := int64(32 * 1024) // 32KB minimum
103+
maxBuffer := int64(32 * 1024 * 1024) // 32MB maximum
104+
105+
if baseBuffer < minBuffer {
106+
return minBuffer
107+
}
108+
if baseBuffer > maxBuffer {
109+
return maxBuffer
110+
}
111+
112+
return baseBuffer
113+
}
114+
```
115+
116+
## Performance Benefits
117+
118+
### Expected Improvements
119+
120+
1. **Consistent Throughput**: Eliminates performance degradation towards completion
121+
2. **Better Resource Utilization**: Adaptive concurrency matches system and network capabilities
122+
3. **Reduced Memory Pressure**: Smart buffer sizing prevents memory exhaustion
123+
4. **Storage-Optimized Operations**: Each backend uses optimal concurrency patterns
124+
5. **Self-Healing Performance**: Automatic adjustment when performance degrades
125+
126+
### Monitoring and Observability
127+
128+
**New Log Messages:**
129+
```
130+
# Performance degradation detection
131+
WARN performance degradation detected current_speed_mbps=45.2 average_speed_mbps=67.8 peak_speed_mbps=89.1 concurrency=32
132+
133+
# Concurrency adjustments
134+
INFO concurrency adjusted for better performance concurrency=24 speed_mbps=52.3
135+
136+
# Final performance metrics
137+
INFO object_disk data downloaded with performance metrics disk=s3_disk duration=2m34s size=15.2GB avg_speed_mbps=67.8 peak_speed_mbps=89.1 final_concurrency=28 degradation_detected=false
138+
```
139+
140+
## Configuration
141+
142+
The improvements work automatically with existing configurations. For fine-tuning:
143+
144+
```yaml
145+
general:
146+
# These settings now adapt automatically, but can still be overridden
147+
download_concurrency: 0 # 0 = auto-detect optimal
148+
upload_concurrency: 0 # 0 = auto-detect optimal
149+
150+
# Object disk operations now unified with main concurrency
151+
object_disk_server_side_copy_concurrency: 0 # 0 = auto-detect optimal
152+
```
153+
154+
## Backwards Compatibility
155+
156+
- All existing configurations continue to work
157+
- New adaptive features activate when concurrency is set to 0 (auto-detect)
158+
- Manual concurrency settings still override adaptive behavior
159+
- No breaking changes to existing APIs or configurations
160+
161+
## Testing Recommendations
162+
163+
1. **Monitor logs** for performance degradation warnings
164+
2. **Compare download times** before and after the improvements
165+
3. **Watch memory usage** during large downloads
166+
4. **Test with different storage backends** to verify optimizations
167+
5. **Verify consistent performance** throughout entire download process
168+
169+
## Future Enhancements
170+
171+
1. **Upload performance monitoring**: Extend monitoring to upload operations
172+
2. **Network condition detection**: Adapt to changing network conditions
173+
3. **Historical performance learning**: Remember optimal settings per backup
174+
4. **Cross-table optimization**: Coordinate concurrency across multiple table downloads
175+
5. **Bandwidth throttling**: Respect network bandwidth limits

config-performance-monitoring.yaml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Enhanced logging configuration for performance monitoring
2+
3+
general:
4+
# Enable debug logging for performance details
5+
log_level: "debug"
6+
7+
# Optional: Set concurrency to 0 for auto-detection
8+
download_concurrency: 0
9+
upload_concurrency: 0
10+
object_disk_server_side_copy_concurrency: 0
11+
12+
# Log format configuration
13+
log:
14+
# Use JSON format for easier parsing
15+
format: "json"
16+
# Enable timestamps
17+
timestamp: true
18+
# Include caller information
19+
caller: true

0 commit comments

Comments
 (0)