A powerful Python application that extracts text from various file types and converts text to speech using AI services. Available as both a command-line tool and a FastAPI web service. It automatically routes different file types to the most appropriate AI service for optimal text extraction and provides professional-quality text-to-speech generation for podcast creation.
- Multi-format Support: Extract text from PDFs, documents, spreadsheets, videos, and audio files
- AI-Powered: Uses OpenAI GPT for documents and Google Gemini for multimedia
- Smart Routing: Automatically selects the best AI service for each file type
- YouTube Support: Extract transcripts and audio from YouTube videos
- YouTube Transcripts: Fast transcript extraction using existing captions
- Batch Processing: Handle multiple files or URLs at once
- Multiple Voices: 6 different AI voices (alloy, echo, fable, onyx, nova, shimmer)
- Podcast Mode: Multi-voice support perfect for podcast creation with different speakers
- High Quality: Two quality levels (tts-1 and tts-1-hd) with multiple audio formats
- Speed Control: Adjustable speech speed from 0.25x to 4.0x
- Smart Chunking: Automatically handles long text with natural break points
- Speaker Management: Organize audio by speaker names for easy podcast assembly
- 47+ Languages: Support for major world languages including English, Spanish, French, German, Chinese, Japanese, and more
- Dual AI Support: Choose between OpenAI GPT and Google Gemini for translation
- Auto-Detection: Automatically detect source language when unknown
- Batch Translation: Translate multiple texts efficiently in a single request
- Language Detection: Identify the language of any text with confidence scores
- High Quality: Professional-grade translations preserving meaning and tone
- Dual Interface: Command-line tool and REST API server
- Parallel Processing: Process multiple files simultaneously for faster results
- Comprehensive Output: Detailed results with processing time, file info, and extraction statistics
- RESTful API: Complete FastAPI server with auto-generated documentation
- PDF files (
.pdf
) - Text files (
.txt
) - Word documents (
.doc
,.docx
) - Excel spreadsheets (
.xls
,.xlsx
)
- Video files (
.mp4
,.avi
,.mov
,.mkv
) - Audio files (
.mp3
,.wav
,.m4a
,.webm
,.ogg
)
- YouTube videos and audio content
- Supports various YouTube URL formats
- Two methods available:
- Fast Transcript Extraction: Uses existing captions/subtitles (3-5 seconds)
- Audio Processing: Downloads and transcribes audio using AI (30-60 seconds)
- alloy: A balanced voice, suitable for most content
- echo: A warm, friendly voice
- fable: A storytelling voice with character
- onyx: A deep, authoritative voice
- nova: A bright, energetic voice
- shimmer: A soft, gentle voice
- MP3: Good balance of quality and size (recommended)
- FLAC: Highest quality, larger files
- Opus: Good for web streaming
- AAC: Good for mobile apps
- tts-1: Fast processing, good quality
- tts-1-hd: High-definition quality, slightly slower
- Multi-speaker support: Use different voices for different speakers
- Speaker names: Organize audio files by speaker
- Natural conversations: Create realistic dialogues
- Professional quality: Broadcast-ready audio output
For PDF files larger than 30MB, the system automatically switches to resume-only mode to provide efficient processing:
- Smart Sampling: Extracts text from up to 40 strategically selected pages (beginning, middle, end)
- AI-Generated Analysis: Creates a comprehensive 1200-2000 word analysis using OpenAI
- Fast Processing: Significantly faster than full document analysis
- Clear Indication: Output clearly shows this is an analysis from a large file
# Normal usage (automatic resume for >30MB files)
python main.py large_document.pdf
# Force full processing (overrides resume-only mode)
PROCESS_FULL_DOCUMENT=true python main.py large_document.pdf
# Disable resume-only mode entirely
PDF_RESUME_ONLY_LARGE=false python main.py large_document.pdf
=== DOCUMENT RESUME ===
Document: large_manual.pdf
Total pages: 250
File size: 45.2MB
Processing mode: Resume-only (large file)
Pages sampled: 40
[AI-generated comprehensive analysis follows...]
📝 ANALYSIS NOTE: This comprehensive analysis was generated from 40 strategically
selected pages due to the large file size (>30MB). The sampling provides broad
coverage but may not include every detail. For complete analysis of all 250 pages,
use full document processing mode.
-
Setup and Start API Server:
# Install dependencies pip install -r requirements.txt # Setup API keys (copy env_example.txt to .env and add your keys) cp env_example.txt .env # Start server python api_server.py
-
Test the API:
# Health check curl http://localhost:8000/health # Extract YouTube transcript (fast) curl -X POST http://localhost:8000/youtube-transcript \ -H "Content-Type: application/json" \ -d '{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}' # Upload and extract from file curl -X POST -F "[email protected]" http://localhost:8000/extract # Convert text to speech (single voice) curl -X POST http://localhost:8000/text-to-speech \ -H "Content-Type: application/json" \ -d '{"text": "Hello! This is a test of our text-to-speech API.", "voice": "alloy"}' # Create podcast with multiple voices curl -X POST http://localhost:8000/text-to-speech-podcast \ -H "Content-Type: application/json" \ -d '{ "segments": [ {"text": "Welcome to our podcast!", "voice": "nova", "speaker_name": "Host"}, {"text": "Thanks for having me!", "voice": "onyx", "speaker_name": "Guest"} ] }'
-
View Interactive Documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
-
Setup:
# Install dependencies pip install -r requirements.txt # Setup environment variables cp env_example.txt .env # Edit .env file with your API keys
-
Basic Usage:
# Extract from single file python main.py document.pdf # Extract from multiple files with parallel processing python main.py file1.pdf file2.docx video.mp4 --parallel # Extract from YouTube video python main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" # Process directory python main.py documents/ --parallel --max-workers 8
The new /youtube-transcript
endpoint provides ultra-fast transcript extraction:
- Speed: 3-5 seconds vs 30-60 seconds for audio processing
- Method: Uses YouTube's existing captions/subtitles
- Quality: Original caption quality (manual or auto-generated)
- No Downloads: No audio file downloading required
Basic transcript extraction:
curl -X POST http://localhost:8000/youtube-transcript \
-H "Content-Type: application/json" \
-d '{"url": "https://www.youtube.com/watch?v=VIDEO_ID"}'
With language preferences:
curl -X POST http://localhost:8000/youtube-transcript \
-H "Content-Type: application/json" \
-d '{"url": "https://www.youtube.com/watch?v=VIDEO_ID", "language": "en", "manual_only": true}'
{
"success": true,
"extracted_text": "=== YOUTUBE TRANSCRIPT ===\nTitle: Video Title\nChannel: Channel Name\n[00:18] First subtitle text\n[00:22] Second subtitle text\n...",
"text_length": 2896,
"processing_time": 4.01,
"file_info": {
"name": "Video Title",
"duration": 213,
"video_id": "VIDEO_ID",
"transcript_language": "en"
}
}
pip install -r requirements.txt
Key packages:
openai>=1.3.0
- OpenAI API clientgoogle-generativeai>=0.3.0
- Google Gemini APIfastapi>=0.104.0
- Web API frameworkyoutube-transcript-api==1.2.1
- YouTube transcript extraction (NEW)yt-dlp>=2024.1.1
- YouTube video processing
For YouTube video processing:
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg
# Windows (using chocolatey)
choco install ffmpeg
Create .env
file:
OPENAI_API_KEY=your_openai_api_key_here
GOOGLE_API_KEY=your_gemini_api_key_here
If you encounter ModuleNotFoundError
for youtube_transcript_api
:
-
Check Python version:
python3 --version which python3
-
Use explicit Python path if needed:
# Find Python installation ls /opt/homebrew/opt/python@*/bin/python* # Use explicit path /opt/homebrew/opt/[email protected]/bin/python3.13 api_server.py
PyDub not available - No module named 'pyaudioop'
- Audio processing works via other methods- FastAPI deprecation warnings - Functionality works normally
- SSL/OpenSSL warnings - System-level warnings that don't affect functionality
- No transcript available: Not all videos have captions
- Try the audio-based
/extract-youtube
endpoint as fallback
- Try the audio-based
- Rate limiting: YouTube may block requests
- Use cookie authentication (see setup instructions)
Method | Speed | Requirements | Best For |
---|---|---|---|
/youtube-transcript |
3-5s | Existing captions | Quick extraction |
/extract-youtube |
30-60s | Audio download + AI | When no captions exist |
Small PDF processing | 2-10s | File upload | PDFs under 30MB |
Large PDF resume-only | 25-45s | File upload | PDFs over 30MB (40-page analysis) |
Large PDF full processing | 60-300s | File upload + Full mode | Complete analysis of large PDFs |
Video/Audio AI | 30-120s | File upload + AI | Media files |
# Single file processing
python main.py document.pdf
python main.py presentation.pptx
python main.py audio.mp3 audio.webm audio.ogg
# YouTube content
python main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
python main.py "https://youtu.be/9bZkp7q19f0"
# Batch processing with parallel execution
python main.py file1.pdf file2.docx video.mp4 --parallel --max-workers 4
# Directory processing
python main.py documents/ --parallel
# Custom output file
python main.py document.pdf --output results.txt
# Verbose output
python main.py document.pdf --verbose
import requests
# Extract YouTube transcript (fast method)
response = requests.post(
"http://localhost:8000/youtube-transcript",
json={"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}
)
result = response.json()
print(f"Extracted {result['text_length']} characters in {result['processing_time']:.1f}s")
# Upload file for processing
with open("document.pdf", "rb") as f:
files = {"file": ("document.pdf", f, "application/pdf")}
response = requests.post("http://localhost:8000/extract", files=files)
result = response.json()
print(result['extracted_text'][:200])
# Health check
curl http://localhost:8000/health
# Extract from URL
curl -X POST http://localhost:8000/extract-url \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/document.pdf"}'
# Batch processing
curl -X POST http://localhost:8000/extract-batch-url \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com/doc1.pdf", "https://example.com/doc2.txt"]}'
# Text-to-Speech (single voice)
curl -X POST http://localhost:8000/text-to-speech \
-H "Content-Type: application/json" \
-d '{"text": "Welcome to our AI-powered content processing service!", "voice": "nova"}'
# Text-to-Speech (podcast mode with multiple voices)
curl -X POST http://localhost:8000/text-to-speech-podcast \
-H "Content-Type: application/json" \
-d '{
"segments": [
{"text": "Good morning, listeners!", "voice": "nova", "speaker_name": "Host"},
{"text": "Thanks for having me on the show.", "voice": "onyx", "speaker_name": "Guest"}
]
}'
# Translation Service (single text)
curl -X POST http://localhost:8000/translate \
-H "Content-Type: application/json" \
-d '{"text": "Hello, how are you?", "source_language": "en", "target_language": "es"}'
# Translation Service (auto-detect language)
curl -X POST http://localhost:8000/translate \
-H "Content-Type: application/json" \
-d '{"text": "¿Cómo está el clima?", "source_language": "auto-detect", "target_language": "en"}'
# Language Detection
curl -X POST http://localhost:8000/detect-language \
-H "Content-Type: application/json" \
-d '{"text": "Bonjour, comment allez-vous?"}'
# Batch Translation
curl -X POST http://localhost:8000/translate-batch \
-H "Content-Type: application/json" \
-d '{"texts": ["Hello", "Goodbye", "Thank you"], "source_language": "en", "target_language": "fr"}'
# Get available TTS voices
curl http://localhost:8000/tts-voices
- Modular Design: Separate processors for different file types
- Smart Routing: Automatic processor selection based on file type
- Parallel Processing: Concurrent file processing for better performance
- Error Handling: Graceful handling of individual file failures
- Cleanup: Automatic temporary file cleanup
- API Integration: RESTful API with comprehensive documentation
# Test the API
python api_client_examples.py
# Test YouTube functionality
python youtube_client_examples.py
# Test URL processing
python url_client_examples.py
# Test translation service
python translation_examples.py
# Test translation processor
python test_translation_service.py
ai-content-process/
├── main.py # CLI interface
├── api_server.py # FastAPI server
├── requirements.txt # Dependencies
├── src/
│ ├── config.py # Configuration
│ ├── text_extractor.py # Main orchestrator
│ └── file_processors/ # Processor modules
│ ├── openai_processor.py
│ ├── gemini_processor.py
│ ├── youtube_processor.py
│ ├── youtube_transcript_processor.py
│ ├── tts_processor.py # NEW - Text-to-Speech
│ └── translation_processor.py # NEW - Translation Service
└── examples/ # Usage examples
For detailed TTS usage examples and advanced features, see:
- examples/TTS_CURL_EXAMPLES.md - Comprehensive curl examples for all TTS endpoints
- Voice samples and use cases - Perfect for podcast creation, audiobooks, and voice-overs
- API endpoint reference - Complete parameter documentation and response formats
For comprehensive translation service documentation and examples, see:
- examples/TRANSLATION_GUIDE.md - Complete translation service guide with examples
- examples/TRANSLATION_CURL_EXAMPLES.md - Quick CURL examples for all translation endpoints
- translation_examples.py - Python client examples and usage patterns
- 47+ supported languages - Including auto-detection and batch processing capabilities
- API_README.md - FastAPI server setup and usage
- FASTAPI_README.md - FastAPI quick start guide
- examples/CURL_EXAMPLES.md - Additional API examples
- examples/IMAGE_TRANSCRIPTION_GUIDE.md - Image transcription examples
- examples/YOUTUBE_MP3_SERVICE.md - YouTube MP3 processing guide
- examples/WEBM_OGG_URL_GUIDE.md - WebM/OGG URL processing examples
- examples/DIGITALOCEAN_QUICKSTART.md - DigitalOcean deployment guide
- Interactive Documentation: http://localhost:8000/docs (when server is running)
MIT License - see LICENSE file for details.