AI Content Processing Tool

A powerful Python application that extracts text from various file types and converts text to speech using AI services. Available as both a command-line tool and a FastAPI web service. It automatically routes different file types to the most appropriate AI service for optimal text extraction and provides professional-quality text-to-speech generation for podcast creation.

Features

Text Extraction

Multi-format Support: Extract text from PDFs, documents, spreadsheets, videos, and audio files
AI-Powered: Uses OpenAI GPT for documents and Google Gemini for multimedia
Smart Routing: Automatically selects the best AI service for each file type
YouTube Support: Extract transcripts and audio from YouTube videos
YouTube Transcripts: Fast transcript extraction using existing captions
Batch Processing: Handle multiple files or URLs at once

Text-to-Speech (NEW)

Multiple Voices: 6 different AI voices (alloy, echo, fable, onyx, nova, shimmer)
Podcast Mode: Multi-voice support perfect for podcast creation with different speakers
High Quality: Two quality levels (tts-1 and tts-1-hd) with multiple audio formats
Speed Control: Adjustable speech speed from 0.25x to 4.0x
Smart Chunking: Automatically handles long text with natural break points
Speaker Management: Organize audio by speaker names for easy podcast assembly

Translation Service (NEW)

47+ Languages: Support for major world languages including English, Spanish, French, German, Chinese, Japanese, and more
Dual AI Support: Choose between OpenAI GPT and Google Gemini for translation
Auto-Detection: Automatically detect source language when unknown
Batch Translation: Translate multiple texts efficiently in a single request
Language Detection: Identify the language of any text with confidence scores
High Quality: Professional-grade translations preserving meaning and tone

General

Dual Interface: Command-line tool and REST API server
Parallel Processing: Process multiple files simultaneously for faster results
Comprehensive Output: Detailed results with processing time, file info, and extraction statistics
RESTful API: Complete FastAPI server with auto-generated documentation

Supported File Types

OpenAI GPT (Documents & Text)

PDF files (.pdf)
Text files (.txt)
Word documents (.doc, .docx)
Excel spreadsheets (.xls, .xlsx)

Google Gemini (Multimedia)

Video files (.mp4, .avi, .mov, .mkv)
Audio files (.mp3, .wav, .m4a, .webm, .ogg)

YouTube Processor

YouTube videos and audio content
Supports various YouTube URL formats
Two methods available:
- Fast Transcript Extraction: Uses existing captions/subtitles (3-5 seconds)
- Audio Processing: Downloads and transcribes audio using AI (30-60 seconds)

Text-to-Speech Capabilities

Available Voices

alloy: A balanced voice, suitable for most content
echo: A warm, friendly voice
fable: A storytelling voice with character
onyx: A deep, authoritative voice
nova: A bright, energetic voice
shimmer: A soft, gentle voice

Audio Formats

MP3: Good balance of quality and size (recommended)
FLAC: Highest quality, larger files
Opus: Good for web streaming
AAC: Good for mobile apps

Quality Models

tts-1: Fast processing, good quality
tts-1-hd: High-definition quality, slightly slower

Perfect for Podcasts

Multi-speaker support: Use different voices for different speakers
Speaker names: Organize audio files by speaker
Natural conversations: Create realistic dialogues
Professional quality: Broadcast-ready audio output

PDF Processing Features

Resume-Only Mode for Large Files

For PDF files larger than 30MB, the system automatically switches to resume-only mode to provide efficient processing:

Smart Sampling: Extracts text from up to 40 strategically selected pages (beginning, middle, end)
AI-Generated Analysis: Creates a comprehensive 1200-2000 word analysis using OpenAI
Fast Processing: Significantly faster than full document analysis
Clear Indication: Output clearly shows this is an analysis from a large file

Configuration Options

# Normal usage (automatic resume for >30MB files)
python main.py large_document.pdf

# Force full processing (overrides resume-only mode)
PROCESS_FULL_DOCUMENT=true python main.py large_document.pdf

# Disable resume-only mode entirely
PDF_RESUME_ONLY_LARGE=false python main.py large_document.pdf

Example Output

=== DOCUMENT RESUME ===
Document: large_manual.pdf
Total pages: 250
File size: 45.2MB
Processing mode: Resume-only (large file)
Pages sampled: 40

[AI-generated comprehensive analysis follows...]

📝 ANALYSIS NOTE: This comprehensive analysis was generated from 40 strategically
selected pages due to the large file size (>30MB). The sampling provides broad
coverage but may not include every detail. For complete analysis of all 250 pages,
use full document processing mode.

Quick Start

🚀 FastAPI Web Service (Recommended)

Setup and Start API Server:

# Install dependencies
pip install -r requirements.txt

# Setup API keys (copy env_example.txt to .env and add your keys)
cp env_example.txt .env

# Start server
python api_server.py

Test the API:

# Health check
curl http://localhost:8000/health

# Extract YouTube transcript (fast)
curl -X POST http://localhost:8000/youtube-transcript \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}'

# Upload and extract from file
curl -X POST -F "[email protected]" http://localhost:8000/extract

# Convert text to speech (single voice)
curl -X POST http://localhost:8000/text-to-speech \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello! This is a test of our text-to-speech API.", "voice": "alloy"}'

# Create podcast with multiple voices
curl -X POST http://localhost:8000/text-to-speech-podcast \
  -H "Content-Type: application/json" \
  -d '{
    "segments": [
      {"text": "Welcome to our podcast!", "voice": "nova", "speaker_name": "Host"},
      {"text": "Thanks for having me!", "voice": "onyx", "speaker_name": "Guest"}
    ]
  }'

View Interactive Documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc

🖥️ Command Line Interface

Setup:

# Install dependencies
pip install -r requirements.txt

# Setup environment variables
cp env_example.txt .env
# Edit .env file with your API keys

Basic Usage:

# Extract from single file
python main.py document.pdf

# Extract from multiple files with parallel processing
python main.py file1.pdf file2.docx video.mp4 --parallel

# Extract from YouTube video
python main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

# Process directory
python main.py documents/ --parallel --max-workers 8

New YouTube Transcript Feature

Fast Transcript Extraction

The new /youtube-transcript endpoint provides ultra-fast transcript extraction:

Speed: 3-5 seconds vs 30-60 seconds for audio processing
Method: Uses YouTube's existing captions/subtitles
Quality: Original caption quality (manual or auto-generated)
No Downloads: No audio file downloading required

Usage Examples

Basic transcript extraction:

curl -X POST http://localhost:8000/youtube-transcript \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.youtube.com/watch?v=VIDEO_ID"}'

With language preferences:

curl -X POST http://localhost:8000/youtube-transcript \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.youtube.com/watch?v=VIDEO_ID", "language": "en", "manual_only": true}'

Response Format

{
  "success": true,
  "extracted_text": "=== YOUTUBE TRANSCRIPT ===\nTitle: Video Title\nChannel: Channel Name\n[00:18] First subtitle text\n[00:22] Second subtitle text\n...",
  "text_length": 2896,
  "processing_time": 4.01,
  "file_info": {
    "name": "Video Title",
    "duration": 213,
    "video_id": "VIDEO_ID",
    "transcript_language": "en"
  }
}

Installation & Dependencies

Required Dependencies

pip install -r requirements.txt

Key packages:

openai>=1.3.0 - OpenAI API client
google-generativeai>=0.3.0 - Google Gemini API
fastapi>=0.104.0 - Web API framework
youtube-transcript-api==1.2.1 - YouTube transcript extraction (NEW)
yt-dlp>=2024.1.1 - YouTube video processing

System Dependencies

For YouTube video processing:

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# Windows (using chocolatey)
choco install ffmpeg

API Keys Setup

Create .env file:

OPENAI_API_KEY=your_openai_api_key_here
GOOGLE_API_KEY=your_gemini_api_key_here

Troubleshooting

Python Version Issues

If you encounter ModuleNotFoundError for youtube_transcript_api:

Check Python version:
```
python3 --version
which python3
```

Use explicit Python path if needed:

# Find Python installation
ls /opt/homebrew/opt/python@*/bin/python*

# Use explicit path
/opt/homebrew/opt/[email protected]/bin/python3.13 api_server.py

Common Warnings (Safe to Ignore)

PyDub not available - No module named 'pyaudioop' - Audio processing works via other methods
FastAPI deprecation warnings - Functionality works normally
SSL/OpenSSL warnings - System-level warnings that don't affect functionality

YouTube Processing Issues

No transcript available: Not all videos have captions
- Try the audio-based /extract-youtube endpoint as fallback
Rate limiting: YouTube may block requests
- Use cookie authentication (see setup instructions)

Performance Comparison

Method	Speed	Requirements	Best For
`/youtube-transcript`	3-5s	Existing captions	Quick extraction
`/extract-youtube`	30-60s	Audio download + AI	When no captions exist
Small PDF processing	2-10s	File upload	PDFs under 30MB
Large PDF resume-only	25-45s	File upload	PDFs over 30MB (40-page analysis)
Large PDF full processing	60-300s	File upload + Full mode	Complete analysis of large PDFs
Video/Audio AI	30-120s	File upload + AI	Media files

Command Line Examples

# Single file processing
python main.py document.pdf
python main.py presentation.pptx
python main.py audio.mp3 audio.webm audio.ogg

# YouTube content
python main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
python main.py "https://youtu.be/9bZkp7q19f0"

# Batch processing with parallel execution
python main.py file1.pdf file2.docx video.mp4 --parallel --max-workers 4

# Directory processing
python main.py documents/ --parallel

# Custom output file
python main.py document.pdf --output results.txt

# Verbose output
python main.py document.pdf --verbose

API Usage Examples

Python Client

import requests

# Extract YouTube transcript (fast method)
response = requests.post(
    "http://localhost:8000/youtube-transcript",
    json={"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}
)
result = response.json()
print(f"Extracted {result['text_length']} characters in {result['processing_time']:.1f}s")

# Upload file for processing
with open("document.pdf", "rb") as f:
    files = {"file": ("document.pdf", f, "application/pdf")}
    response = requests.post("http://localhost:8000/extract", files=files)
    result = response.json()
    print(result['extracted_text'][:200])

cURL Examples

# Health check
curl http://localhost:8000/health

# Extract from URL
curl -X POST http://localhost:8000/extract-url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/document.pdf"}'

# Batch processing
curl -X POST http://localhost:8000/extract-batch-url \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com/doc1.pdf", "https://example.com/doc2.txt"]}'

# Text-to-Speech (single voice)
curl -X POST http://localhost:8000/text-to-speech \
  -H "Content-Type: application/json" \
  -d '{"text": "Welcome to our AI-powered content processing service!", "voice": "nova"}'

# Text-to-Speech (podcast mode with multiple voices)
curl -X POST http://localhost:8000/text-to-speech-podcast \
  -H "Content-Type: application/json" \
  -d '{
    "segments": [
      {"text": "Good morning, listeners!", "voice": "nova", "speaker_name": "Host"},
      {"text": "Thanks for having me on the show.", "voice": "onyx", "speaker_name": "Guest"}
    ]
  }'

# Translation Service (single text)
curl -X POST http://localhost:8000/translate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, how are you?", "source_language": "en", "target_language": "es"}'

# Translation Service (auto-detect language)
curl -X POST http://localhost:8000/translate \
  -H "Content-Type: application/json" \
  -d '{"text": "¿Cómo está el clima?", "source_language": "auto-detect", "target_language": "en"}'

# Language Detection
curl -X POST http://localhost:8000/detect-language \
  -H "Content-Type: application/json" \
  -d '{"text": "Bonjour, comment allez-vous?"}'

# Batch Translation
curl -X POST http://localhost:8000/translate-batch \
  -H "Content-Type: application/json" \
  -d '{"texts": ["Hello", "Goodbye", "Thank you"], "source_language": "en", "target_language": "fr"}'

# Get available TTS voices
curl http://localhost:8000/tts-voices

Architecture

Modular Design: Separate processors for different file types
Smart Routing: Automatic processor selection based on file type
Parallel Processing: Concurrent file processing for better performance
Error Handling: Graceful handling of individual file failures
Cleanup: Automatic temporary file cleanup
API Integration: RESTful API with comprehensive documentation

Development

Running Tests

# Test the API
python api_client_examples.py

# Test YouTube functionality
python youtube_client_examples.py

# Test URL processing
python url_client_examples.py

# Test translation service
python translation_examples.py

# Test translation processor
python test_translation_service.py

File Structure

ai-content-process/
├── main.py                 # CLI interface
├── api_server.py           # FastAPI server
├── requirements.txt        # Dependencies
├── src/
│   ├── config.py          # Configuration
│   ├── text_extractor.py  # Main orchestrator
│   └── file_processors/   # Processor modules
│       ├── openai_processor.py
│       ├── gemini_processor.py
│       ├── youtube_processor.py
│       ├── youtube_transcript_processor.py
│       ├── tts_processor.py           # NEW - Text-to-Speech
│       └── translation_processor.py   # NEW - Translation Service
└── examples/              # Usage examples

Additional Documentation

Text-to-Speech

For detailed TTS usage examples and advanced features, see:

examples/TTS_CURL_EXAMPLES.md - Comprehensive curl examples for all TTS endpoints
Voice samples and use cases - Perfect for podcast creation, audiobooks, and voice-overs
API endpoint reference - Complete parameter documentation and response formats

Translation Service

For comprehensive translation service documentation and examples, see:

examples/TRANSLATION_GUIDE.md - Complete translation service guide with examples
examples/TRANSLATION_CURL_EXAMPLES.md - Quick CURL examples for all translation endpoints
translation_examples.py - Python client examples and usage patterns
47+ supported languages - Including auto-detection and batch processing capabilities

Other Resources

API_README.md - FastAPI server setup and usage
FASTAPI_README.md - FastAPI quick start guide
examples/CURL_EXAMPLES.md - Additional API examples
examples/IMAGE_TRANSCRIPTION_GUIDE.md - Image transcription examples
examples/YOUTUBE_MP3_SERVICE.md - YouTube MP3 processing guide
examples/WEBM_OGG_URL_GUIDE.md - WebM/OGG URL processing examples
examples/DIGITALOCEAN_QUICKSTART.md - DigitalOcean deployment guide
Interactive Documentation: http://localhost:8000/docs (when server is running)

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.do		.do
deploy		deploy
examples		examples
scripts		scripts
src		src
.cursorrules		.cursorrules
.dockerignore		.dockerignore
.gitignore		.gitignore
API_README.md		API_README.md
AUDIO_INTEGRATION_COMPLETE.md		AUDIO_INTEGRATION_COMPLETE.md
AUDIO_VIDEO_SYNC_FIX_COMPLETE.md		AUDIO_VIDEO_SYNC_FIX_COMPLETE.md
CHARACTER_LIMIT_FIX.md		CHARACTER_LIMIT_FIX.md
CONSOLE_PREVIEW_COMPLETE.md		CONSOLE_PREVIEW_COMPLETE.md
CUSTOM_VOICE_SCRIPT_GUIDE.md		CUSTOM_VOICE_SCRIPT_GUIDE.md
DYNAMIC_VIDEO_SYNC_AUDIO_COMPLETE.md		DYNAMIC_VIDEO_SYNC_AUDIO_COMPLETE.md
Dockerfile		Dockerfile
ENHANCED_MOTION_EXAMPLES.md		ENHANCED_MOTION_EXAMPLES.md
ENHANCED_MOTION_SUMMARY.md		ENHANCED_MOTION_SUMMARY.md
ENHANCED_VIDEO_MOTION_SUMMARY.md		ENHANCED_VIDEO_MOTION_SUMMARY.md
FASTAPI_README.md		FASTAPI_README.md
FULL_DOCUMENT_PROCESSING.md		FULL_DOCUMENT_PROCESSING.md
MODERATION_FIX_COMPLETE.md		MODERATION_FIX_COMPLETE.md
MOTION_DEBUG_ANALYSIS.md		MOTION_DEBUG_ANALYSIS.md
MOTION_SOLUTION_COMPLETE.md		MOTION_SOLUTION_COMPLETE.md
MULTI_MODEL_ANIMATION_COMPLETE.md		MULTI_MODEL_ANIMATION_COMPLETE.md
PERFECT_SYNC_MOTION_SUMMARY.md		PERFECT_SYNC_MOTION_SUMMARY.md
PRODUCTION_DEPLOYMENT_GUIDE.md		PRODUCTION_DEPLOYMENT_GUIDE.md
PROFESSIONAL_NARRATION_COMPLETE.md		PROFESSIONAL_NARRATION_COMPLETE.md
PROMPT_PREVIEW_EXAMPLES.md		PROMPT_PREVIEW_EXAMPLES.md
RATE_LIMITING_SOLUTION.md		RATE_LIMITING_SOLUTION.md
README.md		README.md
RUNWAY_VIDEO_GUIDE.md		RUNWAY_VIDEO_GUIDE.md
TIKTOK_VIDEO_COMPLETE.md		TIKTOK_VIDEO_COMPLETE.md
TRANSLATION_SUMMARY.md		TRANSLATION_SUMMARY.md
VIDEO_AUDIO_MERGE_COMPLETE.md		VIDEO_AUDIO_MERGE_COMPLETE.md
VIDEO_EFFECTS_COMPLETE.md		VIDEO_EFFECTS_COMPLETE.md
api_client_examples.py		api_client_examples.py
api_server.py		api_server.py
audio_video_examples.py		audio_video_examples.py
chromeyoutube_cookies.txt		chromeyoutube_cookies.txt
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
dynamic_video_examples.py		dynamic_video_examples.py
enhanced_motion_examples.py		enhanced_motion_examples.py
extract_endpoint_guide.md		extract_endpoint_guide.md
image_transcribe_cli.py		image_transcribe_cli.py
image_transcribe_examples.py		image_transcribe_examples.py
main.py		main.py
motion_enhancement_test.py		motion_enhancement_test.py
multi_model_animation_test.py		multi_model_animation_test.py
perfect_sync_motion_examples.py		perfect_sync_motion_examples.py
postman_endpoint_fix.md		postman_endpoint_fix.md
postman_image_examples.md		postman_image_examples.md
preview_prompts_cli.py		preview_prompts_cli.py
prompt_to_video_cli.py		prompt_to_video_cli.py
requirements.txt		requirements.txt
run_api.py		run_api.py
runway_video_cli.py		runway_video_cli.py
runway_video_examples.py		runway_video_examples.py
setup_instructions.md		setup_instructions.md
setup_youtube_auth.py		setup_youtube_auth.py
suggest_working_images.py		suggest_working_images.py
test_api_pdf.py		test_api_pdf.py
test_chunking.py		test_chunking.py
test_forced_chunking.py		test_forced_chunking.py
test_image_processor.py		test_image_processor.py
test_image_url_api.py		test_image_url_api.py
test_moderation_fix.py		test_moderation_fix.py
test_runway_urls.py		test_runway_urls.py
test_setup.py		test_setup.py
test_translation_service.py		test_translation_service.py
test_youtube_config.py		test_youtube_config.py
tiktok_video_examples.py		tiktok_video_examples.py
translation_examples.py		translation_examples.py
url_client_examples.py		url_client_examples.py
video_effects_examples.py		video_effects_examples.py
webm_ogg_url_examples.py		webm_ogg_url_examples.py
youtube_client_examples.py		youtube_client_examples.py
youtube_cookies.txt		youtube_cookies.txt
youtube_mp3_cli.py		youtube_mp3_cli.py
youtube_mp3_examples.py		youtube_mp3_examples.py

defmethodinc/ai-content-process

Folders and files

Latest commit

History

Repository files navigation