A comprehensive research data processing and visualization system for the Computer Science Department at Texas State University. This project processes academic research data from OpenAlex, generates semantic embeddings, performs clustering analysis, and creates interactive visualizations to explore research patterns and collaborations.
The CADS Research Visualization System is a complete end-to-end solution that transforms raw research data into interactive, explorable visualizations. It combines advanced machine learning techniques with modern web technologies to provide insights into research patterns, collaborations, and thematic clusters.
- π Automated Data Processing: Extract and process research data from OpenAlex API
- π§ Semantic Analysis: Generate embeddings and perform clustering using UMAP/HDBSCAN
- π¨ Interactive Visualization: Web-based dashboard with advanced filtering and search
- π Semantic Search: Find similar research works using vector similarity
- π₯ Researcher Profiles: Detailed views of faculty research and collaborations
- π Real-time Analytics: Live statistics and data quality monitoring
CADS Research Visualization System
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Data Sources β β Core Pipeline β β Visualization β
βββββββββββββββββββ€ ββββββββββββββββββββ€ βββββββββββββββββββ€
β β’ OpenAlex API βββββΆβ β’ Data Loader βββββΆβ β’ Web Dashboard β
β β’ Supabase DB β β β’ Embeddings β β β’ Search System β
β β’ CADS Faculty β β β’ UMAP/HDBSCAN β β β’ Interactive β
β β’ Research Data β β β’ Theme Gen β β Visualizationsβ
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
CADS-Research-Visualization/
βββ π cads/ # Core data processing pipeline
β βββ README.md # Pipeline documentation
β βββ data_loader.py # Data loading and embeddings
β βββ process_data.py # Main pipeline orchestration
β βββ requirements.txt # Python dependencies
β βββ .env.example # Environment template
β βββ data/ # Generated data files
β βββ models/ # Trained ML models
β βββ tests/ # Comprehensive test suite
β
βββ π¨ visuals/ # Interactive visualization dashboard
β βββ public/ # Web interface files
β β βββ index.html # Main dashboard interface
β β βββ app.js # Visualization logic
β β βββ data/ # Visualization data files
β βββ data/ # Raw visualization data
β βββ models/ # Visualization ML models
β βββ tests/ # Visualization tests
β
βββ ποΈ database/ # Database schema and migrations
β βββ README.md # Database documentation
β βββ schema/ # Table definitions
β β βββ create_cads_tables.sql # Complete CADS schema
β β βββ create_cads_tables_simple.sql
β βββ migrations/ # Database migrations
β
βββ π§ scripts/ # Organized utility scripts
β βββ README.md # Scripts documentation
β βββ migration/ # Database setup scripts
β β βββ execute_cads_migration.py # β
Main migration script
β β βββ legacy/ # Archived migration attempts
β βββ processing/ # Data processing scripts
β β βββ process_cads_with_openalex_ids.py # β
Data collection
β β βββ migrate_cads_data_to_cads_tables.py # β
Data migration
β βββ utilities/ # Verification and maintenance
β βββ check_cads_data_location.py
β βββ [other utility scripts]
β
βββ π docs/ # Comprehensive documentation
β βββ README.md # Documentation index
β βββ setup/ # Installation and configuration
β βββ pipeline/ # Technical documentation
β βββ migration/ # Historical documentation
β
βββ π¦ data/ # Centralized data storage
β βββ README.md # Data documentation
β βββ raw/ # Original data files
β βββ processed/ # Analyzed data
β β βββ cluster_themes.json # AI-generated cluster themes
β β βββ clustering_results.json # HDBSCAN clustering results
β β βββ visualization-data.json # Complete visualization dataset
β βββ search/ # Search indexes
β βββ search-index.json # Pre-built search index
β
βββ README.md # This main documentation
βββ CADS_REPOSITORY_ANALYSIS.md # Repository organization analysis
βββ .env # Environment variables
βββ .gitignore # Git ignore rules
- Python 3.8+ with pip
- PostgreSQL with vector extension
- Supabase account for database hosting
- OpenAlex API access (free with email registration)
- Modern web browser for visualization
# Clone the repository
git clone [repository-url]
cd cads-research-visualization
# Verify repository structure
ls -la# Create CADS database tables
python3 scripts/migration/execute_cads_migration.py
# Verify table creation
python3 scripts/utilities/check_cads_data_location.py# Copy environment template
cp cads/.env.example cads/.env
# Edit with your credentials
nano cads/.envRequired environment variables:
# Database Connection
DATABASE_URL=postgresql://user:pass@host:port/db
# API Configuration
[email protected]
GROQ_API_KEY=your_groq_api_key # Optional for theme generation
# ML Configuration (optional)
EMBEDDING_MODEL=all-MiniLM-L6-v2
UMAP_N_NEIGHBORS=15
HDBSCAN_MIN_CLUSTER_SIZE=5# Install Python dependencies
cd cads
pip install -r requirements.txt# Process CADS research data
python3 scripts/processing/process_cads_with_openalex_ids.py
# Migrate data to CADS tables
python3 scripts/processing/migrate_cads_data_to_cads_tables.py
# Run the complete pipeline
python3 cads/process_data.py# Start local web server
cd visuals/public
python3 -m http.server 8000
# Open in browser
open http://localhost:8000Purpose: Core data processing and machine learning pipeline
Key Features:
- Data Loading: Connects to Supabase and loads research data
- Embedding Generation: Creates 384-dimensional semantic vectors
- UMAP Reduction: Projects embeddings to 2D coordinates
- HDBSCAN Clustering: Groups similar research works
- Theme Generation: AI-powered cluster descriptions
- Data Validation: Comprehensive quality checks
Main Files:
data_loader.py- Database connection and data processingprocess_data.py- Pipeline orchestration and clusteringtests/- Complete test suite with 10+ test files
Purpose: Interactive web-based research exploration interface
Key Features:
- Interactive Map: Zoomable, pannable research visualization
- Advanced Filtering: Multi-criteria filtering system
- Semantic Search: Find similar research works
- Researcher Profiles: Faculty research overviews
- Real-time Statistics: Live data updates
- Responsive Design: Desktop and mobile optimized
Technologies:
- Deck.gl: WebGL-powered visualization framework
- Vanilla JavaScript: No framework dependencies
- CSS3: Modern styling with custom properties
Purpose: PostgreSQL schema with vector extensions
Key Tables:
cads_researchers- Faculty information and profilescads_works- Research papers with embeddingscads_topics- Research topic classifications
Features:
- Vector Storage: pgvector extension for embeddings
- Full-text Search: Optimized text search indexes
- Relationship Management: Foreign key constraints
- Performance Optimization: Strategic indexing
Purpose: Organized utility scripts for system management
Categories:
- Migration: Database setup and schema creation
- Processing: Data collection and transformation
- Utilities: Verification and maintenance tools
Key Scripts:
migration/execute_cads_migration.py- Database setupprocessing/process_cads_with_openalex_ids.py- Data collectionutilities/check_cads_data_location.py- Data verification
- ~32 CADS Researchers: Faculty from CS Department
- ~2,454 Research Works: Academic papers with full metadata
- ~6,834 Research Topics: Hierarchical topic classifications
- 384-dimensional embeddings: Semantic representations for all works
- Data Loading: ~30 seconds for complete dataset
- Embedding Generation: ~2 minutes for missing embeddings
- UMAP Reduction: ~45 seconds for 2,454 works
- HDBSCAN Clustering: ~15 seconds for 2D coordinates
- Complete Pipeline: ~5-10 minutes total processing time
- 15-25 Research Clusters: Automatically identified themes
- AI-Generated Themes: Descriptive cluster names and summaries
- 2D Coordinates: Optimized for visualization layout
- Quality Metrics: >95% of works successfully clustered
# Test repository structure
python3 cads/tests/test_basic_structure.py
# Test database connectivity
python3 cads/tests/test_connection.py
# Test complete pipeline (requires ML dependencies)
python3 cads/tests/test_full_pipeline.py
# Verify data integrity
python3 scripts/utilities/check_cads_data_location.py- Structure Tests: Repository organization and file presence
- Connection Tests: Database connectivity and basic queries
- Pipeline Tests: End-to-end data processing
- Integration Tests: Component interaction validation
- Performance Tests: Timing and resource usage benchmarks
# Test connection
python3 cads/tests/test_connection.py
# Check environment variables
cat cads/.env
# Verify database URL format
echo $DATABASE_URL# Install all requirements
pip install -r cads/requirements.txt
# Check Python version
python3 --version # Should be 3.8+# Verify data location
python3 scripts/utilities/check_cads_data_location.py
# Run migration if needed
python3 scripts/processing/migrate_cads_data_to_cads_tables.py# Check basic structure
python3 cads/tests/test_basic_structure.py
# Run with debug logging
export LOG_LEVEL=DEBUG
python3 cads/process_data.py- Check Documentation: Review relevant README files
- Run Tests: Use test suite to identify issues
- Check Logs: Enable debug logging for detailed output
- Verify Environment: Ensure all prerequisites are met
- Installation Guide - Complete setup instructions from scratch
- User Guide - How to use the visualization system
- Troubleshooting Guide - Common issues and solutions
- CI/CD Pipeline Guide - Automated testing and deployment
- Monitoring Setup - Error tracking and analytics
- Monitoring Interpretation - Understanding metrics and alerts
- CADS Pipeline - Core data processing system
- Database Schema - Table structure and relationships
- Scripts Guide - Utility scripts and workflows
- Data Organization - Data structure and formats
- Testing Guide - Comprehensive test suite documentation
- Test Results - Current test status and coverage
- Repository Analysis - Detailed organization analysis
- Migration Reports - Historical context and migration records
- API Documentation - Function references and examples
We welcome contributions to improve the CADS Research Visualization System!
- Fork the repository
- Create a feature branch
git checkout -b feature/amazing-feature
- Make your changes
- Add tests for new functionality
- Update documentation
- Submit a pull request
- Code Quality: Follow existing patterns and style
- Testing: Add tests for new features
- Documentation: Update relevant README files
- Performance: Consider impact on processing time
- Compatibility: Ensure cross-platform compatibility
This project is part of the Texas State University research infrastructure. See individual component licenses for specific terms.
- CADS Faculty: For providing research data and domain expertise
- Texas State University: For supporting this research visualization initiative
- OpenAlex: For providing open access to scholarly data
- Supabase: For reliable database hosting and vector extensions
- Open Source Community: For the excellent ML and visualization libraries
- Documentation: Check relevant README files first
- Issues: Use GitHub Issues for bug reports and feature requests
- Testing: Run the test suite to diagnose problems
- Community: Join discussions in GitHub Discussions
- Lead Developer: Saksham Adhikari
- Institution: Texas State University
- Email: [contact information]
- Project Repository: [GitHub repository URL]
π Complete research data processing and visualization system ready for exploration!
Built with β€οΈ for the research community at Texas State University