Skip to content

Tar-ive/CADS-Visualizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

62 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CADS Research Visualization System

A comprehensive research data processing and visualization system for the Computer Science Department at Texas State University. This project processes academic research data from OpenAlex, generates semantic embeddings, performs clustering analysis, and creates interactive visualizations to explore research patterns and collaborations.

🎯 System Overview

The CADS Research Visualization System is a complete end-to-end solution that transforms raw research data into interactive, explorable visualizations. It combines advanced machine learning techniques with modern web technologies to provide insights into research patterns, collaborations, and thematic clusters.

Key Capabilities

  • πŸ”„ Automated Data Processing: Extract and process research data from OpenAlex API
  • 🧠 Semantic Analysis: Generate embeddings and perform clustering using UMAP/HDBSCAN
  • 🎨 Interactive Visualization: Web-based dashboard with advanced filtering and search
  • πŸ” Semantic Search: Find similar research works using vector similarity
  • πŸ‘₯ Researcher Profiles: Detailed views of faculty research and collaborations
  • πŸ“Š Real-time Analytics: Live statistics and data quality monitoring

πŸ—οΈ System Architecture

CADS Research Visualization System
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Data Sources  β”‚    β”‚   Core Pipeline  β”‚    β”‚   Visualization β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β€’ OpenAlex API  │───▢│ β€’ Data Loader    │───▢│ β€’ Web Dashboard β”‚
β”‚ β€’ Supabase DB   β”‚    β”‚ β€’ Embeddings     β”‚    β”‚ β€’ Search System β”‚
β”‚ β€’ CADS Faculty  β”‚    β”‚ β€’ UMAP/HDBSCAN   β”‚    β”‚ β€’ Interactive   β”‚
β”‚ β€’ Research Data β”‚    β”‚ β€’ Theme Gen      β”‚    β”‚   Visualizationsβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Repository Organization

CADS-Research-Visualization/
β”œβ”€β”€ πŸ“Š cads/                          # Core data processing pipeline
β”‚   β”œβ”€β”€ README.md                    # Pipeline documentation
β”‚   β”œβ”€β”€ data_loader.py               # Data loading and embeddings
β”‚   β”œβ”€β”€ process_data.py              # Main pipeline orchestration
β”‚   β”œβ”€β”€ requirements.txt             # Python dependencies
β”‚   β”œβ”€β”€ .env.example                 # Environment template
β”‚   β”œβ”€β”€ data/                        # Generated data files
β”‚   β”œβ”€β”€ models/                      # Trained ML models
β”‚   └── tests/                       # Comprehensive test suite
β”‚
β”œβ”€β”€ 🎨 visuals/                       # Interactive visualization dashboard
β”‚   β”œβ”€β”€ public/                      # Web interface files
β”‚   β”‚   β”œβ”€β”€ index.html              # Main dashboard interface
β”‚   β”‚   β”œβ”€β”€ app.js                  # Visualization logic
β”‚   β”‚   └── data/                   # Visualization data files
β”‚   β”œβ”€β”€ data/                        # Raw visualization data
β”‚   β”œβ”€β”€ models/                      # Visualization ML models
β”‚   └── tests/                       # Visualization tests
β”‚
β”œβ”€β”€ πŸ—„οΈ database/                      # Database schema and migrations
β”‚   β”œβ”€β”€ README.md                    # Database documentation
β”‚   β”œβ”€β”€ schema/                      # Table definitions
β”‚   β”‚   β”œβ”€β”€ create_cads_tables.sql  # Complete CADS schema
β”‚   β”‚   └── create_cads_tables_simple.sql
β”‚   └── migrations/                  # Database migrations
β”‚
β”œβ”€β”€ πŸ”§ scripts/                       # Organized utility scripts
β”‚   β”œβ”€β”€ README.md                    # Scripts documentation
β”‚   β”œβ”€β”€ migration/                   # Database setup scripts
β”‚   β”‚   β”œβ”€β”€ execute_cads_migration.py   # βœ… Main migration script
β”‚   β”‚   └── legacy/                     # Archived migration attempts
β”‚   β”œβ”€β”€ processing/                  # Data processing scripts
β”‚   β”‚   β”œβ”€β”€ process_cads_with_openalex_ids.py  # βœ… Data collection
β”‚   β”‚   └── migrate_cads_data_to_cads_tables.py # βœ… Data migration
β”‚   └── utilities/                   # Verification and maintenance
β”‚       β”œβ”€β”€ check_cads_data_location.py
β”‚       └── [other utility scripts]
β”‚
β”œβ”€β”€ πŸ“š docs/                          # Comprehensive documentation
β”‚   β”œβ”€β”€ README.md                    # Documentation index
β”‚   β”œβ”€β”€ setup/                       # Installation and configuration
β”‚   β”œβ”€β”€ pipeline/                    # Technical documentation
β”‚   └── migration/                   # Historical documentation
β”‚
β”œβ”€β”€ πŸ“¦ data/                          # Centralized data storage
β”‚   β”œβ”€β”€ README.md                    # Data documentation
β”‚   β”œβ”€β”€ raw/                         # Original data files
β”‚   β”œβ”€β”€ processed/                   # Analyzed data
β”‚   β”‚   β”œβ”€β”€ cluster_themes.json     # AI-generated cluster themes
β”‚   β”‚   β”œβ”€β”€ clustering_results.json # HDBSCAN clustering results
β”‚   β”‚   └── visualization-data.json # Complete visualization dataset
β”‚   └── search/                      # Search indexes
β”‚       └── search-index.json       # Pre-built search index
β”‚
β”œβ”€β”€ README.md                        # This main documentation
β”œβ”€β”€ CADS_REPOSITORY_ANALYSIS.md      # Repository organization analysis
β”œβ”€β”€ .env                            # Environment variables
└── .gitignore                      # Git ignore rules

πŸš€ Quick Start Guide

Prerequisites

  • Python 3.8+ with pip
  • PostgreSQL with vector extension
  • Supabase account for database hosting
  • OpenAlex API access (free with email registration)
  • Modern web browser for visualization

1. Repository Setup

# Clone the repository
git clone [repository-url]
cd cads-research-visualization

# Verify repository structure
ls -la

2. Database Setup

# Create CADS database tables
python3 scripts/migration/execute_cads_migration.py

# Verify table creation
python3 scripts/utilities/check_cads_data_location.py

3. Environment Configuration

# Copy environment template
cp cads/.env.example cads/.env

# Edit with your credentials
nano cads/.env

Required environment variables:

# Database Connection
DATABASE_URL=postgresql://user:pass@host:port/db

# API Configuration
[email protected]
GROQ_API_KEY=your_groq_api_key  # Optional for theme generation

# ML Configuration (optional)
EMBEDDING_MODEL=all-MiniLM-L6-v2
UMAP_N_NEIGHBORS=15
HDBSCAN_MIN_CLUSTER_SIZE=5

4. Dependencies Installation

# Install Python dependencies
cd cads
pip install -r requirements.txt

5. Data Processing

# Process CADS research data
python3 scripts/processing/process_cads_with_openalex_ids.py

# Migrate data to CADS tables
python3 scripts/processing/migrate_cads_data_to_cads_tables.py

# Run the complete pipeline
python3 cads/process_data.py

6. Launch Visualization

# Start local web server
cd visuals/public
python3 -m http.server 8000

# Open in browser
open http://localhost:8000

πŸ“Š System Components

πŸ”§ CADS Pipeline (cads/)

Purpose: Core data processing and machine learning pipeline

Key Features:

  • Data Loading: Connects to Supabase and loads research data
  • Embedding Generation: Creates 384-dimensional semantic vectors
  • UMAP Reduction: Projects embeddings to 2D coordinates
  • HDBSCAN Clustering: Groups similar research works
  • Theme Generation: AI-powered cluster descriptions
  • Data Validation: Comprehensive quality checks

Main Files:

  • data_loader.py - Database connection and data processing
  • process_data.py - Pipeline orchestration and clustering
  • tests/ - Complete test suite with 10+ test files

🎨 Visualization Dashboard (visuals/)

Purpose: Interactive web-based research exploration interface

Key Features:

  • Interactive Map: Zoomable, pannable research visualization
  • Advanced Filtering: Multi-criteria filtering system
  • Semantic Search: Find similar research works
  • Researcher Profiles: Faculty research overviews
  • Real-time Statistics: Live data updates
  • Responsive Design: Desktop and mobile optimized

Technologies:

  • Deck.gl: WebGL-powered visualization framework
  • Vanilla JavaScript: No framework dependencies
  • CSS3: Modern styling with custom properties

πŸ—„οΈ Database Layer (database/)

Purpose: PostgreSQL schema with vector extensions

Key Tables:

  • cads_researchers - Faculty information and profiles
  • cads_works - Research papers with embeddings
  • cads_topics - Research topic classifications

Features:

  • Vector Storage: pgvector extension for embeddings
  • Full-text Search: Optimized text search indexes
  • Relationship Management: Foreign key constraints
  • Performance Optimization: Strategic indexing

πŸ”§ Scripts Collection (scripts/)

Purpose: Organized utility scripts for system management

Categories:

  • Migration: Database setup and schema creation
  • Processing: Data collection and transformation
  • Utilities: Verification and maintenance tools

Key Scripts:

  • migration/execute_cads_migration.py - Database setup
  • processing/process_cads_with_openalex_ids.py - Data collection
  • utilities/check_cads_data_location.py - Data verification

πŸ“ˆ Expected Results

Data Volume

  • ~32 CADS Researchers: Faculty from CS Department
  • ~2,454 Research Works: Academic papers with full metadata
  • ~6,834 Research Topics: Hierarchical topic classifications
  • 384-dimensional embeddings: Semantic representations for all works

Processing Performance

  • Data Loading: ~30 seconds for complete dataset
  • Embedding Generation: ~2 minutes for missing embeddings
  • UMAP Reduction: ~45 seconds for 2,454 works
  • HDBSCAN Clustering: ~15 seconds for 2D coordinates
  • Complete Pipeline: ~5-10 minutes total processing time

Clustering Results

  • 15-25 Research Clusters: Automatically identified themes
  • AI-Generated Themes: Descriptive cluster names and summaries
  • 2D Coordinates: Optimized for visualization layout
  • Quality Metrics: >95% of works successfully clustered

πŸ§ͺ Testing and Validation

Comprehensive Test Suite

# Test repository structure
python3 cads/tests/test_basic_structure.py

# Test database connectivity
python3 cads/tests/test_connection.py

# Test complete pipeline (requires ML dependencies)
python3 cads/tests/test_full_pipeline.py

# Verify data integrity
python3 scripts/utilities/check_cads_data_location.py

Test Categories

  • Structure Tests: Repository organization and file presence
  • Connection Tests: Database connectivity and basic queries
  • Pipeline Tests: End-to-end data processing
  • Integration Tests: Component interaction validation
  • Performance Tests: Timing and resource usage benchmarks

🚨 Troubleshooting

Common Issues and Solutions

1. Database Connection Errors

# Test connection
python3 cads/tests/test_connection.py

# Check environment variables
cat cads/.env

# Verify database URL format
echo $DATABASE_URL

2. Missing Dependencies

# Install all requirements
pip install -r cads/requirements.txt

# Check Python version
python3 --version  # Should be 3.8+

3. Data Location Issues

# Verify data location
python3 scripts/utilities/check_cads_data_location.py

# Run migration if needed
python3 scripts/processing/migrate_cads_data_to_cads_tables.py

4. Pipeline Failures

# Check basic structure
python3 cads/tests/test_basic_structure.py

# Run with debug logging
export LOG_LEVEL=DEBUG
python3 cads/process_data.py

Getting Help

  1. Check Documentation: Review relevant README files
  2. Run Tests: Use test suite to identify issues
  3. Check Logs: Enable debug logging for detailed output
  4. Verify Environment: Ensure all prerequisites are met

πŸ“š Documentation

πŸš€ Getting Started

πŸ”§ System Operation

πŸ“– Component Documentation

πŸ§ͺ Testing and Quality

πŸ“‹ Technical References

🀝 Contributing

We welcome contributions to improve the CADS Research Visualization System!

Development Workflow

  1. Fork the repository
  2. Create a feature branch
    git checkout -b feature/amazing-feature
  3. Make your changes
  4. Add tests for new functionality
  5. Update documentation
  6. Submit a pull request

Contribution Guidelines

  • Code Quality: Follow existing patterns and style
  • Testing: Add tests for new features
  • Documentation: Update relevant README files
  • Performance: Consider impact on processing time
  • Compatibility: Ensure cross-platform compatibility

πŸ“„ License

This project is part of the Texas State University research infrastructure. See individual component licenses for specific terms.

πŸ™ Acknowledgments

  • CADS Faculty: For providing research data and domain expertise
  • Texas State University: For supporting this research visualization initiative
  • OpenAlex: For providing open access to scholarly data
  • Supabase: For reliable database hosting and vector extensions
  • Open Source Community: For the excellent ML and visualization libraries

πŸ“ž Support and Contact

Getting Support

  • Documentation: Check relevant README files first
  • Issues: Use GitHub Issues for bug reports and feature requests
  • Testing: Run the test suite to diagnose problems
  • Community: Join discussions in GitHub Discussions

Contact Information

  • Lead Developer: Saksham Adhikari
  • Institution: Texas State University
  • Email: [contact information]
  • Project Repository: [GitHub repository URL]

πŸŽ‰ Complete research data processing and visualization system ready for exploration!

Built with ❀️ for the research community at Texas State University

About

Data Discovery in A Visual Way

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •