Skip to content

A logical, reasonably standardized, but flexible project structure for conducting, reporting and publishing bioinformatics work.

License

Notifications You must be signed in to change notification settings

abhi18av/template-analysis-and-writeup

Repository files navigation

πŸ”¬ Comprehensive Data Science & Analysis Template

A production-ready template for end-to-end data science projects, from exploration to publication

License: MIT Copier pre-commit Just Quarto

This is a comprehensive template built with Copier for creating professional data science projects that scale from exploration to publication. It provides a complete workflow covering data analysis, machine learning, academic writing, and deployment.

πŸš€ Quick Start

# Create a new project
copier copy gh:abhi18av/template-analysis-and-writeup my-analysis-project
cd my-analysis-project

# Set up the development environment
just setup

# Start your first analysis
just notebooks new-eda "initial-exploration"

✨ Key Features

  • πŸ“Š 10-Stage Analysis Pipeline: From data extraction to deployment
  • πŸ“ Comprehensive Academic Writing: Complete ecosystem for research dissemination
  • πŸ”„ Reproducible Workflows: DVC pipelines and environment management
  • πŸ€– Multi-Language Support: Python, R, Julia, Clojure, and more
  • 🐳 Infrastructure as Code: Docker, Terraform, and VM provisioning
  • πŸ“ˆ Experiment Tracking: MLflow, Weights & Biases integration
  • βœ… Data Validation: Great Expectations and Pandera frameworks
  • πŸ” Code Quality: Pre-commit hooks, testing, and linting
  • πŸ“– Rich Documentation: Quarto-based reports and documentation

🎯 Enhanced Academic Writing System

  • πŸ“„ Presentations: Academic, corporate, and workshop templates with automation
  • πŸ“ Abstracts: Conference, journal, and symposium abstract management with tracking
  • πŸ’° Grants: NSF, NIH, DOE, and foundation proposal templates with deadlines
  • πŸ“‹ Reports: Technical, executive, project, and grant progress reports
  • 🎨 Posters: Academic, conference, and professional poster templates
  • ⚑ Automation: Just-based commands for creation, rendering, and management
  • πŸ”„ Workflow: Complete lifecycle from draft to publication with version control

Features

Project structure

It is assumed that most of the work will be done in Jupyter Notebooks. However, the template also includes a python project, in which you can put functions and classes shared across notebooks. The repository is set up to use Pytest for unit testing this module code.

The template also includes a data directory whose contents will be ignored by git. You can use this folder to store data that you do not commit. You may also put a readme file in which you can document the source datasets you use and how to acquire them.

just is a command runner that allows you to easily to run project-specific commands. In fact, you can use just to run all the setup commands listed below:

just setup

pre-commit is a tool that runs checks on your files before you commit them with git, thereby helping ensure code quality. Enable it with the following command:

pre-commit install --install-hooks

The configuration is stored in .pre-commit-config.yaml.

Github Actions

You may optionally add a github workflow file which checks the following:

  • uses ruff to check files are formatted and linted
  • Runs unit tests and checks coverage
  • Checks any markdown files are formatted with markdownlint-cli2
  • Checks that all jupyter notebooks are clean

Typos checks for common typos in code, aiming for a low false positive rate. The repository is configured not to use it for Jupyter notebook files, as it tends to find errors in cell outputs.

πŸ“‹ Table of Contents

πŸ“ Project Structure

my-analysis-project/
β”œβ”€β”€ analysis/                    # Main analysis directory
β”‚   β”œβ”€β”€ notebooks/              # Organized analysis notebooks
β”‚   β”‚   β”œβ”€β”€ 00_scratch/         # Quick experiments
β”‚   β”‚   β”œβ”€β”€ 01-data/            # Data processing pipeline
β”‚   β”‚   β”œβ”€β”€ 02-exploration/     # Exploratory data analysis
β”‚   β”‚   β”œβ”€β”€ 03-analysis/        # Statistical analysis
β”‚   β”‚   β”œβ”€β”€ 04-feat_eng/        # Feature engineering
β”‚   β”‚   β”œβ”€β”€ 05-models/          # Model development
β”‚   β”‚   β”œβ”€β”€ 06-interpretation/  # Model interpretation
β”‚   β”‚   β”œβ”€β”€ 07-reports/         # Result summaries
β”‚   β”‚   β”œβ”€β”€ 08-deploy/          # Deployment preparation
β”‚   β”‚   β”œβ”€β”€ 09-governance/      # Model governance
β”‚   β”‚   └── 10-iteration/       # Continuous improvement
β”‚   β”œβ”€β”€ scripts/                # Production scripts
β”‚   β”œβ”€β”€ data/                   # Data pipeline structure
β”‚   β”œβ”€β”€ tests/                  # Testing framework
β”‚   └── infrastructure/         # Infrastructure as code
β”œβ”€β”€ writeup/                    # Academic writing ecosystem
β”‚   β”œβ”€β”€ manuscript/             # Journal articles & papers
β”‚   β”œβ”€β”€ presentation/           # Conference presentations & slides
β”‚   β”‚   β”œβ”€β”€ templates/          # Academic, corporate, workshop templates
β”‚   β”‚   β”œβ”€β”€ presentations/      # Your presentation projects
β”‚   β”‚   └── _output/            # Rendered presentations (PDF/HTML)
β”‚   β”œβ”€β”€ abstracts/              # Conference & journal abstracts
β”‚   β”‚   β”œβ”€β”€ templates/          # Abstract templates by type
β”‚   β”‚   β”œβ”€β”€ conference/         # Conference abstracts
β”‚   β”‚   β”œβ”€β”€ journal/            # Journal abstracts
β”‚   β”‚   β”œβ”€β”€ symposium/          # Workshop & symposium abstracts
β”‚   β”‚   └── tracking/           # Deadline & review tracking
β”‚   β”œβ”€β”€ grants/                 # Grant applications & management
β”‚   β”‚   β”œβ”€β”€ templates/          # NSF, NIH, DOE, private foundation templates
β”‚   β”‚   β”œβ”€β”€ applications/       # Active, submitted, awarded grants
β”‚   β”‚   β”œβ”€β”€ assets/             # Supporting materials & budgets
β”‚   β”‚   └── tracking/           # Deadlines & progress tracking
β”‚   β”œβ”€β”€ report/                 # Technical reports & documentation
β”‚   β”‚   β”œβ”€β”€ templates/          # Technical, executive, project, grant reports
β”‚   β”‚   β”œβ”€β”€ reports/            # Your report projects
β”‚   β”‚   └── _output/            # Rendered reports (PDF/HTML/DOCX)
β”‚   β”œβ”€β”€ poster/                 # Academic & professional posters
β”‚   β”‚   β”œβ”€β”€ templates/          # Academic, conference, professional templates
β”‚   β”‚   β”œβ”€β”€ posters/            # Your poster projects
β”‚   β”‚   └── _output/            # Rendered posters (PDF/HTML)
β”‚   └── blog/                   # Blog posts & informal writing
β”œβ”€β”€ justfile                    # Task automation
β”œβ”€β”€ pyproject.toml              # Python project configuration
└── README.md                   # Project documentation

πŸ”„ Workflow Overview

1. Data Science Pipeline

flowchart TD
    A[Data Extraction] --> B[Data Cleaning]
    B --> C[Exploratory Analysis]
    C --> D[Feature Engineering]
    D --> E[Model Development]
    E --> F[Model Evaluation]
    F --> G[Deployment]
    G --> H[Monitoring]
Loading

2. Academic Workflow

flowchart LR
    A[Analysis] --> B[Results]
    B --> C[Manuscript]
    C --> D[Peer Review]
    D --> E[Publication]
    
    A --> F[Grant Proposal]
    F --> G[Funding]
    G --> A
Loading

πŸ›  Technology Stack

Core Technologies

  • Python: Primary programming language
  • R: Statistical computing
  • Quarto: Scientific publishing
  • Just: Task automation
  • uv: Python package management

Data & ML

Infrastructure

Quality & Testing

🎯 Getting Started

Prerequisites

Installation

  1. Install Copier:

    pip install copier
  2. Install Just (choose your platform):

    # macOS
    brew install just
    
    # Linux (cargo)
    cargo install just
    
    # Or download from releases
  3. Generate Project:

    copier copy gh:abhi18av/template-analysis-and-writeup my-project
    cd my-project
  4. Initialize Environment:

    just setup

πŸ’‘ Usage Examples

Creating Experiments

# Create EDA notebook
just notebooks new-eda "customer-segmentation"

# Create modeling experiment
just notebooks new-model "random-forest" --stage="05-models" --type="advanced"

# Create custom experiment
just notebooks new-experiment "hypothesis-test" "03-analysis/031_hypothesis_testing"

Running Analysis

# Run specific notebook
just notebooks run "02-exploration/eda_20241201_customer-segmentation.qmd"

# Run entire stage
just notebooks run-stage "05-models"

# Execute DVC pipeline
just analysis pipeline

Managing Experiments

# List all notebooks
just notebooks list

# Search experiments
just notebooks search "machine learning"

# Generate experiment report
just notebooks report "customer-segmentation"

Academic Writing

# Manuscript writing
just writeup manuscript-new "customer-analysis-paper"
just writeup manuscript-render

# Presentations and slides
cd writeup/presentation
just create-academic "research-talk"
just render-presentation "research-talk" format=revealjs

# Abstract management
cd writeup/abstracts
just new-conference "conference-abstract" "ICML-2024"
just list-submitted

# Grant applications
cd writeup/grants
just new-nsf "data-science-grant"
just deadlines

# Reports and documentation
cd writeup/report
just create-technical "analysis-report"
just render-report "analysis-report" format=pdf

# Poster creation
cd writeup/poster
just create-academic "conference-poster"
just render-poster academic "conference-poster"

Infrastructure Management

# Create development VM
just infrastructure vm-create

# Deploy to cloud
just infrastructure deploy-cloud

# Monitor resources
just infrastructure status

βš™οΈ Configuration

Project Configuration

The template uses several configuration files:

  • copier.yml: Template configuration
  • pyproject.toml: Python project settings
  • justfile: Task automation
  • dvc.yaml: Data pipeline definition
  • .pre-commit-config.yaml: Code quality hooks

Environment Variables

Create a .env file for sensitive configuration:

# Experiment tracking
MLFLOW_TRACKING_URI=https://your-mlflow-server.com
WANDB_API_KEY=your-wandb-key

# Cloud credentials
AWS_ACCESS_KEY_ID=your-aws-key
AWS_SECRET_ACCESS_KEY=your-aws-secret

# Database connections
DATABASE_URL=postgresql://user:pass@localhost:5432/db

Customization Options

When generating a project, you can customize:

  • Programming languages (Python, R, both)
  • Experiment tracking platform
  • Documentation format
  • Deployment options
  • Infrastructure components

πŸ”§ Advanced Features

Multi-Language Support

The template supports multiple programming languages with consistent structure:

# Python analysis
just notebooks new-experiment "python-analysis" "02-exploration" --lang=python

# R analysis  
just notebooks new-experiment "r-analysis" "02-exploration" --lang=r

# Julia analysis
just notebooks new-experiment "julia-analysis" "02-exploration" --lang=julia

Data Validation

Integrated data validation with multiple frameworks:

# Use Great Expectations
from analysis.validation import create_expectation_suite
suite = create_expectation_suite(df, "customer_data")

# Use Pandera schemas
from analysis.validation import validate_dataframe
validated_df = validate_dataframe(df, "customer_schema")

Experiment Tracking

Automatic experiment logging:

# MLflow integration
import mlflow
with mlflow.start_run():
    mlflow.log_param("model_type", "random_forest")
    mlflow.log_metric("accuracy", 0.95)
    mlflow.sklearn.log_model(model, "model")

🀝 Contributing

Contributions are welcome! Please see our Contributing Guide for details.

Development Setup

# Clone the template repository
git clone https://github.com/abhi18av/template-analysis-and-writeup.git
cd template-analysis-and-writeup

# Test the template
copier copy . test-project
cd test-project
just setup

Testing

# Run template tests
pytest tests/

# Test with copier-template-tester
ctt test

πŸ“š Resources

πŸ† Acknowledgments

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Made with ❀️ for the data science community

Test with Copier and copier-template-tester.

About

A logical, reasonably standardized, but flexible project structure for conducting, reporting and publishing bioinformatics work.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 57