A production-ready template for end-to-end data science projects, from exploration to publication
This is a comprehensive template built with Copier for creating professional data science projects that scale from exploration to publication. It provides a complete workflow covering data analysis, machine learning, academic writing, and deployment.
# Create a new project
copier copy gh:abhi18av/template-analysis-and-writeup my-analysis-project
cd my-analysis-project
# Set up the development environment
just setup
# Start your first analysis
just notebooks new-eda "initial-exploration"
- π 10-Stage Analysis Pipeline: From data extraction to deployment
- π Comprehensive Academic Writing: Complete ecosystem for research dissemination
- π Reproducible Workflows: DVC pipelines and environment management
- π€ Multi-Language Support: Python, R, Julia, Clojure, and more
- π³ Infrastructure as Code: Docker, Terraform, and VM provisioning
- π Experiment Tracking: MLflow, Weights & Biases integration
- β Data Validation: Great Expectations and Pandera frameworks
- π Code Quality: Pre-commit hooks, testing, and linting
- π Rich Documentation: Quarto-based reports and documentation
- π Presentations: Academic, corporate, and workshop templates with automation
- π Abstracts: Conference, journal, and symposium abstract management with tracking
- π° Grants: NSF, NIH, DOE, and foundation proposal templates with deadlines
- π Reports: Technical, executive, project, and grant progress reports
- π¨ Posters: Academic, conference, and professional poster templates
- β‘ Automation: Just-based commands for creation, rendering, and management
- π Workflow: Complete lifecycle from draft to publication with version control
It is assumed that most of the work will be done in Jupyter Notebooks. However, the template also includes a python project, in which you can put functions and classes shared across notebooks. The repository is set up to use Pytest for unit testing this module code.
The template also includes a data
directory whose contents will be ignored by git.
You can use this folder to store data that you do not commit.
You may also put a readme file in which you can document the source datasets you use and how to acquire them.
just
is a command runner that allows you to easily to run project-specific commands.
In fact, you can use just
to run all the setup commands listed below:
just setup
pre-commit is a tool that runs checks on your files before you commit them with git, thereby helping ensure code quality. Enable it with the following command:
pre-commit install --install-hooks
The configuration is stored in .pre-commit-config.yaml
.
You may optionally add a github workflow file which checks the following:
- uses ruff to check files are formatted and linted
- Runs unit tests and checks coverage
- Checks any markdown files are formatted with markdownlint-cli2
- Checks that all jupyter notebooks are clean
Typos checks for common typos in code, aiming for a low false positive rate. The repository is configured not to use it for Jupyter notebook files, as it tends to find errors in cell outputs.
- Quick Start
- Key Features
- Project Structure
- Workflow Overview
- Technology Stack
- Getting Started
- Usage Examples
- Configuration
- Contributing
- License
my-analysis-project/
βββ analysis/ # Main analysis directory
β βββ notebooks/ # Organized analysis notebooks
β β βββ 00_scratch/ # Quick experiments
β β βββ 01-data/ # Data processing pipeline
β β βββ 02-exploration/ # Exploratory data analysis
β β βββ 03-analysis/ # Statistical analysis
β β βββ 04-feat_eng/ # Feature engineering
β β βββ 05-models/ # Model development
β β βββ 06-interpretation/ # Model interpretation
β β βββ 07-reports/ # Result summaries
β β βββ 08-deploy/ # Deployment preparation
β β βββ 09-governance/ # Model governance
β β βββ 10-iteration/ # Continuous improvement
β βββ scripts/ # Production scripts
β βββ data/ # Data pipeline structure
β βββ tests/ # Testing framework
β βββ infrastructure/ # Infrastructure as code
βββ writeup/ # Academic writing ecosystem
β βββ manuscript/ # Journal articles & papers
β βββ presentation/ # Conference presentations & slides
β β βββ templates/ # Academic, corporate, workshop templates
β β βββ presentations/ # Your presentation projects
β β βββ _output/ # Rendered presentations (PDF/HTML)
β βββ abstracts/ # Conference & journal abstracts
β β βββ templates/ # Abstract templates by type
β β βββ conference/ # Conference abstracts
β β βββ journal/ # Journal abstracts
β β βββ symposium/ # Workshop & symposium abstracts
β β βββ tracking/ # Deadline & review tracking
β βββ grants/ # Grant applications & management
β β βββ templates/ # NSF, NIH, DOE, private foundation templates
β β βββ applications/ # Active, submitted, awarded grants
β β βββ assets/ # Supporting materials & budgets
β β βββ tracking/ # Deadlines & progress tracking
β βββ report/ # Technical reports & documentation
β β βββ templates/ # Technical, executive, project, grant reports
β β βββ reports/ # Your report projects
β β βββ _output/ # Rendered reports (PDF/HTML/DOCX)
β βββ poster/ # Academic & professional posters
β β βββ templates/ # Academic, conference, professional templates
β β βββ posters/ # Your poster projects
β β βββ _output/ # Rendered posters (PDF/HTML)
β βββ blog/ # Blog posts & informal writing
βββ justfile # Task automation
βββ pyproject.toml # Python project configuration
βββ README.md # Project documentation
flowchart TD
A[Data Extraction] --> B[Data Cleaning]
B --> C[Exploratory Analysis]
C --> D[Feature Engineering]
D --> E[Model Development]
E --> F[Model Evaluation]
F --> G[Deployment]
G --> H[Monitoring]
flowchart LR
A[Analysis] --> B[Results]
B --> C[Manuscript]
C --> D[Peer Review]
D --> E[Publication]
A --> F[Grant Proposal]
F --> G[Funding]
G --> A
- Python: Primary programming language
- R: Statistical computing
- Quarto: Scientific publishing
- Just: Task automation
- uv: Python package management
- DVC: Data version control
- MLflow: Experiment tracking
- Great Expectations: Data validation
- Pandera: Data validation framework
- Ruff: Python linting and formatting
- pre-commit: Git hooks
- pytest: Testing framework
-
Install Copier:
pip install copier
-
Install Just (choose your platform):
# macOS brew install just # Linux (cargo) cargo install just # Or download from releases
-
Generate Project:
copier copy gh:abhi18av/template-analysis-and-writeup my-project cd my-project
-
Initialize Environment:
just setup
# Create EDA notebook
just notebooks new-eda "customer-segmentation"
# Create modeling experiment
just notebooks new-model "random-forest" --stage="05-models" --type="advanced"
# Create custom experiment
just notebooks new-experiment "hypothesis-test" "03-analysis/031_hypothesis_testing"
# Run specific notebook
just notebooks run "02-exploration/eda_20241201_customer-segmentation.qmd"
# Run entire stage
just notebooks run-stage "05-models"
# Execute DVC pipeline
just analysis pipeline
# List all notebooks
just notebooks list
# Search experiments
just notebooks search "machine learning"
# Generate experiment report
just notebooks report "customer-segmentation"
# Manuscript writing
just writeup manuscript-new "customer-analysis-paper"
just writeup manuscript-render
# Presentations and slides
cd writeup/presentation
just create-academic "research-talk"
just render-presentation "research-talk" format=revealjs
# Abstract management
cd writeup/abstracts
just new-conference "conference-abstract" "ICML-2024"
just list-submitted
# Grant applications
cd writeup/grants
just new-nsf "data-science-grant"
just deadlines
# Reports and documentation
cd writeup/report
just create-technical "analysis-report"
just render-report "analysis-report" format=pdf
# Poster creation
cd writeup/poster
just create-academic "conference-poster"
just render-poster academic "conference-poster"
# Create development VM
just infrastructure vm-create
# Deploy to cloud
just infrastructure deploy-cloud
# Monitor resources
just infrastructure status
The template uses several configuration files:
copier.yml
: Template configurationpyproject.toml
: Python project settingsjustfile
: Task automationdvc.yaml
: Data pipeline definition.pre-commit-config.yaml
: Code quality hooks
Create a .env
file for sensitive configuration:
# Experiment tracking
MLFLOW_TRACKING_URI=https://your-mlflow-server.com
WANDB_API_KEY=your-wandb-key
# Cloud credentials
AWS_ACCESS_KEY_ID=your-aws-key
AWS_SECRET_ACCESS_KEY=your-aws-secret
# Database connections
DATABASE_URL=postgresql://user:pass@localhost:5432/db
When generating a project, you can customize:
- Programming languages (Python, R, both)
- Experiment tracking platform
- Documentation format
- Deployment options
- Infrastructure components
The template supports multiple programming languages with consistent structure:
# Python analysis
just notebooks new-experiment "python-analysis" "02-exploration" --lang=python
# R analysis
just notebooks new-experiment "r-analysis" "02-exploration" --lang=r
# Julia analysis
just notebooks new-experiment "julia-analysis" "02-exploration" --lang=julia
Integrated data validation with multiple frameworks:
# Use Great Expectations
from analysis.validation import create_expectation_suite
suite = create_expectation_suite(df, "customer_data")
# Use Pandera schemas
from analysis.validation import validate_dataframe
validated_df = validate_dataframe(df, "customer_schema")
Automatic experiment logging:
# MLflow integration
import mlflow
with mlflow.start_run():
mlflow.log_param("model_type", "random_forest")
mlflow.log_metric("accuracy", 0.95)
mlflow.sklearn.log_model(model, "model")
Contributions are welcome! Please see our Contributing Guide for details.
# Clone the template repository
git clone https://github.com/abhi18av/template-analysis-and-writeup.git
cd template-analysis-and-writeup
# Test the template
copier copy . test-project
cd test-project
just setup
# Run template tests
pytest tests/
# Test with copier-template-tester
ctt test
This project is licensed under the MIT License - see the LICENSE file for details.
Made with β€οΈ for the data science community
Test with Copier and copier-template-tester.