Arabidopsis csRNA-seq pipeline

Author: Benjamin Jean-Marie Tremblay

Requirements

homer (with the tair10 genome; tested with v5.1)
samtools (tested with v1.20)
bfqutils (tested with v1.0)
bwa (tested with v0.7.19)
bedtools (tested with v2.31.1)
R (with packages: rtracklayer, readr, edgeR; tested with v4.4.1)
Basic unix tools: gunzip, sed, grep, awk, cut, tr, etc.

The pipeline

The scripts included in this pipeline allow for going from raw reads to a final set of merged TSSs, alongside quantification data and normalized bigWigs. All scripts always check whether an output exists already and will only proceed unless it is absent, or if the appropriate option is set to override this behaviour. This makes it easy to add samples to an existing project without having to process different batches in separate directories and needing to manually integrate them every time.

Part 1: `process_reads.sh`

This first script processes the raw reads and outputs HOMER tag directories, which are used by all other downstream scripts. Single-end reads are trimmed with bfqtrimse, and paired-end reads are merged using bfqmerge. Trimmed reads are aligned to a genome using bwa-aln and filtered, finally allowing for the creation of tag directories. I recommend using an mapping quality score of at least 30 for filtering the alignments, though a slightly lower threshold of 25 can allow in more true positive alignments without too many accompanying false positives. (See this link for a discussion on what these scores mean in this context.) Personally I prefer the stricter score, since many false positive alignments tend to pile up in shared single positions and thus can very much resemble TSSs. Please note that the Arabidopsis chromosome names must be in Ensembl format, i.e. 1, 2, 3, 4, 5, Mt, Pt.

Part 2: `identify_tss.sh`

This second script goes through all samples and identifies TSSs as well as creating raw bedGraph files. It also creates a few additional outputs used by the next step in the pipeline, quality control. These include sRNA peaks from the input samples, and quantification data from a combined set of all identified TSSs.

Part 3: `qc_tss.sh`

Using the outputs from the last part of the pipeline, this script calculates key quality control statistics including the percentage of nuclear reads and the fraction of reads in peaks. At the end it will output a final sample quality indicator (OK or FAIL).

Part 4: `finalize_tss.sh`

After performing quality control, it may be that some samples will need to be discarded before assembling a final set of TSSs and performing quantification. This script will take selected high quality samples and generate a final set of TSSs and quantify them. It will then use this quantification data to create normalized bigWig files.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
bin		bin
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Arabidopsis csRNA-seq pipeline

Requirements

The pipeline

Part 1: `process_reads.sh`

Part 2: `identify_tss.sh`

Part 3: `qc_tss.sh`

Part 4: `finalize_tss.sh`

About

Uh oh!

Releases

Packages

Languages

noborilab/csRNAseq_pipeline

Folders and files

Latest commit

History

Repository files navigation

Arabidopsis csRNA-seq pipeline

Requirements

The pipeline

Part 1: process_reads.sh

Part 2: identify_tss.sh

Part 3: qc_tss.sh

Part 4: finalize_tss.sh

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Part 1: `process_reads.sh`

Part 2: `identify_tss.sh`

Part 3: `qc_tss.sh`

Part 4: `finalize_tss.sh`

Packages