Directory Structure

When using the PEARG clusters or the Melbourne Bioinformatics clusters, your files and data should be organised according to the convention described here.

Each project should have its own directory located inside your home directory or a shared location. Your project directory name should be meaningful.

Any shared data such as reference genomes and indices should be stored outside your project directory to avoid unnecessary duplication of data.

Check your project directory

At absolute minimum, please check:

  • is your project name sensible?
  • do you have a README file (or equivalent) in the parent directory of your project directory that contains meaningful information?
  • do you have a data directory (containing your raw data) and a results directory (containing processed data) with files organised in a logical manner?

If you’ve answered no to any of these questions, you may get a stern email in the future.

The motivation behind having a strict structure that all members adhere to for project organisation is transparency. Anyone from the lab should be able to look in your project directory and clearly understand what was run and reproduce your analysis. Most of the time, the person trying to decode what analyses were performed and why will be future you. Cooperate with your future self by leaving verbose notes in README files!

While the sysadmin won’t be authoritarian about the precise directory structure, any flagrant disregard for the guidelines (such as dumping all your processing files in the top-level results directory) will be met with consequences.


Filesystem overview

Click on the image to view a larger version of the recommended directory structure or scroll down for another example in text.


Creating a project directory using the template

# TODO: Jess will create a project skeleton in the future

You should rename the directories starting with rename_* into something sensible.


Example directory structure

In this example, our project name is called project_name and is a RAD-seq experiment that was processed using Stacks.

project_name/
├── data/
├── results/
├── scripts/
├── software/
└── README

The directory has four subdirectories: data, results, scripts, and software, and one README file. The README file should be a plain-text file containing basic project information (e.g. what the project is about, what type of data was sequenced).

Data directory

The data directory should contain the raw data received from sequencing. Each library should have it’s own directory containing sequencing files and a text file containing barcodes corresponding to samples. This file is needed for Stacks process_radtags.

project_name/
├── data/
│   ├── library_1_raw_data/
│   │   ├── seqA_R1_001.fastq.gz
│   │   ├── seqA_R2_001.fastq.gz
│   │   └──  library_1_barcodes.txt
│   ├── library_2_raw_data/
│   │   ├── seqB_R1_001.fastq.gz
│   │   ├── seqB_R2_001.fastq.gz
│   │   └── library_2_barcodes.txt
├── results/
├── scripts/
├── software/
└── README

Make files read-only (optional)

The files in your data directory should never be edited.

If you are familiar with UNIX file permissions, you can remove write permissions with the chmod command. For example, the following command removes write permission for all users:

chmod a-w seqA_R1_001.fastq.gz

You can check file permissions with ls -l where the first column represents whether read/write/execute access is avaiable.

$ ls -l
-r--r--r-- 1 jess jess 142870 Aug  8 14:30 seqA_R1_001.fastq.gz
-r--r--r-- 1 jess jess 177552 Aug  8 14:30 seqA_R2_001.fastq.gz

Results directory

The results directory should have one directory for each time you generate a set of results. Using subdirectories inside the main results directory is recommended because often experiments are re-run in the future (e.g. updated software versions, more sequencing data, reanalysis before publication). I recommend naming the directory with a date in YYYY-MM-DD format at the beginning of the name so the directories are sorted chronologically.

project_name/
├── data/
├── results/
│   ├── 2017-08-01_results/
│   │   ├── ...
│   │   ├── ...
│   │   └── ...
│   ├── 2017-11-01_extra_samples/
│   │   ├── ...
│   │   ├── ...
│   │   └── ...
│   ├── 2018-05-01_reanalysis/
│   │   ├── ...
│   │   ├── ...
│   │   └── ...
├── scripts/
├── software/
└── README

Inside each result subdirectory, there should be multiple directories containing output from steps in your workflow. In this example, the directories inside 2017-08-01_results are: demuxed_seq, demuxed_cat, alignments, stacks, and qc. Your directory names may look different depending on what type of analysis you’re performing. The contents of each directory is described below.

Sequencing data

results/
├── 2017-08-01_results/
│   ├── demuxed_seq/
│   │   ├── mozzie-1.1.fq
│   │   ├── mozzie-1.2.fq
│   │   ├── mozzie-1.rem.1.fq
│   │   ├── mozzie-1.rem.2.fq
│   │   ├── mozzie-2.1.fq
│   │   ├── mozzie-2.2.fq
│   │   └── ...
│   ├── demuxed_cat/
│   │   ├── mozzie-1.fq
│   │   ├── mozzie-2.fq
│   │   ├── mozzie-3.fq
│   │   └── ...
│   └── ...

The demuxed_seq directory contains demuxed sequencing data processed by process_ragtags. Stacks should output four files for each sample listed in the barcode file. In this example, mozzie-1.1.fq and mozzie-1.2.fq contain the set forward and reverse reads for the mozzie-1 sample. The mozzie-1.rem.1.fq and mozzie-1.rem.2.fq files contain the remaining reads that are unpaired due to their mate being discarded.

If you’re working with ddRADseq data, Stacks recommends concatenating the four files together. Here, demuxed_cat contains the concatenated files.

Alignment data

results/
├── 2017-08-01_results/
│   ├── demuxed_seq/
│   ├── demuxed_cat/
│   ├── alignments/
│   │   ├── AaegL2/
│   │   │   ├── mozzie-1.AaegL2.sorted.bam
│   │   │   ├── mozzie-1.AaegL2.sorted.bam.bai
│   │   │   ├── mozzie-2.AaegL2.sorted.bam
│   │   │   ├── mozzie-2.AaegL2.sorted.bam.bai
│   │   │   ├── ...
│   │   │   └── AagL2_alignment_code.txt
│   │   ├── AaegL3/
│   │   │   ├── mozzie-1.AaegL3.sorted.bam
│   │   │   ├── mozzie-1.AaegL3.sorted.bam.bai
│   │   │   ├── mozzie-2.AaegL3.sorted.bam
│   │   │   ├── mozzie-2.AaegL3.sorted.bam.bai
│   │   │   ├── ...
│   │   │   └── AagL3_alignment_code.txt
│   └── ...

Alignments should be stored in the alignments directory with a separate directory for each reference genome aligned against. Alignments should be stored as bam files with the .bam suffix and bam index files, if provided, should end with .bai. If alignements are sorted, it’s recommended to include sorted in the filename. Including the reference genome name in the filename is also helpful.

A plain-text file with what commands were run should also be included in the directory (e.g. AagL2_alignment_code.txt) or in the scripts directory.

Stacks data

results/
├── 2017-08-01_results/
│   ├── demuxed_seq/
│   ├── demuxed_cat/
│   ├── alignments/
│   ├── stacks/
│   │   ├── stacks_AaegL2_females/
│   │   │   ├── catalog/
│   │   │   │   ├── mozzie-1.AaegL2.alleles.tsv
│   │   │   │   ├── mozzie-1.AaegL2.matches.tsv
│   │   │   │   ├── mozzie-1.AaegL2.models.tsv
│   │   │   │   ├── mozzie-1.AaegL2.snps.tsv
│   │   │   │   ├── mozzie-2.AaegL2.alleles.tsv
│   │   │   │   ├── ...
│   │   │   │   ├── batch_1.catalog.alleles.tsv
│   │   │   │   ├── batch_1.catalog.snps.tsv
│   │   │   │   ├── batch_1.catalog.tags.tsv
│   │   │   │   └── batch_1.markers.tsv
│   │   │   ├── population_females_filtered/
│   │   │   │   ├── batch_1.vcf
│   │   │   │   ├── batch_1.haplotypes.tsv
│   │   │   │   ├── batch_1.sumstats_summary.tsv
│   │   │   │   ├── batch_1.sumstats.tsv
│   │   │   │   ├── batch_1.hapstats.tsv
│   │   │   │   └── code_females_filtered.txt
│   │   │   ├── population_females_singlesnp/
│   │   │   │   ├── batch_1.vcf
│   │   │   │   ├── batch_1.haplotypes.tsv
│   │   │   │   ├── batch_1.sumstats_summary.tsv
│   │   │   │   ├── batch_1.sumstats.tsv
│   │   │   │   ├── batch_1.hapstats.tsv
│   │   │   │   └── code_females_singlesnp.txt
│   │   ├── stacks_AaegL2_males/
│   │   │   └── ...
│   │   └── ...
│   └── ...

Each run of Stacks to get a catalogue should have its own separate directory in the stacks directory. The output files from ref_map or denovo_map stored in its own directory.

Each time you run Stacks populations with designated filters, you should store the files in a separate directory. You should also include a file containing the code that was used to produce the output.

QC data

results/
├── 2017-08-01_results/
│   ├── demuxed_seq/
│   ├── demuxed_cat/
│   ├── alignments/
│   ├── stacks/
│   ├── qc/
│   │   ├── fastqc/
│   │   │   ├── ...
│   │   │   ├── ...
│   │   │   └── ...
│   │   ├── flagstat/
│   │   │   ├── ...
│   │   │   ├── ...
│   │   │   └── ...
│   │   └── ...
│   └── ...

The qc directory should contain the output of programs run for quality control purposes (e.g. fastQC, samtools flagstat).

Scripts directory

It’s up to you if you want to store your scripts inside the scripts directory or with the output files that were generated. Just make sure you document all the code that was run somewhere sensible.

Software directory

If you have any additional software you compiled specifically for your project, you can store them here.

README files

README files are plain-text files where you should write descriptions of what the directory contains, what analysis was done, why certain parameters were chosen, what results were found, etc. Place a README file in any directory you feel could use one. Documenting your work clearly is good practice and often pays dividends in the future.