Introduction to RAD-Seq

Note: Currently, this tutorial is written to be performed on the mozzie server. In the future, this will be expanded so it can be done on any linux server.

This tutorial goes through the steps of setting up a project directory, demultiplexing RAD-Seq data, aligning RAD-seq samples to a reference genome, building the loci catalogue and calling SNPs with Stacks, and generating a PCA plot of the SNP data.


Pre-requisites

Before beginning this tutorial, you should be familiar with RAD-seq for SNP discovery. If you’re not familiar with RAD-seq, we recommend you read this review paper to help you get started.

This tutorial also assumes you’re familiar with Unix and running commands on the command line interface. If you’re using a shared cluster, you should also be familiar with running jobs through SLURM.

The last section also assumes some basic knowledge of R.


Learning objectives


The data

The data this tutorial uses is from Aedes Aegypti mosquitos collected from Indonesia and Malaysia. The samples were prepared following a protocol adapted from the methods described here. The restriction enzymes used for the double digest were nlaIII and mluCI.

To save time, we’ll be using a subset of the full data in this tutorial. Only reads belonging to Aedes Aegypti Chr1:1-1000000 have been included in the sequencing data.

The size of the tutorial data is around 500 MB. In real situations, you would expect libraries to be in the order of 10s of gigabytes, and have multiple libraries in a project.

Setup the project directory

Let’s begin by setting up a directory for your project. Good project organisation will make your life easier in the long run, so spend some time at the outset to do it properly. A few months after your initial analysis, you may not remember what you did or what your files contain. Detailed documentation and a sensible directory structure can help with this.

Go to the directory where you wish to store this project, and create a new directory for this tutorial. An example project name could be rad_seq_tutorial.

cd /path/to/sensible/project/directory
mkdir rad_seq_tutorial
cd rad_seq_tutorial

# Store the tutorial path in a variable so we can refer to it later
TUTORIAL_DIR=$(pwd)

In this directory, we’ll create some new directories where we’ll store our data and results.

mkdir raw_data demuxed_seq results scripts software

We’ll also create a few sub-directories in results where we’ll store output from different programs that we’ll run. These will be stored in results_2020-02-01 (change the date to your current date). When performing bioinformatics, it’s not unusual to have to re-run analyses again sometime down the track (e.g. the addition of a new sequencing library, changing of parameters for tools, fixing mistakes). Including the date in the directory name is an easy way to identify different sets of analyses in the future.

cd results
mkdir results_2020-02-01
cd results_2020-02-01
mkdir alignments gstacks populations

Lastly, create a README.txt text file with a brief description of what the project directory is about using your favourite text editor. For example:

# RAD-seq Tutorial

This directory contains analysis of RAD-seq data from Indonesia and Malaysia 
from the Intro to RAD-Seq tutorial.

Your project directory should look something like this:

rad_seq_tutorial/
├── demuxed_seq/
├── raw_data/
├── results/
│   ├── results_2020-02-01/
│   │   ├── alignments/
│   │   ├── gstacks/
│   │   └── populations/
├── scripts/
├── software/
└── README.txt

Get the data

If you’re using the mozzie server, copy the files in /mnt/galaxy/shared/tutorial_data/rad_seq_workshop/ into the raw_data directory of your tutorial project directory.

cd $TUTORIAL_DIR
cd raw_data
cp -r /mnt/galaxy/shared/tutorial_data/rad_seq_workshop/* .

Note that our library data is stored in another sub-directory, tute_library. It’s common to have multiple libraries for a single project, therefore a good idea to have a separate sub-directories for each of your libraries.

In the library directory, the paired end sequencing data is in the two FASTQ files:

Also included is:

There are also two files called popmap_01.txt and popmap_02.txt that we’ll discuss later.

Your raw data directory should now look like this:

raw_data/
├── tute_library/
│   ├── barcodes.txt
│   ├── tute_library_R1.fastq.gz
│   └── tute_library_R2.fastq.gz
├── popmap_01.txt
└── popmap_02.txt

Demultiplexing

Examine the data

Multiple samples are sequenced in a single RAD-seq library. Each sample has a unique combination of barcodes that we can use to separate which reads belong to each sample. The program that we’ll use, process_radtags, requires this information to be in a specific form.

Navigate to the raw_data/tute_library directory. Take a look at the barcodes.txt file to see an example of a barcode file.

cat barcodes.txt
CTAGTC	GTACTG	indonesia_01
AGCTGA	GTACTG	indonesia_02
TCGACT	GTACTG	indonesia_03
ATGCTA	GTACTG	indonesia_04
GCATAG	GTACTG	indonesia_05
TGCACA	CATGAC	malaysia_01
CATGTG	CATGAC	malaysia_02
GTACAC	TGCAGT	malaysia_03
ACGTGT	TGCAGT	malaysia_04
AGCTCT	TGCAGT	malaysia_05
GATCTC	TGCAGT	unknown_01
AGCTGA	ACGTCA	unknown_02

The barcode file is a tab-separated file where the first column corresponds to the p1 barcode (the barcode that will be found at the beginning of the read), the second column corresponds to the p2 barcode (the barcode that will be found a the end of the read), and the third column represents the sample name.

From this file, we can see there are 12 samples we’d like to demultiplex from the library: five samples from Indonesia, five from Malaysia, and two samples that have not been labelled with their location of origin.

Let’s now have a look at the raw FASTQ data. If you’re unfamiliar with the FASTQ format, you can read up on it here.

Use less to view the R1 reads (i.e. the forward reads of this library):

less tute_library_R1.fastq.gz
# press 'q' to quit when you're done

Look at the first read of the file which makes up the first four lines:

@SN7001291:345:HFKWVBCXX:2:1101:1829:1987 1:N:0:2
ATCGCTCATGCCAGCTCAAAATAATCAACTTCGAGGTGCACCAATCCTAGCGCTATCAGACCTCCCTCGCTATGAGACGGCTGCGCATTTTCATTCCCTTG
+
GGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIGGGGIIIIIIIGIIIIIIIIIIIIIIIIIIIIIGIGIGGGIIIIIIIIIGGG

Let’s look at the second line (i.e. the sequence) more closely:

ATCGCTCATGCCAGCTCAAAATAATCAACTTCGAGGTGCACCAATCCTAGCGCTATCAGACCTCCCTCGCTATGAGACGGCTGCGCATTTTCATTCCCTTG

The first six bases (e.g. ATCGCT) of each read in this file correspond to the p1 barcode sequence. The next four bases are CATG which corresponds to the restriction site where nlaIII cleaves.

5'-   C A T G | -3'
3'- | G T A C   -5'

All reads in the R1 file should have this restriction site sequence immediately following the barcode sequence.

Now, look at the R2 file (i.e. the reverse reads):

less tute_library_R2.fastq.gz
# press 'q' to quit when you're done

The first read in this file is

@SN7001291:345:HFKWVBCXX:2:1101:1829:1987 2:N:0:2
CATGACAATTACCCTTTGTCAAGAAGGCGAAGATGTCTCCCTTTAGGTGGTAAGTCAACTTACCAGAAAGAAGTTGAGCAATGTGGAAAAATCCACCATTC
+
GGIGGIGGIGIGIGGIIIIIIIIIIGIIGIGIIIGIIGGGIIGGGGGAGGGIIIIIIIIIIGIIIIIIIIIGIIIIIIIIIIIGIIIIIIIIIIIIGGGGG

Like the R1 file, the first six bases (CATGAC) correspond to the p2 barcode sequence. Since the first read of the R1 file and the first read of the R2 file are mate pairs (i.e. they were sequenced from the same DNA fragment), we can use both the p1 and p2 barcodes to determine what sample the sequence belongs to.

The next four bases are AATT which corresponds to the restriction site were mluCI cleaves.

5'- | A A T T   -3'
3'-   T T A A | -5'

These two reads are both sequenced from the same DNA fragment.

Question:
For a specific set of paired-end reads, the p1 barcode is observed to be ATGCTA and the p2 barcode is observed to be GTACTG. What sample does the read belong to?
Answer:
indonesia_04

Run process_radtags

The Stacks suite has a program called process_radtags that demultiplexes RAD-seq data. We’ll run it on our tutorial data and provide it the barcode file like so:

cd $TUTORIAL_DIR

process_radtags \
    -1 raw_data/tute_library/tute_library_R1.fastq.gz \
    -2 raw_data/tute_library/tute_library_R2.fastq.gz \
    -o demuxed_seq \
    -b raw_data/tute_library/barcodes.txt \
    --renz_1 nlaIII \
    --renz_2 mluCI \
    --paired \
    -i gzfastq \
    --inline_inline \
    -c -q -r -t 80 -s 20

Generally, if you’re running commands on a shared server, you should be using SLURM to submit your jobs. However, this tutorial data only takes a few seconds to process, so if you’re using the mozzie server, you can just run it on the master node without SLURM. With real data, jobs can take hours to run, so make sure you don’t run large jobs on the master node.

process_radtags produces four files for each sample. For example, for indonesia_01:

Reads can be orphaned when the mate pair’s read restriction enzyme site doesn’t match or the read is dropped due to low quality.

Question:
Have a look at the log file in process_radtags. How many reads matched the barcode for the indonesia_01 sample? After filtering those without a matching restriction site and low quality, how many reads have been retained for indonesia_01?
Answer:
258052 total sequences matched indonesia_01. 229479 were retained after removing reads with restriction site not matching (23282) and of low quality (5291).

Question:
In the process_radtags command you ran, what does the option -t 80 and -s 20 do? (Hint: check the manual)
Answer:
The “-t 80” option truncates the read to a final length of 80 bp. The “-s 20” option tells the program to discard the read due to low quality if the average quality score drops below 20 in a sliding window (whose size is set by the “-w” option and has a default of 0.15).

In this tutorial, we have trimmed reads to 80 bp, as our sequence reads are 100 bp. If you’re using new sequencing data, your reads may be 150 bp reads, and you should trim these to a longer length such as 130 bp or 140 bp. If you don’t know what length your sequence reads are, make sure you find out before progressing further in this analysis. One method to do this is to run FastQC on your raw FASTQ files.


Alignment

The reference genome

Next we’ll align the demultiplexed reads to the reference Aedes Aegypti reference genome using bowtie2. First, have a look at the reference assembly FASTA file. On the mozzie server:

cd /mnt/galaxy/shared/genomes/GCF_002204515.2_AaegL5.0
less GCF_002204515.2_AaegL5.0_genomic.fna
# press 'q' to quit when you're done

This large file contains the 1.9 gigabases that make up the Aedes Aegypti genome. Most of the bases have been assembled into three chromosomes: NC_035107.1 (chr1), NC_035108.1 (chr2), and NC_035109.1 (chr3). Since the assembly is imperfect, there are also thousands of smaller scaffolds that have yet to be placed to a chromosome.

If you’re working on the mozzie server, the genome has already been indexed and is ready to be used for alignment with bowtie2. If you’re working on another server, you’ll need to download the genomic FASTA file for GenBank assembly GCF_002204515.2 yourself. You’ll also need to build a Bowtie index for the reference genome using bowtie2-build.

Bowtie2

Alignment is the process of mapping reads from your FASTQ files to a location on the reference genome. Bowtie2 is a commonly used aligner that can quickly align short reads to a reference. We’ll now align the demultiplexed reads to the Aegypti reference genome we viewed earlier. For indonesia_01:

cd $TUTORIAL_DIR/results/results_2020-02-01

bowtie2 \
  -p 2 \
  -x /mnt/galaxy/shared/genomes/GCF_002204515.2_AaegL5.0/GCF_002204515.2_AaegL5.0_genomic \
  --very-sensitive \
  -1 ../../demuxed_seq/indonesia_01.1.fq.gz \
  -2 ../../demuxed_seq/indonesia_01.2.fq.gz \
  2> alignments/indonesia_01.bt2.log \
  | samtools view -bS \
  | samtools sort -o alignments/indonesia_01.bam

We’ve chosen the --very-sensitive preset option which is through in trying to align reads to the reference genome, but takes a longer time than other presets. The SAM output from bowtie2 is piped into samtools to convert it to BAM, and then piped again to sort the BAM file. If you’re unfamiliar with the SAM/BAM format, you can read up on it here.

Take a look at the BAM file:

samtools view alignments/indonesia_01.bam | less
# press q to quit when you're done

If you want to have a more in-depth look at the options that bowtie2 has, you can view the help message with the -h flag:

bowtie2 -h

or have a look at the bowtie2 manual.

We can also write a loop to process all samples sequentially. For example, the code below uses the barcode file’s to extract the sample names in the library, and then runs bowtie2 on the sample’s demultiplexed FASTQ file.

while read line; do
    sample=$(echo $line | cut -d " " -f 3)
    echo Aligning sample: $sample
    bowtie2 \
        -p 2 \
        -x /mnt/galaxy/shared/genomes/GCF_002204515.2_AaegL5.0/GCF_002204515.2_AaegL5.0_genomic \
        --very-sensitive \
        -1 ../../demuxed_seq/${sample}.1.fq.gz \
        -2 ../../demuxed_seq/${sample}.2.fq.gz \
        2> alignments/${sample}.bt2.log \
        | samtools view -bS \
        | samtools sort -o alignments/${sample}.bam
done < ../../raw_data/tute_library/barcodes.txt

Question:
Take a look at the log file malaysia_02.bt2.log. What is the overall alignment rate for the sample? Given what you know about this tutorial data, why might this alignment rate be higher than normal?
Answer:
The observed overall alignment rate of this sample is 93.84%. This alignment rate is not representative of a real RAD-seq experiment because the tutorial data is made up of reads that were selected because they were known to align to a defined part of this genome. If we were working with the full set of data, the alignment rate would be reduced.


Stacks

Stacks is a suite of software tools that can processes RAD-seq data, call SNPs, and calculate some popgen statistics. In reference-based RAD-seq analysis, we’ll be using the programs gstacks and populations.

Run gstacks

The next step is to build a catalogue of RAD-tags using gstacks. You can read more about the program and its options here.

There are two valid ways you can define which BAM files you input into gstacks: using a popmap file or listing each bam file in the command. We’ll be using the latter. Here we’ll use the -M argument to give gstacks a popmap file and the -I argument to give it our alignments directory.

The popmap file is a simple tab-delimited file that defines what population each sample is from. The populations program that we’ll run immediately after gstacks can use this information to calculate common population genetics statistics such as FST and nucleotide diversity.

In some cases, you won’t be interested in the statistics calculated by populations (since you’ll be doing your own analyses downstream). If this is the case, you can just define all samples as belonging in the same group. Have a look at the popmap file included in the raw_data directory:

cd $TUTORIAL_DIR/raw_data/
cat popmap_01.txt

In this directory, there’s also another popmap file called popmap_02.txt. You can use this popmap file if you want to define groups in downstream Stacks analysis.

Question:
Have a look at the popmap_02.txt file. How many unique populations are there?
Answer:
There are three unique populations in the file: “indonesia”, “malaysia”, and “unknown”.

Now go back to the results directory and run gstacks to create the catalogue. We’ll use the popmap_01.txt file here, but note that gstacks doesn’t use the population information, only populations does.

cd $TUTORIAL_DIR/results/results_2020-02-01
gstacks \
    -I alignments \
    -M ../../raw_data/popmap_01.txt \
    -O gstacks \
    --min-mapq 20 \
    -t 4

Have a look at the catalog.fa.gz output from gstacks using less.

catalog.fa.gz is a FASTA file that contains the RAD tags that have been detected in the given RAD-seq samples. The header of each sequence contains the position of where the RAD-tag is in the reference genome (e.g. pos=CM008043.1:9352:-) and the number of samples the RAD-tag was found (e.g. NS=1).

catalog.calls is a binary file that contains the genotyping data that populations will use for calling SNPs.

Run populations

Populations can do a number of calculations and statistical tests, but for this tutorial, we’ll only be using it to output a VCF file of our SNP calls. You can read up on populations here.

When we run populations, we’ll be writing to a sub-directory in the populations results directory. It can be useful to have multiple directories for populations as re-running it with different parameters, samples, etc. is common.

Run populations requiring no missing data for any SNPs:

mkdir populations/pop_01
populations \
    -P gstacks \
    -M ../../raw_data/popmap_01.txt \
    -O populations/pop_01 \
    -t 1 \
    -p 1 \
    -r 1 \
    --write_single_snp \
    --vcf

Have a look at the VCF file. If you’re not familiar with the VCF format, you can read up on it here.

Question:
Approximately many SNPs are there in the VCF file?
Answer:
There are 987 SNPs in the VCF file.

Question:
For the first SNP in the VCF file (chromosome NC_035107.1, position 119501), what is the genotype call for indonesia_01? What nucleotide bases would be observed?
Answer:
For indonesia_01, the GT field is “0/1” which means the sample is heterozygous at that position. Since the REF base is C and the ALT base is A, the genotype would be C/A.

Re-run populations in a new directory (pop_02) using less stringent parameters. For this example, we’ll use the popmap_02.txt file to define populations and require SNPs to be called in at least 80% of individuals (-r 0.8) in at least two of the three populations (-p 2). We’ll also include the --fstats flag to enable SNP-based F-statistics.

mkdir populations/pop_02
populations \
    -P gstacks \
    -M ../../raw_data/popmap_02.txt \
    -O populations/pop_02 \
    -t 1 \
    -p 2 \
    -r 0.8 \
    --write-single-snp \
    --vcf \
    --fstats

Since we defined populations and use the --fstats option, we now are given pairwise estimates of FST in our output. Go to the output directory and view some of these files. The files beginning with populations.fst_* contain the SNP-level measures of FST between two populations and populations.fst_summary.tsv contains the pairwise FSTs for the populations.

Additionally, with our less stringent parameters, the new VCF file now has more than twice the number of SNPs included, but some samples have missing genotype calls (i.e. ./.).

Question:
What does the --write-single-snp option do in populations? Why might you want to use this option when you run populations? (Hint: look at the help page or manual)
Answer:
The “--write-single-snp” restricts analysis to only use the first SNP on the RAD tag. Often you want your SNPs to be independent observations (i.e. not under linkage disequilibrium). Using the “--write-single-snp” or “--write-random-snp” flag removes SNPs which are located on the same RAD tag, and therefore very close together in the genome.


Downstream analysis

Generally, the VCF files from populations will form the basis of your downstream analysis of population genetics.

Here, we’ll go through a toy example of using the SNP data to produce a PCA plot and identifying which populations the two unknown samples belong to, but for serious popgen analysis, you’ll need to turn to software packages like adegenet, vegan, and pegas.

We’ll first convert the VCF file into a “012” matrix using VCFtools. This format represents homozygous REF calls as “0”, heterozygous calls as “1” and homozygous ALT calls as “2”.

# Go to the pop_01 directory of populations
cd $TUTORIAL_DIR/results/results_2020-02-01/populations/pop_01

vcftools \
    --vcf populations.snps.vcf \
    --012 \
    --out populations.snps

Have a look at the output files produced by VCFtools.

The next step is to load the files into R. If you’re on the mozzie server, you can just use the RStudio server provided. If you’re not using the mozzie server, download the populations.snps.012 and populations.snps.012.indv files to your local computer.

In R, this code generates a simple PCA plot:

# Load libraries
library(ggplot2)
library(stringr)

# Load 012 files (change the paths)
gt_matrix <- read.table(file="/path/to/rad_seq_tutorial/results/results_2020-02-01/populations/pop_01/populations.snps.012",
                        row.names = 1, header=FALSE)
sample_names <- read.table(file="/path/to/rad_seq_tutorial/results/results_2020-02-01/populations/pop_01/populations.snps.012.indv",
                           header=FALSE)

# View the first bit of the genotype matrix
gt_matrix[1:5,1:5]

# Create a dataframe containing sample info
sample_info <- data.frame(
  sample_id=sample_names[,1],
  group=str_remove(sample_names[,1], "_\\d\\d$"),
  stringsAsFactors=FALSE)
sample_info

# PCA plot
pca <- prcomp(gt_matrix)$x[,1:4]
pca <- cbind(pca, sample_info)
ggplot(pca, aes(x=PC1, y=PC2, color=group)) +
  geom_point()

Question:
There are two samples from unknown origin. After looking at the PCA plot, what country do you think they originate from?
Answer:
The two unknown samples both cluster more closely to the Malaysian samples, so it is likely they are from Malaysia.