Blog Archives

From SNPs to STRUCTURE (GBS-SNP-CROP -> Final Genetic Structure Plot)

1/26/2022

Hi fellow scientists!

I have finally pieced together multiple notebooks worth of notes to write a complete STRUCTURE pipeline! This tutorial is a quick-start guide to generate a STRUCTURE barcode figure from population genetics data. Although the following guide includes all the necessary steps, I recommend that you also read the user manuals for each of the included programs. All of these programs have additional information on the inputs, filters, and flags that may or may not be necessary for your own dataset.

The dataset I used for this pipeline consists of 3-10 individuals that were collected from approximately 20 populations. These individuals were sequenced using Genotyping-By-Sequencing (GBS) procedures, and the sequencing was processed and filtered using GBS-SNP-CROP and TASSEL programs. Also note, because the purpose of this post is a tutorial, I will include hyperlinks to the download/installation pages for each of these programs, rather than their article citation.

Creating STRUCTURE (.str) files

TASSEL
Following filtering in GBS-SNP-CROP (step 7), export a TASSEL compatible data file (step 8). Using TASSEL, complete any additional post-processing filtering steps (i.e., minor allele frequency, heterozygosity, etc.) and export dataset as a .vcf.

PGDSpider
Convert the .vcf file into a .str file using a program such as PGDSpider. Note, if using PGDSpider with a MacOS system, you will need to launch it from the Terminal app. using the cd command, navigate into the directory that contains the PGDSpider .jar file, and run the following line: java -Xmx1024m -Xms512m -jar PGDSpider2.jar (for more information, see the PGDSpider ReadMe file), otherwise the program won't recognize your file tree. If launching on Windows, run the .exe file normally. For PDGSpider, the input format should be set to VCF (the output of TASSEL), and the output format should be set to STRUCTURE. Your input file will look something like this:

In the parser questions, the following should be set (of course adjust any settings to match your data):

What is the ploidy of the data? Diploid
Take most likely genotype if "PL" or "GL" is given in the genotype field? No
Do you want to exclude loci with only missing data? Yes (I'm not convinced this works, but it's worth setting)
Do you want to include non-polymorphic SNPs? Yes
Do you want to include a file with population definitions? No (this is something that doesn't work well in my experience, and it's easier and more reliable to assign manually, especially if the population ID is included in the individual ID. I highly recommend this naming scheme to keep your data clean and organized. For example, I use the naming scheme pop#.ind# in the dataset used here).

In the writer questions, the following should be set:

Save more specific fastSTRUCTURE format? No
Do you want to include inter-marker distances? No
Specify what data type should be included in the STRUCTURE file: SNP

Your output file should look something like this:

Using the text editor of your choice (I recommend Notepad++ on Windows or VS Code on MacOS), update the population IDs in the second column of the .str output file. In a .str file, the first column is typically the individual ID (can contain letters, numbers, and symbols), the second column is the population ID (should be an integer), and each following column is the SNP identification for each locus. Be sure not to edit any formatting of the .str file.

Once this .str file is generated, you can use it as an input for adegenet or hierfstat, which are packages in R that can calculate pairwise metrics between populations (i.e., FST, chord distance, etc.). If you would like to make a STRUCTURE plot, use the tutorial found below.

Running STRUCTURE

The next step is to run the STRUCTURE program itself. While it is possible to run this locally on your computer, it's much easier to run STRUCTURE on a computing cluster, if you have access to one! An R package called ParallelStructure offers parallelizing abilities, which greatly reduces run time for this program. However, you will still need to have the base STRUCTURE program installed. The three required files:

STR file: this is the output file produced by PGDSpider (and edited by you!).
A parallel jobs .txt file: this file can be a bit of a hassle, and to be honest, I was fortunate enough to inherit a version of this file from a previous grad student, which I just had to edit to match my own data. There is an example supplied in the ParallelStructure package, which can be viewed with data(structure_jobs) or see the image below the R script for the head of my own job list file. You will need to make sure this file matches your number of populations IDs and the order of populations (boxed in orange in image below) matches your .str file.
R script: your .R file will contain the parallel_structure command to run, in addition the path to your local STRUCTURE download. Your script should read something like this:

library(ParallelStructure)
setwd("[your/file/path"])
parallel_structure(infile = "structure_file.str", output = "results/", joblist = "ParallelJobs.txt", n_cpu = [# requested cores for parallelization], structure_path = "path/to/structure/program/function/", numinds = [number of individuals], numloci = [number of loci], ploidy = [ploidy; how many rows per individual in str file. Typically 2 for diploid], label = [column with individual labels], popdata = [column with population data], markernames = [row with loci names], missing = [missing data value; typically -9], locprior = 1, printquat = 1)

STRUCTURE will output a bunch of files in your specified output path. This will be parsed into a more reader-friendly format in the following steps. Use the compress or zip function in Windows or MacOS (or the zip command in Linux) to zip your results file for Structure Harvester.

Parsing the STRUCTURE output

Structure Harvester
The program Structure Harvester helps you interpret the STRUCTURE output and provides an easy way to identify the ideal number of clusters. Navigate to the Structure Harvester webpage and upload your zipped results file. It’ll take a few minutes to run, and the page will automatically refresh with your results. At the top of the page, there will be a download button. Download it and untar and unzip the file (tar -xvzf archive.tar.gz) for the next few steps.

One of the main take-away points of Structure Harvester is the plot of Delta K and the Evanno table. This is part of the downloaded dataset or can also be seen in the web browser of Structure Harvester once the data are processed. Scroll down to the Delta K plot, which should look something like this:

The optimal number of clusters for the data set is at the peak of this Delta K plot (see the arrow in the figure above). This data set has an optimum of 2 clusters, which is the minimum number of clusters possible.
Another visual of the cluster patterns for these data is in the Evanno table farther down in the Structure Harvester output. It should look something like below. Note that I tested from 1 to 23 clusters (specified in jobs list text file described above), and 2 clusters was still the best option.

CLUMPP
CLUMPP is a program that checks for biases in the STRUCTURE assignments. Move the appropriate files corresponding to your optimal number of clusters out of the Structure Harvester downloaded dataset (for the above dataset, the ideal number of clusters was two. Therefore, I would move files K2.indfile and K2.popfile.) into the folder containing the CLUMPP executable (I've had trouble getting the MacOS executable to work, but the Windows one works just fine). CLUMPP is run by editing the paramfile file and then executing CLUMPP.exe. You will have to run this program twice: once for individual data and once for population data.

Population data: In a text editor (e.g., Notepad++, TextEdit, or VSCode), modify the following parameters:

DATATYPE: 1
INDFILE: K2.indfile (or whatever your optimal Structure Harvester output is)
POPFILE: K2.popfile
OUTFILE: [pop file prefix].outfile
MISCFILE: [pop file prefix].miscfile
K: # of clusters (identified in Structure Harvester)
C: # of populations
R: number of runs (probably 10)
M: 1 (FullSearch method)
W: 1 (TRUE; weight by the number of ind in each population)
S: 2 (setting to 2 basically just makes sure values are between 0 and 1)
[…] everything else can stay the same until you get to:
PERMUTED_DATAFILE: [pop file prefix].perm_datafile

Individual data: In a text editor, modify the following parameters in the paramfile:

DATATYPE: 0
INDFILE: K2.indfile
POPFILE: K2.popfile
OUTFILE: [ind file prefix].outfile
MISCFILE: [ind file prefix].miscfile
K: # of clusters (identified in Structure Harvester)
C: # of individuals
R: number of runs (probably 10)
M: 1 (FullSearch method)
W: 1 (TRUE; weight by the number of ind in each population)
S: 2 (setting to 2 basically just makes sure values are between 0 and 1)
[…] everything else can stay the same until you get to:
PERMUTED_DATAFILE: [ind file prefix].perm_datafile

STRUCTURE visualization

distruct
distruct is a program that makes pretty, but not easy, figures for STRUCTURE. distruct operates a very similar way to CLUMPP where you edit a parameter file and run an executable. You will need to have a program that allows you to view a .ps image for this program. Mac’s Preview application can open a .ps image by converting it to a .pdf; Windows will require an additional program to view these images (I know GIMP works, which is an open-source photo editing program, but it’s a bit clunky so there might be better options out there).

To run, copy the OUTFILEs for both the individual and population runs of CLUMPP into the distruct folder. Rename the extension on the individual file to .indivq (i.e., indK2.indivq) and the extension on the population file to .popq (i.e., popK2.popq).

The INDIVQ file should look something like this, with one line per individual and the number of columns of the right side of the “:” symbol corresponds to the number of clusters (I have two clusters, so there are two columns on the right side of the colon):

The POPQ file should look like this, with one line per population:

You will also need to make a .names file (i.e, popK2.names) with the names of your populations, if you’d like to include them in your figure:

And finally, you’ll need a .perm (popK2.perm) file with the specified color of each cluster. See the distruct documentation for color options:

Edit the drawparams file in a text editor. Include your .popq, .indivq, .names, and .perm files (note, I cannot get this program to add labels below the figure for whatever reason, but it will happily add them to the top of the figure. You may have to play around a little bit with the settings!). Make sure to also update the parameters for K, NUMPOPS, and NUMINDS to match your data.

The rest of the drawparams file defines the appearance of the figure itself. You may have to run this program multiple times and iterate through the settings until you find parameters that work for your dataset. I found the following to work well for my data:

PRINT_INDIVS: 1
PRINT_LABEL_ATOP: 1
PRINT_LABEL_BELOW: 0
PRINT_SEP: 1
FONTHEIGHT: 8
DIST_ABOVE: 5
DIST_BELOW: -7
BOXHEIGHT: 50
INDIVWIDTH: 1.5
ORIENTATION: 1
XORIGIN: 300
YORIGIN: 50
XSCALE: 2.5
YSCALE: 2.5
[…] defaults on the rest of the settings

I then cropped and rotated my figure with the following result! If any additional formatting is necessary that is not possible within distruct, I recommend a photo-editing software such as GIMP.

The final result:

0 Comments

Unreasonably Detailed Methods:
A Resource for Beginning and Self-Taught Bioinformaticians

From SNPs to STRUCTURE (GBS-SNP-CROP -> Final Genetic Structure Plot)

Creating STRUCTURE (.str) files

Running STRUCTURE

Parsing the STRUCTURE output

STRUCTURE visualization

Elizabeth Scott (Hendrickson)

Archives

Categories

Unreasonably Detailed Methods:A Resource for Beginning and Self-Taught Bioinformaticians

From SNPs to STRUCTURE (GBS-SNP-CROP -> Final Genetic Structure Plot)

Creating STRUCTURE (.str) files

Running STRUCTURE

Parsing the STRUCTURE output

STRUCTURE visualization

Elizabeth Scott (Hendrickson)

Archives

Categories

Unreasonably Detailed Methods:
A Resource for Beginning and Self-Taught Bioinformaticians