Holobiont Integration ITS2 Pipeline 2022

Holobiont Integration ITS2 Bioinformatics pipeline - Andromeda version

I’m re-doing the pipeline for this so that when we publish this is reproducible. The first time around was done on my local computer in 2019.

Original pipeline on local computer, includes background information on ITS2, NG Sequencing, SymPortal: https://github.com/hputnam/Acclim_Dynamics/blob/master/ITS2_seq/ITS2_bioinformatic_pipeline.md.

This ITS2 on the cluster script edited from K. Wong ITS2 andromeda pipeline and E. Strand ITS2 andromeda pipepline

Set-up

File location

Original on andromeda:

/data/putnamlab/KITT/hputnam/ITS2/FULL_ITS2.

Copy to my folder on andromeda. I copy these files instead of using the originals in case I make an error in coding and don’t accidentally change the original sequence file.

cp -r /data/putnamlab/KITT/hputnam/ITS2/FULL_ITS2/ /data/putnamlab/estrand/HoloInt_ITS2

File path to reference for the below scripts:

/data/putnamlab/estrand/HoloInt_ITS2/FULL_ITS2.

copy metadata file from local computer to andromeda

Outside of andromeda.

scp ~/MyProjects/Acclim_Dynamics/ITS2_seq/R_Input/ITS2_Full.csv emma_strand@ssh3.hac.uri.edu:/data/putnamlab/estrand/HoloInt_ITS2/HoloInt_ITS2_metadata.csv

Scripts for analysis

See E. Strand’s Setting Up Andromeda and Conda Environment, Creating a reference database, and Testing installation and reference databases sections on how to get SymPortal installed.

For Holobiont Integration project - SymPortal and SP reference database downloaded January 18th, 2022. (Include this info in paper).

Loading SymPortal and metadata

nano symportal_load.sh. This takes 8+ hours.

#!/bin/bash
#SBATCH -t 48:00:00 #I ran this on 24:00:00 but just in case this takes longer b/c more files, use 48 next time
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=emma_strand@uri.edu #your email to send notifications
#SBATCH --account=putnamlab                  
#SBATCH --error="script_error_symportal_load" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_symportal_load" #once your job is completed, any final job report comments will be put in this file

source /usr/share/Modules/init/sh # load the module function

module load Miniconda3/4.9.2

eval "$(conda shell.bash hook)"
conda activate symportal_env

module load SymPortal/0.3.21-foss-2020b
module unload SymPortal/0.3.21-foss-2020b

export PYTHONPATH=/data/putnamlab/estrand/SymPortal/:/data/putnamlab/estrand/SymPortal/lib/python3.7/site-packages:$PYTHONPATH

export PATH=/data/putnamlab/estrand/SymPortal/:/data/putnamlab/estrand/SymPortal/bin:$PATH

main.py --load /data/putnamlab/estrand/HoloInt_ITS2/FULL_ITS2 \
--name HoloInt1 \
--num_proc $SLURM_CPUS_ON_NODE \
--data_sheet /data/putnamlab/estrand/HoloInt_ITS2/HoloInt_ITS2_metadata.csv

less output_script_symportal_load:

# one line per file: only printing sample HP101 here to view
Lat and long are currently nan for HP101. Values will be set to 999

Creating data_set_sample objects
^MCreating data_set_sample HP101

Performing initial mothur QC
HP101: QC started
HP101: data_set_sample_instance_in_q.num_contigs = 63180
HP101: starting fwd PCR. This may take some time.
HP101: starting rev PCR. This may take some time.
HP101: data_set_sample_instance_in_q.post_qc_unique_num_seqs = 278
HP101: data_set_sample_instance_in_q.post_qc_absolute_num_seqs = 19738
HP101: Initial mothur complete

Performing potential sym tax screening QC
HP101: verifying seqs are Symbiodinium and determining clade
HP101: BLAST complete
HP101: 0 sequences thrown out initially due to being too divergent from reference sequences

Performing sym non sym tax screening QC
HP101: 0 sequences thrown out initially due to being too divergent from reference sequences
HP101: non_sym_unique_num_seqs = 0
HP101: non_sym_absolute_num_seqs = 0
HP101: unique_num_sym_seqs = 278
HP101: absolute_num_sym_seqs = 19738
HP101: size_violation_absolute = 0
HP101: size_violation_unique = 0
HP101: pre-med QC complete

HP108: starting MED analysis
HP108: padding sequences
HP108: decomposing
HP108: MED analysis complete
HP108: starting MED analysis
HP108: padding sequences
HP108: decomposing
HP108: MED analysis complete

Creating DataSetSampleSequencePM objects
Populating the consolidated sequence to sample and abundance dictionaryfor pre-MED sequence processing
Processing pre-MED seqs for sample 1 of 255

255 out of 255 samples successfully passed QC.
0 samples produced errors

...

DATA LOADING COMPLETE
DataSet id: 6
DataSet name: HoloInt1
Loading completed in 52766.7060508728s
DataSet loading_complete_time_stamp: 20220706T055541

less script_error_symportal_load:

# series of warning for negative eigenvalues.. come back to this.

/opt/software/scikit-bio/0.5.5-foss-2020b/lib/python3.8/site-packages/skbio/stats/ordination/_principal_coordinate_analysis.py:143: Runt
imeWarning: The result contains negative eigenvalues. Please compare their magnitude with the magnitude of some of the largest positive
eigenvalues. If the negative ones are smaller, it's probably safe to ignore them, but if they are large in magnitude, the results won't
be useful. See the Notes section for more details. The smallest eigenvalue is -0.00932804128381559 and the largest is 5.293133466289359.

file outputs:

  • mothur log file for every sequence file

Moved all mothur output to a new folder for organization.

$ mkdir mothur_output
$ mv mothur* mothur_output

within SymPortal folder on andromeda

$ ls SymPortal

(base) [emma_strand@ssh3]/data/putnamlab/estrand/SymPortal% ls
bin               dbBackUp          django_general.py  lib          manage.py    pop_datasheet_seq_file_cols.py  reference_trees  sp_config.py        temp
data_analysis.py  db.sqlite3        easybuild          lib64        output.py    populate_db_ref_seqs.py         refSeqDB.fa      symbiodiniaceaeDB   tests
data_loading.py   distance.py       exceptions.py      LICENSE.txt  outputs      __pycache__                     seq_match.py     symportal_env.yml   virtual_objects.py
dbApp             distance.py.orig  general.py         main.py      plotting.py  README.md                       settings.py      symportal_utils.py

The output of each SymPortal run will go into outputs > analyses and be assigned a #.

Within my folder analyses #2-5 were run for my Bleaching Pairs project.

Running Analysis

nano symportal_analysis.sh. This takes ~10 minutes.

I used –analyse 6 because 2-5 in my analyses folder was from the Bleaching Pairs project so the next ID will be this one.

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=emma_strand@uri.edu #your email to send notifications
#SBATCH --account=putnamlab                  
#SBATCH --error="script_error_run_analysis" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_run_analysis" #once your job is completed, any final job report comments will be put in this file

source /usr/share/Modules/init/sh # load the module function

module load Miniconda3/4.9.2

eval "$(conda shell.bash hook)"
conda activate symportal_env

module load SymPortal/0.3.21-foss-2020b
module unload SymPortal/0.3.21-foss-2020b

export PYTHONPATH=/data/putnamlab/estrand/SymPortal/:/data/putnamlab/estrand/SymPortal/lib/python3.7/site-packages:$PYTHONPATH

export PATH=/data/putnamlab/estrand/SymPortal/:/data/putnamlab/estrand/SymPortal/bin:$PATH

# Checking dataset number
/data/putnamlab/estrand/SymPortal/main.py --display_data_sets

# Running analysis
/data/putnamlab/estrand/SymPortal/main.py --analyse 6 --name HoloInt1 --num_proc $SLURM_CPUS_ON_NODE

# Checking data analysis instances
/data/putnamlab/estrand/SymPortal/main.py --display_analyses

Analyses info (folder to be copied to local computer):

ANALYSIS COMPLETE: DataAnalysis: name: HoloInt1 UID: 6

DataSet analysis_complete_time_stamp: 20220706T085312

2: BleachingPairs_analysis 20220119T073111 3: BleachingPairs_analysis 20220119T074538 4: BleachingPairs_analysis 20220119T074937 5: BleachingPairs_analysis 20220119T075742 6: HoloInt1 20220706T084046

copying output to local computer for analysis in R

## in terminal window outside of andromeda
$ scp -r emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/SymPortal/outputs/analyses/6/20220706T084046 ~/MyProjects/Acclim_Dynamics/ITS2_seq/SymPortal_Andromeda_2022_Output/  
Written on July 5, 2022