KBay Bleaching Pairs RNASeq Pipeline Analysis

Contents:

Project Details
Setting Up Andromeda
FastQC
Trim Reads
Align Trimmed Reads to Reference Genome
Assemble aligned reads and quantify transcripts
Create Gene Counts Matrix

KBay Bleaching Pairs project details

Project github: HI_Bleaching_Pairs
Molecular laboratory work spreadsheet: excel doc

40 adult coral biopsies of M. capitata used for molecular analysis from July 2019 and December 2019 time points. Laboratory work for this project found here.

Raw data path: ../../data/putnamlab/KITT/hputnam/20220203_BleachedPairs_RNASeq
Sym linked files in my working directory: ../../data/putnamlab/estrand/BleachingPairs_RNASeq

Script to download raw data found here in github repo.

A. Huffmyer first-pass analysis of these files: /data/putnamlab/ashuffmyer/pairs-rnaseq (done with Version 2 of genome)

M. capitata genome database: http://cyanophora.rutgers.edu/montipora/. I will use version 3 for this analysis.

References:

For detailed explanations of each step, refer to the below pipelines.

RNASeq

Erin Chille RNAseq pipeline: https://github.com/echille/Mcapitata_Developmental_Gene_Expression_Timeseries/blob/master/1-QC-Align-Assemble/mcap_rnaseq_analysis.md
Jill Ashey RNAseq pipeline: https://github.com/JillAshey/SedimentStress/blob/master/Bioinf/RNASeq_pipeline_HI.md
Danielle Becker pipeline: https://github.com/daniellembecker/DanielleBecker_Lab_Notebook/blob/master/_posts/2021-04-14-Molecular-Underpinnings-RNAseq-Workflow.md

TagSeq

Kevin Wong pipeline: https://github.com/kevinhwong1/Porites_Rim_Bleaching_2019/blob/master/scripts/TagSeq/TagSeq_Analysis_HPC.md

Function Annotation pipeline

See here for M. capitata genome version 3 functional annotation: https://github.com/emmastrand/EmmaStrand_Notebook/blob/master/_posts/2022-11-02-M.capitata-Genome-v3-Functional-Annotation.md.

Setting Up Andromeda

Sym link raw files (from HP path above) to raw_data folder within /BleachingPairs_RNASeq.

Make a new directory for scripts, fastqc_results

FastQC

fastqc.sh:

#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=emma_strand@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/estrand/BleachingPairs_RNASeq                
#SBATCH --error="script_error_fastqc" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_fastqc" #once your job is completed, any final job report comments will be put in this file

source /usr/share/Modules/init/sh # load the module function (specifc to my computer)

module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2

for file in /data/putnamlab/KITT/hputnam/20220203_BleachedPairs_RNASeq/*fastq.gz
do
fastqc $file --outdir /data/putnamlab/estrand/BleachingPairs_RNASeq/fastqc_results/         
done

multiqc --interactive /data/putnamlab/estrand/BleachingPairs_RNASeq/fastqc_results -o /data/putnamlab/estrand/BleachingPairs_RNASeq/multiqc_results

Outside of Andromeda:

scp emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/BleachingPairs_RNASeq/multiqc_results/multiqc_report.html /Users/emmastrand/MyProjects/HI_Bleaching_Timeseries/Dec-July-2019-analysis/output/RNASeq/initial_multiqc_report.html

fastqc results

All samples have sequences of a single length (101bp).

Trimming Reads

I use the program fastp, but there are others like Trimmomatic.

More info on Trimmomatic from Data Carpentry.
More info on fastp https://manpages.debian.org/testing/fastp/fastp.1.en.html

fastp_multiqc.sh:

Fill in the trimming parameters based on the initial multiqc report.

#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/estrand/BleachingPairs_RNASeq/raw_data                
#SBATCH --error=../"%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output=../"%x_output.%j" #once your job is completed, any final job report comments will be put in this file

source /usr/share/Modules/init/sh # load the module function (specific to my computer)

# Load modules needed 
module load fastp/0.19.7-foss-2018b
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2

# Make an array (list) of sequences to trim
# Needs to be in directory above (raw data directory)
array1=($(ls *.fastq.gz))

# fastp and fastqc loop 
for i in ${array1[@]}; do
    fastp --in1 ${i} \
        --in2 $(echo ${i}|sed s/_R1/_R2/)\
        --out1 /data/putnamlab/estrand/BleachingPairs_RNASeq/processed/trimmed.${i} \
        --out2 /data/putnamlab/estrand/BleachingPairs_RNASeq/processed/trimmed.$(echo ${i}|sed s/_R1/_R2/) \
        --qualified_quality_phred 20 \
        --unqualified_percent_limit 10 \
        --length_required 100 \
        --detect_adapter_for_pe \
        --cut_right cut_right_window_size 5 cut_right_mean_quality 20
    fastqc /data/putnamlab/estrand/BleachingPairs_RNASeq/processed/trimmed.${i}
done

echo "Read trimming of adapters complete." $(date)

# Quality Assessment of Trimmed Reads
cd /data/putnamlab/estrand/BleachingPairs_RNASeq/processed/ #The following command will be run in the /clean directory

# Compile MultiQC report from FastQC files 
multiqc --interactive ./  

echo "Cleaned MultiQC report generated." $(date)

Exporting this report (outside of Andromeda)

scp emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/BleachingPairs_RNASeq/processed/multiqc_report.html /Users/emmastrand/MyProjects/HI_Bleaching_Timeseries/Dec-July-2019-analysis/output/RNASeq/processed_multiqc_report.html

Processed Multiqc report

80 samples had less than 1% of reads made up of overrepresented sequences

No samples found with any adapter contamination > 0.1%

Align Trimmed Reads to Reference Genome

I use the program HISAT2, but other pipelines use STAR.

Index the reference genome
Alignment of clean reads to the reference genome

Reference genome information

I’m using the newest version of the Mcap genome:

wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.assembly.fasta.gz. gunzip this file wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.gff3.gz. gunzip this file

path: /data/putnamlab/estrand/Montipora_capitata_HIv3.assembly.fasta
path: /data/putnamlab/estrand/Montipora_capitata_HIv3.genes.gff3

Creating the reference genome part takes ~5-10 minutes.
Running the hisat2 and samtools parts took under 4 days (the align.sh script at the end of this doc took 2 weeks..)

align2.sh:

#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=15
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/estrand/BleachingPairs_RNASeq/output2/                
#SBATCH --error=../"%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output=../"%x_output.%j" #once your job is completed, any final job report comments will be put in this file

source /usr/share/Modules/init/sh # load the module function (specific need to my computer)

# load modules needed
module load HISAT2/2.2.1-gompi-2021b
module load SAMtools/1.15.1-GCC-11.2.0

# Index the reference genome 
hisat2-build -f /data/putnamlab/estrand/Montipora_capitata_HIv3.assembly.fasta ./Mcap_ref
echo "Reference genome indexed. Starting alignment" $(date)

# Alignment of clean reads to the reference genome
array=($(ls /data/putnamlab/estrand/BleachingPairs_RNASeq/processed/*_R1_001.fastq.gz))

for i in ${array[@]}; do
    hisat2 -p 8 --rna-strandness RF --dta -q -x Mcap_ref -1 ${i} -2 $(echo ${i}|sed s/_R1/_R2/) -S ${i}.sam
    samtools sort -@ 8 -o ${i}.bam ${i}.sam
    	echo "${i} bam-ified!"
    rm ${i}.sam
done

OUTPUT SHOULD BE BOTH R1 AND R2 IN ONE SAM FILE!!

Assemble aligned reads and quantify transcripts

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts.

Reference-guided assembly with novel transcript discovery
Merge output GTF files and assess the assembly performance
Compilation of GTF-files into gene and transcript count matrices

GFF File Fix

My original counts matrix output with names like STRG.6173 instead of the gene name.

Ariana had a solution to this outlined here: https://github.com/Putnam-Lab/Lab_Management/issues/51

Upload new GFF file

scp /Users/emmastrand/MyProjects/HI_Bleaching_Timeseries/Dec-July-2019-analysis/Montipora_capitata_HIv3.genes_fixed.gff3.gz emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/

Ended up using assemble2.sh below (b/c of a formatting issue in my original gff)

assemble2.sh:

#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=128GB
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/estrand/BleachingPairs_RNASeq/output2/                
#SBATCH --error=../"%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output=../"%x_output.%j" #once your job is completed, any final job report comments will be put in this file

source /usr/share/Modules/init/sh # load the module function (specific need to my computer)
module load StringTie/2.1.4-GCC-9.3.0

# Assemble and estimate reads
# array 1 ls is every file that ends with .bam in the output2 folder specified above 
array1=($(ls *.bam))

for i in ${array1[@]}; do
    stringtie -p 8 -e -B -G /data/putnamlab/estrand/Montipora_capitata_HIv3.genes_fixed.gff3 -o /data/putnamlab/estrand/BleachingPairs_RNASeq/output2/${i}.gtf /data/putnamlab/estrand/BleachingPairs_RNASeq/output2/${i}
done

# Merge stringTie gtf results

ls trimmed.*.gtf > mergelist.txt
cat mergelist.txt

stringtie --merge -p 8 -G /data/putnamlab/estrand/Montipora_capitata_HIv3.genes_fixed.gff3 -o stringtie_merged.gtf mergelist.txt

echo "Stringtie merge complete" $(date)

head mergelist.txt:

trimmed.16_S121_L003_R1_001.fastq.gz.bam.gtf
trimmed.17_S122_L003_R1_001.fastq.gz.bam.gtf
trimmed.21_S123_L003_R1_001.fastq.gz.bam.gtf
trimmed.22_S124_L003_R1_001.fastq.gz.bam.gtf
trimmed.23_S125_L003_R1_001.fastq.gz.bam.gtf
trimmed.24_S126_L003_R1_001.fastq.gz.bam.gtf
trimmed.25_S127_L003_R1_001.fastq.gz.bam.gtf
trimmed.26_S128_L003_R1_001.fastq.gz.bam.gtf
trimmed.28_S129_L003_R1_001.fastq.gz.bam.gtf

head stringtie_merged.gtf:

# stringtie --merge -p 8 -G /data/putnamlab/estrand/Montipora_capitata_HIv3.genes_fixed.gff3 -o stringtie_merged.gtf mergelist.txt
# StringTie version 2.1.4
Montipora_capitata_HIv3___Scaffold_1    StringTie       transcript      22732   36176   1000    -       .       gene_id "MSTRG.1"; transcript_id "Montipora_capitata_HIv3___RNAseq.g4581.t1"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4581.t1"; 
Montipora_capitata_HIv3___Scaffold_1    StringTie       exon    22732   23002   1000    -       .       gene_id "MSTRG.1"; transcript_id "Montipora_capitata_HIv3___RNAseq.g4581.t1"; exon_number "1"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4581.t1"; 
Montipora_capitata_HIv3___Scaffold_1    StringTie       exon    27868   27934   1000    -       .       gene_id "MSTRG.1"; transcript_id "Montipora_capitata_HIv3___RNAseq.g4581.t1"; exon_number "2"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4581.t1"; 
Montipora_capitata_HIv3___Scaffold_1    StringTie       exon    29683   29722   1000    -       .       gene_id "MSTRG.1"; transcript_id "Montipora_capitata_HIv3___RNAseq.g4581.t1"; exon_number "3"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4581.t1"; 
Montipora_capitata_HIv3___Scaffold_1    StringTie       exon    30150   30230   1000    -       .       gene_id "MSTRG.1"; transcript_id "Montipora_capitata_HIv3___RNAseq.g4581.t1"; exon_number "4"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4581.t1"; 
Montipora_capitata_HIv3___Scaffold_1    StringTie       exon    31466   31705   1000    -       .       gene_id "MSTRG.1"; transcript_id "Montipora_capitata_HIv3___RNAseq.g4581.t1"; exon_number "5"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4581.t1"; 
Montipora_capitata_HIv3___Scaffold_1    StringTie       exon    31890   31988   1000    -       .       gene_id "MSTRG.1"; transcript_id "Montipora_capitata_HIv3___RNAseq.g4581.t1"; exon_number "6"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4581.t1"; 
Montipora_capitata_HIv3___Scaffold_1    StringTie       exon    33295   33367   1000    -       .       gene_id "MSTRG.1"; transcript_id "Montipora_capitata_HIv3___RNAseq.g4581.t1"; exon_number "7"; ref_gene_id "Montipora_capitata_HIv3___RNAseq.g4581.t1"; 

Create gene counts matrix

The StringTie program includes a script, prepDE.py that compiles your assembly files into gene and transcript count matrices. This script requires as input the list of sample names and their full file paths, sample_list.txt. This file will live in StringTie program directory.

Copy this file: cp /data/putnamlab/ashuffmyer/pairs-rnaseq/prepDE.py .

Ended up using genecount2.sh b/c of a formatting issue above

genecount2.sh:

#!/bin/bash
#SBATCH -t 60:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=128GB
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/estrand/BleachingPairs_RNASeq/output2/                
#SBATCH --error=../"%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output=../"%x_output.%j" #once your job is completed, any final job report comments will be put in this file

source /usr/share/Modules/init/sh # load the module function (specific need to my computer)
module load StringTie/2.1.4-GCC-9.3.0
module load Python/2.7.15-foss-2018b
module load GffCompare/0.12.1-GCCcore-8.3.0

# Assess the performance of the assembly
gffcompare -r /data/putnamlab/estrand/Montipora_capitata_HIv3.genes_fixed.gff3 -o merged stringtie_merged.gtf

echo "GFFcompare complete, Starting gene count matrix assembly..." $(date)

# Make gtf list text file

for filename in *.bam.gtf; do echo $filename $PWD/$filename; done > listGTF.txt

# Compile the gene count matrix

python ../scripts/prepDE.py -g ../KB_gene_count_matrix2.csv -i listGTF.txt
echo "Gene count matrix compiled." $(date)

head output2/listGTF.txt :

trimmed.16_S121_L003_R1_001.fastq.gz.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output2/trimmed.16_S121_L003_R1_001.fastq.gz.bam.gtf
trimmed.17_S122_L003_R1_001.fastq.gz.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output2/trimmed.17_S122_L003_R1_001.fastq.gz.bam.gtf
trimmed.21_S123_L003_R1_001.fastq.gz.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output2/trimmed.21_S123_L003_R1_001.fastq.gz.bam.gtf
trimmed.22_S124_L003_R1_001.fastq.gz.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output2/trimmed.22_S124_L003_R1_001.fastq.gz.bam.gtf
trimmed.23_S125_L003_R1_001.fastq.gz.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output2/trimmed.23_S125_L003_R1_001.fastq.gz.bam.gtf
trimmed.24_S126_L003_R1_001.fastq.gz.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output2/trimmed.24_S126_L003_R1_001.fastq.gz.bam.gtf
trimmed.25_S127_L003_R1_001.fastq.gz.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output2/trimmed.25_S127_L003_R1_001.fastq.gz.bam.gtf
trimmed.26_S128_L003_R1_001.fastq.gz.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output2/trimmed.26_S128_L003_R1_001.fastq.gz.bam.gtf
trimmed.28_S129_L003_R1_001.fastq.gz.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output2/trimmed.28_S129_L003_R1_001.fastq.gz.bam.gtf
trimmed.29_S130_L003_R1_001.fastq.gz.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output2/trimmed.29_S130_L003_R1_001.fastq.gz.bam.gtf

head KB_gene_count_matrix.csv:

gene_id,trimmed.16_S121_L003_R1_001.fastq.gz.bam.gtf,trimmed.17_S122_L003_R1_001.fastq.gz.bam.gtf,trimmed.21_S123_L003_R1_001.fastq.gz.bam.gtf,trimmed.22_S124_L003_R1_001.fastq.gz.bam.gtf,trimmed.23_S125_L003_R1_001.fastq.gz.bam.gtf,trimmed.24_S126_L003_R1_001.fastq.gz.bam.gtf,trimmed.25_S127_L003_R1_001.fastq.gz.bam.gtf,trimmed.26_S128_L003_R1_001.fastq.gz.bam.gtf,trimmed.28_S129_L003_R1_001.fastq.gz.bam.gtf,trimmed.29_S130_L003_R1_001.fastq.gz.bam.gtf,trimmed.2_S118_L003_R1_001.fastq.gz.bam.gtf,trimmed.30_S131_L003_R1_001.fastq.gz.bam.gtf,trimmed.31_S132_L003_R1_001.fastq.gz.bam.gtf,trimmed.33_S133_L003_R1_001.fastq.gz.bam.gtf,trimmed.37_S134_L003_R1_001.fastq.gz.bam.gtf,trimmed.39_S135_L003_R1_001.fastq.gz.bam.gtf,trimmed.42_S136_L003_R1_001.fastq.gz.bam.gtf,trimmed.43_S137_L003_R1_001.fastq.gz.bam.gtf,trimmed.45_S138_L003_R1_001.fastq.gz.bam.gtf,trimmed.46_S139_L003_R1_001.fastq.gz.bam.gtf,trimmed.47_S140_L003_R1_001.fastq.gz.bam.gtf,trimmed.4_S119_L003_R1_001.fastq.gz.bam.gtf,trimmed.50_S141_L003_R1_001.fastq.gz.bam.gtf,trimmed.51_S142_L003_R1_001.fastq.gz.bam.gtf,trimmed.52_S143_L003_R1_001.fastq.gz.bam.gtf,trimmed.54_S144_L003_R1_001.fastq.gz.bam.gtf,trimmed.55_S145_L003_R1_001.fastq.gz.bam.gtf,trimmed.56_S146_L003_R1_001.fastq.gz.bam.gtf,trimmed.57_S147_L003_R1_001.fastq.gz.bam.gtf,trimmed.59_S148_L003_R1_001.fastq.gz.bam.gtf,trimmed.60_S149_L003_R1_001.fastq.gz.bam.gtf,trimmed.61_S150_L003_R1_001.fastq.gz.bam.gtf,trimmed.62_S151_L003_R1_001.fastq.gz.bam.gtf,trimmed.63_S152_L003_R1_001.fastq.gz.bam.gtf,trimmed.64_S153_L003_R1_001.fastq.gz.bam.gtf,trimmed.65_S154_L003_R1_001.fastq.gz.bam.gtf,trimmed.66_S155_L003_R1_001.fastq.gz.bam.gtf,trimmed.67_S156_L003_R1_001.fastq.gz.bam.gtf,trimmed.68_S157_L003_R1_001.fastq.gz.bam.gtf,trimmed.6_S120_L003_R1_001.fastq.gz.bam.gtf
Montipora_capitata_HIv3___RNAseq.g42319.t1,1,0,0,6,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0
Montipora_capitata_HIv3___TS.g49315.t1b,42,0,13,115,24,2,130,22,255,191,0,70,113,3,0,6,56,74,306,65,0,178,29,242,9,15,139,3,160,48,114,0,0,27,85,18,88,153,105,116
Montipora_capitata_HIv3___TS.g49315.t1a,325,13,104,1239,669,24,619,56,1441,1052,0,576,825,0,0,19,44,752,1816,1276,0,1018,71,2005,0,79,1047,0,1082,29,1140,0,22,729,1666,96,1287,905,2351,1874
Montipora_capitata_HIv3___RNAseq.g19176.t1,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0
Montipora_capitata_HIv3___RNAseq.g42173.t1,0,0,6,6,0,0,6,0,2,3,4,0,2,1,3,0,2,1,2,4,6,0,0,2,2,0,3,2,3,3,4,1,4,0,2,2,6,3,2,0
Montipora_capitata_HIv3___TS.g35745.t1,14,8,10,13,11,19,10,17,17,6,0,7,12,16,36,16,12,28,18,9,33,12,7,38,8,9,9,8,6,3,7,17,8,16,19,17,14,1,8,0
Montipora_capitata_HIv3___RNAseq.g8204.t1,30,45,50,250,14,76,698,778,74,317,72,87,69,337,30,152,3428,45,514,122,136,192,217,595,222,158,290,1409,264,60,555,403,300,378,711,46,103,660,1393,51
Montipora_capitata_HIv3___RNAseq.g9577.t1,14,5,15,10,13,20,11,12,20,3,27,11,12,6,20,9,14,15,4,11,12,9,11,21,13,20,18,21,18,15,23,17,21,6,48,20,22,14,2,15
Montipora_capitata_HIv3___RNAseq.g23056.t1,532,606,1067,1139,601,631,655,661,932,676,511,559,642,774,675,459,446,298,472,307,411,618,1025,801,577,684,764,631,671,943,722,1012,405,362,667,741,1189,611,494,443

Error file:

reference transcripts loaded.
duplicate reference transcripts discarded.
query transfrags loaded.

This seems to have fixed the issue with gene naming.

copy gene counts matrix to computer (again)

scp emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/BleachingPairs_RNASeq/KB_gene_count_matrix2.csv /Users/emmastrand/MyProjects/HI_Bleaching_Timeseries/Dec-July-2019-analysis/output/RNASeq/KB_gene_count_matrix2.csv

less merged.stats:

#= Summary for dataset: stringtie_merged.gtf 
#     Query mRNAs :   54384 in   54185 loci  (36023 multi-exon transcripts)
#            (141 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs :   54382 in   54185 loci  (36023 multi-exon)
# Super-loci w/ reference transcripts:    54185
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |   100.0    |
        Exon level:   100.0     |   100.0    |
      Intron level:   100.0     |   100.0    |
Intron chain level:   100.0     |   100.0    |
  Transcript level:   100.0     |   100.0    |
       Locus level:   100.0     |   100.0    |

     Matching intron chains:   36023
       Matching transcripts:   54361
              Matching loci:   54182

          Missed exons:       0/256029  (  0.0%)
           Novel exons:       0/256026  (  0.0%)
        Missed introns:       0/201643  (  0.0%)
         Novel introns:       0/201643  (  0.0%)
           Missed loci:       0/54185   (  0.0%)
            Novel loci:       0/54185   (  0.0%)

 Total union super-loci across all input datasets: 54185 
54384 out of 54384 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant)

Previous versions of scripts

This took two weeks.. seems long but all worked. Later I found out that the below align.sh actually produced a bam file for both R1 and R2. ah!

align.sh:

#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/estrand/BleachingPairs_RNASeq/output/                
#SBATCH --error=../"%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output=../"%x_output.%j" #once your job is completed, any final job report comments will be put in this file

source /usr/share/Modules/init/sh # load the module function (specific need to my computer)

# load modules needed
module load HISAT2/2.2.1-foss-2019b
module load SAMtools/1.9-foss-2018b

# Index the reference genome 
hisat2-build -f /data/putnamlab/estrand/Montipora_capitata_HIv3.assembly.fasta ./Mcap_ref
echo "Reference genome indexed. Starting alignment" $(date)

# Alignment of clean reads to the reference genome
array=($(ls /data/putnamlab/estrand/BleachingPairs_RNASeq/processed/*.fastq.gz))

for i in ${array[@]}; do
    sample_name=`echo $i| awk -F [.] '{print $2}'`
    hisat2 -p 8 --rna-strandness RF --dta -q -x Mcap_ref -1 ${i} -2 $(echo ${i}|sed s/_R1/_R2/) -S ${sample_name}.sam
    samtools sort -@ 8 -o ${sample_name}.bam ${sample_name}.sam
    	echo "${i} bam-ified!"
    rm ${sample_name}.sam
done

assemble.sh:

#!/bin/bash
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=128GB
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/estrand/BleachingPairs_RNASeq/output/                
#SBATCH --error=../"%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output=../"%x_output.%j" #once your job is completed, any final job report comments will be put in this file

source /usr/share/Modules/init/sh # load the module function (specific need to my computer)
module load StringTie/2.1.4-GCC-9.3.0
module load gffcompare/0.11.5-foss-2018b

# Assemble and estimate reads

array1=($(ls *.bam))

for i in ${array1[@]}; do
    sample_name=`echo $i| awk -F [_] '{print $1"_"$2"_"$3}'`
    stringtie -p 8 -e -B -G /data/putnamlab/estrand/Montipora_capitata_HIv3.genes.gff3 -o /data/putnamlab/estrand/BleachingPairs_RNASeq/output/${i}.gtf /data/putnamlab/estrand/BleachingPairs_RNASeq/output/${i}
done

# Merge stringTie gtf results

ls *gtf > mergelist.txt
cat mergelist.txt

stringtie --merge -p 8 -G /data/putnamlab/estrand/Montipora_capitata_HIv3.genes.gff3 -o stringtie_merged.gtf mergelist.txt

echo "Stringtie merge complete" $(date)

genecount.sh:

#!/bin/bash
#SBATCH -t 60:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=128GB
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/estrand/BleachingPairs_RNASeq/output/                
#SBATCH --error=../"%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output=../"%x_output.%j" #once your job is completed, any final job report comments will be put in this file

source /usr/share/Modules/init/sh # load the module function (specific need to my computer)
module load StringTie/2.1.4-GCC-9.3.0
module load Python/2.7.15-foss-2018b
module load GffCompare/0.12.1-GCCcore-8.3.0

# Assess the performance of the assembly
gffcompare -r /data/putnamlab/estrand/Montipora_capitata_HIv3.genes.gff3 -o merged stringtie_merged.gtf

echo "GFFcompare complete, Starting gene count matrix assembly..." $(date)

# Make gtf list text file

for filename in *.bam.gtf; do echo $filename $PWD/$filename; done > listGTF.txt

# Compile the gene count matrix

python ../scripts/prepDE.py -g ../KB_gene_count_matrix.csv -i listGTF.txt
echo "Gene count matrix compiled." $(date)

head output/listGTF.txt :

16_S121_L003_R1_001.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output/16_S121_L003_R1_001.bam.gtf
16_S121_L003_R2_001.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output/16_S121_L003_R2_001.bam.gtf
17_S122_L003_R1_001.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output/17_S122_L003_R1_001.bam.gtf
17_S122_L003_R2_001.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output/17_S122_L003_R2_001.bam.gtf
21_S123_L003_R1_001.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output/21_S123_L003_R1_001.bam.gtf
21_S123_L003_R2_001.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output/21_S123_L003_R2_001.bam.gtf
22_S124_L003_R1_001.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output/22_S124_L003_R1_001.bam.gtf
22_S124_L003_R2_001.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output/22_S124_L003_R2_001.bam.gtf
23_S125_L003_R1_001.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output/23_S125_L003_R1_001.bam.gtf
23_S125_L003_R2_001.bam.gtf /glfs/brick01/gv0/putnamlab/estrand/BleachingPairs_RNASeq/output/23_S125_L003_R2_001.bam.gtf

less less genecount.sh_error.201740:

GCCcore/8.3.0(24):ERROR:150: Module 'GCCcore/8.3.0' conflicts with the currently loaded module(s) 'GCCcore/9.3.0'
GCCcore/8.3.0(24):ERROR:102: Tcl command execution failed: conflict GCCcore

  54384 reference transcripts loaded.
  2 duplicate reference transcripts discarded.
  54384 query transfrags loaded.

head -5 KB_gene_count_matrix.csv:

gene_id,16_S121_L003_R1_001.bam.gtf,16_S121_L003_R2_001.bam.gtf,17_S122_L003_R1_001.bam.gtf,17_S122_L003_R2_001.bam.gtf,21_S123_L003_R1_001.bam.gtf,21_S123_L003_R2_001.bam.gtf,22_S124_L003_R1_001.bam.gtf,22_S124_L003_R2_001.bam.gtf,23_S125_L003_R1_001.bam.gtf,23_S125_L003_R2_001.bam.gtf,24_S126_L003_R1_001.bam.gtf,24_S126_L003_R2_001.bam.gtf,25_S127_L003_R1_001.bam.gtf,25_S127_L003_R2_001.bam.gtf,26_S128_L003_R1_001.bam.gtf,26_S128_L003_R2_001.bam.gtf,28_S129_L003_R1_001.bam.gtf,28_S129_L003_R2_001.bam.gtf,29_S130_L003_R1_001.bam.gtf,29_S130_L003_R2_001.bam.gtf,2_S118_L003_R1_001.bam.gtf,2_S118_L003_R2_001.bam.gtf,30_S131_L003_R1_001.bam.gtf,30_S131_L003_R2_001.bam.gtf,31_S132_L003_R1_001.bam.gtf,31_S132_L003_R2_001.bam.gtf,33_S133_L003_R1_001.bam.gtf,33_S133_L003_R2_001.bam.gtf,37_S134_L003_R1_001.bam.gtf,37_S134_L003_R2_001.bam.gtf,39_S135_L003_R1_001.bam.gtf,39_S135_L003_R2_001.bam.gtf,42_S136_L003_R1_001.bam.gtf,42_S136_L003_R2_001.bam.gtf,43_S137_L003_R1_001.bam.gtf,43_S137_L003_R2_001.bam.gtf,45_S138_L003_R1_001.bam.gtf,45_S138_L003_R2_001.bam.gtf,46_S139_L003_R1_001.bam.gtf,46_S139_L003_R2_001.bam.gtf,47_S140_L003_R1_001.bam.gtf,47_S140_L003_R2_001.bam.gtf,4_S119_L003_R1_001.bam.gtf,4_S119_L003_R2_001.bam.gtf,50_S141_L003_R1_001.bam.gtf,50_S141_L003_R2_001.bam.gtf,51_S142_L003_R1_001.bam.gtf,51_S142_L003_R2_001.bam.gtf,52_S143_L003_R1_001.bam.gtf,52_S143_L003_R2_001.bam.gtf,54_S144_L003_R1_001.bam.gtf,54_S144_L003_R2_001.bam.gtf,55_S145_L003_R1_001.bam.gtf,55_S145_L003_R2_001.bam.gtf,56_S146_L003_R1_001.bam.gtf,56_S146_L003_R2_001.bam.gtf,57_S147_L003_R1_001.bam.gtf,57_S147_L003_R2_001.bam.gtf,59_S148_L003_R1_001.bam.gtf,59_S148_L003_R2_001.bam.gtf,60_S149_L003_R1_001.bam.gtf,60_S149_L003_R2_001.bam.gtf,61_S150_L003_R1_001.bam.gtf,61_S150_L003_R2_001.bam.gtf,62_S151_L003_R1_001.bam.gtf,62_S151_L003_R2_001.bam.gtf,63_S152_L003_R1_001.bam.gtf,63_S152_L003_R2_001.bam.gtf,64_S153_L003_R1_001.bam.gtf,64_S153_L003_R2_001.bam.gtf,65_S154_L003_R1_001.bam.gtf,65_S154_L003_R2_001.bam.gtf,66_S155_L003_R1_001.bam.gtf,66_S155_L003_R2_001.bam.gtf,67_S156_L003_R1_001.bam.gtf,67_S156_L003_R2_001.bam.gtf,68_S157_L003_R1_001.bam.gtf,68_S157_L003_R2_001.bam.gtf,6_S120_L003_R1_001.bam.gtf,6_S120_L003_R2_001.bam.gtf
Montipora_capitata_HIv3___RNAseq.g19176.t1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,5,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Montipora_capitata_HIv3___RNAseq.g42173.t1,0,0,0,6,6,11,6,13,0,3,0,1,6,0,0,10,2,3,3,7,4,82,0,0,2,1,1,0,3,3,0,6,2,5,1,6,2,5,4,4,6,2,0,4,0,7,2,0,2,0,0,3,3,5,2,5,3,5,3,7,4,4,1,0,4,52,0,5,2,4,2,0,6,8,3,11,2,7,0,4
Montipora_capitata_HIv3___RNAseq.g13553.t1,0,0,0,0,0,0,1,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,2,0,0,0,0,0,0,10,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,11,6,0,0,0,0,0,0,0,0,0,2,0,2,0,0,1,2,0,0,0,0,0,0,0,0,0,1
STRG.6173,396,227,1108,635,799,461,924,535,554,312,1959,1154,931,549,2016,1178,1126,663,1014,604,874,516,1169,689,1574,931,1471,867,1542,912,1525,908,1370,815,1739,1011,716,428,826,488,716,415,1109,651,866,519,2429,1434,1124,664,804,487,1575,957,1524,936,1585,934,822,488,1219,720,1646,973,1169,716,1282,764,1598,972,1915,1165,1465,877,1331,799,915,556,1016,590

Column with gene ID contains both gene names from reference genome and IDs from this pipeline (e.g. STRG.6173). Those from this pipeline seem to have the highest counts.. What does this mean? see troubleshooting section below

Copy that output to local computer

``` scp emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/BleachingPairs_RNASeq/KB_gene_count_matrix.csv /Users/emmastrand/MyProjects/HI_Bleaching_Timeseries/Dec-July-2019-analysis/output/RNASeq/KB_gene_count_matrix.csv

Written on February 3, 2022