Pocillopora acuta v2 Functional Genome Annotation

Contents:

Setting Up Andromeda
Align query protein sequences against databases
BLAST2GO
Merge for Full Annotation

References:

For detailed explanations of each step, refer to the below pipelines.

Functional Annotation

Erin Chille Functional Annotation: https://github.com/echille/Mcapitata_Developmental_Gene_Expression_Timeseries/blob/master/0-BLAST-GO-KO/2020-10-08-M-capitata-functional-annotation-pipeline.md
Jill Ashey Functional Annotation: https://github.com/JillAshey/FunctionalAnnotation/blob/main/FunctionalAnnotation_Worflow.md
Kevin Wong Functional Annotation: https://github.com/hputnam/Past_Genome/blob/master/genome_annotation_pipeline.md#functional-annotation-1
Danielle Becker-Polinski Functional Annotation: https://github.com/daniellembecker/DanielleBecker_Lab_Notebook/blob/master/_posts/2021-12-08-Molecular-Underpinnings-Functional-Annotation-Pipeline.md#overview
Emma Strand Mcap v3 functional annotation: https://github.com/emmastrand/EmmaStrand_Notebook/blob/master/_posts/2022-11-02-M.capitata-Genome-v3-Functional-Annotation.md

Setting Up Andromeda

Shared Protein Databases from the Putnam Lab

Make directory for genomes: mkdir genome_annotation & mkdir genome_annotation/scripts

Trembl path: /data/putnamlab/shared/databases/trembl_db/uniprot_trembl.fasta
Swiss: /data/putnamlab/shared/databases/swiss_db/uniprot_sprot.fasta Uniprot: /data/putnamlab/shared/databases/nr/.

Download the protein fasta files from genome:

wget http://cyanophora.rutgers.edu/Pocillopora_acuta/Pocillopora_acuta_HIv2.genes.pep.faa.gz & gunzip Pocillopora_acuta_HIv2.genes.pep.faa.gz

Count number of protein sequences.

zgrep -c ">" /data/putnamlab/estrand/Pocillopora_acuta_HIv2.genes.pep.faa 

output: 33730

Align query protein sequences against databases

NCBI nr database

Output: .tab and .xml file of aligned sequence info. The .xml file is the important one - it will be used as input for Blast2GO

pacuta_blast_nr.sh:

#!/bin/bash
#SBATCH --job-name="PA-nr"
#SBATCH -t 240:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --mem=128GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/genome_annotation #### this should be your new output directory so all the outputs ends up here
#SBATCH --error=../"%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output=../"%x_output.%j" #once your job is completed, any final job report comments will be put in this file

# load modules needed (specific need for my computer)
source /usr/share/Modules/init/sh # load the module function

# load modules needed
echo "START" $(date)
module load DIAMOND/2.0.0-GCC-8.3.0 #Load DIAMOND

echo "Updating Mcap annotation" $(date)
diamond blastp -b 2 -d /data/putnamlab/shared/databases/nr -q /data/putnamlab/estrand/Pocillopora_acuta_HIv2.genes.pep.faa -o Pacuta_v2_blastp_annot -f 100 -e 0.00001 -k 1 --threads $SLURM_CPUS_ON_NODE --tmpdir /tmp/

echo "Search complete... converting format to XML and tab"

diamond view -a Pacuta_v2_blastp_annot.daa -o Pacuta_v2_blastp_annot.xml -f 5
diamond view -a Pacuta_v2_blastp_annot.daa -o Pacuta_v2_blastp_annot.tab -f 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen

echo "STOP" $(date)

Check how many hits this yielded:

$ wc -l Pacuta_v2_blastp_annot.tab

output: X Pacuta_v2_blastp_annot.tab # ~X hits out of 33730

SwissProt Database

Output: .xml file of aligned sequence info

pacuta_swissprot_blast.sh:

#!/bin/bash
#SBATCH --job-name="PA-swiss"
#SBATCH -t 240:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --mem=128GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/genome_annotation #### this should be your new output directory so all the outputs ends up here
#SBATCH --error=../"%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output=../"%x_output.%j" #once your job is completed, any final job report comments will be put in this file

# load modules needed (specific need for my computer)
source /usr/share/Modules/init/sh # load the module function

# load modules needed
echo "START" $(date)
module load BLAST+/2.11.0-gompi-2020b #load blast module

echo "Blast against swissprot database" $(date)
blastp -max_target_seqs 5 -num_threads 20 -db /data/putnamlab/shared/databases/swiss_db/swissprot_20220307 -query /data/putnamlab/estrand/Pocillopora_acuta_HIv2.genes.pep.faa -evalue 1e-5 -outfmt '5 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen' -out Pacuta_v2_swissprot_protein.out

echo "STOP" $(date)

Trembl database

Split fasta file into six different files to make trembl run faster

Do this in bluewaves:

mkdir ../split_files
cd ../split_files/

module load pyfasta/0.5.2

pyfasta split -n 6 /data/putnamlab/estrand/Pocillopora_acuta_HIv2.genes.pep.faa

output:
creating new files:
/data/putnamlab/estrand/Pocillopora_acuta_HIv2.genes.pep.0.faa
/data/putnamlab/estrand/Pocillopora_acuta_HIv2.genes.pep.1.faa
/data/putnamlab/estrand/Pocillopora_acuta_HIv2.genes.pep.2.faa
/data/putnamlab/estrand/Pocillopora_acuta_HIv2.genes.pep.3.faa
/data/putnamlab/estrand/Pocillopora_acuta_HIv2.genes.pep.4.faa
/data/putnamlab/estrand/Pocillopora_acuta_HIv2.genes.pep.5.faa

Moved the above files to my genome annotation/split_files folder.

pacuta_trembl_blast.sh

#!/bin/bash
#SBATCH --job-name="PA-trem"
#SBATCH -t 30-00:00:00
#SBATCH --mem=128GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/genome_annotation #### this should be your new output directory so all the outputs ends up here
#SBATCH --error="%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output="%x_output.%j" #once your job is completed, any final job report comments will be put in this file

# load modules needed (specific need for my computer)
source /usr/share/Modules/init/sh # load the module function

echo "START" $(date)

module load BLAST+/2.11.0-gompi-2020b #load blast module

# running the blast against the trembl database 
echo "Blast against trembl database" $(date)

# creating array for the split files 
array1=($(ls /data/putnamlab/estrand/genome_annotation/split_files/Pocillopora_acuta_HIv2.genes.pep.*.faa))

for i in ${array1[${SLURM_ARRAY_TASK_ID}]}
do
blastp -max_target_seqs 1 \
-num_threads $SLURM_CPUS_ON_NODE \
-db /data/putnamlab/shared/databases/trembl_db/trembl_20220307 \
-query ${i} \
-evalue 1e-5 \
-outfmt '5 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen' \
-out ${i}.xml
done

echo "STOP" $(date)

InterPro

Remove the * character in some of the protein files.

sed -i s/\*//g Pocillopora_acuta_HIv2.genes.pep.faa

i = in-place (edit file in place) s = substitute /replacement_from_reg_exp/replacement_to_text/ = search and replace statement * = what I want to replace Add nothing for replacement_to_text g = global (replace all occurances in file)

pacuta_interpro.sh:

#!/bin/bash
#SBATCH --job-name="PA-int"
#SBATCH -t 30-00:00:00
#SBATCH --mem=128GB
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/genome_annotation #### this should be your new output directory so all the outputs ends up here
#SBATCH --error="%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output="%x_output.%j" #once your job is completed, any final job report comments will be put in this file

# load modules needed (specific need for my computer)
source /usr/share/Modules/init/sh # load the module function

echo "START $(date)"

module load InterProScan/5.59-91.0-foss-2021a
module load Java/11.0.2
java -version

# Run InterProScan
interproscan.sh -version
interproscan.sh -f XML -i /data/putnamlab/estrand/Pocillopora_acuta_HIv2.genes.pep.faa -b pacutav2.interpro -iprlookup -goterms -pa 
interproscan.sh -mode convert -f GFF3 -i pacutav2.interpro.xml -b pacutav2.interpro

echo "DONE $(date)"

Moving and copying the produced XML files to local desk top

## Blastp xml 
scp emma_strand@ssh3.hac.uri.edu:/data/putnamlab/estrand/genome_annotation/Pacuta_v2_blastp_annot.xml ~/MyProjects/Acclim_Dynamics_molecular/data/

## Swissprot 
scp emma_strand@ssh3.hac.uri.edu:/data/putnamlab/estrand/genome_annotation/Pacuta_v2_swissprot_protein.out ~/MyProjects/Acclim_Dynamics_molecular/data/

BLAST2GO

These steps happen on local computer b/c its an interface that doesn’t require code. The following B2G protocol is from Jill Ashey and her functional annotation protocol: https://github.com/JillAshey/FunctionalAnnotation/blob/main/FunctionalAnnotation_Worflow.md

Register for Blast2GO free membership: https://www.blast2go.com/b2g-register-basic.
Downland Blast2GO program: https://www.biobam.com/blast2go-previous-versions/

Load XML files from NCBI, SwissProt, and Trembl databases in B2G

To load the file, go to File<Load<Load Blast results<Load Blast XML (Legacy)

Once the file is loaded, a table loads with info about those BLAST results (nr, Tags, SeqName, Description, Length, Hits, e-Value, and sim mean). All of the cells should be orange with Tags that say BLASTED. This indicates that these sequences have only been blasted, not mapped or annotated.

Only one .xml file can be loaded at here! To analyze all .xml files, go through these steps with each one separately.

Map GO terms

To map results, select the mapping icon (white circle with green circle inside) at the top of the screen. Then select Run Mapping. A box will open up; don’t change anything, click run. Depending on the number of BLAST results, mapping could take hours to days. B2G will continue running if the computer is in sleep mode. Mapping status can be checked under the Progress tab in the lower left box. If mapping retrieved a GO term that may be related to a certain sequence, that sequence row will turn light green. Only move forward when the Progress tab is 100% completed.

Annotate GO terms

To annotate, select the annot icon (white circle with blue circle inside) at the top of the screen. Then select Run Annotation. A box will open up; don’t change anything unless thresholds need to be adjusted. If no changes are necessary, click next through the boxes until the final one and click run. Depending on the mapping results, annotating could take hours to days. B2G will continue running if the computer is in sleep mode. Annotation status can be checked under the Progress tab in the lower left box. If a GO term has been successfully assigned to a sequence, that sequence row will turn blue. Only move forward when the Progress tab is 100% completed.

Merge with InterPro Scan Results

A .xml file was generated during the IPS step. This file can be loaded into B2G and merged with the newly mapped and annotated data from NCBI, SwissProt, or Trembl. Once these files are merged, new columns will appear with IPS GO and enzyme code information.

Export files

Once the mapping, annotating, and IPS merging are complete, download the table as a .txt file (this is the only option to save in B2G, can change to .csv later if needed).

Merge for Full Annotation

These steps happen in an R script. Mine can be found for the Pacuta v2 within Holobiont Integration molecular github page. This was modeled after Jill Ashey’s script: https://github.com/JillAshey/FunctionalAnnotation/blob/main/RAnalysis/annotation_20211102.Rmd

Written on November 17, 2022