HoloInt WGBS Analysis Pipeline

HoloInt WGBS Methylation Analysis Pipeline

Raw data folder path: /data/putnamlab/KITT/hputnam/20211008_HoloInt_WGBS
Processed data folder path: /data/putnamlab/estrand/HoloInt_WGBS

References:

  • Wong and Strand WGBS pipeline info: https://github.com/Putnam-Lab/Lab_Management/blob/master/Bioinformatics_%26_Coding/Workflows/Methylation_QC.md
  • Wong Past pipeline: https://github.com/kevinhwong1/Thermal_Transplant_Molecular/blob/main/scripts/Past_WGBS_Workflow.md
  • Wong methylseq trimming tests pipeline: https://kevinhwong1.github.io/KevinHWong_Notebook/Methylseq-trimming-test-to-remove-m-bias/

Contents:

Setting Up Andromeda

Make a new directory for output files

$ mkdir HoloInt_WGBS
$ mkdir scripts
$ mkdir fastqc_results

Test runs were done on a small subset (n=5) to reduce the time that these scripts had to run. I.e. 5 samples instead of 60.

Creating a test run folder

$ mkdir test_set
$ cp /data/putnamlab/KITT/hputnam/20211008_HoloInt_WGBS/1047_S0_L001_R{1,2}_001.fastq.gz /data/putnamlab/estrand/HoloInt_WGBS/test_set
$ cp /data/putnamlab/KITT/hputnam/20211008_HoloInt_WGBS/1051_S0_L001_R{1,2}_001.fastq.gz /data/putnamlab/estrand/HoloInt_WGBS/test_set
$ cp /data/putnamlab/KITT/hputnam/20211008_HoloInt_WGBS/1059_S0_L001_R{1,2}_001.fastq.gz /data/putnamlab/estrand/HoloInt_WGBS/test_set
$ cp /data/putnamlab/KITT/hputnam/20211008_HoloInt_WGBS/1090_S0_L001_R{1,2}_001.fastq.gz /data/putnamlab/estrand/HoloInt_WGBS/test_set
$ cp /data/putnamlab/KITT/hputnam/20211008_HoloInt_WGBS/1103_S0_L001_R{1,2}_001.fastq.gz /data/putnamlab/estrand/HoloInt_WGBS/test_set

Initial fastqc run

fastqc.sh. This took over 24 hours and timed out the first time around so I increased the -t to 60 hours.

#!/bin/bash
#SBATCH -t 60:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=emma_strand@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS                
#SBATCH --error="script_error_fastqc" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_fastqc" #once your job is completed, any final job report comments will be put in this file

source /usr/share/Modules/init/sh # load the module function

cd /data/putnamlab/estrand/HoloInt_WGBS

module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2

for file in /data/putnamlab/KITT/hputnam/20211008_HoloInt_WGBS/*fastq.gz
do
fastqc $file --outdir /data/putnamlab/estrand/HoloInt_WGBS/fastqc_results/         
done

multiqc --interactive fastqc_results

This failed at the multiqc line above because the path was incorrect. So I did that function in the terminal.

scp emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/HoloInt_WGBS/multiqc_report.html /Users/emmastrand/MyProjects/Acclim_Dynamics/Molecular_paper/WGBS/output/initial_multiqc_report.html

Initial MultiQC Report

Full report here: https://github.com/hputnam/Acclim_Dynamics/blob/master/Molecular_paper/WGBS/output/initial_multiqc_report.html

All samples have sequences of a single length (150bp , 151bp).

Sample Name % Dups % GC M Seqs
1047_S0_L001_R1_001 28.6% 27% 67.0
1047_S0_L001_R2_001 26.8% 27% 67.0
1051_S0_L001_R1_001 36.6% 23% 104.2
1051_S0_L001_R2_001 36.0% 24% 104.2
1059_S0_L001_R1_001 30.3% 26% 98.9
1059_S0_L001_R2_001 30.6% 26% 98.9
1090_S0_L001_R1_001 25.8% 27% 97.3
1090_S0_L001_R2_001 26.2% 27% 97.3
1103_S0_L001_R1_001 19.1% 28% 64.8
1103_S0_L001_R2_001 19.3% 28% 64.8
1147_S0_L001_R1_001 19.8% 28% 27.1
1147_S0_L001_R2_001 20.1% 28% 27.1
1159_S0_L001_R1_001 19.9% 29% 57.8
1159_S0_L001_R2_001 20.7% 29% 57.8
1168_S0_L001_R1_001 25.8% 32% 94.9
1168_S0_L001_R2_001 26.4% 31% 94.9
1184_S0_L001_R1_001 20.8% 26% 80.5
1184_S0_L001_R2_001 21.7% 26% 80.5
1205_S0_L001_R1_001 28.3% 27% 117.1
1205_S0_L001_R2_001 28.9% 27% 117.1
1225_S0_L001_R1_001 43.4% 25% 63.5
1225_S0_L001_R2_001 43.0% 25% 63.5
1238_S0_L001_R1_001 22.9% 30% 70.3
1238_S0_L001_R2_001 23.5% 30% 70.3
1281_S0_L001_R1_001 29.3% 26% 122.8
1281_S0_L001_R2_001 30.0% 26% 122.8
1296_S0_L001_R1_001 42.0% 24% 86.1
1296_S0_L001_R2_001 42.0% 24% 86.1
1303_S0_L001_R1_001 22.5% 27% 83.6
1303_S0_L001_R2_001 22.7% 27% 83.6
1312_S0_L001_R1_001 27.7% 28% 94.8
1312_S0_L001_R2_001 28.5% 28% 94.8
1329_S0_L001_R1_001 43.9% 25% 117.1
1329_S0_L001_R2_001 44.3% 26% 117.1
1415_S0_L001_R1_001 25.6% 27% 80.2
1415_S0_L001_R2_001 26.1% 27% 80.2
1416_S0_L001_R1_001 26.4% 27% 98.4
1416_S0_L001_R2_001 27.2% 27% 98.4
1427_S0_L001_R1_001 24.4% 29% 74.5
1427_S0_L001_R2_001 23.9% 29% 74.5
1445_S0_L001_R1_001 51.9% 23% 81.9
1445_S0_L001_R2_001 51.2% 23% 81.9
1459_S0_L001_R1_001 21.1% 28% 91.6
1459_S0_L001_R2_001 21.8% 29% 91.6
1487_S0_L001_R1_001 21.3% 27% 90.4
1487_S0_L001_R2_001 22.1% 28% 90.4
1536_S0_L001_R1_001 21.4% 27% 81.4
1536_S0_L001_R2_001 22.2% 27% 81.4
1559_S0_L001_R1_001 28.4% 29% 86.4
1559_S0_L001_R2_001 28.5% 29% 86.4
1563_S0_L001_R1_001 32.2% 25% 110.5
1563_S0_L001_R2_001 32.7% 26% 110.5
1571_S0_L001_R1_001 31.9% 28% 116.3
1571_S0_L001_R2_001 32.2% 28% 116.3
1582_S0_L001_R1_001 20.1% 28% 79.1
1582_S0_L001_R2_001 20.2% 28% 79.1
1596_S0_L001_R1_001 22.7% 25% 76.8
1596_S0_L001_R2_001 23.7% 25% 76.8
1641_S0_L001_R1_001 29.4% 26% 122.5
1641_S0_L001_R2_001 30.0% 26% 122.5
1647_S0_L001_R1_001 19.4% 28% 84.4
1647_S0_L001_R2_001 19.7% 28% 84.4
1707_S0_L001_R1_001 83.1% 24% 99.2
1707_S0_L001_R2_001 82.0% 24% 99.2
1709_S0_L001_R1_001 55.8% 27% 680.6
1709_S0_L001_R2_001 60.5% 27% 680.6
1728_S0_L001_R1_001 29.9% 24% 106.4
1728_S0_L001_R2_001 28.1% 25% 106.4
1732_S0_L001_R1_001 26.2% 29% 97.4
1732_S0_L001_R2_001 26.5% 28% 97.4
1755_S0_L001_R1_001 23.6% 26% 81.1
1755_S0_L001_R2_001 24.6% 26% 81.1
1757_S0_L001_R1_001 28.4% 28% 92.4
1757_S0_L001_R2_001 29.6% 28% 92.4
1765_S0_L001_R1_001 17.2% 28% 65.5
1765_S0_L001_R2_001 18.0% 27% 65.5
1777_S0_L001_R1_001 19.6% 29% 36.6
1777_S0_L001_R2_001 19.1% 29% 36.6
1820_S0_L001_R1_001 28.5% 31% 133.3
1820_S0_L001_R2_001 28.6% 31% 133.3
2012_S0_L001_R1_001 40.1% 24% 110.5
2012_S0_L001_R2_001 40.3% 26% 110.5
2064_S0_L001_R1_001 18.7% 26% 83.6
2064_S0_L001_R2_001 19.2% 27% 83.6
2072_S0_L001_R1_001 17.2% 28% 61.7
2072_S0_L001_R2_001 17.7% 28% 61.7
2087_S0_L001_R1_001 31.0% 25% 106.0
2087_S0_L001_R2_001 31.8% 26% 106.0
2185_S0_L001_R1_001 44.0% 30% 83.1
2185_S0_L001_R2_001 44.1% 30% 83.1
2197_S0_L001_R1_001 45.2% 23% 94.4
2197_S0_L001_R2_001 45.1% 24% 94.4
2212_S0_L001_R1_001 77.0% 22% 86.8
2212_S0_L001_R2_001 75.0% 23% 86.8
2300_S0_L001_R1_001 27.9% 27% 113.1
2300_S0_L001_R2_001 28.6% 26% 113.1
2304_S0_L001_R1_001 30.4% 29% 91.5
2304_S0_L001_R2_001 30.7% 29% 91.5
2306_S0_L001_R1_001 20.7% 28% 81.5
2306_S0_L001_R2_001 21.1% 27% 81.5
2409_S0_L001_R1_001 31.3% 27% 101.5
2409_S0_L001_R2_001 31.9% 27% 101.5
2413_S0_L001_R1_001 43.9% 23% 108.4
2413_S0_L001_R2_001 44.1% 24% 108.4
2513_S0_L001_R1_001 21.6% 32% 96.5
2513_S0_L001_R2_001 22.3% 32% 96.5
2550_S0_L001_R1_001 34.4% 25% 152.1
2550_S0_L001_R2_001 34.9% 26% 152.1
2564_S0_L001_R1_001 22.5% 27% 73.5
2564_S0_L001_R2_001 22.5% 27% 73.5
2668_S0_L001_R1_001 18.1% 28% 67.6
2668_S0_L001_R2_001 18.6% 28% 67.6
2861_S0_L001_R1_001 26.6% 26% 76.9
2861_S0_L001_R2_001 27.7% 26% 76.9
2877_S0_L001_R1_001 24.2% 26% 65.3
2877_S0_L001_R2_001 24.6% 27% 65.3
2878_S0_L001_R1_001 29.8% 24% 159.4
2878_S0_L001_R2_001 17.4% 25% 159.4
2879_S0_L001_R1_001 23.2% 29% 85.7
2879_S0_L001_R2_001 23.0% 28% 85.7

Sample 1709 had a very large # of reads. The rest are within the range of my other projects (not low reads, graph is relative).

This peak has a normal distribution (good, what we’re looking for) but is shifted to a peak ~20-25… This is usually higher? Red flag? Is this shifted because methylation changes unmethylated Cytosine to Thymine and therefore a smaller amount of GC content? Especially in an invertebrate that only has 20% methylation.. This is actually a green flag then?

~6 samples have a red peak >10. I’m not sure why.. come back to this.

Methylseq: Trimming parameters test

Options tested

This was done on a small subset (n=5) to reduce the time that these scripts had to run. I.e. 5 samples instead of 60. Each script took a little over 24 hours to run.

Goal: Reduce M-bias but keep as much of the sequence as possible.

Reasoning behind values: I chose to not include the Zymo preset because this one worked the least well for Kevin’s set. I proceeded with my own preset cut-offs to test. I included the final version that Kevin went with for his and what appeared to be the second best option as well.

Options tested:

  1. HoloInt_methylseq.sh: clip_r1 = 10, clip_r2 = 10, three_prime_clip_r1 = 10, three_prime_clip_r2 = 10
  2. HoloInt_methylseq2.sh: clip_r1 = 15, clip_r2 = 30, three_prime_clip_r1 = 15, three_prime_clip_r2 = 15 (options that worked best for Kevin Past workflow)
  3. HoloInt_methylseq3.sh: clip_r1 = 15, clip_r2 = 30, three_prime_clip_r1 = 30, three_prime_clip_r2 = 15

Nextflow version 21.03.0 requires an -input command.

HoloInt_methylseq (1)

nano HoloInt_methylseq.sh
#!/bin/bash
#SBATCH --job-name="methylseq"
#SBATCH -t 120:00:00 #use higher # next time
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=500GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS
#SBATCH -p putnamlab
#SBATCH --cpus-per-task=3

# load modules needed
source /usr/share/Modules/init/sh # load the module function

module load Nextflow/21.03.0

# run nextflow methylseq

nextflow run nf-core/methylseq -resume \
-profile singularity \
--aligner bismark \
--igenomes_ignore \
--fasta /data/putnamlab/estrand/Pocillopora_acuta_HIv1.assembly.fasta \
--save_reference \
--input '/data/putnamlab/estrand/HoloInt_WGBS/test_set/*_R{1,2}_001.fastq.gz' \
--clip_r1 10 \
--clip_r2 10 \
--three_prime_clip_r1 10 --three_prime_clip_r2 10 \
--non_directional \
--cytosine_report \
--relax_mismatches \
--unmapped \
--outdir /data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq \
-name WGBS_methylseq_HoloInt-10

The multiqc report ran into this error: Missing output file(s) multiqc_plots expected by process multiqc (1).

I ran this command in the terminal and it worked :

$ module load MultiQC/1.9-intel-2020a-Python-3.8.2

$ multiqc -f --title "WGBS_methylseq_HoloInt-10" --filename WGBS_methylseq_HoloInt_10_multiqc_report  . \
      -m custom_content -m picard -m qualimap -m bismark -m samtools -m preseq -m cutadapt -m fastqc

Ouptut: WGBS_methylseq_HoloInt_10_multiqc_report.html.

scp emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq/WGBS_methylseq_HoloInt_10_multiqc_report.html /Users/emmastrand/MyProjects/Acclim_Dynamics/Molecular_paper/WGBS/output/WGBS_methylseq_HoloInt_10_multiqc_report.html

HoloInt_methylseq2

nano HoloInt_methylseq2.sh
#!/bin/bash
#SBATCH --job-name="methylseq2"
#SBATCH -t 120:00:00 #use higher # next time
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=500GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS
#SBATCH -p putnamlab
#SBATCH --cpus-per-task=3

# load modules needed
source /usr/share/Modules/init/sh # load the module function

module load Nextflow/21.03.0

# run nextflow methylseq

nextflow run nf-core/methylseq \
-profile singularity \
--aligner bismark \
--igenomes_ignore \
--fasta /data/putnamlab/estrand/Pocillopora_acuta_HIv1.assembly.fasta \
--save_reference \
--input '/data/putnamlab/estrand/HoloInt_WGBS/test_set/*_R{1,2}_001.fastq.gz' \
--clip_r1 15 \
--clip_r2 30 \
--three_prime_clip_r1 15 --three_prime_clip_r2 15 \
--non_directional \
--cytosine_report \
--relax_mismatches \
--unmapped \
--outdir /data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq2 \
-name WGBS_methylseq_HoloInt2

The multiqc report ran into this error: Missing output file(s) multiqc_plots expected by process multiqc (1).

I ran this command in the terminal and it worked (within the HoloInt_methylseq2 folder):

$ module load MultiQC/1.9-intel-2020a-Python-3.8.2

$ multiqc -f --title "WGBS_methylseq_HoloInt2" --filename WGBS_methylseq_HoloInt2_multiqc_report  . \
      -m custom_content -m picard -m qualimap -m bismark -m samtools -m preseq -m cutadapt -m fastqc

Ouptut: WGBS_methylseq_HoloInt2_multiqc_report.html.

scp emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq2/WGBS_methylseq_HoloInt2_multiqc_report.html /Users/emmastrand/MyProjects/Acclim_Dynamics/Molecular_paper/WGBS/output/WGBS_methylseq_HoloInt2_multiqc_report.html

HoloInt_methylseq3

nano HoloInt_methylseq3.sh
#!/bin/bash
#SBATCH --job-name="methylseq3"
#SBATCH -t 120:00:00 #use higher # next time
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=500GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS
#SBATCH -p putnamlab
#SBATCH --cpus-per-task=3

# load modules needed
source /usr/share/Modules/init/sh # load the module function

module load Nextflow/21.03.0

# run nextflow methylseq

nextflow run nf-core/methylseq \
-profile singularity \
--aligner bismark \
--igenomes_ignore \
--fasta /data/putnamlab/estrand/Pocillopora_acuta_HIv1.assembly.fasta \
--save_reference \
--input '/data/putnamlab/estrand/HoloInt_WGBS/test_set/*_R{1,2}_001.fastq.gz' \
--clip_r1 15 \
--clip_r2 30 \
--three_prime_clip_r1 30 --three_prime_clip_r2 15 \
--non_directional \
--cytosine_report \
--relax_mismatches \
--unmapped \
--outdir /data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq3 \
-name WGBS_methylseq_HoloInt3

The multiqc report ran into this error: Missing output file(s) multiqc_plots expected by process multiqc (1).

I ran this command in the terminal and it worked (within the HoloInt_methylseq2 folder):

$ module load MultiQC/1.9-intel-2020a-Python-3.8.2

$ multiqc -f --title "WGBS_methylseq_HoloInt3" --filename WGBS_methylseq_HoloInt3_multiqc_report  . \
      -m custom_content -m picard -m qualimap -m bismark -m samtools -m preseq -m cutadapt -m fastqc

Ouptut: WGBS_methylseq_HoloInt3_multiqc_report.html.

scp emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq3/WGBS_methylseq_HoloInt3_multiqc_report.html /Users/emmastrand/MyProjects/Acclim_Dynamics/Molecular_paper/WGBS/output/WGBS_methylseq_HoloInt3_multiqc_report.html

Results of the multiqc on the three reports above

HoloInt_methylseq full report: https://github.com/hputnam/Acclim_Dynamics/blob/master/Molecular_paper/WGBS/output/WGBS_methylseq_HoloInt_10_multiqc_report.html
HoloInt_methylseq2 full report: https://github.com/hputnam/Acclim_Dynamics/blob/master/Molecular_paper/WGBS/output/WGBS_methylseq_HoloInt2_multiqc_report.html
HoloInt_methylseq3 full report: https://github.com/hputnam/Acclim_Dynamics/blob/master/Molecular_paper/WGBS/output/WGBS_methylseq_HoloInt3_multiqc_report.html

Comparison of methylseq statistics found here: https://github.com/hputnam/Acclim_Dynamics/blob/master/Molecular_paper/scripts/methylseq_statistics.md

HoloInt_methylseq

HoloInt_methylseq2

HoloInt_methylseq3

Based on the above, I am moving forward with HoloInt_methylseq trimming of clip 10 for all four options.

Methylseq: final script run

GENOME VERSION 1

HoloInt_methylseq_final.sh. This timed out after 10 days so I ran again with the -resume flag and changed the output name to WGBS_methylseq_HoloInt_final2. Total this took 2 weeks.

#!/bin/bash
#SBATCH --job-name="methylseq_final"
#SBATCH -t 250:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=500GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq_final
#SBATCH -p putnamlab
#SBATCH --cpus-per-task=3

# load modules needed
source /usr/share/Modules/init/sh # load the module function

module load Nextflow/21.03.0

# run nextflow methylseq

nextflow run nf-core/methylseq -resume \
-profile singularity \
--aligner bismark \
--igenomes_ignore \
--fasta /data/putnamlab/estrand/Pocillopora_acuta_HIv1.assembly.fasta \
--save_reference \
--input '/data/putnamlab/KITT/hputnam/20211008_HoloInt_WGBS/*_R{1,2}_001.fastq.gz' \
--clip_r1 10 \
--clip_r2 10 \
--three_prime_clip_r1 10 --three_prime_clip_r2 10 \
--non_directional \
--cytosine_report \
--relax_mismatches \
--unmapped \
--outdir /data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq_final \
-name WGBS_methylseq_HoloInt_final

The multiqc function is running into errors from the above script so I ran:

multiqc -f --title "WGBS_methylseq_HoloInt_final2" --filename WGBS_methylseq_HoloInt_final2_multiqc_report  . \
      -m custom_content -m picard -m qualimap -m bismark -m samtools -m preseq -m cutadapt -m fastqc

This was part of the output file. Shouldn’t there be 60 preseq files, not 59?

[INFO   ]        qualimap : Found 60 BamQC reports
[INFO   ]          preseq : Found 59 reports
[INFO   ]         bismark : Found 60 alignment reports
[INFO   ]         bismark : Found 60 dedup reports
[INFO   ]         bismark : Found 60 methextract reports
[INFO   ]        cutadapt : Found 120 reports
[INFO   ]          fastqc : Found 240 reports

Copying this file to project folder:

scp emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq_final/WGBS_methylseq_HoloInt_final2_multiqc_report.html /Users/emmastrand/MyProjects/Acclim_Dynamics/Molecular_paper/WGBS/output/WGBS_methylseq_HoloInt_final2_multiqc_report.html

Methylseq: Final Multiqc Report Output

GENOME VERSION 1

I created a new github repo for the molecular portion of this project: MyProjects/Acclim_Dynamics_molecular/.

Based on this assessment we will likely take a few of the wonky samples out prior to analysis, but will keep them in through the next following steps.

GENOME VERSION 2

A newer version of the genome was released while I was analyzing this data so I want to run this again with the newer version (http://cyanophora.rutgers.edu/Pocillopora_acuta/).

wget http://cyanophora.rutgers.edu/Pocillopora_acuta/Pocillopora_acuta_HIv2.assembly.fasta.gz and gunzip Pocillopora_acuta_HIv2.assembly.fasta.gz.

HoloInt_methylseq_genomev2.sh:

#!/bin/bash
#SBATCH --job-name="methylseq_v2"
#SBATCH -t 600:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=128GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq_genomev2
#SBATCH -p putnamlab
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_methylseq_v2" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_methylseq_v2" #once your job is completed, any final job report comments will be put in this file

# load modules needed
source /usr/share/Modules/init/sh # load the module function

module load Nextflow/21.03.0
module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2

# run nextflow methylseq

nextflow run nf-core/methylseq -resume \
-profile singularity \
--aligner bismark \
--igenomes_ignore \
--fasta /data/putnamlab/estrand/Pocillopora_acuta_HIv2.assembly.fasta \
--save_reference \
--input '/data/putnamlab/KITT/hputnam/20211008_HoloInt_WGBS/*_R{1,2}_001.fastq.gz' \
--clip_r1 10 \
--clip_r2 10 \
--three_prime_clip_r1 10 --three_prime_clip_r2 10 \
--non_directional \
--cytosine_report \
--relax_mismatches \
--unmapped \
--outdir /data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq_genomev2 \
-name HoloInt_methylseq_genomev2_3 ### I had to change this any time running a new project or running this script again

The multiqc function is running into errors from the above script so I ran:

interactive 

module load MultiQC/1.9-intel-2020a-Python-3.8.2

multiqc -f --filename WGBS_methylseq_HoloInt_genomev2_multiqc_report  . \
      -m custom_content -m picard -m qualimap -m bismark -m samtools -m preseq -m cutadapt -m fastqc

Copying this file to project folder:

scp emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq_genomev2/WGBS_methylseq_HoloInt_genomev2_multiqc_report.html /Users/emmastrand/MyProjects/Acclim_Dynamics_molecular/data/WGBS/output/WGBS_methylseq_HoloInt_genomev2_multiqc_report.html

MultiQC Report

Merge Strands

The file output from the methylseq pipeline that is used for the following steps: bismark_methylation_calls/methylation_coverage/*deduplicated.bismark.cov.gz.

The Bismark coverage2cytosine command re-reads the genome-wide report and merges methylation evidence of both top and bottom strand to create one file.

Input: *deduplicated.bismark.cov.gz.
Output: *merged_CpG_evidence.cov.

GENOME VERSION 1

Make a new directory for this output: mkdir merged_cov.

This takes ~7 hours (60 samples).

merge_strands.sh (named cov_to_cyto in other lab members’ pipelines):

#!/bin/bash
#SBATCH --job-name="merge"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=500GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS/merged_cov #### this should be your new output directory so all the outputs ends up here
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_merge" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_merge" #once your job is completed, any final job report comments will be put in this file

# load modules needed (specific need for my computer)
source /usr/share/Modules/init/sh # load the module function

# load modules needed

module load Bismark/0.20.1-foss-2018b

# run coverage2cytosine merge of strands
# change paths below 
# change file names below (_S0_L001_*)
# there can't be any spaces after the \

find /data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq_final/bismark_methylation_calls/methylation_coverage/*deduplicated.bismark.cov.gz \
 | xargs basename -s _S0_L001_R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz \
 | xargs -I{} coverage2cytosine \
 --genome_folder /data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq_final/reference_genome/BismarkIndex \
 -o {} \
 --merge_CpG \
 --zero_based \
/data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq_final/bismark_methylation_calls/methylation_coverage/{}_S0_L001_R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz

GENOME VERSION 2

Make a new directory for this output: mkdir merged_cov_genomev2.

scripts/genomev2/merge_strandsv2.sh

#!/bin/bash
#SBATCH --job-name="v2merge"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=128GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS/merged_cov_genomev2 #### this should be your new output directory so all the outputs ends up here
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_merge_v2" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_merge_v2" #once your job is completed, any final job report comments will be put in this file

# load modules needed (specific need for my computer)
source /usr/share/Modules/init/sh # load the module function

# load modules needed

module load Bismark/0.20.1-foss-2018b

# run coverage2cytosine merge of strands
# change paths below 
# change file names below (_S0_L001_*)
# there can't be any spaces after the \

find /data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq_genomev2/bismark_methylation_calls/methylation_coverage/*deduplicated.bismark.cov.gz \
 | xargs basename -s _S0_L001_R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz \
 | xargs -I{} coverage2cytosine \
 --genome_folder /data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq_genomev2/reference_genome/BismarkIndex \
 -o {} \
 --merge_CpG \
 --zero_based \
/data/putnamlab/estrand/HoloInt_WGBS/HoloInt_methylseq_genomev2/bismark_methylation_calls/methylation_coverage/{}_S0_L001_R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz

OVERVIEW

The script is saying:

  • for every file in methylation_coverage repo that ends with deduplicated.bismark.cov.gz (there should be 60 for this project),
  • and has basename of _S0_L001_R1_001_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz (everything that comes after the sample ID)
  • perform the function coverage2cytosine within the Bismark module
  • identifies where the output genome is located (folder reference_genome/BismarkIndex within methylseq output folder)
  • --zero_based: uses 0-based genomic coordinates instead of 1-based coordinates. Default is OFF
  • -o: output file names; {} identifies these remain the same
  • merge_CpG: write out additional coverage files that has the top and bottom strand methylation evidence pooled into a single CpG dinucleotide entity.

Help on merge_CpG function: https://github.com/FelixKrueger/Bismark/issues/86/.

The output files will look like (without the headers):

Scaffold Start Position Stop Position % Methylated Methylated Unmethylated
000000F 29076 29078 0.000000 0 5
000000F 29158 29160 0.000000 0 12
000000F 29185 29187 0.000000 0 8
000000F 29215 29217 0.000000 0 4
000000F 29232 29234 0.000000 0 3
000000F 29241 29243 11.111111 1 8
000000F 29277 29279 0.000000 0 11
000000F 29282 29284 0.000000 0 12
000000F 29313 29315 0.000000 0 11
29335 29335 29337 0.000000 0 10

Each CpG dinucleotide will have data for % methylation, and how many times that CpG was methylated or unmethylated.

Sort CpG .cov file

This function sorts all the merged files so that all scaffolds are in the same order. This needs to be done for multiIntersectBed to run correctly. This sets up a loop to do this for every sample (file).

GENOME VERSION 1

bedtools_sort.sh:

#!/bin/bash
#SBATCH --job-name="H-sort"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=500GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS/merged_cov #### this is the output from the merge cov step above 
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_sort" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_sort" #once your job is completed, any final job report comments will be put in this file

# load modules needed (specific need for my computer)
source /usr/share/Modules/init/sh # load the module function

# load BEDTools 
module load BEDTools/2.27.1-foss-2018b

for f in *merged_CpG_evidence.cov
do
  STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov)
  bedtools sort -i "${f}" \
  > "${STEM}"_sorted.cov
done

GENOME VERSION 2

bedtools_sortv2.sh:

#!/bin/bash
#SBATCH --job-name="v2H-sort"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=500GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS/merged_cov_genomev2 #### this is the output from the merge cov step above 
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_sortv2" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_sortv2" #once your job is completed, any final job report comments will be put in this file

# load modules needed (specific need for my computer)
source /usr/share/Modules/init/sh # load the module function

# load BEDTools 
module load BEDTools/2.27.1-foss-2018b

for f in *merged_CpG_evidence.cov
do
  STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov)
  bedtools sort -i "${f}" \
  > "${STEM}"_sorted.cov
done

No errors - move on.

OVERVIEW

The script is saying:

  • For every sample’s .cov file in the output folder merged_cov, use bedtools function to sort and then output a file with the same name plus _sorted.cov.

No errors reported with this script, move on to the next one. Viewing one file to make sure this worked and it appears to be sorted.

less 1536_sorted.cov

# example of output from this step V1
Pocillopora_acuta_HIv1___Scaffold_000000F___length_7015273      3246    3248    0.000000        0       2
Pocillopora_acuta_HIv1___Scaffold_000000F___length_7015273      3517    3519    0.000000        0       3
Pocillopora_acuta_HIv1___Scaffold_000000F___length_7015273      3538    3540    0.000000        0       3
Pocillopora_acuta_HIv1___Scaffold_000000F___length_7015273      3555    3557    0.000000        0       3

Greate bedgraph files

GENOME VERSION 2

bedgraph-v2.sh:

#!/bin/bash
#SBATCH --job-name="bedgraph-v2"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=128GB
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS/merged_cov_genomev2 #### this is the output from the merge cov step above 
#SBATCH --cpus-per-task=3
#SBATCH --error="%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output="%x_output.%j" #once your job is completed, any final job report comments will be put in this file

# load modules needed (specific need for my computer)
source /usr/share/Modules/init/sh # load the module function

# Bedgraphs for 5X coverage 

for f in *_sorted.cov
do
  STEM=$(basename "${f}" _sorted.cov)
  cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 5) {print $1, $2, $3, $4}}' \
  > "${STEM}"_5x_sorted.bedgraph
done

# Bedgraphs for 10X coverage 

for f in *_sorted.cov
do
  STEM=$(basename "${f}" _sorted.cov)
  cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4}}' \
  > "${STEM}"_10x_sorted.bedgraph
done

Filter for a specific coverage (5X, 10X)

This script is running a loop to filter CpGs for a specified coverage and creating tab files.

Essentially, the loop in this script will take columns 5 (Methylated) and 6 (Unmethylated) positions and keeps that row if it is greater than or equal to 5. This means that we have 5x coverage for this position. This limits our interpretation to 0%, 20%, 40%, 60%, 80%, 100% methylation resolution per position.

Input File: *merged_CpG_evidence.cov
Output File: 5x_sorted.tab and 10x_sorted.tab

GENOME VERSION 1

covX.sh:

#!/bin/bash
#SBATCH --job-name="H-covX"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=500GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS/merged_cov #### this is the output from the merge cov step above 
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_covX" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_covX" #once your job is completed, any final job report comments will be put in this file

# load modules needed (specific need for my computer)
source /usr/share/Modules/init/sh # load the module function

### Filtering for CpG for 5x coverage. To change the coverage, replace X with your desired coverage in ($5+6 >= X)

for f in *_sorted.cov
do
  STEM=$(basename "${f}" _sorted.cov)
  cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 5) {print $1, $2, $3, $4, $5, $6}}' \
  > "${STEM}"_5x_sorted.tab
done

### Filtering for CpG for 10x coverage. To change the coverage, replace X with your desired coverage in ($5+6 >= X)

for f in *_sorted.cov
do
  STEM=$(basename "${f}" _sorted.cov)
  cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4, $5, $6}}' \
  > "${STEM}"_10x_sorted.tab
done

GENOME VERSION 2

covX-v2.sh:

#!/bin/bash
#SBATCH --job-name="v2H-covX"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=128GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS/merged_cov_genomev2 #### this is the output from the merge cov step above 
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_covX-v2" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_covX-v2" #once your job is completed, any final job report comments will be put in this file

# load modules needed (specific need for my computer)
source /usr/share/Modules/init/sh # load the module function

### Filtering for CpG for 5x coverage. To change the coverage, replace X with your desired coverage in ($5+6 >= X)

for f in *_sorted.cov
do
  STEM=$(basename "${f}" _sorted.cov)
  cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 5) {print $1, $2, $3, $4, $5, $6}}' \
  > "${STEM}"_5x_sorted.tab
done

### Filtering for CpG for 10x coverage. To change the coverage, replace X with your desired coverage in ($5+6 >= X)

for f in *_sorted.cov
do
  STEM=$(basename "${f}" _sorted.cov)
  cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4, $5, $6}}' \
  > "${STEM}"_10x_sorted.tab
done

OVERVIEW

Moving forward I want to see the differences in data we get from 5X and 10X. We’ll have to decide which threshold to use moving forward. We want confidence and high resolution but also a large dataset so we need a happy medium.

No errors or output messages so we are good to move on.

At this point all samples have the following files:

  • 1755.CpG_report.txt
  • 1755.CpG_report.merged_CpG_evidence.cov
  • 1755_5x_sorted.tab
  • 1755_10x_sorted.tab

5x coverage

wc -l *5x_sorted.tab:

    6865623 1047_5x_sorted.tab
    8069887 1051_5x_sorted.tab
    7444259 1059_5x_sorted.tab
    7980157 1090_5x_sorted.tab
    6832422 1103_5x_sorted.tab
    2852639 1147_5x_sorted.tab
    6283492 1159_5x_sorted.tab
    6976028 1168_5x_sorted.tab
    7422350 1184_5x_sorted.tab
    7882151 1205_5x_sorted.tab
    5010918 1225_5x_sorted.tab
    6755446 1238_5x_sorted.tab
    8279427 1281_5x_sorted.tab
    6140381 1296_5x_sorted.tab
    7465439 1303_5x_sorted.tab
      80637 1312_5x_sorted.tab
    7859939 1329_5x_sorted.tab
    7546253 1415_5x_sorted.tab
    8010156 1416_5x_sorted.tab
    6948549 1427_5x_sorted.tab
    5184139 1445_5x_sorted.tab
    7492393 1459_5x_sorted.tab
    7658361 1487_5x_sorted.tab
    7545425 1536_5x_sorted.tab
    7246827 1559_5x_sorted.tab
    8173394 1563_5x_sorted.tab
    8099950 1571_5x_sorted.tab
    7413674 1582_5x_sorted.tab
    7495356 1596_5x_sorted.tab
    8376183 1641_5x_sorted.tab
    7389924 1647_5x_sorted.tab
     688805 1707_5x_sorted.tab
    9910482 1709_5x_sorted.tab
    8284869 1728_5x_sorted.tab
    7562180 1732_5x_sorted.tab
    7367647 1755_5x_sorted.tab
    7283732 1757_5x_sorted.tab
    6814612 1765_5x_sorted.tab
    4084161 1777_5x_sorted.tab
    8032989 1820_5x_sorted.tab
    7953679 2012_5x_sorted.tab
    7451856 2064_5x_sorted.tab
    6390273 2072_5x_sorted.tab
    8048988 2087_5x_sorted.tab
    5472832 2185_5x_sorted.tab
    6521355 2197_5x_sorted.tab
    1707460 2212_5x_sorted.tab
    8257655 2300_5x_sorted.tab
    7474390 2304_5x_sorted.tab
    7265909 2306_5x_sorted.tab
    7794581 2409_5x_sorted.tab
    7172620 2413_5x_sorted.tab
    6997311 2513_5x_sorted.tab
    8368965 2550_5x_sorted.tab
    6815393 2564_5x_sorted.tab
    6847716 2668_5x_sorted.tab
    7157433 2861_5x_sorted.tab
    6735201 2877_5x_sorted.tab
    8929120 2878_5x_sorted.tab
    7282137 2879_5x_sorted.tab
  415456130 total

10x coverage

wc -l *10x_sorted.tab:

 3982376 1047_10x_sorted.tab
    5467738 1051_10x_sorted.tab
    5273220 1059_10x_sorted.tab
    5707710 1090_10x_sorted.tab
    3913593 1103_10x_sorted.tab
     527796 1147_10x_sorted.tab
    3262202 1159_10x_sorted.tab
    4656862 1168_10x_sorted.tab
    5010797 1184_10x_sorted.tab
    5914757 1205_10x_sorted.tab
    1898034 1225_10x_sorted.tab
    4054856 1238_10x_sorted.tab
    6454765 1281_10x_sorted.tab
    3033366 1296_10x_sorted.tab
    5037380 1303_10x_sorted.tab
      46104 1312_10x_sorted.tab
    5370531 1329_10x_sorted.tab
    4721474 1415_10x_sorted.tab
    5994279 1416_10x_sorted.tab
    4263187 1427_10x_sorted.tab
    2073468 1445_10x_sorted.tab
    5046779 1459_10x_sorted.tab
    5408130 1487_10x_sorted.tab
    5030710 1536_10x_sorted.tab
    4676058 1559_10x_sorted.tab
    5935725 1563_10x_sorted.tab
    6100858 1571_10x_sorted.tab
    4908976 1582_10x_sorted.tab
    4802522 1596_10x_sorted.tab
    6385840 1641_10x_sorted.tab
    4952502 1647_10x_sorted.tab
      93089 1707_10x_sorted.tab
    9198191 1709_10x_sorted.tab
    5957516 1728_10x_sorted.tab
    5272299 1732_10x_sorted.tab
    5059171 1755_10x_sorted.tab
    4701658 1757_10x_sorted.tab
    3976650 1765_10x_sorted.tab
    1147103 1777_10x_sorted.tab
    6090104 1820_10x_sorted.tab
    5349823 2012_10x_sorted.tab
    5071313 2064_10x_sorted.tab
    3560352 2072_10x_sorted.tab
    5657362 2087_10x_sorted.tab
    2403681 2185_10x_sorted.tab
    3511657 2197_10x_sorted.tab
     206987 2212_10x_sorted.tab
    6133288 2300_10x_sorted.tab
    4836135 2304_10x_sorted.tab
    4608006 2306_10x_sorted.tab
    5409933 2409_10x_sorted.tab
    4339191 2413_10x_sorted.tab
    4547010 2513_10x_sorted.tab
    6939987 2550_10x_sorted.tab
    4086151 2564_10x_sorted.tab
    4132946 2668_10x_sorted.tab
    4501831 2861_10x_sorted.tab
    3682802 2877_10x_sorted.tab
    7221073 2878_10x_sorted.tab
    4803990 2879_10x_sorted.tab
  272411894 total

Remove samples that we don’t want to include in analysis

List of potential samples to take out:

  1. 1312: 5X coverage = 80,637; low aligned uniquely in multiqc report; methylation % = 13.1%
  2. 1147: 5X coverage = 2,852,639; low number of reads from the beginning
  3. 1707: 5X coverage = 688,805; a lot of deduplicated reads removed
  4. 1709: 5X coverage = 9,910,482; way high reads at the beginning
  5. 1777: 5X coverage = 4,084,161; low number of reads from the beginning
  6. 2197: Methlyation % = 4.1%
  7. 1225: Methylation % = 4.2%

1312 is M. capitata for another project. Filter this one out.

Only 59 P. acuta samples, remove 2197 and 1225 for high methylation bias

  • 2197 = ATHC 20181117 Tank 11
  • 1225 = HTAC 20181117 Tank 4
mkdir unused_samples

# doing cp first just in case this command doesn't run correctly 
cp merged_cov_genomev2/1225* unused_samples/
cp merged_cov_genomev2/2197* unused_samples/

rm merged_cov_genomev2/2197* 
rm merged_cov_genomev2/1225*

# this worked and is safe 
mv merged_cov_genomev2/1312* unused_samples/

After removing files - run the below scripts (again if already done previously)

Total sample number is now 57.

Create a file with positions found in all samples

We need to create a file that is filtered to only positions that are found in all samples (both methylated and unmethylated). multiIntersectBed creates a file that merges all samples together. The 4th column then tells you how samples have that position. We can then filter positions based on this column that is equal to our sample size. n=60 for this project.

Here is where we can loose the most reads.

Input file: 5x_sorted.tab and 10x_sorted.tab
Output file: CpG.filt.all.samps.5x_sorted.bed and CpG.filt.all.samps.10x_sorted.bed (one file for each coverage that has positions found in all samples)

GENOME VERSION 1

cov_allsamples.sh:

#!/bin/bash
#SBATCH --job-name="H-all_cov"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=500GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS/merged_cov #### this is the output from the merge cov step above 
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_all_cov" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_all_cov" #once your job is completed, any final job report comments will be put in this file

# load modules needed  
source /usr/share/Modules/init/sh # load the module function (specific to my computer)
module load BEDTools/2.27.1-foss-2018b

multiIntersectBed -i *_5x_sorted.tab > CpG.all.samps.5x_sorted.bed
multiIntersectBed -i *_10x_sorted.tab > CpG.all.samps.10x_sorted.bed

cat CpG.all.samps.5x_sorted.bed | awk '$4 ==60' > CpG.filt.all.samps.5x_sorted.bed

cat CpG.all.samps.10x_sorted.bed | awk '$4 ==60' > CpG.filt.all.samps.10x_sorted.bed

No errors in the script and all four files were created.

GENOME VERSION 2

cov_allsamples-v2.sh:

#!/bin/bash
#SBATCH --job-name="v2H-all_cov"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=128GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS/merged_cov_genomev2 #### this is the output from the merge cov step above 
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_all_cov-v2" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_all_cov-v2" #once your job is completed, any final job report comments will be put in this file

# load modules needed  
source /usr/share/Modules/init/sh # load the module function (specific to my computer)
module load BEDTools/2.27.1-foss-2018b

multiIntersectBed -i *_5x_sorted.tab > CpG.all.samps.5x_sorted.bed
multiIntersectBed -i *_10x_sorted.tab > CpG.all.samps.10x_sorted.bed

cat CpG.all.samps.5x_sorted.bed | awk '$4 ==57' > CpG.filt.all.samps.5x_sorted.bed

cat CpG.all.samps.10x_sorted.bed | awk '$4 ==57' > CpG.filt.all.samps.10x_sorted.bed
wc -l CpG.filt.all.samps.5x_sorted.bed
202245 CpG.filt.all.samps.5x_sorted.bed

wc -l CpG.filt.all.samps.10x_sorted.bed
11731 CpG.filt.all.samps.10x_sorted.bed

Gene Annotation file

http://cyanophora.rutgers.edu/Pocillopora_acuta/. This is the step I switched to version 2 fully and did not do with version 1.

GENOME VERSION 2

Download the genome.

wget http://cyanophora.rutgers.edu/Pocillopora_acuta/Pocillopora_acuta_HIv2.genes.gff3.gz
gunzip Pocillopora_acuta_HIv2.genes.gff3.gz

View the contents.

$ head Pocillopora_acuta_HIv2.genes.gff3

Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	transcript	151	2746	.	+	.	ID=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	151	172	.	+	0	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	151	172	.	+	0	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	264	304	.	+	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	264	304	.	+	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	1491	1602	.	+	0	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	1491	1602	.	+	0	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	1889	1990	.	+	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	1889	1990	.	+	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	2107	2127	.	+	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1

Filtering the 3rd column for only ‘genes’:

$ awk '{if ($3 == "gene") {print}}' Pocillopora_acuta_HIv2.genes.gff3  > Pocillopora_acuta_HIv2.genes_genesonly.gff3

This came up empty so the original file is only genes and we don’t need to filter to this. I removed the file created.

IntersectBed: Loci mapped to annotated gene

GENOME VERSION 2

Next, merge each sample file with gene annotation file using intersectBed.

Input files: *5x_sorted.tab and *10x_sorted.tab and Montipora_capitata_HIv2.genes.gff3
Output files: *_5x_sorted.tab_gene and *_10x_sorted.tab_gene

intersectBed-v2.sh:

#!/bin/bash
#SBATCH --job-name="v2H-intersectBed"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=128GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS/merged_cov_genomev2 #### this is the output from the merge cov step above 
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_intersectBed-v2" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_intersectBed-v2" #once your job is completed, any final job report comments will be put in this file

#  load modules needed  
source /usr/share/Modules/init/sh # load the module function (specific to my computer)
module load BEDTools/2.27.1-foss-2018b

for i in *5x_sorted.tab
do
  intersectBed \
  -wb \
  -a ${i} \
  -b /data/putnamlab/estrand/Pocillopora_acuta_HIv2.genes.gff3 \
  > ${i}_gene
done

for i in *10x_sorted.tab
do
  intersectBed \
  -wb \
  -a ${i} \
  -b /data/putnamlab/estrand/Pocillopora_acuta_HIv2.genes.gff3 \
  > ${i}_gene
done

No errors - move on.

IntersectBed: File to only positions found in all samples

GENOME VERSION 2

intersect_enrichment-v2.sh:

Run time = 15 minutes

#!/bin/bash
#SBATCH --job-name="v2H-enrich"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=128GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS/merged_cov_genomev2 #### this is the output from the merge cov step above 
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_v2H-enrich" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_v2H-enrich" #once your job is completed, any final job report comments will be put in this file

# load modules needed  
source /usr/share/Modules/init/sh # load the module function (specific to my computer)
module load BEDTools/2.27.1-foss-2018b

for i in *_5x_sorted.tab_gene
do
  intersectBed \
  -a ${i} \
  -b CpG.filt.all.samps.5x_sorted.bed \
  > ${i}_CpG_5x_enrichment.bed
done

for i in *_10x_sorted.tab_gene
do
  intersectBed \
  -a ${i} \
  -b CpG.filt.all.samps.10x_sorted.bed \
  > ${i}_CpG_10x_enrichment.bed
done

Within merged_cov_genomev2 folder:

wc -l *10x_enrichment.bed > 10x_enrichment_sample_size.txt 
wc -l *5x_enrichment.bed > 5x_enrichment_sample_size.txt

5X COVERAGE

head 5x_enrichment_sample_size.txt

    221239 1047_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    221239 1051_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    221239 1059_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    221239 1090_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    221239 1103_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    221239 1147_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    221239 1159_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    221239 1168_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    221239 1184_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    221239 1205_5x_sorted.tab_gene_CpG_5x_enrichment.bed

10X COVERAGE

head 10x_enrichment_sample_size.txt

    10152 1047_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    10152 1051_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    10152 1059_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    10152 1090_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    10152 1103_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    10152 1147_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    10152 1159_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    10152 1168_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    10152 1184_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    10152 1205_10x_sorted.tab_gene_CpG_10x_enrichment.bed

creating a file without the filtering for all samples step

I’m trying to see if Triploids have a higher number of methylated loci than diploids.

intersect_all-v2.sh:

#!/bin/bash
#SBATCH --job-name="v2H-allint"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=128GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS/merged_cov_genomev2 #### this is the output from the merge cov step above 
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_v2H-allint" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_v2H-allint" #once your job is completed, any final job report comments will be put in this file

# load modules needed  
source /usr/share/Modules/init/sh # load the module function (specific to my computer)
module load BEDTools/2.27.1-foss-2018b

for i in *_5x_sorted.tab_gene
do
  intersectBed \
  -a ${i} \
  -b CpG.all.samps.5x_sorted.bed \
  > ${i}_CpG_5x_all.bed
done

Within merged_cov_genomev2 folder:

wc -l *5x_all.bed > 5x_all_sample_size.txt

Output: head 5x_all_sample_size.txt

    5623521 1047_5x_sorted.tab_gene_CpG_5x_all.bed
    6164773 1051_5x_sorted.tab_gene_CpG_5x_all.bed
    5939069 1059_5x_sorted.tab_gene_CpG_5x_all.bed
    6109077 1090_5x_sorted.tab_gene_CpG_5x_all.bed
    5623282 1103_5x_sorted.tab_gene_CpG_5x_all.bed
    2932721 1147_5x_sorted.tab_gene_CpG_5x_all.bed
    5365392 1159_5x_sorted.tab_gene_CpG_5x_all.bed
    5717905 1168_5x_sorted.tab_gene_CpG_5x_all.bed
    5857148 1184_5x_sorted.tab_gene_CpG_5x_all.bed
    6119823 1205_5x_sorted.tab_gene_CpG_5x_all.bed

Export Files

scp 'emma_strand@ssh3.hac.uri.edu:/data/putnamlab/estrand/HoloInt_WGBS/merged_cov_genomev2/*_5x_enrichment.bed' ~/MyProjects/Acclim_Dynamics_molecular/data/WGBS/output/meth_counts_5x

scp 'emma_strand@ssh3.hac.uri.edu:/data/putnamlab/estrand/HoloInt_WGBS/merged_cov_genomev2/*_10x_enrichment.bed' ~/MyProjects/Acclim_Dynamics_molecular/data/WGBS/output/meth_counts_10x

without last filtering step test

scp 'emma_strand@ssh3.hac.uri.edu:/data/putnamlab/estrand/HoloInt_WGBS/merged_cov_genomev2/*_5x_all.bed' ~/MyProjects/Acclim_Dynamics_molecular/data/WGBS/output/meth_counts_5x

Troubleshooting

Prior to filtering out samples 1312, 1225, and 2197

wc -l CpG.all.samps.5x_sorted.bed
# 10189403 CpG.all.samps.5x_sorted.bed

wc -l CpG.filt.all.samps.5x_sorted.bed
# 3465 CpG.filt.all.samps.5x_sorted.bed

cat CpG.all.samps.5x_sorted.bed | awk '$4 ==57' > CpG.filt.all.samps.5x_sorted-57.bed
# 981018 CpG.filt.all.samps.5x_sorted-57.bed

cat CpG.all.samps.5x_sorted.bed | awk '$4 ==58' > CpG.filt.all.samps.5x_sorted-58.bed
# 610555 CpG.filt.all.samps.5x_sorted-58.bed

cat CpG.all.samps.5x_sorted.bed | awk '$4 ==59' > CpG.filt.all.samps.5x_sorted-59.bed
# 204177 CpG.filt.all.samps.5x_sorted-59.bed

5X COVERAGE

    2955 1047_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    2955 1051_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    2955 1059_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    2955 1090_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    2955 1103_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    2955 1147_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    2955 1159_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    2955 1168_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    2955 1184_5x_sorted.tab_gene_CpG_5x_enrichment.bed
    2955 1205_5x_sorted.tab_gene_CpG_5x_enrichment.bed

10X COVERAGE

    190 1047_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    190 1051_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    190 1059_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    190 1090_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    190 1103_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    190 1147_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    190 1159_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    190 1168_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    190 1184_10x_sorted.tab_gene_CpG_10x_enrichment.bed
    190 1205_10x_sorted.tab_gene_CpG_10x_enrichment.bed

Versions 1 and 2 of the Pacuta genome

Halfway through my analysis, a new and improved P. acuta genome was released. I did the analysis with both of these files.

Gene annotation with the wrong Pacuta file

This step needs a modified gff file that is only includes gene positions.

http://ihpe.univ-perp.fr/acces-aux-donnees/. This website has a Data_to_downoload.rar file that I downloaded to Andromeda using wget http://ihpe.univ-perp.fr/telechargement/Data_to_downoload.rar. Kevin Bryan downloaded the module unrar/6.0.2-GCCcore-10.2.0 for me to use the function unrar to see what files are stored in this Data_to_downoload.rar file.

Data_to_downoload.rar is not mispelled. There is an extra “o” after “down”.

# mkdir /data/putnamlab/estrand/Pacuta_download_genome_rar
# wget command above

# load the module 
$ module load unrar/6.0.2-GCCcore-10.2.0

# to see all files that are in this .rar file
$ unrar -l Data_to_downoload.rar 

Some files of interest in the output: 
/Pocillopora_acuta_genome_v1.fasta
/READ_ME.txt
/Structural_annotation_abintio.gff
/Structural_annotation_experimental.gff
/Functionnal_annotation_orthoMCL_abinitio

# download only one file from the .rar file 
unrar x Data_to_downoload.rar Data_to_downoload/READ_ME.txt 
# then deleted the Data_to_downoload directory folder made in the above command (b/c mispelled)

unrar x Data_to_downoload.rar Data_to_downoload/Structural_annotation_abintio.gff

After opening the READ_ME.txt file, I belive this is the .gff I’m looking for:

  • Data_to_downoload/Structural_annotation_abintio.gff (not spelled the same as in the READ_ME.txt file. File name is missing an “i”).
  • This file is the results of the structural annotation performed on the genes predicted from AUGUSTUS. It give the position of genes, transcripts, exon (initial, internal and terminal) CDS, intron, start codon and stop codon on the genome sequence. It is in gff format.

Contents of that file:

scaffold7cov100 b2h     ep      12544   12555   0       .       .       grp=TCONS00053944;pri=4;src=E "3.16e+04;0.995;16:2"
scaffold7cov100 b2h     ep      12533   12556   0       .       .       grp=TCONS00035771;pri=4;src=E "1.58e+03;0.995;3:2"
scaffold7cov100 b2h     ep      12532   12564   0       .       .       grp=TCONS00018041;pri=4;src=E "316;0.995;0:2"
scaffold7cov100 b2h     ep      12535   12564   0       .       .       grp=TCONS00051924;pri=4;src=E "1.58e+03;0.995;3:2"
scaffold7cov100 b2h     ep      12533   12565   0       .       .       grp=TCONS00013753;pri=4;src=E "588;0.995;1:2"

Filtering the 3rd column for only ‘genes’:

$ awk '{if ($3 == "gene") {print}}' Structural_annotation_abintio.gff  > Structural_annotation_abintio_genes.gff

This filters out exon and CDS? Those are parts of genes so don’t we want those too? I’m filtering for ‘genes’ for now but I might need to come back to this step and use a version that has all the parts of a gene..

IntersectBed: Loci mapped to annotated gene

Next, merge each sample file with gene annotation file using intersectBed.

Input files: *5x_sorted.tab and *10x_sorted.tab and Montipora_capitata_HIv2.genes.gff3
Output files: *_5x_sorted.tab_gene and *_10x_sorted.tab_gene

intersectBed.sh:

#!/bin/bash
#SBATCH --job-name="H-intersectBed"
#SBATCH -t 500:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=500GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS/merged_cov #### this is the output from the merge cov step above 
#SBATCH --cpus-per-task=3
#SBATCH --error="script_error_intersectBed" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_intersectBed" #once your job is completed, any final job report comments will be put in this file

# load modules needed  
source /usr/share/Modules/init/sh # load the module function (specific to my computer)
module load BEDTools/2.27.1-foss-2018b

for i in *5x_sorted.tab
do
  intersectBed \
  -wb \
  -a ${i} \
  -b /data/putnamlab/estrand/Pacuta_download_genome_rar/Data_to_downoload/Structural_annotation_abintio_genes.gff \
  > ${i}_gene
done

for i in *10x_sorted.tab
do
  intersectBed \
  -wb \
  -a ${i} \
  -b /data/putnamlab/estrand/Pacuta_download_genome_rar/Data_to_downoload/Structural_annotation_abintio_genes.gff \
  > ${i}_gene
done

The _gene files came back empty..

I think it’s because the Structural_annotation_abintio.gff (and genes subset file I made) and the sorted tab files for coverage don’t have the same gene names? So intersectBed is coming up with a blank file because none of them are the same.

Structural_annotation_abintio_genes.gff:

# start gene g1
scaffold6cov64  AUGUSTUS        gene    1       5652    0.46    -       .       g1
# end gene g1
# start gene g2
scaffold6cov64  AUGUSTUS        gene    5805    6678    0.57    +       .       g2
# end gene g2
# start gene g3
scaffold7cov100 AUGUSTUS        gene    1       2566    0.96    +       .       g3
# end gene g3
# start gene g4

Sample ID File: 2197_5x_sorted.tab

Pocillopora_acuta_HIv1___Scaffold_000000F___length_7015273      3981    3983    0.000000        0       5
Pocillopora_acuta_HIv1___Scaffold_000000F___length_7015273      11833   11835   0.000000        0       5
Pocillopora_acuta_HIv1___Scaffold_000000F___length_7015273      11869   11871   0.000000        0       5
Pocillopora_acuta_HIv1___Scaffold_000000F___length_7015273      11880   11882   0.000000        0       5
Pocillopora_acuta_HIv1___Scaffold_000000F___length_7015273      14825   14827   0.000000        0       10
Pocillopora_acuta_HIv1___Scaffold_000000F___length_7015273      14876   14878   0.000000        0       10
Pocillopora_acuta_HIv1___Scaffold_000000F___length_7015273      14904   14906   0.000000        0       8
Pocillopora_acuta_HIv1___Scaffold_000000F___length_7015273      14910   14912   0.000000        0       10
Pocillopora_acuta_HIv1___Scaffold_000000F___length_7015273      14947   14949   0.000000        0       10
Pocillopora_acuta_HIv1___Scaffold_000000F___length_7015273      14970   14972   0.000000        0       10

left off here: might be using wrong gff file… try one not built with augustus? what other files are in the .rar file? Do we have this correct file anywhere else on andromeda?

Original methylseq parameters (old_HoloInt_methylseq) - the first time it was run (which I don’t think was fully which is why I ran this again)

This doesn’t look right.. I don’t think this script ran all the way..

#!/bin/bash
#SBATCH --job-name="methylseq"
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --mem=128GB
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/HoloInt_WGBS
#SBATCH --cpus-per-task=3

# load modules needed
source /usr/share/Modules/init/sh # load the module function

module load Nextflow/21.03.0

# run nextflow methylseq

nextflow run nf-core/methylseq -resume \
-profile singularity \
--aligner bismark \
--igenomes_ignore \
--fasta /data/putnamlab/estrand/Pocillopora_acuta_HIv1.assembly.fasta \
--save_reference \
--input '/data/putnamlab/KITT/hputnam/20211008_HoloInt_WGBS/*_R{1,2}_001.fastq.gz' \
--clip_r1 10 \
--clip_r2 10 \
--three_prime_clip_r1 10 --three_prime_clip_r2 10 \
--non_directional \
--cytosine_report \
--relax_mismatches \
--unmapped \
--outdir /data/putnamlab/estrand/HoloInt_WGBS

I couldn’t find the multiqc report from this script so I ran this command in the terminal and it worked (in the old_HoloInt_methylseq folder):

$ module load MultiQC/1.9-intel-2020a-Python-3.8.2

$ multiqc -f --filename WGBS_methylseq_HoloInt_original_multiqc_report  . \
      -m custom_content -m picard -m qualimap -m bismark -m samtools -m preseq -m cutadapt -m fastqc

[WARNING]         multiqc : MultiQC Version v1.12 now available!
[INFO   ]         multiqc : This is MultiQC v1.9
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching   : /data/putnamlab/estrand/HoloInt_WGBS/old_HoloInt_methylseq
[INFO   ]         multiqc : Only using modules custom_content, picard, qualimap, bismark, samtools, preseq, cutadapt, fastqc
Searching 3881 files..  [####################################]  100%          
[INFO   ]  custom_content : nf-core-methylseq-summary: Found 1 sample (html)
[INFO   ]        qualimap : Found 50 BamQC reports
[INFO   ]          preseq : Found 49 reports
[INFO   ]         bismark : Found 60 alignment reports
[INFO   ]         bismark : Found 50 dedup reports
[INFO   ]         bismark : Found 29 methextract reports
[INFO   ]        cutadapt : Found 120 reports
[INFO   ]          fastqc : Found 240 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : WGBS_methylseq_HoloInt_original_multiqc_report.html
[INFO   ]         multiqc : Data        : WGBS_methylseq_HoloInt_original_multiqc_report_data
[INFO   ]         multiqc : MultiQC complete

The above numbers look wonky.. Should be 60 preseq reports?

Ouptut: WGBS_methylseq_HoloInt_original_multiqc_report.html.

scp emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/HoloInt_WGBS/old_HoloInt_methylseq/WGBS_methylseq_HoloInt_original_multiqc_report.html /Users/emmastrand/MyProjects/Acclim_Dynamics/Molecular_paper/WGBS/output/WGBS_methylseq_HoloInt_original_multiqc_report.html
Written on October 21, 2021