Point Judith Oyster DNA Methylation (MBD-BS)
Point Judith Oyster DNA Methylation
We used MBD-BS for our 2022 lab meetings oyster methylation paper: Cvir github page; laboratory protocol details post. Sequenced on a single Illumina NovaSeq S4 flow cell lane for 2 x 300 bp sequencing at Genewiz.
Contents:
Setting Up Andromeda
Original data lives here: /data/putnamlab/KITT/hputnam/20200119_Oyst_Nut/MBDBS.
Make a new directory for output files
$ mkdir /data/putnamlab/estrand/PointJudithData_MBDBS
## within the new PointJudithData_MBDBS folder
$ mkdir fastqc_results
$ mkdir scripts
$ mkdir PJ_methylseq1 ### for methylseq output
Creating a test run folder
Creating a test set folder in case I need to run different iterations of the methylseq script.
$ mkdir test_set
$ cp /data/putnamlab/KITT/hputnam/20200119_Oyst_Nut/MBDBS/HPB10_S44_L001_R{1,2}_001.fastq.gz /data/putnamlab/estrand/PointJudithData_MBDBS/test_set
$ cp /data/putnamlab/KITT/hputnam/20200119_Oyst_Nut/MBDBS/HPB11_S45_L001_R{1,2}_001.fastq.gz /data/putnamlab/estrand/PointJudithData_MBDBS/test_set
$ cp /data/putnamlab/KITT/hputnam/20200119_Oyst_Nut/MBDBS/HPB12_S46_L001_R{1,2}_001.fastq.gz /data/putnamlab/estrand/PointJudithData_MBDBS/test_set
Download genome file
C. virginica genome (the reference we will be using): https://www.ncbi.nlm.nih.gov/genome/398.
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0/GCF_002022765.2_C_virginica-3.0_genomic.fna.gz
into desired directory.
fna = FastA format file containing Nucleotide sequence (DNA)
Initial fastqc on files
fastqc.sh
. This took less than 10 minutes.
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=emma_strand@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/estrand/PointJudithData_MBDBS/scripts
#SBATCH --error="script_error_fastqc" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script_fastqc" #once your job is completed, any final job report comments will be put in this file
source /usr/share/Modules/init/sh # load the module function
cd /data/putnamlab/estrand/PointJudithData_MBDBS
module load FastQC/0.11.9-Java-11
module load MultiQC/1.9-intel-2020a-Python-3.8.2
for file in /data/putnamlab/KITT/hputnam/20200119_Oyst_Nut/MBDBS/*fastq.gz
do
fastqc $file --outdir /data/putnamlab/estrand/PointJudithData_MBDBS/fastqc_results
done
multiqc --interactive fastqc_results
Initial MultiQC Report
scp emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/PointJudithData_MBDBS/multiqc_report.html /Users/emmastrand/MyProjects/Cvir_Nut_Int/output/MBDBS/initial_multiqc_report.html
Full report here: https://github.com/hputnam/Cvir_Nut_Int/blob/master/output/MBDBS/initial_multiqc_report.html
Number of reads range from 820k to 1550k but most fall around 900-100k per sample. Most sequences are the full 300 bp length, and none lower than 280 bp.
Mean quality sequencing is low
We might need to mess with cut off variations in the methylseq because of this. We’ll see how low cut-offs go first.
NF-core: Methylseq
Different runs
Parameters
PJ_methylseq1.sh
: –clip_r1 10 \ –clip_r2 10 \ –three_prime_clip_r1 10 –three_prime_clip_r2 10 \
- Run this first to assess m-bias and then decide if we need more trial runs. See Emma Strand and Kevin Wong’s notebook posts for methylation scripts and how they dealt with these issues.
PJ_methylseq2.sh
: –clip_r1 10 \ –clip_r2 10 \ –three_prime_clip_r1 20 –three_prime_clip_r2 20 \
- Because the run time seemed oddly short, I wanted to run this again to be sure it worked.
PJ_methylseq3.sh
: –clip_r1 150 \ –clip_r2 150 \ –three_prime_clip_r1 150 –three_prime_clip_r2 150 \
- We think that methylseq isn’t properly recognizing the adapters because 1.) the multiqc report says 300 bp length when usually for methylation data we use 2x150 bp. 2.) If the program isn’t trimming the adapter fully that would explain the extreme m-bias, adapter content, and decreased quality all after 150 bp length. 3.) Usually Illumina adapters are ~150 bp of the read length and are not included in multiqc reports (if recognized appropriately).
- I’m cutting 150 here to see if this makes a difference but unsure if we trust methylseq if it’s not recognizing the adapters (i.e. what other problems might we have and don’t see yet?)
Run Time
PJ_methylseq1.sh
Run time: 3h 49m 45s (this seems weirdly short..)
PJ_methylseq2.sh
Run time: 20+hr
PJ_methylseq3.sh
Run time:
Methylseq script
Take out the comments after the lines of code if you copy and paste this script (i.e. “# EDIT HERE FOR PJ METHYLSEQ#”)
#!/bin/bash
#SBATCH --job-name="1methylseq" # EDIT HERE FOR PJ METHYLSEQ#
#SBATCH -t 200:00:00
#SBATCH --nodes=1 --cpus-per-task=15 # PJ METHYLSEQ_3 needs --cpus-per-task=15; the others didn't
#SBATCH --mem=128GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/estrand/PointJudithData_MBDBS
#SBATCH --error=../"%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output=../"%x_output.%j" #once your job is completed, any final job report comments will be put in this file
# load modules needed
source /usr/share/Modules/init/sh # load the module function
module load Nextflow/21.03.0
# run nextflow methylseq
nextflow run nf-core/methylseq -resume \
-profile singularity \
--aligner bismark \
--igenomes_ignore \
--fasta /data/putnamlab/estrand/PointJudithData_MBDBS/GCF_002022765.2_C_virginica-3.0_genomic.fa \
--save_reference \
--input '/data/putnamlab/KITT/hputnam/20200119_Oyst_Nut/MBDBS/*_R{1,2}_001.fastq.gz' \
--clip_r1 10 \ # EDIT HERE FOR PJ METHYLSEQ#
--clip_r2 10 \ # EDIT HERE FOR PJ METHYLSEQ#
--three_prime_clip_r1 10 \ # EDIT HERE FOR PJ METHYLSEQ#
--three_prime_clip_r2 10 \ # EDIT HERE FOR PJ METHYLSEQ#
--non_directional \
--cytosine_report \
--relax_mismatches \
--unmapped \
--outdir /data/putnamlab/estrand/PointJudithData_MBDBS/PJ_methylseq1 # EDIT HERE FOR PJ METHYLSEQ#
-name PJ_3 # EDIT HERE FOR PJ METHYLSEQ# ; THIS HAS TO BE DIFFERENT EVERY TIME YOU RUN IT
The multiqc function is running into errors from the above script so I ran (this took much longer than KBay and HoloInt?):
creating multiqc report
The multiqc report fails with methylseq b/c we need a more updated versinon. The below code also works.
PJ_methylseq1
interactive
module load MultiQC/1.9-intel-2020a-Python-3.8.2
multiqc -f --filename MBDBS_methylseq_PJ_multiqc_report . \
-m custom_content -m picard -m qualimap -m bismark -m samtools -m preseq -m cutadapt -m fastqc
mv MBDBS_methylseq_PJ_multiqc_report.html MBDBS_methylseq_PJ_multiqc_report2.html
I found another multiqc report that I think is what I’m looking for?
Copying this file to project folder:
scp emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/PointJudithData_MBDBS/PJ_methylseq1/MultiQC/multiqc_report.html /Users/emmastrand/MyProjects/Cvir_Nut_Int/output/MBDBS/MBDBS_methylseq_PJ_multiqc_report.html
scp emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/PointJudithData_MBDBS/MBDBS_methylseq_PJ_multiqc_report.html /Users/emmastrand/MyProjects/Cvir_Nut_Int/output/MBDBS/MBDBS_methylseq_PJ_multiqc_report2.html
PJ_methylseq2
interactive
module load MultiQC/1.9-intel-2020a-Python-3.8.2
multiqc -f --filename MBDBS_methylseq_PJ_multiqc_report_methylseq2 . \
-m custom_content -m picard -m qualimap -m bismark -m samtools -m preseq -m cutadapt -m fastqc
Copying this file to project folder:
scp emma_strand@ssh3.hac.uri.edu:../../data/putnamlab/estrand/PointJudithData_MBDBS/PJ_methylseq2/MBDBS_methylseq_PJ_multiqc_report_methylseq2.html /Users/emmastrand/MyProjects/Cvir_Nut_Int/output/MBDBS/MBDBS_methylseq_PJ_multiqc_report_methylseq2.html
Methylseq 1: Full Multiqc Report
https://github.com/hputnam/Cvir_Nut_Int/blob/master/output/MBDBS/MBDBS_methylseq_PJ_multiqc_report.html
The remaining multiqc reports are on our Cvir repo..