BSSnper on WGBS data for KBay Bleaching Pairs

Previous workflow

I used nextflow on WGBS data from KBay Bleaching pairs, workflow here. But I want to add BS-Snper program to identify SNPs in our dataset.

Corresponding github repo.

This time I will do nextflow on Unity within the scratch directory.

Set up a shared scratch directory

7-15-2025 I set up shared directory in scratch for me and Hollie when re-running this with the ploidy dataset.

https://docs.unity.uri.edu/documentation/managing-files/hpc-workspace/. Max days = 30

Options for creating shared workspaces:

Usage: ws_allocate: [options] workspace_name duration
Options:
  -h [ --help ]            produce help message
  -V [ --version ]         show version
  -d [ --duration ] arg    duration in days
  -n [ --name ] arg        workspace name
  -F [ --filesystem ] arg  filesystem
  -r [ --reminder ] arg    reminder to be sent n days before expiration
  -m [ --mailaddress ] arg mailaddress to send reminder to
  -x [ --extension ]       extend workspace
  -u [ --username ] arg    username
  -g [ --group ]           group workspace
  -G [ --groupname ] arg   groupname
  -c [ --comment ] arg     comment

Creating space shared between me and Hollie

ws_allocate -G pi_hputnam_uri_edu shared -m emma_strand@uri.edu -r 1
## that successfully created it but then I tried:

ws_allocate -G pi_hputnam_uri_edu shared -m emma_strand@uri.edu -r 2 -d 30 -n Strand_Putnam
## this also worked but just re-used previous.. The max is 30 days so I must need to extend the workspace 5 times (6x5=30)

ws_list

id: shared
     workspace directory  : /scratch3/workspace/emma_strand_uri_edu-shared
     remaining time       : 6 days 23 hours
     creation time        : Wed Jul 16 23:19:35 2025
     expiration date      : Wed Jul 23 23:19:35 2025
     filesystem name      : workspace
     available extensions : 5

7-23-2025 I’m extending this workspace: ws_extend shared 30

emma_strand_uri_edu@login1:/scratch3/workspace/emma_strand_uri_edu-shared$ ws_extend shared 30
Info: extending workspace.
Info: reused mail address emma_strand@uri.edu
/scratch3/workspace/emma_strand_uri_edu-shared
remaining extensions  : 4
remaining time in days: 30

Download genome

Navigate to the proper folder: /work/pi_hputnam_uri_edu/estrand/BleachingPairs_WGBS

Download the genome:

wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.assembly.fasta.gz
gunzip Montipora_capitata_HIv3.assembly.fasta.gz

Creating samplesheet

Create a list of rawdata files: ls -d /project/pi_hputnam_uri_edu/raw_sequencing_data/20211008_BleachingPairs_WGBS/*.gz > /work/pi_hputnam_uri_edu/estrand/BleachingPairs_WGBS/rawdata_file_list

Use RStudio in OOD to run the following R script to create a sample sheet create_metadata.R

### Creating samplesheet for nextflow methylseq
### Bleaching Pairs

## Load libraries 
library(dplyr)
library(stringr)
library(strex) 

### Read in sample sheet 

sample_list <- read.delim2("/work/pi_hputnam_uri_edu/estrand/BleachingPairs_WGBS/rawdata_file_list", header=F) %>% 
  dplyr::rename(fastq_1 = V1) %>%
  mutate(sample = str_after_nth(fastq_1, "WGBS/", 1),
         sample = str_before_nth(sample, "_S", 1),
         sample = paste0("HI_", sample)
         )

# creating sample ID 
sample_list$sample <- gsub("-", "_", sample_list$sample)

# keeping only rows with R1
sample_list <- filter(sample_list, grepl("R1", fastq_1, ignore.case = TRUE))

# duplicating column 
sample_list$fastq_2 <- sample_list$fastq_1

# replacing R1 with R2 in only one column 
sample_list$fastq_2 <- gsub("R1", "R2", sample_list$fastq_2)

# rearranging columns 
sample_list <- sample_list[,c(2,1,3)]

sample_list %>% write.csv("/work/pi_hputnam_uri_edu/estrand/BleachingPairs_WGBS/samplesheet.csv", 
                          row.names=FALSE, quote = FALSE)

Nextflow methyl-seq

01-KBay_WGBS_nexflow.sh

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=48
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=600GB
#SBATCH -t 120:00:00
#SBATCH -o output/"%x_output.%j"
#SBATCH -e output/"%x_error.%j"

## Load Nextflow and Apptainer environment modules
module purge
module load nextflow/24.10.3
module load apptainer/latest

## Set Nextflow directories to use scratch
out="/scratch3/workspace/emma_strand_uri_edu-shared/BleachingPairs_WGBS"

export NXF_WORK=${out}/nextflow_work
export NXF_TEMP=${out}/nextflow_temp
export NXF_LAUNCHER=${out}/nextflow_launcher

export APPTAINER_CACHEDIR=${out}/apptainer_cache
export SINGULARITY_CACHEDIR=${out}/apptainer_cache
export NXF_SINGULARITY_CACHEDIR=${out}/apptainer_cache

## set paths
samplesheet="/work/pi_hputnam_uri_edu/estrand/BleachingPairs_WGBS/samplesheet.csv"
ref="/work/pi_hputnam_uri_edu/estrand/BleachingPairs_WGBS/Montipora_capitata_HIv3.assembly.fasta"

# run nextflow methylseq
nextflow run nf-core/methylseq -resume \
-profile singularity \
--aligner bismark \
--igenomes_ignore \
--fasta ${ref} \
--input ${samplesheet} \
--clip_r1 10 --clip_r2 10 \
--three_prime_clip_r1 10 --three_prime_clip_r2 10 \
--non_directional \
--cytosine_report \
--relax_mismatches \
--outdir ${out} \
--skip_fastqc --skip_multiqc

Troubleshooting

Testing to see if .sh works, salloc to grab an interactive node and bash 01-KBay_WGBS_nexflow.sh

7-17-2025: Interactive node test script. HoloInt started running but I need to change samplesheet headers to sample, fastq_1, fastq_2. Edited samplesheet and ran again on interactive node. Got a node required error which is because of interactive so now I can switch this to sbatch!

Execution cancelled -- Finishing pending tasks before exit
Pulling Singularity image https://depot.galaxyproject.org/singularity/multiqc:1.29--pyhdfd78af_0 [cache /scratch3/workspace/emma_strand_uri_edu-shared/BleachingPairs_WGBS/apptainer_cache/depot.galaxyproject.org-singularity-multiq
c-1.29--pyhdfd78af_0.img]
-[nf-core/methylseq] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_METHYLSEQ:METHYLSEQ:TRIMGALORE (HI_17)'
Caused by:
  Process requirement exceeds available CPUs -- req: 12; avail: 1

7-18-2025: The work directory didn’t export to /scratch so I stopped and changed memory, added skip fastqc, multiqc, took out the save unmapped reads, and up’d the memory and tasks. I got an error and need to unzip this again. Got a node required error which is because of interactive so now I can switch this to sbatch!

Take out windows characters: sed -i 's/\r$//' 01-KBay_WGBS_nexflow.sh

Finished!!

Sorting deduplicated bam files

The pipeline does already, yay.

Running biscuit on deduplicated bam files

https://shellywanamaker.github.io/401th-post/

https://huishenlab.github.io/biscuit/

Start conda environment to download biscuit in

## path to putnamlab conda environments 
/work/pi_hputnam_uri_edu/conda/envs

## I need to set up conda channels (only once)
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --set channel_priority strict

## making new conda environment inthat path so all can use it
conda create --prefix /work/pi_hputnam_uri_edu/conda/envs/biscuit biscuit

## test that 
conda activate /work/pi_hputnam_uri_edu/conda/envs/biscuit

Run the script. Shelly had trouble with the sample name not retained through subsequent steps so she can a foor loop. This below will do a slurm array.

nano 02-biscuit_SNP.sh:

#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=12
#SBATCH --partition=uri-cpu
#SBATCH --no-requeue
#SBATCH --mem=50GB
#SBATCH -t 120:00:00
#SBATCH -o output/biscuit/"%x_output.%j"
#SBATCH -e output/biscuit/"%x_error.%j"

## Load Conda environment with biscuit downloaded 
module load conda/latest
conda activate /work/pi_hputnam_uri_edu/conda/envs/biscuit

## Set output directories to use scratch
out="/scratch3/workspace/emma_strand_uri_edu-shared/BleachingPairs_WGBS/biscuit"
ref="/work/pi_hputnam_uri_edu/estrand/BleachingPairs_WGBS/Montipora_capitata_HIv3.assembly.fasta"

## set paths
deduplicated_bams="/scratch3/workspace/emma_strand_uri_edu-shared/BleachingPairs_WGBS/bismark/deduplicated"
biscuit_path="/work/pi_hputnam_uri_edu/conda/envs/biscuit/bin"

## CREATE SAMPLE LIST FOR SLURM ARRAY
### 1. Create list of all .gz files in raw data path
ls -d ${deduplicated_bams}/*sorted.bam > ${deduplicated_bams}/samplelist

### 2. Create a list of filenames based on that list created in step 1
mapfile -t FILENAMES < ${deduplicated_bams}/samplelist

### 3. Create variable i that will assign each row of FILENAMES to a task ID
i=${FILENAMES[$SLURM_ARRAY_TASK_ID]}

# Create a pileup VCF of DNA methylation and genetic information
# Also compresses and indexes the VCF

filename=${i##*/}

${biscuit_path}/biscuit pileup -q 48 -o ${out}/${filename}.vcf ${ref} ${i}

bgzip --threads 48 ${out}/${filename}.vcf
tabix -p vcf ${out}/${filename}.vcf.gz

# Extract DNA methylation into BED format
# Also compresses and indexes the BED
${biscuit_path}/biscuit vcf2bed ${out}/${filename}.vcf.gz > ${out}/${filename}.bed

bgzip ${out}/${filename}.bed
tabix -p bed ${out}/${filename}.bed.gz

To run that with 40 files: sbatch --array=1-40 02-biscuit_SNP.sh

Troubleshooting

7-23-2025: I tried this for the first time with array of 1-40. This couldn’t find the conda command so I added module load conda/latest to load the latest version of conda.

7-25-2025/7-28-2025 OK this worked but now get the below error because I don’t have bgzip and tabix (conda install -c bioconda htslib). Successfully installed and will try again.

pileup: invalid option -- '@'
[main_pileup:1302] Unrecognized command/option: ?.
/var/spool/slurm/slurmd/job40305031/slurm_script: line 42: bgzip: command not found
/var/spool/slurm/slurmd/job40305031/slurm_script: line 43: tabix: command not found
[wzopen:20] Fatal, cannot open file: filename.vcf.gz
/var/spool/slurm/slurmd/job40305031/slurm_script: line 49: bgzip: command not found
/var/spool/slurm/slurmd/job40305031/slurm_script: line 50: tabix: command not found

Added the full path to biscuit too just in case it’s not finding this program. OK this is better now! Now this error that tells me I need to remove the extra ‘biscuit’ in the file path.

/var/spool/slurm/slurmd/job40371768/slurm_script: line 38: /work/pi_hputnam_uri_edu/conda/envs/biscuit/bin/biscuit/biscuit: Not a directory
[bgzip] No such file or directory: filename.vcf
tbx_index_build failed: filename.vcf.gz
/var/spool/slurm/slurmd/job40371768/slurm_script: line 45: /work/pi_hputnam_uri_edu/conda/envs/biscuit/bin/biscuit/biscuit: Not a directory
[bgzip] can't create filename.bed.gz: File exists
[tabix] the index file exists. Please use '-f' to overwrite.

I still get the filename.vcf issue:

[bgzip] No such file or directory: filename.vcf
tbx_index_build failed: filename.vcf.gz
[wzopen:20] Fatal, cannot open file: filename.vcf.gz
[bgzip] can't create filename.bed.gz: File exists
[tabix] the index file exists. Please use '-f' to overwrite.

OK this is now recognizing the file name, but still no biscuit… using -p instead for pileup and –threads for bgzip.

pileup: invalid option -- '@'
[main_pileup:1302] Unrecognized command/option: ?.
[bgzip] No such file or directory: /scratch3/workspace/emma_strand_uri_edu-shared/BleachingPairs_WGBS/biscuit/HI_17.deduplicated.sorted.bam.vcf
tbx_index_build failed: /scratch3/workspace/emma_strand_uri_edu-shared/BleachingPairs_WGBS/biscuit/HI_17.deduplicated.sorted.bam.vcf.gz
[wzopen:20] Fatal, cannot open file: /scratch3/workspace/emma_strand_uri_edu-shared/BleachingPairs_WGBS/biscuit/HI_17.deduplicated.sorted.bam.vcf.gz

OK now I get this error. This can’t find the {i} file so it aborts that function. Realized that it needs to be ${i}.

[E::hts_open_format] fail to open file '{i}'
[main_pileup:1361] Cannot open {i}
Abort.
[bgzip] No such file or directory: /scratch3/workspace/emma_strand_uri_edu-shared/BleachingPairs_WGBS/biscuit/HI_18.deduplicated.sorted.bam.vcf
tbx_index_build failed: /scratch3/workspace/emma_strand_uri_edu-shared/BleachingPairs_WGBS/biscuit/HI_18.deduplicated.sorted.bam.vcf.gz
[wzopen:20] Fatal, cannot open file: /scratch3/workspace/emma_strand_uri_edu-shared/BleachingPairs_WGBS/biscuit/HI_18.deduplicated.sorted.bam.vcf.gz
[bgzip] can't create /scratch3/workspace/emma_strand_uri_edu-shared/BleachingPairs_WGBS/biscuit/HI_18.deduplicated.sorted.bam.bed.gz: File exists
[tabix] the index file exists. Please use '-f' to overwrite.

8-4-2025 Running this again after adding $. This ran! I now have vcf and bam.bed outputs. Some files have an average tsv output.. What is this? Eg HI_39.deduplicated.sorted.bam.vcf_meth_average.tsv. Files that don’t have this: 17, 18.

From the error file in 17 and 18. This error file also had building reference… Maybe this was happening simultaneously and then was fine for the rest? Maybe I can just re-run these two files? This also didn’t run HI-16, probably because I did 1-40 instead of 0-39 within the slurm array.

[fai_load] build FASTA index.
[refcache_fetch:95] Error, cannot retrieve reference Montipora_capitata_HIv3___Scaffold_1:0-0.

8-5-2025 I made 3 new scripts with just the 16, 17, and 18 input files so they are running individually.

Next Steps:

How to analyze vcf files
How to get CT snps from bed files?

Written on July 16, 2025