CpG OE Analysis for DNA Methylation
Historical DNA Methylation Analysis: CpG O/E
Background on CpG O/E:
Based on scripts and pipeline instructions from:
- Dimond et al 2016.
- Jay Dimond workflow: https://github.com/jldimond/Coral-CpG
1. Preparing M.capitata and P.acuta CDS files
Download from Rutgers
wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.cds.fna.gz
wget http://cyanophora.rutgers.edu/Pocillopora_acuta/Pocillopora_acuta_HIv2.genes.cds.fna.gz
Format of these files:
Pocillopora_acuta_HIv2.genes.cds.fna
:
>Pocillopora_acuta_HIv2___RNAseq.g24143.t1a
ATGCCTCTTTCTTCCACAGTAGTTGGTTGTTTTCCAAAGCCTTCCTATCTGCACATTCCAGACTGGTTCCAAACCTCCCACAGTGGAACATTCACCGAGGATTACAATCGCTTCCTCGAAACCTCATCCAAGGAAGAAACAGAAGACCTGATCAAGAGAGCTACAAAAGATATAATTGATATCCAAGCCGAAGCTGGCATTGATGTGATCAGTGACGGTGAGTTACGGAGAGAATCTTACATCTTGCACTTCTGTCGGGCGCTAAAAGGGTTAGACTTCCACAATTTGTTTTCTAAAAACTGTCGCGATGGTGCTTGTGTCACGGACGTTCCCCGTATCATTGGCCGGGTGACACCGCGAGAAAGCGAAGCGTGGGTATGGAAAGAGTGGAAATTTTCTCAAGATTTGTCTCCGTTGCCCATGAAAATTACTATACCGGGACCGATGACAATAATAAACTCCACCGAGGATCAGTATTACAATGATGACAGGGCGCTAGGGAAAGTTTTGGCAGAGATAATAAACAGTGAGATCAAGGCTCTTGTTTCCGCTGGATGTAGATGTATTCAGATACCGGACTGGTTCCAGCCAGCGTACGCCGAAAATTTCCCAAGGGAGTACAATCGCTTTTTGCAGAGTTTTTCATCCACAGACATAGAAGAATTGTTTATCAGAGCTATTCGAGAAGTGATTGAAATACAGAGCAAAGCCGGTCTTGATGTGATAACAGATGGGGAAGTGCGGCGTGAAAACTACATCCATTACTTTTGTAGAAAGCTGAATGGCTTTGATTTCAAAAGTTTATGTCTGACGTCGATTCGCGACGACGCTGCAACAATTCTATTGCCACAAATTGTGGGAGAGGTATCAGCGCAGGAAAATGAAGGTTGGGTTTGGAAAGAATGGAAGAGTTCACAGGTTTTATCCAAACTTCCTGTGAAAATTACTCTTCCGGGACCCATGACAATTGCGCACACAGTCGTCGATCAATACTACAAGGATCAAAAGGTTCTAGGTGCTGTGTTGACAAAAATTTTAAACAGGGAAATTAAAGCTCTTGTATCGGTTGGCTGTAAGGTTATTCAGGTTGATGAACCAGTGTTAATGCGATTTCCTGATGAGGCTTTGGATTATGGAATTGATCAACTTTCAACTTGTTTTGACGGAGTTACATCTCCTGATGTAGTCAAGGCTGTTCATTTGTGTTGCGGTTATCCAACCTACTTGGATCAGGATGACTATAAGAAGGCAGACAAGGATGGATACCTGAGACTGGCTGATAAACTAGATGCTGCAGGCTTTGATGAGATTTCAATTGAGGATGCTCATAGGCACAATGATTTATCATTATTTGAGCATTTTAAAAAGAGCAAGGTTGTTCTAGGTGTTGTTAAGATTGCTAGCAGCAAAGTAGAGAGTGTCGAGGAGATTCGCAGCCGTCTAAAGCAAGTACTGACTGTTTTACCAGCAGAACGATTGGTTGTTGCTCCAGACTGTGGTTTGGGATTTCTTCCAACTGATCTGGCAAAGCAGAAACTTTTTAATATGGTACAAGCGGCAAAGTCTTTACCCTAA
Montipora_capitata_HIv3.genes.cds.fna
:
>Montipora_capitata_HIv3___RNAseq.g7985.t1
ATGGCAGAGGCTGCTTCACGGACTTGTGTGAAAGTTACGGGCGAACAGTGTTCCAACAAATGGAAGAAACTGGAGGAGAAGTATAAAAAAGTCAAGGAACACAACGACAAAACGGGTAACGACAGGAAAGAGTGCGACTTCCAAGAAGAGCTGGAAGAGTTTTTCGGAACTGATCCGAAGATAATTCCAACGGCTACCGTTTCTTCTTTGACAGCAAAGCTGGCCACAGGCAATACGTCAACGGATGATGAGGAGGAATTACCTCTGGCAAACTGCCGAGAGCCACCAAAGAAGAAGAAGAAACGCCGCTCGAAGTCGTCCGCTACTGAAATGATTGAGTTTCTGACCGAGTACAAGGAAGAAAAGCGGAAAGAGGAAGAAGCAAAAATTGTCCTTGCGCAGAAAATGCATAATGAAAAATTGGGCATAATGGAGAGATTCCTTGACGTCCTGTCGAAGAAAGCAAATTGA
>Montipora_capitata_HIv3___RNAseq.g36632.t1
ATGGCACTCACTGAGGACCAACGGTCGTGTCGTAGCAGTTTTCTGTCAGGGACTCCAGGATCAGGGGCTGATGTAGCTCCACCAGCTGAAAAAAAAAACCTAGAGCTTACAGAGCAAGTCAACGCTCTGCCCGAAGAGATTCGTAAGCTGAAAGAACACATTACTGAACATACTAGCTCTCCGGACCGCAGTGCACGGTCAACAGAAGCGGCTTTGCAGTTTTACAGTGATGCACATGACGAGTCTCAAACCTTCCAAGCAGATGTGAGTAAGGAACTGAAACGCTTTAGTTCTTGGCTCGCTGAAATCAGCACCAGAGTTGATGAAATCGGGAAGGCAATTGATAGTATGTGTGAATATAGCTATCAATACAATGTAAAGATCGTTGGGATGCCCGAGCTAAGTGAACAAGAGTCTTACTCTCAAACAATTTCTTTATGCGTCAAATTATTCTCTGATATGGGAGCCGATGTTTCTCTTCACGATATAGACATTGCACATCGTGTGCCGTAG
Number of sequences from each:
fgrep -c ">" Pocillopora_acuta_HIv2.genes.cds.fna
: 33730
fgrep -c ">" Montipora_capitata_HIv3.genes.cds.fna
: 54384
Converting FASTA to tabular format and placing output file in analyses directory
cd /data/putnamlab/estrand/CpG_OE
for the following commands
perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)} print $_;} print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n";' \
../Pocillopora_acuta_HIv2.genes.cds.fna > PA.v2.fasta2tab
Output: Converted 33730 FASTA records in 67460 lines to tabular format Total sequence length: 45794064
perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)} print $_;} print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n";' \
../Montipora_capitata_HIv3.genes.cds.fna > MC.v3.fasta2tab
Output: Converted 54384 FASTA records in 108768 lines to tabular format Total sequence length: 63481968
head -2 MC.v3.fasta2tab
:
Montipora_capitata_HIv3___TS.g31918.t1 ATGTCATGTTATGCTATTCGATCTCATCTAGATACTGTTGCCAAGGCTTACTATGTCGGTTTGAGTCAAGCGATCGAGGATAATCCTTCACGAAATTCTAAGCTTTGCATCCTTCGATCCCAACTTGTAACACGCAGTCGTCAATTCATCCTCCTTTTGAGAGAATGCACCACCTCCTACACTGCATCATGCCTTGTACCCGACGAAAAGAGCAGCACACTTCGTGATGCCCTCGCCCGATTAGCAGTGGGTCTACACCCCCTCGATGGACCTCAAGCTGTCATTCGGGTCGACCCTGCACCGGGATTTGTCTCACTGAAAACCACACATGCTTTACAGCCTCTTGGCATCTCAGTCGAAGTTGGTCGCATCAAAAACTCCAACAAAAATCCTGTTGCCGAACGAGCTGTGCTGGAACTACAAGAAGAGCTGTTACGTCAGGAGCCTGGCGGAGGCCCTGTTACCGAACTCGGCCTTGCTGTTGCCACTGCCCGCCTGAACTCGCGACTCCGCAGCCAAGGCCTGTCCTCACGGGAGCTTTGGACACAGTGCAACCAGTGCACCAATGAGCAGATACCCGTCAATGACGTACAACACATTGTAGCCAAGCACCAAGCTCGCCAAACCAACCATCCTTTCAGCGAAGCGGCCAAAGGTGGCTACCGCCCACTAGCTCCCGTTGCCCCCTTGCAGGTTGGAGACGTGGTGTATGTCAAGTCTGACCGAGATAAGTCACGCGCCTGTGATCGCTATATCATTGTTAGCATCGACGGCGAATGGTGTTTCATTAAGAAGTTCTCTGGCTCAAAACTCAGGGCTACATCTTACAAGGTGAAGCTAGCCGAGTGCTACACTGTACCACATACCCTCCCACCACCCTCGTACCAGTCTGTCGTCCCTCCACTTGACGAAGACGAATGTGTGGAGATTCCTGAACAGCCCCAGCCTTGTCAAGAGCCCTCTGCCCCACCAGACCTGTTACGTCCGCCCAACCCAGATCCTCCAGACTCCACCCCGGACCAAGCCGAACAGCAGGAAGACTCGACCCCCCTTGAAACTGATCGTGTTCCTGTTTCAATTTCATGTGAACCAAGACCCCAACGGGCACGCCGGGCCCCCGCATACTTACAGGAGTACATTCTCAACTAG
Montipora_capitata_HIv3___RNAseq.g39902.t1 ATGCAGACGTTTCGCATGGGATACAACACGCCGGTGTCTGCTCAGCGTTTCTTGGGAGTTTATGACGACTTGCGCTTGATTGTAACTCAACTCGCAAACTCTTGGGTGAAAATTGGGGGAAATGGAGAAGTTGCAGAAGATTTTGTTATTTGTGGCAATAAAGTAGGCAATAAGAGGGCACGAGAGAAACTGAACACTATTGTGGCCGGTGTCAGCTGTAAAACGATTGACATCAAACAGCCTGGGCTGAGGAAATTCTATACCAGTGGTAAAGCAGATCTTCTGGAAAAATCGATAGAGAAAAACCATGAATGCGCGATCCTTGTCCAGAAGAACTTTGACTCGAAGCGAGAGGAAGGTAAAGTACAAGGTGCACAAGTAGGTTCTGACTCCACAAGCAGTGATGACGATGATGATGAGGCTGTCATTAGTGGTGATGATGATATTTATGATGAGTGGGAAGCAGTTGTAAGTGAAACTGATCGATCATCTTTCGCCACCAAAACTGGCCACCATGTTTCGTGGAGAGCTGGAATCATTGAGGCTGAAAAGGTGAGTTAA
head -2 PA.v2.fasta2tab
:
Pocillopora_acuta_HIv2___RNAseq.g24143.t1a ATGCCTCTTTCTTCCACAGTAGTTGGTTGTTTTCCAAAGCCTTCCTATCTGCACATTCCAGACTGGTTCCAAACCTCCCACAGTGGAACATTCACCGAGGATTACAATCGCTTCCTCGAAACCTCATCCAAGGAAGAAACAGAAGACCTGATCAAGAGAGCTACAAAAGATATAATTGATATCCAAGCCGAAGCTGGCATTGATGTGATCAGTGACGGTGAGTTACGGAGAGAATCTTACATCTTGCACTTCTGTCGGGCGCTAAAAGGGTTAGACTTCCACAATTTGTTTTCTAAAAACTGTCGCGATGGTGCTTGTGTCACGGACGTTCCCCGTATCATTGGCCGGGTGACACCGCGAGAAAGCGAAGCGTGGGTATGGAAAGAGTGGAAATTTTCTCAAGATTTGTCTCCGTTGCCCATGAAAATTACTATACCGGGACCGATGACAATAATAAACTCCACCGAGGATCAGTATTACAATGATGACAGGGCGCTAGGGAAAGTTTTGGCAGAGATAATAAACAGTGAGATCAAGGCTCTTGTTTCCGCTGGATGTAGATGTATTCAGATACCGGACTGGTTCCAGCCAGCGTACGCCGAAAATTTCCCAAGGGAGTACAATCGCTTTTTGCAGAGTTTTTCATCCACAGACATAGAAGAATTGTTTATCAGAGCTATTCGAGAAGTGATTGAAATACAGAGCAAAGCCGGTCTTGATGTGATAACAGATGGGGAAGTGCGGCGTGAAAACTACATCCATTACTTTTGTAGAAAGCTGAATGGCTTTGATTTCAAAAGTTTATGTCTGACGTCGATTCGCGACGACGCTGCAACAATTCTATTGCCACAAATTGTGGGAGAGGTATCAGCGCAGGAAAATGAAGGTTGGGTTTGGAAAGAATGGAAGAGTTCACAGGTTTTATCCAAACTTCCTGTGAAAATTACTCTTCCGGGACCCATGACAATTGCGCACACAGTCGTCGATCAATACTACAAGGATCAAAAGGTTCTAGGTGCTGTGTTGACAAAAATTTTAAACAGGGAAATTAAAGCTCTTGTATCGGTTGGCTGTAAGGTTATTCAGGTTGATGAACCAGTGTTAATGCGATTTCCTGATGAGGCTTTGGATTATGGAATTGATCAACTTTCAACTTGTTTTGACGGAGTTACATCTCCTGATGTAGTCAAGGCTGTTCATTTGTGTTGCGGTTATCCAACCTACTTGGATCAGGATGACTATAAGAAGGCAGACAAGGATGGATACCTGAGACTGGCTGATAAACTAGATGCTGCAGGCTTTGATGAGATTTCAATTGAGGATGCTCATAGGCACAATGATTTATCATTATTTGAGCATTTTAAAAAGAGCAAGGTTGTTCTAGGTGTTGTTAAGATTGCTAGCAGCAAAGTAGAGAGTGTCGAGGAGATTCGCAGCCGTCTAAAGCAAGTACTGACTGTTTTACCAGCAGAACGATTGGTTGTTGCTCCAGACTGTGGTTTGGGATTTCTTCCAACTGATCTGGCAAAGCAGAAACTTTTTAATATGGTACAAGCGGCAAAGTCTTTACCCTAA
Pocillopora_acuta_HIv2___RNAseq.g24143.t1b ATGTCTCAATGTTTTGTTCAGGGAGCATTTATGCCGGAGCATCCACAGACATCCTTTTGCAGGGCGGATTTTGTCGTTCGAGTGAAAGTGCTGGGAATGATCAAAAGGGACGACGAGGTACCCACAACGCAAGCTTCAGCGGGTGATAAGTTTACGCTTCCTCCCACGACTGAGACGCTTGTAGCGGACACAGACGCCCCAGAAACGAAGACTGCTTCCCCTCCGAGGGTTCAGAAAATTGGTGATTTACCGCCACAGAGCAATAACAGCGATGTTGGGTCGCTTCTAGGTTTCATCTTTGGCCGGCCCAAGCGAAGTATCGGAATACGAGGTGCCCCACCTCCTGGAGGAGAAATGCGGTACCGTCCAATAGGCGGGAAAGGCCGCGACAAGGGAATGATGTACTATGATGTTCTAATAAAGAAAATCTTCAGGGGTGAAGACAGAGTGCGGAGAACCAAAAGAACTTTTGTTGATGCCGCCAATCCATCTCGATTGTTTGCAAGGATTTACTTCTCCAGTCATCACGGTCAGGACCTGGTACCGGGTGATGCTTACATGCTTTCTGGTAGAATCATGGGTAATGAATTGGTTCTCCCATCGCGAGGCTGCTGGATGCAAAAATGGCGTGACCTGTCCAGAGATCAAACTCACGGTCTGAGTAAGGTCTATGCGGAAAACTGCGAGTGTATGATCCAGTTTTGCATTCCTGGTCTTTGTGCATCTATGCCTGAGTCTCAGTTGCAGTGTAACTGGGAATTGCAAAATTTTAGGCAACCAAGAAGGGATTGTGGAGCAATGCACAACACATGCGTCAATAAACATGGTCAATGCCAGTGGAAAATAGGCAAAGGATATGATTCTTGTACAAAACCCTCGGGACCTCTCCCATAA
Add column with length of sequence
perl -e '$col = 2;' -e 'while (<>) { s/\r?\n//; @F = split /\t/, $_; $len = length($F[$col]); print "$_\t$len\n" } warn "\nAdded column with length of column $col for $. lines.\n\n";' \
MC.v3.fasta2tab > MC.v3.tab_1
## Added column with length of column 2 for 54384 lines.
perl -e '$col = 2;' -e 'while (<>) { s/\r?\n//; @F = split /\t/, $_; $len = length($F[$col]); print "$_\t$len\n" } warn "\nAdded column with length of column $col for $. lines.\n\n";' \
PA.v2.fasta2tab > PA.v2.tab_1
## Added column with length of column 2 for 33730 lines.
Example output:
Montipora_capitata_HIv3___TS.g31918.t1 ATGTCATGTTATGCTATTCGATCTCATCTAGATACTGTTGCCAAGGCTTACTATGTCGGTTTGAGTCAAGCGATCGAGGATAATCCTTCACGAAATTCTAAGCTTTGCATCCTTCGATCCCAACTTGTAACACGCAGTCGTCAATTCATCCTCCTTTTGAGAGAATGCACCACCTCCTACACTGCATCATGCCTTGTACCCGACGAAAAGAGCAGCACACTTCGTGATGCCCTCGCCCGATTAGCAGTGGGTCTACACCCCCTCGATGGACCTCAAGCTGTCATTCGGGTCGACCCTGCACCGGGATTTGTCTCACTGAAAACCACACATGCTTTACAGCCTCTTGGCATCTCAGTCGAAGTTGGTCGCATCAAAAACTCCAACAAAAATCCTGTTGCCGAACGAGCTGTGCTGGAACTACAAGAAGAGCTGTTACGTCAGGAGCCTGGCGGAGGCCCTGTTACCGAACTCGGCCTTGCTGTTGCCACTGCCCGCCTGAACTCGCGACTCCGCAGCCAAGGCCTGTCCTCACGGGAGCTTTGGACACAGTGCAACCAGTGCACCAATGAGCAGATACCCGTCAATGACGTACAACACATTGTAGCCAAGCACCAAGCTCGCCAAACCAACCATCCTTTCAGCGAAGCGGCCAAAGGTGGCTACCGCCCACTAGCTCCCGTTGCCCCCTTGCAGGTTGGAGACGTGGTGTATGTCAAGTCTGACCGAGATAAGTCACGCGCCTGTGATCGCTATATCATTGTTAGCATCGACGGCGAATGGTGTTTCATTAAGAAGTTCTCTGGCTCAAAACTCAGGGCTACATCTTACAAGGTGAAGCTAGCCGAGTGCTACACTGTACCACATACCCTCCCACCACCCTCGTACCAGTCTGTCGTCCCTCCACTTGACGAAGACGAATGTGTGGAGATTCCTGAACAGCCCCAGCCTTGTCAAGAGCCCTCTGCCCCACCAGACCTGTTACGTCCGCCCAACCCAGATCCTCCAGACTCCACCCCGGACCAAGCCGAACAGCAGGAAGACTCGACCCCCCTTGAAACTGATCGTGTTCCTGTTTCAATTTCATGTGAACCAAGACCCCAACGGGCACGCCGGGCCCCCGCATACTTACAGGAGTACATTCTCAACTAG 1149
Montipora_capitata_HIv3___RNAseq.g39902.t1 ATGCAGACGTTTCGCATGGGATACAACACGCCGGTGTCTGCTCAGCGTTTCTTGGGAGTTTATGACGACTTGCGCTTGATTGTAACTCAACTCGCAAACTCTTGGGTGAAAATTGGGGGAAATGGAGAAGTTGCAGAAGATTTTGTTATTTGTGGCAATAAAGTAGGCAATAAGAGGGCACGAGAGAAACTGAACACTATTGTGGCCGGTGTCAGCTGTAAAACGATTGACATCAAACAGCCTGGGCTGAGGAAATTCTATACCAGTGGTAAAGCAGATCTTCTGGAAAAATCGATAGAGAAAAACCATGAATGCGCGATCCTTGTCCAGAAGAACTTTGACTCGAAGCGAGAGGAAGGTAAAGTACAAGGTGCACAAGTAGGTTCTGACTCCACAAGCAGTGATGACGATGATGATGAGGCTGTCATTAGTGGTGATGATGATATTTATGATGAGTGGGAAGCAGTTGTAAGTGAAACTGATCGATCATCTTTCGCCACCAAAACTGGCCACCATGTTTCGTGGAGAGCTGGAATCATTGAGGCTGAAAAGGTGAGTTAA 561
wc MC.v3.tab_1
= 54384 163152 66110515 MC.v3.tab_1
wc PA.v2.tab_1
= 33730 101190 47386752 PA.v2.tab_1
Create only sequence file - The file used to count Cs and Gs will only include the sequence
awk '{print $2}' MC.v3.tab_1 > MC.v3.tab_2
awk '{print $2}' PA.v2.tab_1 > PA.v2.tab_2
Counts CGs - both cases
-F
fs = –field-separator=fs
echo "CG" | awk -F\CG '{print NF-1}' MC.v3.tab_2 > mcap_CG
echo "CG" | awk -F\CG '{print NF-1}' PA.v2.tab_2 > pacuta_CG
Count C’s
echo "C" | awk -F\[Cc] '{print NF-1}' MC.v3.tab_2 > mcap_C
echo "C" | awk -F\[Cc] '{print NF-1}' PA.v2.tab_2 > pacuta_C
Count G’s
echo "C" | awk -F\[Gg] '{print NF-1}' MC.v3.tab_2 > mcap_G
echo "C" | awk -F\[Gg] '{print NF-1}' PA.v2.tab_2 > pacuta_G
Combining counts
paste MC.v3.tab_1 \
mcap_CG \
mcap_C \
mcap_G \
> mcap_comb
head -1 mcap_comb
Montipora_capitata_HIv3___TS.g31918.t1 ATGTCATGTTATGCTATTCGATCTCATCTAGATACTGTTGCCAAGGCTTACTATGTCGGTTTGAGTCAAGCGATCGAGGATAATCCTTCACGAAATTCTAAGCTTTGCATCCTTCGATCCCAACTTGTAACACGCAGTCGTCAATTCATCCTCCTTTTGAGAGAATGCACCACCTCCTACACTGCATCATGCCTTGTACCCGACGAAAAGAGCAGCACACTTCGTGATGCCCTCGCCCGATTAGCAGTGGGTCTACACCCCCTCGATGGACCTCAAGCTGTCATTCGGGTCGACCCTGCACCGGGATTTGTCTCACTGAAAACCACACATGCTTTACAGCCTCTTGGCATCTCAGTCGAAGTTGGTCGCATCAAAAACTCCAACAAAAATCCTGTTGCCGAACGAGCTGTGCTGGAACTACAAGAAGAGCTGTTACGTCAGGAGCCTGGCGGAGGCCCTGTTACCGAACTCGGCCTTGCTGTTGCCACTGCCCGCCTGAACTCGCGACTCCGCAGCCAAGGCCTGTCCTCACGGGAGCTTTGGACACAGTGCAACCAGTGCACCAATGAGCAGATACCCGTCAATGACGTACAACACATTGTAGCCAAGCACCAAGCTCGCCAAACCAACCATCCTTTCAGCGAAGCGGCCAAAGGTGGCTACCGCCCACTAGCTCCCGTTGCCCCCTTGCAGGTTGGAGACGTGGTGTATGTCAAGTCTGACCGAGATAAGTCACGCGCCTGTGATCGCTATATCATTGTTAGCATCGACGGCGAATGGTGTTTCATTAAGAAGTTCTCTGGCTCAAAACTCAGGGCTACATCTTACAAGGTGAAGCTAGCCGAGTGCTACACTGTACCACATACCCTCCCACCACCCTCGTACCAGTCTGTCGTCCCTCCACTTGACGAAGACGAATGTGTGGAGATTCCTGAACAGCCCCAGCCTTGTCAAGAGCCCTCTGCCCCACCAGACCTGTTACGTCCGCCCAACCCAGATCCTCCAGACTCCACCCCGGACCAAGCCGAACAGCAGGAAGACTCGACCCCCCTTGAAACTGATCGTGTTCCTGTTTCAATTTCATGTGAACCAAGACCCCAACGGGCACGCCGGGCCCCCGCATACTTACAGGAGTACATTCTCAACTAG 1149 60 362 249
paste PA.v2.tab_1 \
pacuta_CG \
pacuta_C \
pacuta_G \
> pacuta_comb
head -1 pacuta_comb
Pocillopora_acuta_HIv2___RNAseq.g24143.t1a ATGCCTCTTTCTTCCACAGTAGTTGGTTGTTTTCCAAAGCCTTCCTATCTGCACATTCCAGACTGGTTCCAAACCTCCCACAGTGGAACATTCACCGAGGATTACAATCGCTTCCTCGAAACCTCATCCAAGGAAGAAACAGAAGACCTGATCAAGAGAGCTACAAAAGATATAATTGATATCCAAGCCGAAGCTGGCATTGATGTGATCAGTGACGGTGAGTTACGGAGAGAATCTTACATCTTGCACTTCTGTCGGGCGCTAAAAGGGTTAGACTTCCACAATTTGTTTTCTAAAAACTGTCGCGATGGTGCTTGTGTCACGGACGTTCCCCGTATCATTGGCCGGGTGACACCGCGAGAAAGCGAAGCGTGGGTATGGAAAGAGTGGAAATTTTCTCAAGATTTGTCTCCGTTGCCCATGAAAATTACTATACCGGGACCGATGACAATAATAAACTCCACCGAGGATCAGTATTACAATGATGACAGGGCGCTAGGGAAAGTTTTGGCAGAGATAATAAACAGTGAGATCAAGGCTCTTGTTTCCGCTGGATGTAGATGTATTCAGATACCGGACTGGTTCCAGCCAGCGTACGCCGAAAATTTCCCAAGGGAGTACAATCGCTTTTTGCAGAGTTTTTCATCCACAGACATAGAAGAATTGTTTATCAGAGCTATTCGAGAAGTGATTGAAATACAGAGCAAAGCCGGTCTTGATGTGATAACAGATGGGGAAGTGCGGCGTGAAAACTACATCCATTACTTTTGTAGAAAGCTGAATGGCTTTGATTTCAAAAGTTTATGTCTGACGTCGATTCGCGACGACGCTGCAACAATTCTATTGCCACAAATTGTGGGAGAGGTATCAGCGCAGGAAAATGAAGGTTGGGTTTGGAAAGAATGGAAGAGTTCACAGGTTTTATCCAAACTTCCTGTGAAAATTACTCTTCCGGGACCCATGACAATTGCGCACACAGTCGTCGATCAATACTACAAGGATCAAAAGGTTCTAGGTGCTGTGTTGACAAAAATTTTAAACAGGGAAATTAAAGCTCTTGTATCGGTTGGCTGTAAGGTTATTCAGGTTGATGAACCAGTGTTAATGCGATTTCCTGATGAGGCTTTGGATTATGGAATTGATCAACTTTCAACTTGTTTTGACGGAGTTACATCTCCTGATGTAGTCAAGGCTGTTCATTTGTGTTGCGGTTATCCAACCTACTTGGATCAGGATGACTATAAGAAGGCAGACAAGGATGGATACCTGAGACTGGCTGATAAACTAGATGCTGCAGGCTTTGATGAGATTTCAATTGAGGATGCTCATAGGCACAATGATTTATCATTATTTGAGCATTTTAAAAAGAGCAAGGTTGTTCTAGGTGTTGTTAAGATTGCTAGCAGCAAAGTAGAGAGTGTCGAGGAGATTCGCAGCCGTCTAAAGCAAGTACTGACTGTTTTACCAGCAGAACGATTGGTTGTTGCTCCAGACTGTGGTTTGGGATTTCTTCCAACTGATCTGGCAAAGCAGAAACTTTTTAATATGGTACAAGCGGCAAAGTCTTTACCCTAA 157553 297 380
2. Calculate CpG O/E
pacuta_comb
and mcap_comb
= final files from step #1
use ^ instead of ** for exponent
awk '{print $1, "\t", (($4)/($5*$6))*(($3^2)/($3-1))}' mcap_comb > mcap_ID_CpG
head mcap_ID_CpG
:
Montipora_capitata_HIv3___TS.g31918.t1 0.765493
Montipora_capitata_HIv3___RNAseq.g39902.t1 0.766505
Montipora_capitata_HIv3___RNAseq.g22918.t1 1.10518
Montipora_capitata_HIv3___RNAseq.g7985.t1 1.02165
Montipora_capitata_HIv3___RNAseq.g36632.t1 0.797079
Montipora_capitata_HIv3___RNAseq.g20158.t1 0.306959
Montipora_capitata_HIv3___RNAseq.g27719.t1 0.558765
Montipora_capitata_HIv3___RNAseq.g34935.t1 0.538187
Montipora_capitata_HIv3___RNAseq.g14270.t1 1.04623
Montipora_capitata_HIv3___RNAseq.g4696.t1 0.818074
awk '{print $1, "\t", (($4)/($5*$6))*(($3^2)/($3-1))}' pacuta_comb > pacuta_ID_CpG
head pacuta_ID_CpG
:
Pocillopora_acuta_HIv2___RNAseq.g24143.t1a 0.740103
Pocillopora_acuta_HIv2___RNAseq.g24143.t1b 0.766137
Pocillopora_acuta_HIv2___RNAseq.g22918.t1 0.827021
Pocillopora_acuta_HIv2___RNAseq.g18333.t1 0.494065
Pocillopora_acuta_HIv2___RNAseq.g7985.t1 0.608104
Pocillopora_acuta_HIv2___RNAseq.g13511.t1 0.570192
Pocillopora_acuta_HIv2___RNAseq.g27719.t1 0.192805
Pocillopora_acuta_HIv2___RNAseq.g14270.t1 0.647634
Pocillopora_acuta_HIv2___TS.g15308.t1 0.489107
Pocillopora_acuta_HIv2___RNAseq.g2057.t1 0.537751
3. Functional annotation of Mcap and Pacuta
Genome functional annotation:
- Emma Strand Mcap v3 functional annotation: https://github.com/emmastrand/EmmaStrand_Notebook/blob/master/_posts/2022-11-02-M.capitata-Genome-v3-Functional-Annotation.md
- Emma Strand Pacuta v2 functional annotation: https://github.com/emmastrand/EmmaStrand_Notebook/blob/master/_posts/2022-11-17-Pocillopora-acuta-v2-Functional-Genome-Annotation.md
RNASeq gene names functional annotation from Rutgers:
wget http://cyanophora.rutgers.edu/Pocillopora_acuta/Pocillopora_acuta_HIv2.genes.EggNog_results.txt.gz
wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.EggNog_results.txt.gz
Sort files in EggNog functional annotation and CpG OE results files
sort -i pacuta_ID_CpG > pacuta_ID_CpG_sorted
Pocillopora_acuta_HIv2___RNAseq.10002_t 0.874359
Pocillopora_acuta_HIv2___RNAseq.10010_t 0.824372
Pocillopora_acuta_HIv2___RNAseq.1016_t 0.852742
Pocillopora_acuta_HIv2___RNAseq.10171_t 0.711555
Pocillopora_acuta_HIv2___RNAseq.10212_t 0.625815
Pocillopora_acuta_HIv2___RNAseq.10213_t 1.22103
Pocillopora_acuta_HIv2___RNAseq.10263_t 0.781719
Pocillopora_acuta_HIv2___RNAseq.10287_t 0.733141
Pocillopora_acuta_HIv2___RNAseq.10316_t 1.05813
Pocillopora_acuta_HIv2___RNAseq.10405_t 0.640849
sort -i mcap_ID_CpG > mcap_ID_CpG_sorted
:
Montipora_capitata_HIv3___RNAseq.10136_t 0.615473
Montipora_capitata_HIv3___RNAseq.10187_t 1.07151
Montipora_capitata_HIv3___RNAseq.10207_t 0.566746
Montipora_capitata_HIv3___RNAseq.10214_t 0.475799
Montipora_capitata_HIv3___RNAseq.10304_t 0.893919
Montipora_capitata_HIv3___RNAseq.10384_t 0.707401
Montipora_capitata_HIv3___RNAseq.10476_t 0.666366
Montipora_capitata_HIv3___RNAseq.10478_t 0.666366
Montipora_capitata_HIv3___RNAseq.1048_t 0.827187
Montipora_capitata_HIv3___RNAseq.10547_t 0.613498
sort -i Pocillopora_acuta_HIv2.genes.EggNog_results.txt > CpG_OE/Pocillopora_acuta_HIv2.genes.EggNog_results_sorted.txt
sort -i Montipora_capitata_HIv3.genes.EggNog_results.txt > CpG_OE/Montipora_capitata_HIv3.genes.EggNog_results_sorted.txt
Combine functional annotation file with CpG sorted file
join mcap_ID_CpG_sorted Montipora_capitata_HIv3.genes.EggNog_results_sorted.txt | awk '{print $1, "\t", $2}' > Mcap_cpg_anno
:
Montipora_capitata_HIv3___RNAseq.10187_t 1.07151
Montipora_capitata_HIv3___RNAseq.10207_t 0.566746
join pacuta_ID_CpG_sorted Pocillopora_acuta_HIv2.genes.EggNog_results_sorted.txt | awk '{print $1, "\t", $2}' > Pacuta_cpg_anno
:
Pocillopora_acuta_HIv2___RNAseq.10002_t 0.874359
Pocillopora_acuta_HIv2___RNAseq.10010_t 0.824372
Export data to local computer
M. capitata is for Bleaching Pairs project.
scp emma_strand@ssh3.hac.uri.edu:/data/putnamlab/estrand/CpG_OE/Mcap_cpg_anno /Users/emmastrand/MyProjects/HI_Bleaching_Timeseries/Dec-July-2019-analysis/output/WGBS/Mcap_cpg_anno.txt
P. acuta is for Holobiont Integration project.
scp emma_strand@ssh3.hac.uri.edu:/data/putnamlab/estrand/CpG_OE/Pacuta_cpg_anno /Users/emmastrand/MyProjects/Acclim_Dynamics_molecular/data/WGBS/Pacuta_cpg_anno.txt
Next steps
The dataframe I need is:
Gene_ID CpG_ratio Annotation
The annotation is the GOSlim terms from the functional annotation pipeline I use.
1.) R script to analyze CpG Density 2.) Identify which Gene IDs (and annotations) are in weakly and strongly methylated sets 3.) Interpret what P. acuta and M. capitata traditionally heavily methylate (or not) 4.) Track % methylation on those two groups of genes (weakly and strongly) through time, treatment (might have to collapse temperature)
CpG ratio is only at the species level not individual because the timescale changes in nucleotides is larger than the shorter-term effect of environment or phenotype in the case of bleaching pairs.