ssHMM: extracting sequence-structure motifs

We can now use the called binding sites and find out whether they share a common motif/pattern. For this purpose, we use ssHMM. The following steps are explained for the binding sites from PureCLIP in basic mode. They can be found in ~/protein-RNA-interactions/RBFOX2_data/PureCLIP_results/bindingRegions.basic.bed. To analyze the binding sites from PureCLIP with control data you have to use ~/protein-RNA-interactions/RBFOX2_data/PureCLIP_results/bindingRegions.input_CLmotifs.bed instead.

Get intermediate files: Show/Hide

In case something went wrong, you can copy the intermediate files from ~/protein-RNA-interactions/intermediate_results/.


Preprocessing

The binding sites that PureCLIP calls are very short - too short for containing a motif. Therefore, we need to extend the binding sites by 20 bp in both directions.

cd ~/protein-RNA-interactions/RBFOX2_data/PureCLIP_results/
bedtools slop -i bindingRegions.basic.bed -g ~/protein-RNA-interactions/hg19_data/genome_index/chrNameLength.txt -b 20 > bindingRegions.basic.elong.bed

Next, we sort the binding sites by their score and fetch only the best 3000.

sort -g -k5,5 -r bindingRegions.basic.elong.bed > bindingRegions.basic.elong.soSc.bed
head -n 3000 bindingRegions.basic.elong.soSc.bed > bindingRegions.basic.elong.soSc.top3000.bed

To use the binding regions for ssHMM, we copy them to the sshmm directory.

mkdir -p ~/sshmm/datasets/bed/RBFOX2_PureCLIP-basic_regions
cp bindingRegions.basic.elong.soSc.top3000.bed ~/sshmm/datasets/bed/RBFOX2_PureCLIP-basic_regions/positive_raw.bed

The binding regions are currently just genomic coordinates, i.e. numbers in a file. But ssHMM needs RNA sequences and structures to learn motifs. We therefore use a preprocessing script to fetch the nucleotide sequence from the coordinates and predict the RNA structure of the region with the tool RNAshapes.

cd ~/sshmm/
preprocess_dataset -e 20 datasets RBFOX2_PureCLIP-basic_regions 1 0.0

ssHMM

Now everything is ready and we can start ssHMM. Over many iterations it tries to find the most prominent sequence-structure motif in the data. The intermediate results are printed onto the command line so you can see the current iteration number and a few statistics.

train_seqstructhmm datasets/fasta/RBFOX2_PureCLIP-basic_regions/positive.fasta datasets/shapes/RBFOX2_PureCLIP-basic_regions/positive.txt -o results/ -n 6 -b -j RBFOX2_PureCLIP-basic_top3000_regions_len6_b_random

When ssHMM is done, it tells you the location of the final model graph. You can open this png image in an image viewer.