(804.6) In Vitro Transcription Factor Binding Site Predictions Using Support Vector Machine Classification

Tuesday, April 5, 2022

12:30 PM – 1:45 PM

Location: Exhibit/Poster Hall A-B - Pennsylvania Convention Center

Poster Board Number: A239

Diego Pomales-Matos (University of Puerto Rico, Río Piedras), Diego Rosado-Tristani (University of Puerto Rico, Río Piedras), Emmanuel Carrasquillo-Dones (University of Puerto Rico, Río Piedras), José Rodríguez-Martínez (University of Puerto Rico, Río Piedras)

Presenting Author(s)

Diego A. Pomales-Matos

Presenting Author
University of Puerto Rico, Río Piedras

Transcription factors (TFs) are sequence-specific DNA-binding proteins essential in regulating gene expression. Determining TF DNA-binding specificity can help to study gene regulatory networks within cells and how genetic variation can disrupt normal gene expression. One method for characterizing TF specificity is through Support Vector Machines (SVMs) by analyzing chromatin immunoprecipitation followed by DNA-sequencing (ChIP-seq) data. However, this can also be achieved using Systematic Evolution of Ligands by Exponential Enrichment (SELEX) data, a method that also aids in determining TF-DNA preferences. During this project, I implemented a gapped kmer SVM to study TF-DNA binding preferences by using data from SELEX-seq. I used a large scale-gapped kmer, a sequence-based SVM for analyzing TF specificity. It works by creating a predictive model that is trained with bound and unbound sequences from SELEX data. For purposes of this project, we used the T-box transcription factor 5 (TBX5). After training the model for TBX5 and testing its performance, it had an AUROC value of 0.8248, indicating a significant degree of reliability. Likewise, the sequences with highest scores contained motifs for the TBX5. Given these results, we concluded that SVM was successfully implemented. In addition, SELEX data had not been previously used to train SVM based predictive models, meaning SELEX data is compatible and useful for developing predictive models.

NSF Grant Award 1852259, NIH Grant Award SC1GM127231, NSF EPSCoR Research Infrastructure Grant Award 1736026, RISE Program Grant Award 5R25GM061151-19