Our research is in the area of regulatory genomics, with a focus on developing quantitative assays and mechanistic models to investigate how transcription factors interact with their genomic targets, how they compete/cooperate with one another, and how they interact with other cellular processes that involve the genome, such as DNA damage and repair.
Transcriptional regulation of gene expression is critical for all cellular processes. A major step in the regulation of gene expression is binding of regulatory proteins called transcription factors (TFs) to specific short DNA sites in the promoters and enhancers of the regulated genes. Recent GWAS and association studies estimate that 85-90% of disease-associated variants occur in non-coding genomic regions, and thus have the potential to affect TF binding. These mutations in TF binding sites may lead to dysregulated gene expression and contribute to disease. Importantly, even small changes in gene expression can lead to disease over time (e.g. in neurodegenerative disorders), and even a small change in DNA binding affinity can have important phenotypic consequences (e.g. during development). Thus, there is a growing demand for sensitive and quantitative approaches to measure and model TF-DNA binding, either alone or interactively through cooperative and competitive binding. This knowledge can then be used to assess the effect of genetic variation/mutations on the strength of TF-DNA binding, and subsequently on the level of gene expression. We develop such quantitative approaches in the Gordan laboratory.
Deciphering the codes that govern the genomic recruitment of transcription factors
Inside eukaryotic cells, transcription factors must scan billions of nucleotides and navigate a complex environment in order to identify and bind their genomic target sites. The genome is ‘decorated’ with a myriad of proteins and other molecules, including transcription factors from various protein families, nucleosomes that compact the DNA, epigenetic modifications (such as DNA methylation), DNA damages resulting from internal or external factors, DNA repair enzymes, etc. Understanding how transcription factors identify their targets in this complex environment is critical for studying the changes in binding and gene regulation that occur during evolution and disease.
A major research direction in our lab is deciphering the codes that govern the genomic recruitment of transcription factors. Toward this goal, our general approach is to recreate and model regulatory systems in a controlled cell-free environment, through the quantitative design and careful inclusion of individual components, one at a time, to elucidate the role of each component. We start with simple systems of two molecules (purified transcription factors and naked DNA containing single genomic binding sites) and then we gradually add more components to increase the complexity of the system (e.g. competing factors, additional binding sites and cooperating factors, epigenetic factors, etc.), while adapting our quantitative models at every step. We compare our models against cell-based (in vivo) data, being careful to take into account critical biological variables such as DNA accessibility, as well as the technical biases of in vivo assays. This approach is perfectly suited for learning sensitive, quantitative models of transcription factor binding. The in vitro systems allow us to assign causality to the observed transcription factor-DNA binding signals, which we then test by perturbing the systems in vivo.
Most cell-free experiments performed in our laboratory use large libraries of DNA molecules synthesized de novo on glass slides, also known as chips or microarrays. Using chips allows us to perform precise binding measurements for tens or hundreds of thousands of DNA sequences simultaneously, a throughput that cannot be achieved with standard biochemical experiments. We have pioneered the development and refinement of the genomic-context protein-binding microarray (gcPBM) assay (Gordan et al. 2013). Inspired by the PBM technology developed in the Bulyk lab (Harvard Medical School), the gcPBM assay measures transcription factor binding to DNA sites in their native genomic sequence context (i.e. to the same sequences that the proteins encounter in the cell), but in a controlled cell-free environment where we can measure binding quantitatively and with high accuracy (replicate R2 = 0.95-0.99; R2 > 0.9 when compared to independent binding affinity data). gcPBM assays allow us to quantify the contribution of genomic flanking sequences to TF-DNA binding affinity, through mechanisms such as DNA shape and non-consensus DNA binding (Gordan et al. 2013, Mordelet et al. 2013, Yang et al. 2014, Afek et al. 2014, Zhou et al. 2015, Schipper et al., Shen et al. 2018).
Based on high-quality gcPBM data, we develop quantitative models of protein-DNA binding specificity, able to capture dependencies within TF binding sites. Such models include positional k-mer regression models, trained using support vector regression (SVR) and LASSO (Gordan et al. 2013, Mordelet et al. 2013, Schipper et al., Shen et al.). Our models oftentimes reach the predictive accuracy of replicate experiments. This is due to the unique combination of high-quality data and state-of-the-art machine learning approaches.
Molecular mechanisms by which paralogous TFs select distinct genomic targets
Most human TFs are part of large protein families, with many family members (i.e. paralogous TFs) expressed at the same time in the cell, but binding to different sets of genomic sites and playing different regulatory roles, despite having indistinguishable DNA motifs. The differential in vivo binding and regulatory activity of paralogous TFs is currently believed to be entirely due to interactions with protein cofactors or the chromatin environment. In contrast to this believe, we have identified, using the gcPBM technique, differences in intrinsic binding specificity between paralogous TFs previously reported to have identical DNA motifs (Gordan et al. 2013, Shen et al.). Sequence-based differences are typically located in the core TF binding sites, while structural differences are more common in the flanking regions. Some of the differences are large, other are more subtle, but nonetheless correlated with in vivo binding data. Importantly, many differences identified by gcPBM are undetected by other technologies. This is due to the fact that gcPBMs measure TF binding to sites in their genomic sequence context (i.e. the same sequences they encounter in the cell), but in a controlled cell-free environment where we can measure binding quantitatively and with high accuracy.
Characterizing the transcriptional regulatory effects of genetic variants and mutations in non-coding genomic regions
Recent studies estimate that 85-90% of disease-associated variants occur in non-coding genomic regions, suggesting that such variants are important in disease etiology. However, the effect of non-coding variants on TF-DNA binding is currently assessed based on DNA motif models (which are just approximations of specificity) and/or low-resolution genomic binding data for TFs and histone marks. Neither approach can make accurate quantitative predictions. In contrast to these approaches, we are leveraging PBM data to develop new approaches for predicting the effect of non-coding variants on the level of TF-DNA binding. We focus on two applications: 1) we develop statistical methods for identifying genetic mutations that affect TF-DNA binding, leveraging the quantitative nature of in vitro data generated in our lab or available from literature (Zhao et al., Recomb 2017), and 2) we develop methods for pathway-based analyses of non-coding somatic mutations identified from tumor whole-genome sequencing data (Jusakul et al., Cancer Discovery 2017).
Deciphering the interplay between transcription factors and DNA damage/repair processes
DNA damage occurs frequently in the genome, at a rate of ~104 events per day. These chemical changes, if unrepaired, can impair critical cellular functions, threaten cell viability, lead to mutations and the development of disease. Several molecular pathways are responsible for repairing different types of DNA lesions. The first step in most repair pathways is the recognition of damaged sites by the appropriate DNA repair enzymes. TFs bound at or around damaged sites can interfere with the recognition of these sites by repair enzymes, and thus exert a direct effect on DNA repair. In addition, binding of TFs to the DNA is likely affected by DNA lesions. In a new research direction in our lab, we are investigating the complex interplay between transcription factors and DNA damage/repair processes.