When the genome learned its own vocabulary

Traditionally, predictive models in biology were built using hand-engineered features derived from prior biological knowledge such as known protein domains, conserved sequence motifs, or experimentally validated regulatory elements. In regulatory genomics specifically, transcription factor binding sites were typically encoded as position weight matrices curated from experimental data and compiled into databases that served as a standardized vocabulary of regulatory features. Although this feature engineering paradigm was powerful within the boundaries of established knowledge, it was inherently limited in its ability to identify new signatures.

In the early 2010s, a new paradigm began to emerge. In 2012, AlexNet demonstrated in computer vision that deep convolutional networks trained on large-scale datasets could learn representations that outperform those designed through manual feature engineering. The deep learning revolution had arrived, and it raised a key question in genomics: could a model learn the regulatory code from raw DNA sequence, without being told in advance what to look for?

Comments (0)

No login
gif