Polygraph: A Software Framework for Systematic Evaluation of Synthetic Regulatory DNA Elements

May 07, 2025 16:20 CST Updated 16:20

Genentech

Pharmaceutical R&D Manufacturer

In gene and cell therapy, designing regulatory elements is crucial, but evaluating synthetic DNA sequences is challenging. Researchers conducted the study "Polygraph: A Software Framework for Systematic Evaluation of Synthetic Regulatory DNA Elements," developing the Polygraph framework, which can assess synthetic DNA elements, helping to advance therapeutic progress and enhance understanding of gene regulation mechanisms.

Image generated by AI

In the vast field of gene and cell therapy, the design of regulatory elements acts like a precise key, capable of unlocking the door to specific gene expression and achieving cell type-specific precision regulation, bringing new hope for tackling various diseases. However, in the evaluation of synthetic DNA sequences, researchers have encountered numerous difficulties. On one hand, the lack of reliable evaluation metrics makes it difficult to conduct effective comparisons and integrations between different studies; on the other hand, the absence of simple and user-friendly software tools has made the evaluation process cumbersome and complex, akin to groping in the dark. To break through these barriers, researchers Avantika Lal, Laura Gunsalus, and others from gRED Computational Sciences at Genentech in the United States conducted an important study, with their findings published in *Genome Biology*. This study developed a Python framework named Polygraph, which brings new light to the evaluation of synthetic DNA sequences and is expected to drive significant breakthroughs in the field of gene and cell therapy.

In this study, the researchers mainly utilized the following key technical methods: First, through sequence analysis, they calculated sequence composition metrics such as GC content and k-mer frequency to compare the differences between synthetic sequences and reference sequences. Second, they employed transcription factor binding motif analysis (motif analysis), using the FIMO tool to scan for transcription factor binding sites within the sequences, followed by non-negative matrix factorization (NMF) to uncover regulatory patterns. Furthermore, embedding analysis was applied to project the sequences into a low-dimensional space for visualization and diversity calculations. Additionally, predictive modeling was integrated, utilizing pre-trained neural network models to evaluate sequence characteristics, along with language modeling, where HyenaDNA was used to compute sequence likelihood to assess its "human origin." The yeast promoter dataset used in the study was derived from Vaishnav et al.'s research, while the human enhancer dataset came from Gosai et al.'s work.

The Powerful Features of the Polygraph Software Package

Polygraph, as a comprehensive Python software package, is capable of accepting DNA sequences of arbitrary length and conducting in - depth analysis on them. In terms of sequence composition analysis, it can evaluate metrics such as GC content, k - mer frequency, and the edit distance between synthetic sequences and reference sequences. For instance, human regulatory regions have high GC content, and through the quantitative assessment of these metrics, Polygraph can provide deep insights into the novelty and "human - likeness" of computationally designed regulatory elements. In transcription factor binding motif analysis, Polygraph not only scans for transcription factor binding motifs within sequences and reports the number of matches but also analyzes motif combinations, positions, and orientations. Through NMF decomposition, it reveals common transcription factor programs, assisting researchers in uncovering higher - order regulatory rules utilized by different design methods. The embedding analysis feature enables Polygraph to embed sequences into a low - dimensional space, allowing users to perform various analyses such as sequence visualization, diversity calculation, and classifier training, comparing synthetic sequences with natural DNA from different perspectives. In predictive modeling, Polygraph integrates pre - trained neural network models to evaluate key characteristics of designed sequences, such as activity, specificity, and chromatin accessibility, providing crucial criteria for sequence screening. Language modeling utilizes HyenaDNA to quantify the log - likelihood of synthetic sequences, measuring their "human - likeness" and assessing their similarity to the human genome. Additionally, Polygraph offers a "guided evolution" function, helping researchers evolve DNA sequences with high predicted activity while maintaining similarity to natural sequences.

Polygraph Reveals Differences in Regulatory Strategies of Yeast Promoters

Researchers used yeast promoters as the research object and selected 50 natural yeast promoters with low activity in the medium (Native (Weak)) as the starting point for design. They applied a convolutional regression model and edited them using three methods: Directed Evolution, gradient-based optimization method Ledidi (Gradient), and Polygraph’s Guided Evolution, aiming to enhance promoter activity. Meanwhile, they collected 50 natural yeast promoters with the highest activity (Native (Strong)) as a reference set. The results showed that the predicted activities of synthetic promoters generated by the three design methods were similar and equal to or higher than those of natural strong promoters. However, in terms of sequence composition, the sequences edited by Directed Evolution and Gradient methods did not significantly change the edit distance from Native (Strong) promoters but showed a significant increase in GC content, introducing many non-natural sequence patterns; whereas the sequences produced by Guided Evolution were more similar to Native (Strong), with GC content close to natural DNA. In the analysis of transcription factor binding motifs, the motif content of Directed Evolution and Gradient promoters differed significantly from that of Native (Strong) promoters, introducing many motifs rarely seen in natural promoters; the motif content of Guided Evolution promoters was more similar to that of Native (Strong). Through embedding analysis, it was found that Guided Evolution promoters were closest to Native (Strong) promoters in low-dimensional space, and the diversity of all synthetic promoter groups was significantly lower than that of Native (Strong) promoters, with the Guided Evolution group having the lowest diversity. This indicates that Directed Evolution and Gradient methods adopted a regulatory syntax based on GC-rich activation motifs, which differs significantly from natural sequences; whereas the sequences produced by Guided Evolution were closer to natural strong promoters while still achieving high predicted activity.

Polygraph Identifies a Versatile Subset of Human Enhancers

Researchers further applied Polygraph to evaluate approximately 16,000 human synthetic enhancer sequences designed for the HepG2 cell line, which were generated by three methods: AdaLead, FastSeqProp, and simulated annealing. The study found that although the average GC content of all synthetic sequence groups was close to that of natural sequences, the GC variance of synthetic sequences was lower, and they could be clearly distinguished from natural sequences based on k-mer and motif content. By calculating sequence likelihood through language modeling, it was found that the log-likelihood of all synthetic sequence groups was significantly lower than that of natural sequences, indicating differences from the human genome. Using NMF analysis, it was discovered that synthetic sequences differed from natural sequences in the use of liver-specific regulatory grammar, showing significant enrichment of certain regulatory program factors while lacking others. Predicting chromatin accessibility models revealed that all designed sequences exhibited high predicted accessibility in cell types similar to the HepG2 cell line but also showed high accessibility in other unconsidered cell types, suggesting potential side effects. Additionally, researchers divided synthetic enhancers into successful and failed groups based on experimentally measured activity and found that the failure rate increased with the distance in the embedding space from natural sequences, highlighting the importance of Polygraph’s sequence authenticity metrics. In terms of sequence selection, Polygraph’s metrics can help researchers choose sequences based on different criteria: sequences selected based on "Naturalness" had a zero experimental failure rate, those selected based on "Novelty" were enriched with regulatory program factors different from natural sequences, and geometric sketching could be used to select subsets of sequences representing diverse regulatory strategies.

Research Conclusions and Significance

The Polygraph software framework developed in this study provides a comprehensive and powerful tool for evaluating synthetic regulatory DNA elements. Through various analytical methods, it deeply reveals the characteristics of regulatory elements designed by different generative algorithms, helping researchers understand the differences and similarities between these elements and natural DNA. In case studies involving yeast and humans, Polygraph has demonstrated significant potential in assessing design methods and selecting appropriate synthetic sequences for experimental validation. It enables researchers to design and screen regulatory elements more efficiently in fields such as gene and cell therapy, accelerating the rapid development of therapeutic interventions. At the same time, this framework also aids in gaining a deeper understanding of gene regulatory mechanisms, providing new perspectives and methods for life science research. Although there are still areas for improvement, such as integrating more predictive models, experimentally validating sequences, and feeding back into the design process for optimization, the emergence of Polygraph undoubtedly lays a solid foundation for advancements in synthetic biology and gene therapy, opening a new chapter in the systematic evaluation of synthetic regulatory DNA elements, and is expected to lead the field to new heights.

References

《Genome Biology》：Polygraph: a software framework for the systematic assessment of synthetic regulatory DNA elements

Submission/Advertising Cooperation

Scan to add WeChat for Business

Introduction to the Synthetic Biology Official Account:

Under the funding of the Guangdong Provincial Science and Technology Commission, South China University of Technology, Biofaction, the Austrian scientific safety research department, and the biotransmission platform jointly established a synthetic biology promotion project, aiming to provide more popular science publicity for synthetic biology. After five years of operation, we have gained extensive influence and loyal fans. If you wish to collaborate with us to promote your latest research achievements or advanced technologies to our fan base, please add the WeChat account of our operations assistant, MARK, to contact us.

Let all "synthetic biology" people work together to create a better future!