Maurizio Morri – Science Blog

AI Is Drowning in Data. The Fix Might Be a Protein.

May 29, 2026

There is a quiet problem sitting underneath the AI boom that almost nobody talks about. All of this, the training runs, the generated images, the endless model checkpoints, produces data. An astonishing, accelerating amount of it. And all of that data has to live somewhere physical: a hard drive, a tape, a cluster of disks in a warehouse humming with cooling fans. Those warehouses are filling up, drawing enormous amounts of power, and the drives inside them quietly fail and get replaced every few years. We are generating information far faster than we can comfortably store it.

This month a team at The Hong Kong Polytechnic University proposed a fix that sounds almost like a category error. Do not store the data on a disk. Store it inside a protein.

Writing a file into a molecule

The idea of molecular data storage is not new in itself. For years, researchers have been excited about storing information in DNA, because DNA is nature’s own archival format: incredibly dense, and stable for thousands of years under the right conditions. The PolyU team went a step further and used proteins instead, and the way they did it is worth slowing down on.

A protein is a chain of amino acids. There are twenty common ones, and the order they come in is what makes one protein different from another. If you think of each amino acid as a symbol, then a sequence of them is just a string of symbols, which is to say, it is information. So the team did the obvious-once-you-hear-it thing: they encoded digital files as amino acid sequences. To make this manageable, they borrowed a structural trick from biology and used collagen, the protein that makes up much of your connective tissue, as a template. Collagen has a clean, repetitive backbone, and they embedded their data-carrying sequences into that scaffold.

Then comes the part I find genuinely clever. To actually produce these designer proteins, they did not synthesize them chemically. They handed the instructions to E. coli, the workhorse bacterium of every molecular biology lab, and let living cells manufacture the data-bearing proteins for them. To read the data back, they chopped the proteins into fragments and ran them through mass spectrometry, an instrument that measures molecular weights precisely enough to reconstruct the original sequence, and therefore the original file.

Encode into amino acids, grow in bacteria, read out on a mass spec, recover the file. A full round trip, which is the thing that had not been cleanly demonstrated before in designed unnatural proteins.

Why proteins and not DNA

The natural question is why bother with proteins when DNA storage already exists. The team’s answer is mostly about efficiency. Compared with their own earlier peptide-based approach, the protein method reportedly reached around thirty times the storage density at roughly a tenth of the cost. They also report that the data-bearing proteins were considerably more stable than DNA, which matters a great deal for anything meant to be an archive rather than a working file.

Two further features push this from a curiosity toward something that looks like real engineering. The first is random access. With a naive molecular archive, you cannot easily grab one file out of the middle; you have to decode everything. By attaching specific affinity tags to the proteins carrying particular data, the team could fish out just the segment they wanted using matching antibodies, the molecular equivalent of pulling one labeled box off a shelf. The second is encryption. Because a tagged message can only be retrieved by someone who has the matching capture compound, the storage medium comes with a built-in lock. The data is not just stored in biology; it is hidden in it.

The part I keep thinking about

It is easy to file this under “neat lab demo,” and to be fair, that is what it is for now. Mass spectrometry is slow and expensive compared with reading a hard drive, the capacities are tiny next to a data center, and growing your archive in a bacterial culture is not something you will be doing on a Tuesday afternoon. Nobody is backing up their photos to E. coli this year.

But step back and look at the shape of the thing. The problem was created by the digital world: AI generating data faster than silicon can hold it. The proposed solution comes from the biological world: the same molecules that store and process information inside every living cell, repurposed to hold our files. That crossover is the recurring theme of this whole era. We spent decades teaching computers to read biology, predicting protein structures, modeling cells. Now biology is being recruited to solve a computing problem in the other direction.

I do not know whether protein storage specifically will be the thing that scales. Most clever demonstrations like this do not become the standard; they become the idea that the eventual standard borrowed from. What I am fairly sure of is that the wall between “computing” and “biology” is going to keep getting thinner, because the most interesting problems increasingly sit right on top of it. A file written into a collagen backbone, grown in a dish, and read back out a fortnight ago is a small, strange, and rather beautiful reminder of where things are heading.
CREsted and the Shift From Prediction to Regulatory Design

April 16, 2026

A technically interesting AI and biology paper published on April 2, 2026 comes from Nature Methods and focuses on something deeper than simple classification. The paper introduces CREsted, a software framework for modeling and designing cell type specific enhancers directly from single cell chromatin accessibility data. In practice, that means using deep learning not just to read regulatory DNA, but to help decode enhancer logic and generate new candidate sequences across tissues and even across species.

What makes this especially strong from an engineering perspective is the end to end structure of the framework. CREsted combines preprocessing of scATAC seq data, model training, interpretation of cell type specific enhancer code, and synthetic enhancer design in one pipeline. The authors report applications in mouse cortex, human peripheral blood mononuclear cells, mesenchymal like cancer states, and zebrafish development, which gives the method a broader scope than many narrowly tuned genomics models.

The technical point here is that enhancer modeling is becoming a design problem, not just an annotation problem. The paper describes multi output regression and multi label classification settings, transfer learning from large scale models, nucleotide level explanation methods, motif discovery, and downstream matching to transcription factor candidates. That is a meaningful step because it links foundation style sequence modeling to interpretable regulatory biology instead of stopping at raw predictive performance.

The most compelling part is that the system was not presented only as a computational benchmark. The authors say they trained on a zebrafish development atlas and then used the framework to design synthetic enhancers that were validated in vivo. That is exactly the direction many people have been waiting for in AI biology: models that move from recognizing patterns in genomic data to proposing regulatory elements that can actually be tested in living systems.

This is why papers like this matter. The field is slowly moving away from AI as a passive analysis layer and toward AI as a tool for writing biology with stronger mechanistic grounding. If that trend continues, some of the most important models in genomics will be the ones that can infer regulatory grammar well enough to support real sequence design.

Sources

https://www.nature.com/articles/s41592-026-03057-2

https://doi.org/10.1038/s41592-026-03057-2
How AI is helping antibody discovery

March 24, 2026

AI Is Getting Better at Finding Useful Antibodies Without Screening Everything

One of the most interesting biology and AI stories from the last couple of weeks is a March 15, 2026 Nature Communications paper on AI guided antibody discovery. The study describes a method for mining antibody functionality through structural landscape profiling, using AI to organize and search antibody space more intelligently instead of relying only on brute force experimental screening.

That matters because antibody discovery usually has a scale problem. The space of possible antibodies is enormous, but only a tiny fraction will bind the right target with the right behavior. A system that can map structural relationships and prioritize promising candidates earlier could make discovery faster and cheaper, especially in therapeutic and diagnostic development where screening campaigns are expensive.

The deeper shift is that AI is not just being used here to classify data after the fact. It is being used to navigate biological possibility space. That is a more important role. In practical terms, it means the model can help researchers decide where to look next, which is often the hardest part of modern biology once the raw data become too large to explore by hand.

This is why the story feels bigger than a niche antibody paper. A lot of AI in biology is moving away from simple prediction and toward guided search. Instead of only asking whether a sequence or structure looks plausible, researchers are asking whether AI can help find the rare molecules that are actually worth testing in the lab. That is a much more useful frontier for medicine.

Sources

https://www.nature.com/articles/s41467-026-70553-6
Predicting transcriptomics from chemistry

March 19, 2026

One of the most technically interesting biology and AI stories of the last two weeks is a new Cell paper on a platform called GPS, short for Gene expression profile Predictor on chemical Structures. The core idea is unusually ambitious but easy to state: infer how a molecule will reshape gene expression by looking at the molecule itself, then use that prediction to screen libraries and optimize leads before doing huge amounts of wet lab work. The paper was published in Cell in mid March 2026, and it pushes AI driven drug discovery one step closer to a more mechanistic middle layer between chemical structure and disease phenotype.

That matters because a lot of drug discovery still suffers from an awkward gap. We are reasonably good at representing small molecules, and we are increasingly good at measuring transcriptomic consequences after perturbation, but mapping one to the other at scale is still expensive. In practice, if you want to know how a large library of compounds changes cellular state, you usually need to run a huge number of experiments or fall back on rougher proxies. GPS tries to compress that loop. According to the Cell paper and Michigan State’s summary, the model was trained on millions of experimental measurements and was designed to predict compound induced gene expression profiles directly from chemical structure.

The technical reason this is interesting is that transcriptomics is a much richer target than a single binary property such as toxicity, permeability, or target binding. A gene expression profile is closer to a systems level readout of cellular response. If a model can reliably predict that response from structure, even imperfectly, it becomes a higher bandwidth interface between chemistry and biology. That changes the search problem. Instead of asking only whether a molecule binds one target, researchers can ask whether a molecule pushes a diseased transcriptional state back toward a healthier one. The paper explicitly frames GPS as a platform for identifying drugs that reverse disease associated transcriptomic features, not only for repurposing but also for de novo discovery and lead optimization.

This is a subtle but important shift in how AI is being used in drug discovery. Many successful models still operate in relatively narrow prediction spaces. They estimate affinity, classify toxicity, or rank candidates against a defined assay endpoint. GPS is closer to learning a perturbational biology prior. It tries to model how chemistry perturbs cellular programs. That makes it potentially more useful in diseases where the phenotype is distributed across pathways rather than dominated by one obvious molecular switch. In those settings, transcriptomic reversal can act as a practical objective because it captures a broader notion of cellular correction.

There is also a real modeling challenge here. Predicting transcriptional change from structure is hard because the mapping is many to many and heavily context dependent. The same compound can produce different profiles depending on dose, cell type, timing, and baseline network state. So the achievement is not that biology has suddenly become predictable in the abstract. It is that researchers are starting to build models that are useful despite that complexity, by training on very large perturbation datasets and focusing on patterns that generalize enough to drive screening and optimization. The Cell abstract describes GPS as screening large compound libraries and optimizing lead molecules under transcriptomic guidance, which suggests the model is meant to be part of an active design loop rather than a static benchmark artifact.

Another reason this story stands out is that the tool appears to be open source. The project’s GitHub repository describes GPS as an Apache 2.0 platform for predicting the effects of chemical structures on gene expression, screening large scale libraries, and optimizing lead compounds, with support for retraining on custom data. That matters a lot for technical readers. In AI biology, the difference between a paper and a platform is huge. A method starts to matter much more when labs can actually inspect it, adapt it, and plug it into their own pipelines.

From a computational biology perspective, this sits in an increasingly important zone between foundation models and practical translational tooling. On one side, the field now has very large biological models trained on sequence, structure, and multimodal omics. On the other side, drug discovery still needs operational systems that can rank molecules, suggest modifications, and narrow expensive search spaces. GPS looks like an attempt to connect those worlds through transcriptomics, which is one of the most information dense phenotypic layers available at scale. If that works robustly, it could become a valuable abstraction layer for medicinal chemistry, especially in indication areas where pathway rewiring matters more than single target potency.

The realistic caveat is that transcriptomic prediction is not the same thing as therapeutic truth. A molecule can produce a promising expression signature and still fail because of pharmacokinetics, toxicity, off target effects, or the simple fact that in vitro cell state does not fully represent disease biology in a living organism. So the right way to read this result is not that AI can now design drugs from scratch by itself. The more serious interpretation is that AI is getting better at predicting one of the richest intermediate biological responses we can measure, and that can make the front end of discovery more efficient and more biologically informed.

That is why this paper feels important. It is not just another claim that AI can score molecules faster. It is a claim that structure can be mapped into transcriptomic consequence at enough fidelity to help drive discovery. If that continues to improve, the future workflow for small molecule discovery may look less like blind chemical search and more like iterative programming of cellular state.

Sources

https://www.cell.com/cell/fulltext/S0092-8674(26)00223-0

https://pubmed.ncbi.nlm.nih.gov/41850287/

https://humanmedicine.msu.edu/news/2026-msu-study-demonstrates-faster-discovery-of-therapeutic-drugs-through-ai%20.html

https://github.com/Bin-Chen-Lab/GPS
AI Is Moving From Reading Biology to Running It

March 13, 2026

For years, the big promise of AI in biology was interpretation. Models could read papers, analyze genomic data, classify images, and suggest hypotheses faster than any human team. Over the last two weeks, the story has started to feel more concrete. The frontier is no longer just AI that understands biology. It is AI that can participate in the experimental loop itself, proposing tests, learning from the results, and steering the next round of lab work. That shift became especially visible this month through new reporting on autonomous biology experiments and through continued discussion around models that can now generate short genomic sequences.

The clearest example came from OpenAI and Ginkgo Bioworks. In work highlighted by both OpenAI and Scientific American, GPT 5 was connected to Ginkgo’s cloud laboratory to optimize cell free protein synthesis, a widely used method for making proteins without living cells. According to OpenAI, the system ran more than 36,000 unique reactions across 580 automated plates and achieved a 40 percent reduction in protein production cost, with a 57 percent improvement in reagent cost. Scientific American described the broader significance well: this was not just a chatbot commenting on biology, but an AI system designing experiments, receiving data back from a robotic lab, and iterating at a speed that would be difficult for a human team to match.

That matters because biology has always resisted the hype cycle that dominates other areas of AI. In coding or mathematics, answers can often be checked quickly. In biology, the real bottleneck is usually experimentation. Wet lab work is slow, expensive, noisy, and full of physical constraints. If AI can meaningfully reduce the cost and time of iteration, the impact could spill into drug discovery, diagnostics, synthetic biology, and biomanufacturing. Cell free protein synthesis may sound niche, but proteins sit at the center of modern therapeutics, diagnostics, enzymes, and research tools. Lowering the cost of making and testing them is not a side improvement. It changes how fast real science can move.

At the same time, another strand of the story is developing on the design side. Nature reported on March 4 that the Evo 2 genomic language model can generate short genome sequences, although researchers quoted in the piece stressed that there is still a major gap between writing plausible DNA strings and creating genomes that function reliably inside living cells. That distinction is important. It shows how quickly the field is moving while also reminding us that biological reality is still the final judge. AI can now propose increasingly sophisticated biological designs, but living systems remain far more complex than text, images, or code.

This is exactly why the most interesting development is not raw model capability on its own. It is the coupling of models to instruments, protocols, and validation layers. OpenAI’s writeup makes clear that the experimental loop included strict programmatic checks so the AI could not submit experiments that looked good in text but could not actually run on the automation platform. Scientific American also reported an instructive failure case, where the model tried to assign a negative amount of water when exploring a new condition space. That is not a trivial anecdote. It is a reminder that useful AI in medicine and biology will depend on constraints, guardrails, and interfaces to the physical world. Real progress is going to come from systems that are not only creative, but also grounded.

There is also a necessary caution here for anyone tempted to treat every impressive accuracy number as biological understanding. A University of Warwick study released on March 2 warned that some AI pathology models may rely on shortcuts and confounding signals rather than truly detecting the underlying biology they claim to measure. In other words, a model can perform well on paper while still learning the wrong lesson. That warning lands at exactly the right moment. As AI tools move deeper into medicine, the question is no longer whether they can generate plausible outputs. The real question is whether they are discovering meaningful biological structure or only exploiting correlations that break when conditions change.

That tension is what makes this moment worth writing about for a general audience. We are watching AI in biology become more physical, more operational, and more useful, but also more exposed to the discipline of reality. The next phase will not be won by the model that sounds smartest in a demo. It will be won by systems that can survive the messiness of experiments, the variability of cells and tissues, and the rigor required for medical evidence. If the last era was about AI reading biology, the next one may be about AI doing biology, one validated experiment at a time.

Sources

https://openai.com/index/gpt-5-lowers-protein-synthesis-cost/

https://www.scientificamerican.com/article/openai-and-ginkgo-bioworks-show-how-ai-can-accelerate-scientific-discovery/

https://www.nature.com/articles/d41586-026-00681-y

https://www.eurekalert.org/news-releases/1118118
A Whole Cell Ran on a Supercomputer, and It Took Six Days

March 10, 2026

This week, a research team at the University of Illinois Urbana Champaign reported something that used to sound like science fiction: a full life cycle simulation of a living cell, from DNA replication and metabolism to growth and division. They did it for a genetically minimal bacterium, and they did it at nanoscale resolution, tracking how the cell’s molecules behave throughout the cycle.

The trick was choosing the right organism and the right computing strategy. The team used a “minimal cell” called JCVI syn3A, engineered to carry only the genes needed for basic life functions, which makes the modeling problem hard but not impossible. Even so, the simulation still had to account for every gene, protein, RNA molecule, and chemical reaction strongly enough that the timing of cellular events came out close to reality.

What makes the story feel like a real milestone is the engineering detail. One part of the biology, chromosome replication, was so computationally expensive that it almost doubled the runtime. The team ended up dedicating a separate GPU to DNA replication while another GPU handled the rest of the cell dynamics, which is the kind of pragmatic systems decision you only make after you have actually tried to run the whole thing. With that split, they simulated a 105 minute cell cycle in six days of compute time on the Delta supercomputing system at the National Center for Supercomputing Applications.

This is not an atom by atom digital cell, and it is not a replacement for experiments. The point is leverage. A whole cell model that can predict many cellular properties at once is like running hundreds of coordinated experiments in silico, then using real data to keep the model honest and refine it. If this approach scales, it changes what “understanding a cell” can mean, because you can start asking systems questions that are too entangled to isolate in the lab one variable at a time.

Sources: https://news.illinois.edu/team-simulates-a-living-cell-that-grows-and-divides/ https://www.ncsa.illinois.edu/2026/03/10/simulating-the-life-cycle-of-a-cell-with-ncsas-delta/
The Next Wave of Bio AI Is About Interactions

March 3, 2026
In the last couple of weeks, the most interesting shift in biology focused AI has not been a better single structure predictor. It is the jump from predicting shapes to predicting interactions and designing the parts that create them. A Nature report described a new proprietary drug discovery model from Isomorphic Labs that impressed researchers because it appears to predict how drug sized molecules interact with protein targets at a level people compare to a hypothetical next generation AlphaFold, but now aimed at binding, selectivity, and chemistry relevant signals rather than only static structure. (Nature)

The important technical point is that biology is not only geometry. Drug action is about ensembles, pockets that breathe, water and ions, and the coupling between protein motion and ligand chemistry. If a model can learn interaction landscapes well enough to propose molecules that survive real world constraints, then the bottleneck shifts from “can we model a protein” to “can we close the loop from target to candidate with fewer wet lab cycles”. That is why pharma partnerships keep clustering around models that explicitly predict binding and other interaction level properties, not just sequence to structure. (Reuters)

In parallel, Nature also highlighted how generative biology tools are moving up the abstraction ladder toward designing biological components more directly, including higher level assemblies and genomes, with the same pattern: you get value when the model is constrained by what can actually function inside cells and what can actually be built. The takeaway is that the frontier is becoming system level. The winning models will not just output a plausible sequence. They will output a design that fits a manufacturable path, a measurable assay, and a safety envelope. (Nature)

Sources
https://www.nature.com/articles/d41586-026-00365-7 https://www.reuters.com/business/healthcare-pharmaceuticals/takeda-deepens-ai-drug-discovery-push-with-17-billion-iambic-deal-2026-02-09/ https://www.nature.com/articles/d41586-026-00566-0
Persistency and genomic data

February 19, 2026

Most data breaches fade with time. Passwords get rotated. Credit cards get replaced. Even medical facts can become stale. Genomic data is different because it is persistent, inherently identifying, and useful far beyond the context in which it was collected. Once a genome is out, it is out forever, and it can be linked back to a person in ways that keep improving as more reference data becomes public.

That persistence creates a mismatch between how teams think about privacy and how genomic privacy actually works. Many organizations treat privacy as a compliance perimeter. They focus on access controls, encryption, and policies. Those are necessary, but they are not sufficient because the risk is not only unauthorized access. The risk is also unintended inference, reidentification, and downstream use that was never anticipated when the data was shared or the consent was signed.

NIST has been pushing the conversation toward risk based practice rather than checkbox security. The NIST Privacy Framework is meant to help organizations identify and manage privacy risk as part of enterprise risk management, not as an afterthought bolted onto engineering. https://www.nist.gov/privacy-framework 

For genomics specifically, NIST has also published work that frames genomic cybersecurity and privacy as a combined problem, because in real systems the privacy failures often happen through security failures, and the security failures matter because of the privacy outcomes. A relevant example is NIST’s Genomic Data Cybersecurity and Privacy community profile work, which explicitly positions genomic data as requiring a structured approach to both privacy and cybersecurity capabilities. https://csrc.nist.rip/pubs/ir/8467/2pd 

The research ecosystem has learned this the hard way, which is why controlled access has become the norm for many human datasets. NIH’s Genomic Data Sharing policy lays out expectations for responsible sharing, and the dbGaP access process makes it clear that access is not just a technical permission, it is a governance decision with terms, renewals, and institutional accountability. https://grants.nih.gov/policy-and-compliance/policy-topics/sharing-policies/gds/overview https://grants.nih.gov/policy-and-compliance/policy-topics/sharing-policies/accessing-data/dbgap 

This governance direction is also why machine readable identity and authorization are becoming central in federated genomics. GA4GH Passports formalize the idea that a researcher presents verifiable permissions, called visas, that communicate what they are authorized to access across systems without manual reapproval at every boundary. It is not just an implementation detail. It is an architectural choice that assumes access decisions must be portable, auditable, and harder to spoof. https://www.ga4gh.org/product/ga4gh-passports/ 

People often assume that legal protections solve the discrimination problem, but the reality is narrower. In the United States, GINA makes it illegal for employers to discriminate based on genetic information and restricts how genetic information can be used in employment decisions. That matters, but it does not erase the risk landscape, and it does not automatically cover every scenario a person worries about. The EEOC summary captures the core employment protections under Title II. https://www.eeoc.gov/genetic-information-discrimination 

So what should a genomics team do differently, in practical terms, if they take persistence seriously.

First, design for least data, not just least privilege. The simplest way to reduce genomic privacy risk is to avoid moving raw or near raw data when you do not need it. If a workflow can be done on derived representations, summary statistics, or privacy preserving features, that is a real risk reduction because it narrows what an attacker can steal and what a partner can misuse.

Second, treat consent and data use limits as technical requirements, not just documents. NIH’s approach to controlled access is a reminder that “allowed use” is part of the system specification, and it has to be enforceable through identity, logging, and process, not simply written down. https://grants.nih.gov/policy-and-compliance/policy-topics/sharing-policies/accessing-data/using-genomic-data 

Third, assume linkage will get easier. A dataset that looks deidentified today can become linkable tomorrow because reference panels grow, genealogy databases expand, and methods improve. Your threat model should assume that your future adversary will have better tools than your present self.

Genomic data is powerful because it compresses a lifetime of biology into a format that machines can search, aggregate, and predict from. That same power is what makes it uniquely dangerous to handle casually. The organizations that earn trust in genomics will not be the ones that say they care about privacy. They will be the ones that build systems where privacy risk is engineered down as a default property of how data is collected, accessed, analyzed, and shared.
AI Is Becoming Biology’s New Lab Partner

November 2, 2025

AI is slowly becoming a real collaborator in understanding life. Over the past few months AI systems have gone from predicting structures or gene expression to actually helping design molecules, simulate cells, and guide lab experiments.

Much of this progress comes from a new generation of foundation models in biology, massive systems trained on DNA, protein, and multi-omics data. These models can learn patterns across biology, making them useful for everything from genome decoding to protein design. According to a recent review, such models are starting to connect different biological layers—genes, cells, tissues—in a unified framework.

Foundational Models for AI in Biology

Another example is a single-cell foundation model described in Nature Communications Biology, which can integrate cellular data from different species and conditions to reveal hidden regulatory links.

https://www.nature.com/articles/s12276-025-01547-5

Why does this matter? Because the way we do biology is changing. The time between hypothesis and experiment is shrinking dramatically. An idea that once took months to test can now move from model to lab in days. The space for innovation is also expanding. These systems let scientists ask questions that span molecules, cells, and tissues rather than treating them separately. And finally, the responsibility is growing. As AI starts generating biological designs, researchers must make sure results are reproducible, safe, and interpretable.

https://arxiv.org/abs/2505.23579

If you work in genomics, metabolomics, or synthetic biology, this shift affects you directly. Do you have the right datasets to fine-tune these models? Can your infrastructure support rapid cycles of prediction and validation? Do you track provenance and reproducibility for AI-generated hypotheses? The labs that can answer yes to these questions will lead the next phase of digital biology.

AI in biology is moving from being an assistant to becoming a creative partner. The next generation of discoveries will not just come from analyzing data but from collaborating with intelligent systems that can imagine new forms of life and help us test them responsibly.

References

Baek S, et al. “Single-cell foundation models: bringing artificial intelligence to biology.” Nature Communications Biology, 2025.

https://www.nature.com/articles/s12276-025-01547-5

Le Song, Eran Segal, Eric Xing. “Toward AI-Driven Digital Organism: Multiscale Foundation Models for Predicting, Simulating and Programming Biology at All Levels.” arXiv preprint, December 2024.

https://arxiv.org/abs/2412.06993

“Foundational Models for AI in Biology.” Ardigen, 2025.

Foundational Models for AI in Biology

“Foundation models in drug discovery: phenomenal growth in biotech.” ScienceDirect, 2025.

https://www.sciencedirect.com/science/article/pii/S1359644625002314
Embedding-Driven Protein Generation Enables Motif Diversification

October 22, 2025

A new paper published on arXiv, “Protein generation with embedding learning for motif diversification” (arXiv:2510.18790), introduces a new approach to protein design that combines deep learning embeddings with generative modeling. The paper is available at https://arxiv.org/abs/2510.18790

The study addresses a long-standing challenge in computational biology: generating new protein structures that preserve key functional motifs while introducing meaningful diversity. Conventional design pipelines often fail to balance these goals. Small modifications maintain stability but limit innovation, while large ones disrupt the structural or functional integrity of the protein.

The authors propose a model that learns high-dimensional embeddings of protein motifs and structures, allowing controlled perturbations in embedding space rather than direct coordinate manipulations. This makes it possible to generate diverse but still functional variants. Using a diffusion-based architecture, the system produces proteins that preserve biochemical motifs while varying scaffold backbones in a realistic manner.

Applied to three benchmark systems, including a protein-protein interface and a transcription-factor complex, the model produced substantially more viable structures than existing baselines. The generated designs were predicted to fold stably and retain the target motifs, suggesting the embeddings capture key biophysical constraints.

This work demonstrates how generative AI can move beyond prediction and toward active biological design. By integrating structural embeddings with diffusion processes, the model opens a path to broader exploration of sequence-structure space while maintaining biological plausibility. As experimental validation follows, methods like this may accelerate the creation of new enzymes, therapeutic proteins, and synthetic scaffolds.

It is another sign that AI is beginning to influence the creative side of molecular biology, offering not just analysis but generation of functional biological matter.

recent posts

about

Writing a file into a molecule

Why proteins and not DNA

The part I keep thinking about