Persistency and genomic data

Most data breaches fade with time. Passwords get rotated. Credit cards get replaced. Even medical facts can become stale. Genomic data is different because it is persistent, inherently identifying, and useful far beyond the context in which it was collected. Once a genome is out, it is out forever, and it can be linked back to a person in ways that keep improving as more reference data becomes public.

That persistence creates a mismatch between how teams think about privacy and how genomic privacy actually works. Many organizations treat privacy as a compliance perimeter. They focus on access controls, encryption, and policies. Those are necessary, but they are not sufficient because the risk is not only unauthorized access. The risk is also unintended inference, reidentification, and downstream use that was never anticipated when the data was shared or the consent was signed.

NIST has been pushing the conversation toward risk based practice rather than checkbox security. The NIST Privacy Framework is meant to help organizations identify and manage privacy risk as part of enterprise risk management, not as an afterthought bolted onto engineering. https://www.nist.gov/privacy-framework 

For genomics specifically, NIST has also published work that frames genomic cybersecurity and privacy as a combined problem, because in real systems the privacy failures often happen through security failures, and the security failures matter because of the privacy outcomes. A relevant example is NIST’s Genomic Data Cybersecurity and Privacy community profile work, which explicitly positions genomic data as requiring a structured approach to both privacy and cybersecurity capabilities. https://csrc.nist.rip/pubs/ir/8467/2pd 

The research ecosystem has learned this the hard way, which is why controlled access has become the norm for many human datasets. NIH’s Genomic Data Sharing policy lays out expectations for responsible sharing, and the dbGaP access process makes it clear that access is not just a technical permission, it is a governance decision with terms, renewals, and institutional accountability. https://grants.nih.gov/policy-and-compliance/policy-topics/sharing-policies/gds/overview https://grants.nih.gov/policy-and-compliance/policy-topics/sharing-policies/accessing-data/dbgap 

This governance direction is also why machine readable identity and authorization are becoming central in federated genomics. GA4GH Passports formalize the idea that a researcher presents verifiable permissions, called visas, that communicate what they are authorized to access across systems without manual reapproval at every boundary. It is not just an implementation detail. It is an architectural choice that assumes access decisions must be portable, auditable, and harder to spoof. https://www.ga4gh.org/product/ga4gh-passports/ 

People often assume that legal protections solve the discrimination problem, but the reality is narrower. In the United States, GINA makes it illegal for employers to discriminate based on genetic information and restricts how genetic information can be used in employment decisions. That matters, but it does not erase the risk landscape, and it does not automatically cover every scenario a person worries about. The EEOC summary captures the core employment protections under Title II. https://www.eeoc.gov/genetic-information-discrimination 

So what should a genomics team do differently, in practical terms, if they take persistence seriously.

First, design for least data, not just least privilege. The simplest way to reduce genomic privacy risk is to avoid moving raw or near raw data when you do not need it. If a workflow can be done on derived representations, summary statistics, or privacy preserving features, that is a real risk reduction because it narrows what an attacker can steal and what a partner can misuse.

Second, treat consent and data use limits as technical requirements, not just documents. NIH’s approach to controlled access is a reminder that “allowed use” is part of the system specification, and it has to be enforceable through identity, logging, and process, not simply written down. https://grants.nih.gov/policy-and-compliance/policy-topics/sharing-policies/accessing-data/using-genomic-data 

Third, assume linkage will get easier. A dataset that looks deidentified today can become linkable tomorrow because reference panels grow, genealogy databases expand, and methods improve. Your threat model should assume that your future adversary will have better tools than your present self.

Genomic data is powerful because it compresses a lifetime of biology into a format that machines can search, aggregate, and predict from. That same power is what makes it uniquely dangerous to handle casually. The organizations that earn trust in genomics will not be the ones that say they care about privacy. They will be the ones that build systems where privacy risk is engineered down as a default property of how data is collected, accessed, analyzed, and shared.

Maurizio Morri – Science Blog

recent posts

about