As you can see from my slides in the‘Translational Bioinformatics‘ posting, I highlighted three papers that I think create an interesting dance of ideas. The first one from Homer et al showed that IF you have the genotype data from a single individual and a mixture of up to 1000 others, then you can determine if that first individual is part of the mixture.
What’s the big deal? Well previously it was assumed that the mixture of 1000 was too complex to unravel, and so people released data on mixtures freely into the public. With this paper, it is now slightly dangerous to do that–if someone has a DNA sample and is wondering if someone else is part of the sample of 1000, they can do it. Basically, because we measure 500,000 different SNPs, we can use each SNP to slightly alter our confidence that the individual is or isn’t in the mixture. No single SNP proves it, but 500,000 together make confidence very high.
As a result, NIH’s genome-wide database dbGAP stopped publishing the aggregate data that was heretofore thought to be secure. One could argue that this is overkill because of all the other things that have to be true to allow re-identification (having a DNA sample from someone, genotyping them). The paper was written with forensic applications in mind: I have a DNA sample from a potential perp, is there any chance that this perp contributed to the DNA mixture at the scene of the crime? Also, if they already have the DNA and genotypes, what further loss of privacy could there be? Well, in theory, knowledge that the person participated in a disease-related study could give information about a disease they have that they didn’t want to share. It’s all remote and not very likely, but this paper shows it’s possible and it was a big wake up call for many. For me, not so much–I am on the record saying that DNA anonymity is basically a fiction and we should use social (not technical) means to stop abuse of the information.
So in the next paper, I showed an article by Nyholt et al in which these principles are applied, kind of. Jim Watson donated his DNA to be sequenced and it was released to public, but for some reason he didn’t want his APOE genotype to be released (some APOE alleles are associated with increased risk for Alzheimer’s disease). So they redacted that part of his sequence. In this paper, the authors showed that there is plenty of correlation between different parts of the genome (Linkage Disequilibrium = LD is the official confusing term) so that they could infer the APOE genotypes. As a courtesy, they informed the folks who distributed the genome and gave them a chance to remove a much larger 2 Megabase chunk of the sequence to make this kind of correlation analysis less easy. However the point was made: be careful what you release, because you may be releasing more than you think.
Finally, on a whimsical note, I presented a paper by Christley et al showing that the same Jim Watson genome can be compressed with some clever data compression techniques into a small 4.1 Megabytes–easily small enough to email to a friend as an attachment. So the full circle (?) is achieved: we identify genomes, we infer things about them, and then we email them to our friends. Genome security? Forget about it…Or else pass laws and stuff.