Put SNP identifiers in scholarly abstracts!

It drives me crazy (and costs me lots of grant money, see below) that people publish reports of associations between SNPs (single nucleotide polymorphisms) and phenotypes, but use a non-standard way to refer to the SNPs.  They can appear as nucleotide changes (e.g. C 3435 T) where the position is on some mRNA sequence in Genbank.  They can appear as amino acid changes (e.g.  GLY 49 ARG) where the position is some protein sequence from Swiss-Prot, Genbank, or who knows?  They can have names that are incredibly general (e.g. I-D alleles for insertion/deletion–how many zillion of these must there be?).  Some fields have a very formal looking but irritating numbering system (e.g. CYP2D6*1, CYP2D6*2, etc…) which basically has very little information except for those who know the literature so well that the *35 has meaning.
I think the best strategy would be to ask that authors use the identifiers from dbSNP, also called “RS numbers.”  These are defined by providing flanking DNA sequence around a location, and can usually be mapped to a fairly unambiguous location in the genome.  Yes, there are potential problems (copy number variations, individual differences in detailed sequence, dbSNP not covering a particular SNP [submit it then!], etc…), but for most of us the RS# is a pretty darn good standard identifier.  It would make life much better for those of us creating databases to have some assistance (in the abstract of articles!) in figuring out what area of the genome is being mentioned.  Right now, I have a staff of very devoted and trained curators who do the “mapping” from the informal identifiers to dbSNP RS#’s.   But they are expensive, and there are a lot of other meritorious things I would love for them to do.   I would really be excited if  someone clever coud come up with an automated or semi-automated way to map “beta-adrenergic receptor GLY->Ser 49” to “HGNC = ADRB1, position = chr10:115,794,026  (hg17).”   This would involve (1) identifying the gene being mentioned, (2) identifying the location in the protein or DNA sequence that is mutated, (3) finding the Genbank entry that has the numbering system used (i.e. 49 is defined relative to what?), and (4) translating it to the human genome browser address.   It is fraught with problems, but I bet someone could do a reasonable job.  Maybe I should post this on one of those “solve this problem” websites I have heard about.



  1. Amen! But even rs# have problems so be wary of the homozygous forms of (A;T) or (C;G) snps.

    CNVs don’t yet have any sort of a similar identifier and as a result I’ve yet to read two papers that were clearly talking about the same CNV.

  2. SNPs are just one kinds of the many different variations possible. The community is fixated on SNPs because it is easy to measure them. CNV are the next thing to be hyped but isn’t it time to have a consistent way of referring to a “variation”?

  3. When you have full genome sequence 99% of your information is known. So take a reference genome, and state your genome as a series of ‘differences’ from this reference. snps are the smallest and most common difference, but cnvs, inversions, methylations, and many other features can all be recorded as ‘differences’ in the same way that rs#s are. rs#s are already used for small indels, but there is no reason the same system couldn’t be expanded to cover these other variations as well.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s