Page tree
Skip to end of metadata
Go to start of metadata

Goals

  • Represent variants at DNA, RNA, and protein level
  • Types of variants – substitutions, insertions, deletions, fusions, unspecified, etc.
  • Non-coding DNA variants
  • Equivalencing of representations
  • Representation should support mapping of measurement data to network
  • Compatible with related representation needs– e.g., representation of DNA modifications

Proposal

  • Represent most variants with a single BEL function var(), to replace sub() and trunc()
    • g(ns:v, var(expression))
    • var() applicable to Gene, RNA, microRNA, and Protein abundances
    • Adopt HGVS mutation nomenclature for most BEL variation expressions
  • Represent fusions with a different function than other variants, fus()
    • fus(ns:value, range, ns:value, range)
    • Link to 5' and 3' fusion partners, but treat as "new" sequence rather than as variant of fusion partners
    • Use fus() construct in place of namespace value for g(), r(), p(), BEL functions
  • Will need to enable use of namespaces referencing specific sequence versions, e.g., REFSEQ
    • add to OpenBEL-provided namespaces
    • relationship connecting isoforms to parent/root gene

Examples

Proteins

Reference allele

p(HGNC:CFTR, var(=))

This is different than p(HGNC:CFTR), the root protein abundance which includes all variants and modifications

Unspecified variant

p(HGNC:CFTR, var(?))

Substitution variant

p(REF:NP_000483.3, var(p.Gly576Ala))

CFTR substitution variant NP_000483.3:p.Gly576Ala

NOTE – because a specific position is referenced, an ID for a non-ambiguous sequence is preferred

Deletion variant

p(REF:NP_000483.3, var(p.Phe508del))

CFTR deletion variant NP_000483.3:p.Phe508del (ΔF508)

Frameshift variant

p(REF:NP_000483.3, var(p.Thr1220Lysfs))

CFTR frameshift variant NP_000483.3:p.Thr1220Lysfs (HGVS short description)


p(REF:NP_000483.3, var(p.Thr1220Lysfs*7))

CFTR frameshift variant NP_000483.3:p.Thr1220Lysfs*7 (HGVS long description)

Representation of variant across DNA/RNA

These are all representations of CFTR ΔF508 (deletion variant example above) .

DNA - SNP

g(SNP:rs113993960, var(delCTT))

DNA - chromosome

g(REF:NC_000007.13, var(g.117199646_117199648delCTT))

DNA - coding sequence

g(REF:NM_000492.3, var(c.1521_1523delCTT))

RNA - coding sequence

r(REF:NM_000492.3, var(c.1521_1523delCTT))

RNA - RNA sequence

r(REF:NM_000492.3, var(r.1653_1655delcuu))

  • No labels

5 Comments

    1. If I am understanding your proposal correctly, you are using the REFSEQ record to establish a coordinate system, rather than stating that the variant actually occur in that specific sequence.  But this is not clear from the proposed representation.  Rather than using REF: perhaps you should use something like COORD:
    2. The frameshift variant example for proteins would appear to be ambiguous.  It is unclear how many nucleotides were deleted or inserted that result in the Glu to Arg change, and therefor it is unclear what amino acid changes would come after the Arg.
  1. Richard,

    1. The REFSEQ record is intended to state that the variant occurs in that sequence. The gene symbol or ID could be used as an alternative, but it would be more ambiguous than a reference to a specific sequence. Please let me know what might help clarify this. One issue I see is that when referring to the DNA (geneAbundance), it is not clear if the BEL terms using chromosome, coding sequence, or SNP reference sequences are equivalent - they refer to the same variant, but the reference IDs point to sequences of different lengths, and the relationships between these are not yet defined in BEL.
    2. The frameshift example is consistent with the "short description" form in the hgvs recommendations, which is indeed somewhat ambiguous - http://www.hgvs.org/mutnomen/recs-prot.html#ins. I'll add a second example using the long description.
    1. For #2, the protein frameshift, if it is ambiguous, I don't think you should do it.  Just because someone else does a crappy job at knowledge representation doesn't mean that you need to.  (wink)

      For #1, if I understand you correctly this would mean that the rest of the sequence would have to be identical to the REFSEQ record.  The odds of a specific allele being identical to a REFSEQ record while containing the single substitution in question would be very low.  I think you are really just trying to map to a coordinate position using a REFSEQ record, which is why I recommended using some kind of tag to indicate it as such.

  2. For #1, the rest of the sequence would not necessarily be identical to the REFSEQ record. It should really be interpreted as this variation is present at this sequence position; we may need to think about how to represent the cases where the rest of the sequence is known and "wild-type". This is similar to the post-translational modifications - if a single modification (or none) is specified, it should not be interpreted as unmodified at all other positions. BEL was designed for recoding observations and enabling mapping of measurement data, which is not always the complete picture.

    For #2, it is useful to be able to represent some level of ambiguity. If the specific nucleotide change is reported/known, it would be preferable to report that insertion/deletion at the DNA level. The examples are simply meant to demonstrate representations of different types of variants at the DNA, RNA, and protein levels.

    1. For #1, that's exactly my point.  Using something line r(COORD:NM_000492.3, var(c.1521_1523delCTT)) instead of r(REF:NM_000492.3, var(c.1521_1523delCTT)) might make that clearer.