Skip to end of metadata
Go to start of metadata

The new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction systems. These corpora were also used in the BioCreative V BEL track.




Datasets / Corpora

 CorpusDiscriptionSelection CriterionContent
1BEL_Extraction training
Training corpus provided as training data to the BioCreative V BEL Track participants. This corpus can be used for Task 1 and 2. It is restricted in an automated way to the entity classes, functions, and relationships selected for the BioCreative V BEL track.Randomly selectedpositive examples
2BEL_Extraction sampleSample and development corpus provided to the BioCreative V BEL Track participants for proper system evaluation during development.Random selected; manually re-annotatedpositive examples
3BEL_Extraction testThis corpus was used for the Task 1 evaluation of the BioCreative V BEL Track participating systems.Random selected; manually re-annotated for Task 1 evaluationpositive examples
4BEL_Sentence classification

This corpus was provided to the BioCreative V BEL track Task 1 participants for the extraction of supporting text excerpts. Every Sentence and the associated BEL statement is assigned with two classes: fully supportive and partially supportive.

Predicted by two different systems, one prediction done by a system participating in the BioCreative BEL track task 2positive and negative examples

BEL_Extraction training corpus

VersionDescription#Unique Sentences#StatementsFormats
v1Training for task 1 and 2 BioCreative V BEL track635311066

BEL, Sentence, Tab, BioC (key file), Graphs(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11), Fragments, Entities

BEL_Extraction sample corpus

VersionDescription#Unique Sentences#StatementsFormats
v1Evaluation for task 1 during training phase BioCreative V BEL track, task1183354BELSentenceTab, BioC (key file), Graphs, Fragments


BEL_Extraction test corpus

VersionDescription#Unique Sentences#StatementsFormats
v1Gold standard corpus for evaluation105202BEL, SentenceEntities
v2Gold standard corpus for evaluation105202BEL, Sentence
v3Gold standard corpus for evaluation105207BEL, Sentence, Tab

BEL_Sentence_Classification corpus

VersionDescription#Unique Sentences#StatementsFormats
v1This corpus was used in BioCreative V BEL track task 2 and provided to the participants806100Tab, BEL, BioC, Graphs(PNG, SVG), Fragments, Entities
v2This corpus contains the manually annotated tri-occurrences that were extracted from the Medline (Method 1) and manually annotated predictions from the BioCreative V BEL track task 2 (Method 2). The Tab file also contains the classification labels (fully supportive and partly supportive) as well as the method with which the sentence was found.155499Tab (with class annotations and method)

Format Description


  • Contains the BEL statements
  • Fields: Sentence-ID, BEL original, BEL-ID
  • Description: These are the original BEL statements which have been manually generated from evidence sentences from the literature. A reference to the respective evidence sentence is made by the Sentence-ID. The BEL-IDs are unique identifiers of the specific BEL statement.
  • A visualization of the BEL statements in the context of the pubmed abstract from which they are derived can be seen through the TextAE editor of pubannotation (select "show" for the pubmed abstract to be visualized, then pick the "TextAE" option)


  • Contains the evidence sentences
  • Fields: Sentence-ID PMID Sentence
  • Description: These are the evidence sentences from which the BEL statements in File 2 were manually generated. A reference to the respective BEL statements is made through the Sentence-ID. Furthermore, for each sentence, a reference (PMID) to its origin Pubmed article is given.


  • Combination of BEL and Sentence files.


BioC XML representation of the sample BEL set, including evidence sentences, BEL-ID, Sentence-ID and PMID.

Several text mining tools and components have been customized to handle BioC, which is meant to facilitate data interchange for biomedical text mining systems.

Reference: Official BioC Webpage


Sample Set Graph Visualizations

(Generated from the BioC version of the BEL statements)


CSV representation of the sample BEL set. Each row corresponds to a fragment (entity, function argument, function, relationship or statement) and indicates the IDs of parent and children fragments. For entities, there's an additional field with the internal BEL UUID as specified by the equivalence file(s) of the corresponding namespace (refer to the BEL documentation for details).


This file contains the list of entities present in the training data. It's a tab-separated file with two columns: BEL-ID, Entity-ID.

  • No labels