The new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction systems. These corpora were also used in the BioCreative V BEL track.
Datasets / Corpora
|1||Training corpus provided as training data to the BioCreative V BEL Track participants. This corpus can be used for Task 1 and 2. It is restricted in an automated way to the entity classes, functions, and relationships selected for the BioCreative V BEL track.||Randomly selected||positive examples|
|2||BEL_Extraction sample||Sample and development corpus provided to the BioCreative V BEL Track participants for proper system evaluation during development.||Random selected; manually re-annotated||positive examples|
|3||BEL_Extraction test||This corpus was used for the Task 1 evaluation of the BioCreative V BEL Track participating systems.||Random selected; manually re-annotated for Task 1 evaluation||positive examples|
This corpus was provided to the BioCreative V BEL track Task 1 participants for the extraction of supporting text excerpts. Every Sentence and the associated BEL statement is assigned with two classes: fully supportive and partially supportive.
|Predicted by two different systems, one prediction done by a system participating in the BioCreative BEL track task 2||positive and negative examples|
BEL_Extraction training corpus
|v1||Training for task 1 and 2 BioCreative V BEL track||6353||11066|
BEL_Extraction sample corpus
|v1||Evaluation for task 1 during training phase BioCreative V BEL track, task1||183||354||BEL, Sentence, Tab, BioC (key file), Graphs, Fragments|
BEL_Extraction test corpus
|v1||Gold standard corpus for evaluation||105||202||BEL, Sentence, Entities|
|v2||Gold standard corpus for evaluation||105||202||BEL, Sentence|
|v3||Gold standard corpus for evaluation||105||207||BEL, Sentence, Tab|
|v1||This corpus was used in BioCreative V BEL track task 2 and provided to the participants||806||100||Tab, BEL, BioC, Graphs( , SVG), Fragments, Entities|
|v2||This corpus contains the manually annotated tri-occurrences that were extracted from the Medline (Method 1) and manually annotated predictions from the BioCreative V BEL track task 2 (Method 2). The Tab file also contains the classification labels (fully supportive and partly supportive) as well as the method with which the sentence was found.||1554||99||Tab (with class annotations and method)|
- Contains the BEL statements
- Fields: Sentence-ID, BEL original, BEL-ID
- Description: These are the original BEL statements which have been manually generated from evidence sentences from the literature. A reference to the respective evidence sentence is made by the Sentence-ID. The BEL-IDs are unique identifiers of the specific BEL statement.
- A visualization of the BEL statements in the context of the pubmed abstract from which they are derived can be seen through the TextAE editor of pubannotation (select "show" for the pubmed abstract to be visualized, then pick the "TextAE" option)
- Contains the evidence sentences
- Fields: Sentence-ID PMID Sentence
- Description: These are the evidence sentences from which the BEL statements in File 2 were manually generated. A reference to the respective BEL statements is made through the Sentence-ID. Furthermore, for each sentence, a reference (PMID) to its origin Pubmed article is given.
Combination of BEL and Sentence files.
BioC XML representation of the sample BEL set, including evidence sentences, BEL-ID, Sentence-ID and PMID.
Several text mining tools and components have been customized to handle BioC, which is meant to facilitate data interchange for biomedical text mining systems.
Reference: Official BioC Webpage
Sample Set Graph Visualizations
(Generated from the BioC version of the BEL statements)
CSV representation of the sample BEL set. Each row corresponds to a fragment (entity, function argument, function, relationship or statement) and indicates the IDs of parent and children fragments. For entities, there's an additional field with the internal BEL UUID as specified by the equivalence file(s) of the corresponding namespace (refer to the BEL documentation for details).
This file contains the list of entities present in the training data. It's a tab-separated file with two columns: BEL-ID, Entity-ID.