The BEL resource data pipeline (resource-generator repository) assembles biological identifiers (genes, proteins, disease concepts, etc.) into an RDF model. This RDF model is cohesive such that each identifier is described and linked to equivalent identifiers where possible.
The BEL data pipeline is a python project that performs the following tasks:
- Download dataset files.
- Parse each dataset for descriptive, equivalent, and historical information.
- Relate parsed datasets based on cross references.
- Export as RDF file using the OpenBEL / SKOS vocabularies.
The project would benefit from a streamlined design so a knowledged Python programmer is highly appropriate. One area to explore might be to take a map-reduce approach; that is split up dataset processing and assemble the RDF model in the last stage.
Additionally the following enhancements would be helpful:
- New namespaces to support BEL version 2.0.
- Incorporate Parent-Child identifiers from datasets.
Enhancements to the BEL data pipeline to streamline the pipeline, making it easier to expand and maintain