Data Packages
Curated third party data packages
Gene and Protein Annotations
Description: These packages provide standardized identifiers, gene-to-transcript-to-protein mappings, and functional annotations like protein domains, isoforms, and sequence features. They serve as the structural backbone for linking molecular data across other packages in the knowledge graph.
Use Case: Use this category to ensure consistent representation of genes and proteins across datasets. It's particularly valuable for enabling cross-referencing between genomic, transcriptomic, and proteomic data, and for supporting downstream integration of evidence from functional studies, clinical datasets, and biological pathways.
Ensembl Genes
Description: GENCODE - release 45, comprehensive gene annotation extracted for gene features.
Ensembl Documentation: https://www.ensembl.org/index.html
Concepts: Gene
Ensembl - Human Genome Annotation (Extended)
Description: Contains human genome transcripts and proteins from the Ensembl database.
Ensembl Documentation: https://www.ensembl.org/index.html
Concepts: Gene, Transcript, Protein
ENCODE Regulatory Regions
Description: Regulatory regions identified by SCREEN: Search Candidate cis-Regulatory Elements by ENCODE. Predicted to regulate determined by finding the closest genes to the regulatory element based on Ensembl GRCh38 gtf.
ENCODE Documentation: https://www.encodeproject.org/software/regulatory-elements-database/
Concepts: RegulatoryFeature, CisRegulatoryElement, DistalEnhancerLike, PromoterLike, ProximalEnhancerLike
Relationships:
· predicted to regulate
· region start on
· region end on
UniProt
Proteins, protein features, and protein domains taken from UniProt's database of human proteins and mapped against Ensembl genes where possible
UniProt Documentation: https://www.uniprot.org/help/uniprot_data
Concepts: Protein, ProteinFeature, ProteinDomain, ProteinRegion
Relationships:
· contained within
· has feature
· encoded by
Genetics and Gene to Disease Associations
Description: This category encompasses data packages that link genetic variants, genes, and inherited traits to human diseases. It includes resources like ClinVar, Open Targets Genetics, and Alliance Genome, which aggregate evidence from genome-wide association studies (GWAS), model organism research, and curated variant databases. These packages capture variant pathogenicity, gene-level disease associations, and mechanisms such as susceptibility, resistance, or progression.
Use Case: Use this category to prioritize therapeutic targets based on genetic evidence, interpret variant impacts in patient cohorts, or explore the heritability and mechanistic basis of disease. It’s essential for genetic-driven drug discovery, target validation, and rare disease research.
Open Targets - Genetics
This data package contains information extracted from Open Targets Genetics (OTG) to enhance your graph with GWAS semantics and observations. Contents of this package provides an evidence-based connection between known GWAS traits and their relatedness to variants and genes. Assignment of variants to their lead and/or causal genes is achieved through OTG Locus2Gene machine learning model.
Open Targets Genetics Documentation: https://genetics-docs.opentargets.org/
Concepts: Trait, Variant, VariantAssociation
Human Phenotype Ontology Annotations (HPOA)
The Human Phenotype Ontology Annotations (HPOA) maps the associations between abnormal phenotypes and diseases. Source disease identifiers are cross mapped and harmonized in either EFO or MONDO accessions for compatibility.
Human Phenotype Ontology Associations Documentation: https://hpo.jax.org/
Concepts: Phenotype, Gene
Relationships:
· phenotype associated with
· associated with
· has genetic association
Alliance - Disease Gene Associations
A consortium of 7 model organism databases (MODs) and the Gene Ontology (GO) Consortium whose goal is to provide an integrated view of their data to all biologists, clinicians and other interested parties. This data package contains edges between Genes and Diseases. It will require you to have loaded in EFO and MONDO ontologies.
Alliance - Disease Gene Association Documentation http://alliancegenome.org/downloads
ClinVar
Clinvar GRCh38 vcf mutational information extracted and mapped to relevant diseases, after running the mutations through SnpEff to determine gene and transcript level effects.
ClinVar Documentation https://www.ncbi.nlm.nih.gov/clinvar/
Concepts: Mutation
Relationships:
associated with disease
found on transcript
found on
Category: Cell Line Models and Dependency Data
Description: This category includes experimental datasets derived from cancer cell lines, such as those from DepMap and CCLE. These packages capture gene essentiality (via CRISPR or RNAi screens), expression profiles, and disease modeling relationships. They offer quantitative insights into gene function, dependencies, and transcriptional signatures across diverse cell line models.
Use Case: Use this category to identify context-specific essential genes, select relevant in vitro models for disease research, or evaluate therapeutic vulnerabilities across cancer subtypes. It’s particularly valuable for functional genomics, target prioritization, and preclinical model selection.
DepMap Dependency and CCLE Expression
Cell line to gene dependency edges as well as RPKM/normalized expression values taken from DepMap and CCLE respectively.
DepMap / CCLE Documentation: https://depmap.org/portal/ccle/
Concepts: CellLine
Relationships:
· models disease
· has dependency
· expresses gene
Drug Targets and Safety Liabilities
Description: This category includes data packages that annotate drug targets with known therapeutic indications, mechanisms of action, and safety profiles. Sources like Open Targets Drug Annotations and Safety Liabilities provide curated information on approved and investigational drugs, target-disease relationships, and adverse effect associations, often derived from clinical, pharmacological, and genetic evidence.
Use Case: Use this category to evaluate the druggability and safety risks of potential targets, identify repurposing opportunities, and flag off-target effects early in the discovery pipeline. It is particularly useful for therapeutic hypothesis refinement, target de-risking, and translational planning.
Open Targets - Drug Annotation
Drug and drug annotation data extracted from curated data in the Open Targets platform. This data package specifically incorporates basic molecule information, clinical trial and indication details associated with diseases, and mechanism of action annotation associated with genes.
Open Targets Drug Annotation Documentation: https://platform-docs.opentargets.org/drug
Concepts: Gene, Drug, Disease
Relationships:
· acts on
· is approved for
· has clinical precedence for
Open Targets - Safety Liabilities
Contains manually curated experimental data and insights from publications and other well-known sources of target safety, extracted from Open Targets Database.
Open Targets Safety Liabilities Documentation: https://platform-docs.opentargets.org/target/safety
Concepts: Gene, SafetyLiability
Relationships:
· occurs when increasing
· occurs when decreasing
· manifests
· detected in biosystem
Clinical Trials and Intervention Studies
Description: This category compiles interventional study records from sources like ClinicalTrials.gov, focusing on drug-condition relationships, study designs, and observed outcomes. Each entry typically represents a registered human trial and includes metadata such as trial phase, status, and therapeutic intervention details.
Use Case: Use this category to link experimental or approved drugs to diseases, mine evidence for repurposing opportunities, or assess the clinical relevance of therapeutic hypotheses. It is essential for translational strategy, benchmarking therapeutic development, and integrating real-world evidence into research pipelines.
Clinical Trials - Interventional Studies
A subset of the clinical trials database that captures all intervention studies.
ClinicalTrials.gov Documentation: https://clinicaltrials.gov/
Concepts: ClinicalTrial
Relationships:
· studies condition
· using the drug
Genetic Models & Animal Data
Description:
Includes animal models of disease and genotype-phenotype relationships. Provides in vivo evidence and translational insights.
Use Case:
Includes animal models of disease and genotype-phenotype relationships. Provides in vivo evidence and translational insights.
Alliance Genome - Animal Models of Disease
Description: Extraction of alleles-genotypes-diseases reports from MGI.
Alliance Genome Documentation: https://www.alliancegenome.org/downloads
Concepts: Allele, Genotype
Relationships:
· has allele
· is model of
Biological Processes and Pathways
Description: These data packages define how genes and proteins participate in cellular functions, biochemical reactions, and higher-order physiological processes through structured, hierarchical ontologies and pathway maps.
Use Case: Use this category to contextualize gene or protein activity within known biological functions, interpret omics data through enrichment analysis, or trace mechanistic cascades underlying disease phenotypes. It is foundational for hypothesis generation, systems biology modeling, and mechanistic target evaluation.
Reactome Events
A data package containing partial information extracted from Reactome. Reactome Events and adjacent nodes with the label 'Event', 'Pathway', 'Drug', 'GO', 'Gene', and 'Disease' labels were extracted. Diseases were mapped from DOID to EFO, Drugs were mapped from Reactome to ChEMBL, and Ensembl genes mapping were external ontologies used. Graph db dump: reactome neo4j 4.3.6. Neo4j dump was restored from the dump and migrated to be compatible with new range instead of index.
Reactome Documentation: https://reactome.org/documentation
Concepts: Event, Pathway, Drug, BiologicalProcess, CellularComponent, Gene, Disease
Gene Ontology (GO)
Contains classes in the Gene Ontology transformed as graph objects, assigned to one of the 3 gene ontology top-level concepts. Only direct parents subClassOf relations are preserved as `is a ` edges.
Gene Ontology (GO) Documentation: https://geneontology.org/docs/ontology-documentation/
Concepts: BiologicalProcess, CellularComponent, MolecularFunction
Gene Ontology - Annotations
A GO annotation is a statement about the function of a particular gene. GO annotations are created by associating a gene or gene product with a GO term. Together, these statements comprise a “snapshot” of current biological knowledge. Hence, GO annotations capture statements about how a gene functions at the molecular level, where in the cell it functions, and what biological processes (pathways, programs) it helps to carry out.
Gene Ontology (GO) Documentation: https://geneontology.org/docs/ontology-documentation/
Reference Ontologies
Description:
Structured vocabularies used to represent standardized biological concepts like diseases, phenotypes, and cell types. These are foundational for harmonizing and annotating biomedical data.
Use Case:
Structured vocabularies used to represent standardized biological concepts like diseases, phenotypes, and cell types. These are foundational for harmonizing and annotating biomedical data.
Cell Type (Cell Ontology)
A data package that contains objects representing Cell Types extracted from classes in the Cell Ontology. It also includes transformed object properties between classes into real edges across Cell Type objects.
Cell Ontology (CL) Documentation: https://cell-ontology.github.io/
Concepts
CellType
Experimental Factor Ontology - Subset
This data package is a subset of classes extracted from the Experimental Factor Ontology. It is primarily used to introduce Objects into the knowledge graph that support the annotation and analysis of other data sources such as Open Targets and other EBI databases. Generation of these Objects for use in the BioBox knowledge graphs are restricted to EFO namespaced terms.
Experimental Factor Ontology (EFO) Documentation: https://www.ebi.ac.uk/efo
Concepts: EFO
Disease (MONDO)
A data package that contains objects representing Diseases extracted from classes in the MONDO with inter-ontology axioms. It also includes transformed object properties between classes into real edges across Disease objects.
MONDO Disease Documentation: https://mondo.monarchinitiative.org/
Concepts: Disease
Disease
A collection of diseases and abnormal phenotypes that have known annotation and data available through Open Targets, ChEMBL and other knowledge bases. This package uses ontology classes taken from EFO, HP, MONDO as real-world objects in the biobox knowledge graph. We also preserve the inheritance and subclass relationship through the 'is a' relation
Concepts: Disease
Relationships:
· is a
Tissues (UBERON)
This data packages supplies your knowledge graph with a collection of nodes that represents tissues. The UBERON ontology is used as the data source where we transform the ontology classes into objects inside your knowledge graph. Additionally, annotation properties that indicate inheritance and other cross-class axioms are transformed into relationships that connect tissue objects.
UBERON Documentation: https://www.ebi.ac.uk/ols4/ontologies/uberon
Concepts: Tissue
Last updated