Data Packages

Curated third party data packages

Gene and Protein Annotations

Description: These packages provide standardized identifiers, gene-to-transcript-to-protein mappings, and functional annotations like protein domains, isoforms, and sequence features. They serve as the structural backbone for linking molecular data across other packages in the knowledge graph.

Use Case: Use this category to ensure consistent representation of genes and proteins across datasets. It's particularly valuable for enabling cross-referencing between genomic, transcriptomic, and proteomic data, and for supporting downstream integration of evidence from functional studies, clinical datasets, and biological pathways.

Ensembl Genes

Description: GENCODE - release 45, comprehensive gene annotation extracted for gene features.

Ensembl Documentation: https://www.ensembl.org/index.html

Concepts: Gene

Ensembl - Human Genome Annotation (Extended)

Description: Contains human genome transcripts and proteins from the Ensembl database.

Ensembl Documentation: https://www.ensembl.org/index.html

Concepts: Gene, Transcript, Protein

ENCODE Regulatory Regions

Description: Regulatory regions identified by SCREEN: Search Candidate cis-Regulatory Elements by ENCODE. Predicted to regulate determined by finding the closest genes to the regulatory element based on Ensembl GRCh38 gtf.

ENCODE Documentation: https://www.encodeproject.org/software/regulatory-elements-database/

Concepts: RegulatoryFeature, CisRegulatoryElement, DistalEnhancerLike, PromoterLike, ProximalEnhancerLike

Relationships:

· predicted to regulate

· region start on

· region end on

UniProt

Proteins, protein features, and protein domains taken from UniProt's database of human proteins and mapped against Ensembl genes where possible

UniProt Documentation: https://www.uniprot.org/help/uniprot_data

Concepts: Protein, ProteinFeature, ProteinDomain, ProteinRegion

Relationships:

· contained within

· has feature

· encoded by


Genetics and Gene to Disease Associations

Description: This category encompasses data packages that link genetic variants, genes, and inherited traits to human diseases. It includes resources like ClinVar, Open Targets Genetics, and Alliance Genome, which aggregate evidence from genome-wide association studies (GWAS), model organism research, and curated variant databases. These packages capture variant pathogenicity, gene-level disease associations, and mechanisms such as susceptibility, resistance, or progression.

Use Case: Use this category to prioritize therapeutic targets based on genetic evidence, interpret variant impacts in patient cohorts, or explore the heritability and mechanistic basis of disease. It’s essential for genetic-driven drug discovery, target validation, and rare disease research.

Open Targets - Genetics

This data package contains information extracted from Open Targets Genetics (OTG) to enhance your graph with GWAS semantics and observations. Contents of this package provides an evidence-based connection between known GWAS traits and their relatedness to variants and genes. Assignment of variants to their lead and/or causal genes is achieved through OTG Locus2Gene machine learning model.

Open Targets Genetics Documentation: https://genetics-docs.opentargets.org/

Concepts: Trait, Variant, VariantAssociation

Open Targets Genetics Relationships

· protective against

· risk of trait

· of trait

· when variant is

· increased gene product level of

· decreased gene product level of

· altered gene product level of

· has association

Human Phenotype Ontology Annotations (HPOA)

The Human Phenotype Ontology Annotations (HPOA) maps the associations between abnormal phenotypes and diseases. Source disease identifiers are cross mapped and harmonized in either EFO or MONDO accessions for compatibility.

Human Phenotype Ontology Associations Documentation: https://hpo.jax.org/

Concepts: Phenotype, Gene

Relationships:

· phenotype associated with

· associated with

· has genetic association

Alliance - Disease Gene Associations

A consortium of 7 model organism databases (MODs) and the Gene Ontology (GO) Consortium whose goal is to provide an integrated view of their data to all biologists, clinicians and other interested parties. This data package contains edges between Genes and Diseases. It will require you to have loaded in EFO and MONDO ontologies.

Alliance - Disease Gene Association Documentation http://alliancegenome.org/downloads

Alliance Disease Gene Associations Relationships

· is marker via orthology disease progression of

· is marker via orthology susceptibility to

· is marker for disease progression of

· is implicated via orthology sexual dimorphism in

· is implicated via orthology penetrance of

· is implicated in sexual dimorphism in

· is implicated in

· is implicated in susceptibility to

· is implicated in disease progression of

· is implicated in resistance to

· is implicated via orthology disease progression of

· is implicated via orthology onset of

· is implicated in severity of

· is marker for susceptibility to

· is implicated via orthology resistance to

· is marker for onset of

· is implicated via orthology severity of

· is marker via orthology resistance to

· is implicated in onset of

· is marker via orthology sexual dimorphism in

· is not marker for

· is marker for resistance to

· is marker for severity of

· is implicated via orthology

· is marker for

· is marker for sexual dimorphism in

· is implicated via orthology susceptibility to

· is marker via orthology

· is marker via orthology severity of

· is marker via orthology onset of

· is not implicated in

ClinVar

Clinvar GRCh38 vcf mutational information extracted and mapped to relevant diseases, after running the mutations through SnpEff to determine gene and transcript level effects.

ClinVar Documentation https://www.ncbi.nlm.nih.gov/clinvar/

Concepts: Mutation

Relationships:

  • associated with disease

  • found on transcript

  • found on


Category: Cell Line Models and Dependency Data

Description: This category includes experimental datasets derived from cancer cell lines, such as those from DepMap and CCLE. These packages capture gene essentiality (via CRISPR or RNAi screens), expression profiles, and disease modeling relationships. They offer quantitative insights into gene function, dependencies, and transcriptional signatures across diverse cell line models.

Use Case: Use this category to identify context-specific essential genes, select relevant in vitro models for disease research, or evaluate therapeutic vulnerabilities across cancer subtypes. It’s particularly valuable for functional genomics, target prioritization, and preclinical model selection.

DepMap Dependency and CCLE Expression

Cell line to gene dependency edges as well as RPKM/normalized expression values taken from DepMap and CCLE respectively.

DepMap / CCLE Documentation: https://depmap.org/portal/ccle/

Concepts: CellLine

Relationships:

· models disease

· has dependency

· expresses gene


Drug Targets and Safety Liabilities

Description: This category includes data packages that annotate drug targets with known therapeutic indications, mechanisms of action, and safety profiles. Sources like Open Targets Drug Annotations and Safety Liabilities provide curated information on approved and investigational drugs, target-disease relationships, and adverse effect associations, often derived from clinical, pharmacological, and genetic evidence.

Use Case: Use this category to evaluate the druggability and safety risks of potential targets, identify repurposing opportunities, and flag off-target effects early in the discovery pipeline. It is particularly useful for therapeutic hypothesis refinement, target de-risking, and translational planning.

Open Targets - Drug Annotation

Drug and drug annotation data extracted from curated data in the Open Targets platform. This data package specifically incorporates basic molecule information, clinical trial and indication details associated with diseases, and mechanism of action annotation associated with genes.

Open Targets Drug Annotation Documentation: https://platform-docs.opentargets.org/drug

Concepts: Gene, Drug, Disease

Relationships:

· acts on

· is approved for

· has clinical precedence for

Open Targets - Safety Liabilities

Contains manually curated experimental data and insights from publications and other well-known sources of target safety, extracted from Open Targets Database.

Open Targets Safety Liabilities Documentation: https://platform-docs.opentargets.org/target/safety

Concepts: Gene, SafetyLiability

Relationships:

· occurs when increasing

· occurs when decreasing

· manifests

· detected in biosystem


Clinical Trials and Intervention Studies

Description: This category compiles interventional study records from sources like ClinicalTrials.gov, focusing on drug-condition relationships, study designs, and observed outcomes. Each entry typically represents a registered human trial and includes metadata such as trial phase, status, and therapeutic intervention details.

Use Case: Use this category to link experimental or approved drugs to diseases, mine evidence for repurposing opportunities, or assess the clinical relevance of therapeutic hypotheses. It is essential for translational strategy, benchmarking therapeutic development, and integrating real-world evidence into research pipelines.

Clinical Trials - Interventional Studies

A subset of the clinical trials database that captures all intervention studies.

ClinicalTrials.gov Documentation: https://clinicaltrials.gov/

Concepts: ClinicalTrial

Relationships:

· studies condition

· using the drug


Genetic Models & Animal Data

Description:

Includes animal models of disease and genotype-phenotype relationships. Provides in vivo evidence and translational insights.

Use Case:

Includes animal models of disease and genotype-phenotype relationships. Provides in vivo evidence and translational insights.

Alliance Genome - Animal Models of Disease

Description: Extraction of alleles-genotypes-diseases reports from MGI.

Alliance Genome Documentation: https://www.alliancegenome.org/downloads

Concepts: Allele, Genotype

Relationships:

· has allele

· is model of


Biological Processes and Pathways

Description: These data packages define how genes and proteins participate in cellular functions, biochemical reactions, and higher-order physiological processes through structured, hierarchical ontologies and pathway maps.

Use Case: Use this category to contextualize gene or protein activity within known biological functions, interpret omics data through enrichment analysis, or trace mechanistic cascades underlying disease phenotypes. It is foundational for hypothesis generation, systems biology modeling, and mechanistic target evaluation.

Reactome Events

A data package containing partial information extracted from Reactome. Reactome Events and adjacent nodes with the label 'Event', 'Pathway', 'Drug', 'GO', 'Gene', and 'Disease' labels were extracted. Diseases were mapped from DOID to EFO, Drugs were mapped from Reactome to ChEMBL, and Ensembl genes mapping were external ontologies used. Graph db dump: reactome neo4j 4.3.6. Neo4j dump was restored from the dump and migrated to be compatible with new range instead of index.

Reactome Documentation: https://reactome.org/documentation

Concepts: Event, Pathway, Drug, BiologicalProcess, CellularComponent, Gene, Disease

Reactome Event Relationships

· in compartment

· relates to GO process

· has encapsulated event

· has event

· uses drug as input

· related to normal pathway

· produces drug

· preceding event

· has reversable reaction

· found in disease

· involved in

Gene Ontology (GO)

Contains classes in the Gene Ontology transformed as graph objects, assigned to one of the 3 gene ontology top-level concepts. Only direct parents subClassOf relations are preserved as `is a ` edges.

Gene Ontology (GO) Documentation: https://geneontology.org/docs/ontology-documentation/

Concepts: BiologicalProcess, CellularComponent, MolecularFunction

Gene Ontology - Annotations

A GO annotation is a statement about the function of a particular gene. GO annotations are created by associating a gene or gene product with a GO term. Together, these statements comprise a “snapshot” of current biological knowledge. Hence, GO annotations capture statements about how a gene functions at the molecular level, where in the cell it functions, and what biological processes (pathways, programs) it helps to carry out.

Gene Ontology (GO) Documentation: https://geneontology.org/docs/ontology-documentation/

Gene Ontology Relationships

· enables

· located in

· involved in

· part of

· not enables

· not involved in

· is active in

· not colocalizes with

· colocalizes with

· acts upstream of or within

· contributes to

· not located in

· not part of

· acts upstream of positive effect

· not acts upstream of or within

· acts upstream of

· acts upstream of negative effect

· acts upstream of or within positive effect

· acts upstream of or within negative effect

· not contributes to

· not acts upstream of or within negative effect

· not is active in

Reference Ontologies

Description:

Structured vocabularies used to represent standardized biological concepts like diseases, phenotypes, and cell types. These are foundational for harmonizing and annotating biomedical data.

Use Case:

Structured vocabularies used to represent standardized biological concepts like diseases, phenotypes, and cell types. These are foundational for harmonizing and annotating biomedical data.

Cell Type (Cell Ontology)

A data package that contains objects representing Cell Types extracted from classes in the Cell Ontology. It also includes transformed object properties between classes into real edges across Cell Type objects.

Cell Ontology (CL) Documentation: https://cell-ontology.github.io/

Concepts

CellType

Cell Type Relationships

· is a

· develops from

· has_part

· develops into

· synapsed to

· synapsed by

· directly develops from

· has potential to directly develop into

· derives from

· has synaptic terminal in

· innervates

Experimental Factor Ontology - Subset

This data package is a subset of classes extracted from the Experimental Factor Ontology. It is primarily used to introduce Objects into the knowledge graph that support the annotation and analysis of other data sources such as Open Targets and other EBI databases. Generation of these Objects for use in the BioBox knowledge graphs are restricted to EFO namespaced terms.

Experimental Factor Ontology (EFO) Documentation: https://www.ebi.ac.uk/efo

Concepts: EFO

Disease (MONDO)

A data package that contains objects representing Diseases extracted from classes in the MONDO with inter-ontology axioms. It also includes transformed object properties between classes into real edges across Disease objects.

MONDO Disease Documentation: https://mondo.monarchinitiative.org/

Concepts: Disease

MONDO Disease Relationships

· is a

· has characteristic

· predisposes towards

· disease has feature

· disease arises from feature

· disease shares features of

· disease has major feature

· part of progression of disease

Disease

A collection of diseases and abnormal phenotypes that have known annotation and data available through Open Targets, ChEMBL and other knowledge bases. This package uses ontology classes taken from EFO, HP, MONDO as real-world objects in the biobox knowledge graph. We also preserve the inheritance and subclass relationship through the 'is a' relation

Concepts: Disease

Relationships:

· is a

Tissues (UBERON)

This data packages supplies your knowledge graph with a collection of nodes that represents tissues. The UBERON ontology is used as the data source where we transform the ontology classes into objects inside your knowledge graph. Additionally, annotation properties that indicate inheritance and other cross-class axioms are transformed into relationships that connect tissue objects.

UBERON Documentation: https://www.ebi.ac.uk/ols4/ontologies/uberon

Concepts: Tissue

UBERON Relationships

· is a

· part_of

· continuous with

· contributes to morphology of

· connects

· develops from

· has skeleton

· has_part

· immediate transformation of

· immediately deep to

· composed primarily of

· in lateral side of

· existence ends during

· attached to

· has component

· has potential to develop into

· located in

· location of

· extends_fibers_into

· channel for

· channels_from

· channels_into

· conduit for

· adjacent to

· existence ends during or before

· has 2D boundary

· preceded by

· precedes

· immediately preceded by

· ends

· surrounded by

· developmentally replaces

· developmentally induced by

· existence starts during

· overlaps

· ends with

· starts

· transformation of

· bounding layer of

· surrounds

· anterior to

· has developmental contribution from

· proximalmost part of

· luminal space of

· subdivision of

· has member

· produced by

· dorsal to

· has muscle origin

· has muscle insertion

· existence starts and ends during

· innervates

· branching part of

· innervated_by

· attached to part of

· produces

· in left side of

· connected to

· drains

· supplies

· connecting branch of

· sexually_homologous_to

· existence starts with

· existence ends with

· intersects midsagittal plane of

· in anterior side of

· anteriorly connected to

· posteriorly connected to

· proximally connected to

· distally connected to

· preaxialmost part of

· distalmost part of

· posterior to

· tributary of

· has muscle antagonist

· in right side of

· protects

· lumen of

· develops in

· derived from ancestral fusion of

· anastomoses with

· filtered through

· postaxialmost part of

· skeleton of

· directly develops from

· develops from part of

· indirectly_supplies

· in posterior side of

· deep to

· superficial to

· immediately superficial to

· serially homologous to

· ventral to

· trunk_part_of

· distal to

· proximal to

· in deep part of

· in superficial part of

· in dorsal side of

· has potential to developmentally contribute to

· in ventral side of

· layer part of

· in distal side of

· in proximal side of

· in_innermost_side_of

· in_outermost_side_of

· aboral to

· existence starts during or after

· in central side of

· immediately anterior to

· immediately posterior to

· develops into

Last updated