Upload Internal Data
How to add internal data to your graph through the creation of a data package.
Last updated
How to add internal data to your graph through the creation of a data package.
Last updated
Data packages are a collection of objects, the concepts they are associated with, and the relationships that exist between the objects. Concepts and relationships that you would like to leverage must be created prior to uploading loading. To make the data upload as easy as possible, we have several python data adapters that will convert your processed data files to the required JSON format. Information on how to use our data adapters can be found here. Should you choose to format the data yourself without the use of our data adapters, please follow the instructions outlined in this article.
Within the Foundry use the left hand panel to select βInternal Data Packageβ. Select βUpload Dataβ and follow the stepper to create a new Data Package.
When creating a Data Package you will have to provide the following information;
Name of the data package you are creating
Description of the data package you are creating
Nodes
If using our python data adapters, this would be the node.json file
All nodes must have a universally unique identifier Generally, we recommend using uuid-v4
All nodes must have 1 or more labels. These labels are synonymous with your ontology Concepts. Specifically, to make sure they get picked up in your knowledge graph, you must supply them exactly as youβve specified in your DbLabel.
A properties object must be defined with the uuid
and the displayName
Here the uuid
is the same as the _id
All property values must be primitives (string, number, boolean) or arrays of primitives.
It is your responsibility to ensure that labels used are semantically consistent with your Concept structure
Select from any of the concepts that exist within you graph that pertain to the labels you have provided in your node JSON schema.
If you have used a BioBox adapter to generate your file run list_schema
to see the concepts that should be selected.
Relationships data describe the relationships between objects within your Node JSON schema. This is optional.
If using our data adapters, this would be the edge.json file.
from.uuid
and to.uuid
must be ids of objects that exist or are expected to exist during loading
Select from any of the relationships that exist within your graph that pertain to the relationship data you have provided.
If you have used a BioBox adapter to generate the file, run list_schema
to see the relationships that could be selected.
All nodes and edges should be written as .jsonl
new-line delimited jsons as separate files.
Best practice: Namespace your files for clarity e.g. gene_sets.node.jsonl.gz gene_sets.edge.jsonl.gz
If you are not using our python adapters, please format your data as outlined below.
For bulk RNAseq, on a per library basis, you will want to store the RAW values and the TPM adjusted values. The graph representation looks like this:
Quantified data should be serialized into an array of edges.
If you plan on quantifying and storing the results for the same library twice, you should attribute the file_id or analysis_id on all the edges for a given run
Create a node row for each Differential Expression Dataset. Keep track of the uuid youβve assigned to it, youβll need to reference it later.
Assuming you know the sample (and their ids) that map to the experimental and reference group. You will need to set up the edges to connect them.
Repeat this process for the rest of your experimental group and your reference groups.
It is not recommended to have your edges point across different Concepts. Using these paths consistently will improve your ability to reuse them in your graph models later.
For each dataset, iterate through the rows and transform them into edges. All genes will have their uuid set to their corresponding ensembl stable id.
For scRNA, we want to isolate each by its barcode and treat them similar to RNAseq libraries.
How to prepare this data?
From .h5ad
files where the cell barcode is available, transform the count columns of raw or log-normalized values and create nodes and edges similar to the process with bulk RNAseq, except here you are swapping Cells for Samples.
From any differential expression dataset where youβve compared cell populations. The groupings here donβt matter as much, because by connecting the dataset directly to the cells, you can always identify along what separation the exp/ref was built. So long as you maintain the appropriate annotations for the Cell node(s), they can be used in the graph explorer and the graph models.