⤴️Upload Internal Data

How to add internal data to your graph through the creation of a data package.

Data packages are a collection of objects, the concepts they are associated with, and the relationships that exist between the objects. Concepts and relationships that you would like to leverage must be created prior to uploading loading. To make the data upload as easy as possible, we have several python data adapters that will convert your processed data files to the required JSON format. Information on how to use our data adapters can be found here. Should you choose to format the data yourself without the use of our data adapters, please follow the instructions outlined in this article.

Uploading your data

Within the Foundry use the left hand panel to select “Internal Data Package”. Select “Upload Data” and follow the stepper to create a new Data Package.

When creating a Data Package you will have to provide the following information;

Name

  • Name of the data package you are creating

Description

  • Description of the data package you are creating

Object Data

  • Nodes

  • If using our python data adapters, this would be the node.json file

    Node Payload Object Example
    {
    	"_id": "someid",
    	"labels": ["LabelA", "LabelB"],
    	"properties": {
    		"uuid": "someid",
    		"displayName": "some_name"
    	}
    }
Node JSON Schema
{
  "$schema": "<https://json-schema.org/draft/2020-12/schema>",
  "$id": "<https://schema.biobox.io/node.json>",
  "title": "Node",
  "definition": "Schema for submitting node objects to load into graph",
  "type": "object",
  "properties": {
    "_id": {
      "type": "string",
      "description": "universally unique identifier for the node"
    },
    "labels": {
      "type": "array",
      "description": "List of Concept labels that this node belongs to.",
      "items": {
        "type": "string"
      },
      "minItems": 1
    },
    "properties": {
      "type": "object",
      "description": "set of properties associated with the object including a unique uuid and a human readable name",
      "properties": {
        "uuid": {
          "type": "string",
          "description": "universally unique identifier for the node. Same as _id"
        },
        "displayName": {
          "type": "string",
          "description": "A human readable name for the object"
        }
      },
      "patternProperties": {
        ".*": {
          "oneOf": [
            {
              "type": ["string", "number", "boolean"]
            },
            {
              "type": "array",
              "items": {
                "type": ["string", "number", "boolean"]
              }
            }
          ]
        }
      },
      "required": ["uuid", "displayName"],
      "additionalProperties": false
    }
  },
  "required": ["_id", "labels", "properties"]
}

Requirements

  1. All nodes must have a universally unique identifier Generally, we recommend using uuid-v4

  2. All nodes must have 1 or more labels. These labels are synonymous with your ontology Concepts. Specifically, to make sure they get picked up in your knowledge graph, you must supply them exactly as you’ve specified in your DbLabel.

  3. A properties object must be defined with the uuid and the displayName Here the uuid is the same as the _id

  4. All property values must be primitives (string, number, boolean) or arrays of primitives.

It is your responsibility to ensure that labels used are semantically consistent with your Concept structure

Concepts

  • Select from any of the concepts that exist within you graph that pertain to the labels you have provided in your node JSON schema.

  • If you have used a BioBox adapter to generate your file run list_schema to see the concepts that should be selected.

Relationship Data

  • Relationships data describe the relationships between objects within your Node JSON schema. This is optional.

  • If using our data adapters, this would be the edge.json file.

    Edge Payload Object Example
    {
    	"from": {
    		"uuid": "someid-1"
    	},
    	"to": {
    		"uuid": "someid-2"
    	},
    	"label": "acts on",
    	"properties": {}
    }
Edge JSON Schema
{
  "$schema": "<https://json-schema.org/draft/2020-12/schema>",
  "$id": "<https://schema.biobox.io/edge.json>",
  "title": "Edge",
  "definition": "Schema for submitting edge connections to load into graph",
  "type": "object",
  "properties": {
    "from": {
      "type": "object",
      "description": "An object with a single property, uuid, that marks the start node of the edge",
      "properties": {
        "uuid": {
          "type": "string"
        }
      },
      "required": ["uuid"]
    },
    "to": {
      "type": "object",
      "description": "An object with a single property, uuid, that marks the end node of the edge",
      "properties": {
        "uuid": {
          "type": "string"
        }
      },
      "required": ["uuid"]
    },
    "label": {
      "type": "string",
      "description": "The label for the edge"
    },
    "properties": {
      "type": "object",
      "description": "set of properties associated with the edge",
      "patternProperties": {
        ".*": {
          "oneOf": [
            {
              "type": ["string", "number", "boolean"]
            },
            {
              "type": "array",
              "items": {
                "type": ["string", "number", "boolean"]
              }
            }
          ]
        }
      }
    }
  },
  "required": ["from", "to", "label", "properties"]

Requirements:

  1. from.uuid and to.uuid must be ids of objects that exist or are expected to exist during loading

Relationships

  • Select from any of the relationships that exist within your graph that pertain to the relationship data you have provided.

  • If you have used a BioBox adapter to generate the file, run list_schema to see the relationships that could be selected.

File Formats

All nodes and edges should be written as .jsonl new-line delimited jsons as separate files.

Best practice: Namespace your files for clarity e.g. gene_sets.node.jsonl.gz gene_sets.edge.jsonl.gz

Preparing your data

If you are not using our python adapters, please format your data as outlined below.

Expression Data

For bulk RNAseq, on a per library basis, you will want to store the RAW values and the TPM adjusted values. The graph representation looks like this:

How to prepare the data?

Quantified data should be serialized into an array of edges.

{
	"from": {
		"uuid": "Example_sample:1"
	},
	"to": {
		"uuid": "ENSGxxx123"
	},
	"label": "expresses",
	"properties": {
		"RAW": 100,
		"TPM": 10.123
	}
}

If you plan on quantifying and storing the results for the same library twice, you should attribute the file_id or analysis_id on all the edges for a given run

Differential Expression

How to prepare the data?

  1. Create a node row for each Differential Expression Dataset. Keep track of the uuid you’ve assigned to it, you’ll need to reference it later.

{
	"_id": "some-uuid-v4",
	"labels": ["DifferentalExpressionDataset"],
	"properties": {
		"uuid": "some-uuid-v4",
		"displayName": "some-human-readable-name"
	}
}
  1. Assuming you know the sample (and their ids) that map to the experimental and reference group. You will need to set up the edges to connect them.

{
	"from": {
		"uuid": "some-uuid-v4"
	},
	"to": {
		"uuid": "example_sample:1"
	},
	"label": "experimental group includes",
	"properties": {}
}

Repeat this process for the rest of your experimental group and your reference groups.

It is not recommended to have your edges point across different Concepts. Using these paths consistently will improve your ability to reuse them in your graph models later.

  1. For each dataset, iterate through the rows and transform them into edges. All genes will have their uuid set to their corresponding ensembl stable id.

Single Cell RNAseq

For scRNA, we want to isolate each by its barcode and treat them similar to RNAseq libraries.

How to prepare this data?

  1. From .h5ad files where the cell barcode is available, transform the count columns of raw or log-normalized values and create nodes and edges similar to the process with bulk RNAseq, except here you are swapping Cells for Samples.

  2. From any differential expression dataset where you’ve compared cell populations. The groupings here don’t matter as much, because by connecting the dataset directly to the cells, you can always identify along what separation the exp/ref was built. So long as you maintain the appropriate annotations for the Cell node(s), they can be used in the graph explorer and the graph models.

Last updated