How to prepare and upload data (html)
Open this page as a pdf.
Key concepts
Background
The INTERVALS platform and its associated data warehouse were created as a sharing platform for all studies, protocols, and data related to risk assessment of potential and candidate modified risk tobacco products as well as for reporting mechanistic investigations linked to tobacco-related diseases and pathways of toxicity.
Since such information will be uploaded from different actors to report on different study types and assays producing many different data types, a flexible concept for storing data and metadata was implemented in the INTERVALS data warehouse, even though we will make it possible in the near future to also upload files in standard/regulatory formats (e.g., SDTM, ADAM, SEND, Isa-Tab, see Figure 1).
With the introduction of data schemata accompanying every dataset, our aim was to offer the needed flexibility for uploading new and modified endpoints while adding capabilities for semantically annotating and verifying datasets to allow for easy retrieval, interoperability, and meta-analyses. We also decided to split the data and metadata files to ease preparation of files (possibly by different scientists and/or laboratories) and retain as much information as needed to understand what was measured, how it was measured, and what were the results obtained, so the datasets can be reanalyzed meaningfully.
Following this concept, datasets are composed of four components:
- Description of metadata format: Metadata schema
- Metadata file
- Description of data format: readme file or data schema
- Data file(s)
The relationships between these 4 components can be seen in the Figure 2 and more details will be given in the following sections. First, the metadata and data schemata will be described together since they have the same format. Then, the distinct cases for uploading different data types (tabular or non-tabular) will be outlined.
Figure 1 – Summary flow of a dataset upload
Figure 2 -The four components of an INTERVALS dataset and their interrelationship.
The schemata define the columns and properties of the columns of the (meta)data files. Unique identifiers, the sampleID in this case, are used to link metadata and data.
Metadata and data schemata
Metadata and data schemata should be uploaded in a tabular form - csv, tab-separated and Excel are supported currently. In any schema, the following eight columns have to be included:
A. StudyStage: This groups the information about the dataset into different stages: study set up, treatment, sample preparation, endpoint measurement, and analysis (Figure 3). This information would be especially helpful if we want to produce Isa-Tab files for export.
Figure 3 - Proposed study stages
B. ColumnName: Names of the columns as found in the respective metadata or data files. A relevant ontology term or common vocabulary should be used whenever possible.
C. Role: Because not everyone will have the same column names, but it is important to recognize key columns types in every file for the validation and to allow for any analytics and meta-analysis, specific mandatory data columns are specified in the role column. They will be treated specifically during the data upload and validation process, as well as for data visualization, meta-analyses, and by any other analytical tool that will be made available on the platform. At the moment, possible entries in this column include “sampleID”, “groupID” and “contrastID” in both metadata and data schemata, “fileReference” in metadata schemata and “endpointID” specifying the specific endpoint and “endpointValue” specifying the primary readout in a data schema. Additional specific data columns could be added later providing means for the automatic processing of the datasets. The full list of available roles can be found here.
D. Type: This specifies the type (e.g. string, float, int,...) of (meta)data expected in the column.
E. Ontology: This column is marked with an “x” if the (meta)data is annotated with ontology terms in the (meta)data files. In this case, three columns are expected for this (meta)data entry for 1) the human readable term, 2) the ontology the term is taken from and 3) the identifier of this term (e.g. “tissue”, “obo”, “UBERON_0000479”). A specific naming scheme for these columns in the corresponding (meta)data files is also expected and datasets not following this scheme will not pass the validation performed during the upload process (see below). According to this scheme, the column headers of the 2nd and 3rd column have to be the original column name given in column A of the schema file followed by “Ontology” and “Ontology Entry”, respectively. For example, the two additional columns “MaterialType Ontology” (value: “obo”) and “MaterialType Ontology Entry” (value: “UBERON_0000479”) have to be added following the metadata column “MaterialType” (value: “tissue”) (Figure 2).
F. Unit: This column is marked with an “x” if the values are given in a specific unit. Two columns will be included in the (meta)data files for 1) the numeric value and 2) the unit.
G. UnitOntology: This column is marked with an “x” if the unit is annotated with ontology terms. Additional to the two columns for the numeric value and the unit, two columns are added for 2) the ontology the unit is taken from and 3) the identifier of the unit. This will allow for automatic conversion from one unit to another. Also here the naming scheme has to be followed when creating the corresponding (meta)data files. The column headers of the 2nd to 4th column have to be the original column name given in column A of the schema file followed by “Unit”, “Unit Ontology” and “Unit Ontology Entry”, respectively. For example, the three additional columns “ExposureDuration Unit” (value: “month”), “ExposureDuration Unit Ontology” (value: “EFO”) and “ExposureDuration Unit Ontology Entry” (value: “UO_0000035”) have to be added following the metadata column “ExposureDuration” (value: “6”).
H. Description: This is a free text field with a human-readable description of the (meta)data in this column. This information is essentially to understand the content and the structure of the dataset and, thus, providing this information is mandatory.
Examples of a metadata and a data schema are available here.
Data files
Datasets in INTERVALS’s flexible format can be divided into tabular datasets and non-tabular datasets:
-
In tabular datasets, the data file is structured in a tabular form and a data schema is provided as described above.
- Instead, non-tabular datasets allow to store data files in an arbitrary format, including binary formats, zip files...
Another distinction between datasets in INTERVALS, independent of the tabular/non-tabular categorization, is to divide them into raw, processed, and contrast data files:
- Raw data files will hold sample-specific primary readouts.
- Data generated from this raw data in a processing or analysis step, which might be still sample-specific like normalized values or fold changes or are based on group of samples like average values, standard deviations or p-values, are reported in processed datasets.
- Finally, results comparing two different treatment are reported in contrast datasets.
Tabular datasets
Tabular data is composed out of metadata and data schemata, the metadata file and exactly one data file, all provided in csv, tab-separated or Excel file format.
The metadata file has one line either 1) per sample or 2) per group of samples.
To be able to reference to a specific sample, group, or contrast between two groups, one column with the role sampleID/groupID/contrastID has to be provided, respectively (Figure 4). Otherwise the validation of the dataset will fail (see below). Additionally, the sampleID/groupID/contrastID will be used to match metadata to data. To be able to do this unambiguously, one of the identifiers, the primary identifier (PID), must be unique, i.e. one specific value can only occur once in the metadata file.
Figure 4 - Primary identifiers
The primary ID is assumed by evaluating the following rules (Figure 4):
-
1. For raw data, sampleID is always used as the PID
- 2. ContrastID is the PID if it is available.
- 3. If SampleID and groupID are provided, SampleID is selected as primary ID.
- 4. GroupID is chosen as primary ID only if it is the only identifier available.
SampleIDs/groupIDs/contrastIDs (primary identifier but also others if they are given in both data and metadata files) not matching between the data and metadata file will result in a warning during the upload validation process. Another validation criterion is the consistency of the columns and types in the schema and the corresponding meta(data) file. Data files need to provide an endpointID and endpointValue column, which represent the primary readout of the experiment.
Non-tabular datasets
Non-tabular data in the definition of INTERVALS can come in many different forms like cel file, images from high-contents imaging or standard file formats used in different databases and supported by analysis and modeling tools. The metadata schema and metadata file have the same format as in the tabular case but the metadata columns are adapted to cover information on the data files. One important addition is that a column with the data file name has to be provided marked with the role “fileReference” (since only one data file is supported for tabular data, this column is not needed for such data). Data can be provided as sample-specific files (like cel files or images from high-contents imaging experiments) or combined in one file with e.g. fold changes for multiple samples. In the latter case, additional information like the column/row number in which the data for the specific sample is stored can be provided. If the sample identifiers change between the metadata and data files, which might be the case if the data is duplicated from another database, this also has to be specified. Even if the upload of multiple files is supported to allow sample specific files, only one file type can be associated with one dataset. Multiple columns labeled as role “fileReference” will result in failing of the validation process as will do uploading of files referenced in column not labeled “fileReference” or not referenced at all. The only exception are zip files packing together all the sample-specific files, which will be unpacked during the upload process. Finally, a readme file has to be provided for a non-tabular dataset describing the data file format.
Step-by-step tutorial - dataset creation
INTERVALS datasets are composed out of 4 files in csv, tab-separated, or Microsoft Excel file format: 1) metadata schema, 2) metadata file, 3) data schema (tabular data) or read-me file (non-tabular data) and 4) data file. The creation of these files is described in this tutorial.
Metadata schema (tabular and non-tabular data) and data schema (tabular data) creation
(Meta)data schema creation based on existing schema (recommended)
- Download a dataset from the INTERVALS platform most similar to the experiment you want to upload.
- Open the (meta)data schema file in Microsoft Excel or a compatible program like Google Sheets or OpenOffice and save it under a new name. This should result in a view similar to the figure below.
- Check all rows carefully, especially the ColumnName and Description and remove rows, you don’t want to report for your experiment and add rows to be able to report additional metadata if you wish to. Whenever possible, group these new lines according to the StudyStage.
Example: You want to add the reference of the tissue in the catalogue of the tissue provider as a new metadata field in the Study setup section
- For each new row:
- Specify the StudyStage, e.g. study setup, treatment, sample preparation, endpoint measurement, and analysis
Example: Create a new row close to the other metadata of the “Study setup” StudyStage section (if any) and write “Study setup” into the first column.
- Enter ColumnName for each new (meta)data entry as they will later also appear in the (meta)data file (see below).
Example: Write “CellLineTissueProviderRef” into the ColumnName column.
- Define the role of the column. The full list of available roles can be found here. Roles like sampleID, groupID and contrastID will be used to link metadata to data entries in the metadata and data files and these and others like endpointID and endpointValue will be used for data analytics.
- Specify the variable type of the (meta)data to be filled in (string, integer, float).
Example: Set the type to string since CellLineTissueProviderRef can be an arbitrary combination of letters and numbers.
- Specify if the (meta)data should be annotated with ontology terms or units and unit ontology terms.
Example: Leave the next three columns empty since no ontology term or unit is needed for CellLineTissueProviderRef.
- Provide a description explaining the purpose of the (meta)data entry as detailed as possible.
Example: Provide the description “Reference of the tissue in the provider's catalogue” in the Description column. This concludes the adaptation of the metadata schema resulting in the new schema shown below.
Please note that reporting of non-tabular datasets requires a column with the role “fileReference” to the corresponding metadata schema. This column will either hold the name of the data file uploaded to INTERVALS or links to entries in other standard repositories. Individual files for each sample or for groups of samples or one file for the complete dataset are possible. Additional columns can also be added to simplify locating the data in the non-tabular data file. For example, the sample identifier, if it deviates in the metadata and data file uploaded to a standard repository, can be specified.
- Save the metadata schema.
(Meta)data schema creation without using a template
- Create a new tabular file in csv, tab-delimited or Microsoft Excel format.
- Create 8 columns with the headers StudyStage, ColumnName, Role, Type, Ontology, Unit, UnitOntology and Description
- Add rows to be able to report metadata. If possible, group these rows according to the StudyStage.
- For each row:
- Specify the StudyStage, e.g. study setup, treatment, sample preparation, endpoint measurement, and analysis.
- Enter a ColomnName for each new (meta)data entry as they will later also appear in the (meta)data file.
- Define a role of column. The full list of available roles can be found here. Roles like sampleID, groupID and contrastID will be used to link metadata to data entries in the metadata and data files and these and others like endpointID and endpointValue will be used for data analytics.
- Specify the variable type of the (meta)data to be filled in (string, integer, float).
- Specify if the (meta)data should be annotated with ontology terms or units and unit ontology terms.
- Provide a description explaining the purpose of the (meta)data entry as detailed as possible.
- Save the (meta)data schema.
Metadata file creation
- If a metadata file corresponding to the metadata schema already exists, e.g. the schema was already used for uploading another datasets generated with the same experimental setup, open this file, rename it and remove all sample information remaining from the old dataset.
- Else,
- create a new tabular file,
- create columns using the headers exactly as specified in the ColumnName column in the corresponding metadata schema,
- add ontology and ontology entry or unit, unit ontology and unit ontology entry columns, when such are specified in the metadata schema. The column headers have to be the original column name followed by “Ontology”, “Ontology Entry”, “Unit”, “Unit Ontology” and “Unit Ontology Entry”, e.g. for the metadata column “MaterialType”, the two additional columns “MaterialType Ontology” and “MaterialType Ontology Entry” have to be added.
- Fill in the metadata for each sample, group or contrast.
- Save the metadata file.
Tabular data file creation
- If a data file corresponding to the data schema already exists, open this file, rename it and remove all rows other then the first one with the column headers.
- Else,
- create a new file in Excel format,
- create columns using the headers exactly as specified in the ColumnName column in the corresponding metadata schema,
- add ontology and ontology entry or unit, unit ontology and unit ontology entry columns, when such are specified in the data schema. The column headers have to be the original column name followed by “ontology”, “ontology entry”, “unit”, “unit ontology” and “unit ontology entry”, e.g. for the data column “EndpointValue”, the three additional columns “EndpointValue Unit” , “EndpointValue Unit Ontology” and “EndpointValue Unit Ontology Entry” have to be added.
- Fill in the data for each sample, group or contrast.
- Save the data file.
Non-tabular data and read-me file creation
- Upload non-tabular data file as they are. Use multiple files, all having the same file format, for individual samples or one file with all the data. Data file is not needed if links to entries in other standard repositories are provided in the fileReference column of the metadata file.
- Create a readme file for the dataset. This includes a description of the file format and, in the case the data is not uploaded to INTERVALS, of the public resource the data is available from. This can, in its simplest form, just name the used standard file format, e.g. “Data is provided according to the Affymetrix (c) CEL Data File Format”
Step-by-step tutorial - validation process
The metadata and tabular data files must be uploaded as Excel or text based tabular files (CSV, TSV, etc.). If additional file format support is desired, please contact the administrators.
After uploading the metadata and data files, trigger the validation process for metadata and data in the user interface. INTERVALS will then perform the following automatic validation steps and return the following errors or warnings should they arise in the validation:
Tabular data validation
Figure 5 - Metadata validation summary
Metadata and data schema validation
- Check that the file format is Excel or CSV/TSV.
Possible errors:
- Unrecognized file format: must be a valid Excel or UTF-8 CSV file.
- CSV file must use UTF-8 character encoding.
- Check that the schema file has exactly the following columns: “StudyStage, ColumnName, Role, Type, Ontology, Unit, UnitOntology, Description”
Possible errors:
- Schema file must have exactly these columns: list of columns
- The following schema columns are invalid: list of columns
- The following schema columns are missing: list of columns
- Check that only valid types are given (float, int, string)
Possible errors:
- Unknown type x found in the Type column: column
- Check that the role column contains:
- For RAW datasets: one sampleID and an optional groupID
- For PROCESSED datasets: at least one of sampleID, groupID or contrastID
Possible errors:
- Either sampleID or groupID or contrastId (or any combination) must occur (exactly once), but none of these was present
- Role x must occur exactly once
- Role x must occur exactly once but it appears y times
- Check that at least one line in the schema file in the role column contains a value.
Possible errors:
Metadata validation
Figure 6 – Summary of dataset validation for tabular data
- Check that the file format is Excel or CSV.
Possible errors:
- Unrecognized file format: must be a valid Excel or UTF-8 CSV file.
- CSV file must use UTF-8 character encoding.
- Check that the metadata file has exactly the columns specified in the schema file.
Possible errors:
- Metadata file must have exactly these columns: columns
- The following metadata columns are invalid: columns
- The following metadata columns are missing: columns
- For each row in the metadata file, check that if the schema indicated a column as an ID, check that the field is not empty.
Possible errors:
- Missing value in the ID column x at line y
- For each row in the metadata file, check that if the schema indicated a file reference, check that the field is not empty.
Possible errors:
- Missing value in the file reference column x at line y
- For each row in the metadata file, check that if the schema indicated a unit, check that the field is not empty.
Possible errors:
- Unit is present for a value with unit but the value is missing in column x at line y
- For each row in the metadata file, check that the value matches the data type specified in the schema.
Possible errors:
- Invalid data type in the column x at line y: expected r but found 's' of type t
- For each row in the metadata file, if the schema specified a unit and the value column is non-empty, check that the unit column also contains a value.
Possible errors:
- Missing unit for a value with unit in column x at line y
- For each row in the metadata file, check that the values are unique if the schema specified an unique values in the column.
Possible errors:
- Found duplicate value in the unique column x: y (occurs z times)
Tabular data validation
All validations that are performed for metadata files are also performed on data files (except the mandatory presence of units). In addition, the following validations are performed:
For RAW datasets: SampleId is required and must be unique, GroupId is optional and does not have be unique. The set of IDs used in the metadata and data have to match exactly.
Possible warnings:
- The following IDs are duplicated in the data file x
- The following IDs are duplicated in the metadata file x
- ID specified in the metadata is not present in the data: x
- ID found in the data is not present in the metadata: x
Non-tabular data validation
Figure 7 - Non-tabular dataset validation summary
- Check that a readme file is present.
Possible errors:
- Check that no filename is given more than once.
Possible errors:
- File 'x' is provided y times but should be provided only once
- Check that the set of specified filenames exactly matches the set of uploaded files.
Possible errors:
- Data file referenced in the metadata is missing: x
Possible errors:
- Provided data file is not referenced in the metadata: x