scRNA-seq
HTAN scRNA-seq Data Model - Single-cell RNA sequencing data
BaseSequencingAttributes
Minimal base attributes shared across all sequencing types
Attribute |
Type |
Required |
Description |
|---|---|---|---|
|
string |
No |
Checksum for data integrity verification |
|
string |
Yes |
Name of the file |
|
string |
Yes |
Format of the file (e.g., fastq, bam, vcf, h5ad) |
|
string |
Yes |
HTAN Data File ID (Primary Key) |
|
string |
Yes |
HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |
BaseSequencingLevel1Attributes
Level 1 attributes - sequencing run and library (raw data)
Attribute |
Type |
Required |
Description |
|---|---|---|---|
|
Yes |
Library layout (paired-end or single-end) |
|
|
Yes |
Sequencing platform used |
|
|
string |
No |
Sequencing batch identifier |
|
integer |
No |
Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter ‘Not Applicable’) |
|
string |
No |
Technical replicate group identifier |
|
string |
No |
Link to sequencing protocol |
|
string |
No |
Checksum for data integrity verification |
|
string |
Yes |
Name of the file |
|
string |
Yes |
Format of the file (e.g., fastq, bam, vcf, h5ad) |
|
string |
Yes |
HTAN Data File ID (Primary Key) |
|
string |
Yes |
HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |
BaseSequencingLevel2Attributes
Level 2 attributes - alignment and alignment workflow
Attribute |
Type |
Required |
Description |
|---|---|---|---|
|
Yes |
Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison. |
|
|
string |
Yes |
URL to genomic or transcriptomic reference |
|
string |
Yes |
URL to genome or transcriptome annotation |
|
string |
Yes |
Major version of the workflow, or ‘Not applicable’ when no workflow version applies. |
|
string |
Yes |
Link to workflow or command. DockStore.org recommended |
|
Yes |
Library layout (paired-end or single-end) |
|
|
Yes |
Sequencing platform used |
|
|
string |
No |
Sequencing batch identifier |
|
integer |
No |
Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter ‘Not Applicable’) |
|
string |
No |
Technical replicate group identifier |
|
string |
No |
Link to sequencing protocol |
|
string |
No |
Checksum for data integrity verification |
|
string |
Yes |
Name of the file |
|
string |
Yes |
Format of the file (e.g., fastq, bam, vcf, h5ad) |
|
string |
Yes |
HTAN Data File ID (Primary Key) |
|
string |
Yes |
HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |
BaseSequencingLevel3Attributes
Level 3+ attributes - inherits alignment and workflow; used for processed/analysis levels
Attribute |
Type |
Required |
Description |
|---|---|---|---|
|
Yes |
Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison. |
|
|
string |
Yes |
URL to genomic or transcriptomic reference |
|
string |
Yes |
URL to genome or transcriptome annotation |
|
string |
Yes |
Major version of the workflow, or ‘Not applicable’ when no workflow version applies. |
|
string |
Yes |
Link to workflow or command. DockStore.org recommended |
|
Yes |
Library layout (paired-end or single-end) |
|
|
Yes |
Sequencing platform used |
|
|
string |
No |
Sequencing batch identifier |
|
integer |
No |
Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter ‘Not Applicable’) |
|
string |
No |
Technical replicate group identifier |
|
string |
No |
Link to sequencing protocol |
|
string |
No |
Checksum for data integrity verification |
|
string |
Yes |
Name of the file |
|
string |
Yes |
Format of the file (e.g., fastq, bam, vcf, h5ad) |
|
string |
Yes |
HTAN Data File ID (Primary Key) |
|
string |
Yes |
HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |
scRNALevel1
scRNA-seq Level 1 data - Raw sequencing files and metadata
Attribute |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Format of the raw sequencing file (fastq or fastq.gz) |
|
string |
Yes |
Name of the file. Must end with an extension matching the FILE_FORMAT (.fastq for fastq; .fastq.gz or .fq.gz for fastq.gz) |
|
Yes |
Method used to isolate single cells |
|
|
Yes |
Method used to dissociate tissue into single cells |
|
|
boolean |
No |
Whether cells were cryopreserved in the sample |
|
Yes |
Type of nucleic acid used for sequencing |
|
|
Yes |
Method used to construct the sequencing library |
|
|
Yes |
Primer used for reverse transcription |
|
|
Yes |
Type of spike-in used, if any |
|
|
Yes |
Type of read (forward, reverse, index) |
|
|
Yes |
Library layout (paired-end or single-end) |
|
|
Yes |
Sequencing platform used |
|
|
string |
No |
Sequencing batch identifier |
|
integer |
No |
Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter ‘Not Applicable’) |
|
string |
No |
Technical replicate group identifier |
|
string |
No |
Link to sequencing protocol |
|
string |
No |
Checksum for data integrity verification |
|
string |
Yes |
HTAN Data File ID (Primary Key) |
|
string |
Yes |
HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |
scRNALevel2
scRNA-seq Level 2 data - Workflow and processing metadata
Attribute |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Format of the aligned file (bam or cram) |
|
string |
Yes |
Name of the file. Must end with an extension matching the FILE_FORMAT (.bam for bam; .cram for cram) |
|
Yes |
Generic name for the workflow used to analyze the dataset |
|
|
string |
No |
Link to whitelist cell barcode file |
|
string |
No |
Tag used for cell barcodes |
|
string |
No |
Tag used for UMIs |
|
Yes |
Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison. |
|
|
string |
Yes |
URL to genomic or transcriptomic reference |
|
string |
Yes |
URL to genome or transcriptome annotation |
|
string |
Yes |
Major version of the workflow, or ‘Not applicable’ when no workflow version applies. |
|
string |
Yes |
Link to workflow or command. DockStore.org recommended |
|
Yes |
Library layout (paired-end or single-end) |
|
|
Yes |
Sequencing platform used |
|
|
string |
No |
Sequencing batch identifier |
|
integer |
No |
Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter ‘Not Applicable’) |
|
string |
No |
Technical replicate group identifier |
|
string |
No |
Link to sequencing protocol |
|
string |
No |
Checksum for data integrity verification |
|
string |
Yes |
HTAN Data File ID (Primary Key) |
|
string |
Yes |
HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |
CoreFileAttributes
Universal attributes that apply to all file-based data in HTAN
Attribute |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Name of the file |
|
string |
Yes |
Format of the file (e.g., fastq, bam, vcf, h5ad) |
|
string |
Yes |
HTAN Data File ID (Primary Key) |
|
string |
Yes |
HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |
scRNALevel3and4
Single-cell RNA-seq Level 3 and 4 - Gene expression files and cell relationships
Attribute |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Format of the file (only h5ad files accepted for Level 3/4) |
|
string |
Yes |
Name of the file. Must end with .h5ad extension |
|
Yes |
Generic name for the workflow used to analyze a data set |
|
|
string |
Yes |
Parameters used to run the workflow. scRNA-seq level 3: e.g. Normalization and log transformation, ran empty drops or doublet detection, used filter on # genes/cell, etc. scRNA-seq Level 4: dimensionality reduction with PCA and 50 components, nearest-neighbor graph with k = 20 and Leiden clustering with resolution = 1, UMAP visualization using 50 PCA components, marker genes used to annotate cell types, information about droplet matrix (all barcodes) to cell matrix (only informative barcodes representing real cells) conversion |
|
Yes |
Specific content type of the data file |
|
|
Yes |
Type of data stored in matrix |
|
|
string |
No |
All matrices associated with every part of a SingleCellExperiment object. Comma-delimited list of filenames |
|
integer |
Yes |
Median number of reads per cell |
|
integer |
Yes |
Median number of genes detected per cell |
|
integer |
Yes |
Number of sequenced cells. Applies to raw counts matrix only |
|
string |
Yes |
Version of AnnData schema (must be 0.1 for CellxGene compliance) |
|
boolean |
Yes |
Whether the h5ad file structure has been validated against AnnData 0.1 schema |
|
Yes |
Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison. |
|
|
string |
Yes |
URL to genomic or transcriptomic reference |
|
string |
Yes |
URL to genome or transcriptome annotation |
|
string |
Yes |
Major version of the workflow, or ‘Not applicable’ when no workflow version applies. |
|
string |
Yes |
Link to workflow or command. DockStore.org recommended |
|
Yes |
Library layout (paired-end or single-end) |
|
|
Yes |
Sequencing platform used |
|
|
string |
No |
Sequencing batch identifier |
|
integer |
No |
Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter ‘Not Applicable’) |
|
string |
No |
Technical replicate group identifier |
|
string |
No |
Link to sequencing protocol |
|
string |
No |
Checksum for data integrity verification |
|
string |
Yes |
HTAN Data File ID (Primary Key) |
|
string |
Yes |
HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |
Enums
DataCategoryEnum
Value |
Description |
|---|---|
Exon Expression Quantification |
Exon expression quantification |
Gene Expression |
Gene expression data |
Gene Expression Quantification |
Gene expression quantification |
Isoform Expression Quantification |
Isoform expression quantification |
Other |
Other data category |
Splice Junction Quantification |
Splice junction quantification |
Transcript Expression |
Transcript expression data |
DissociationMethodEnum
Value |
Description |
|---|---|
Enzymatic |
Enzymatic dissociation method |
Mechanical |
Mechanical dissociation method |
Other |
Other dissociation method |
Unknown |
Unknown dissociation method |
GenomicReferenceEnum
Genomic or transcriptomic reference assembly used for alignment
Value |
Description |
|---|---|
GRCh37 |
Genome Reference Consortium human build 37 |
GRCh37.p13 |
GRCh37 patch release 13 |
GRCh38 |
Genome Reference Consortium human build 38 |
GRCh38.p13 |
GRCh38 patch release 13 |
GRCh38.p14 |
GRCh38 patch release 14 |
hg19 |
UCSC human genome reference hg19 |
hg38 |
UCSC human genome reference hg38 |
LibraryConstructionMethodEnum
Value |
Description |
|---|---|
10X Genomics |
10X Genomics library construction method |
Drop-seq |
Drop-seq library construction method |
Fluidigm C1 |
Fluidigm C1 library construction method |
InDrop |
InDrop library construction method |
Other |
Other library construction method |
Smart-seq |
Smart-seq library construction method |
Unknown |
Unknown library construction method |
LibraryLayoutEnum
Value |
Description |
|---|---|
Paired-end |
Paired-end sequencing |
Single-end |
Single-end sequencing |
MatrixTypeEnum
Value |
Description |
|---|---|
Batch Corrected Counts |
Batch corrected count matrix |
Normalized Counts |
Normalized count matrix |
Raw Counts |
Raw count matrix |
Scaled Counts |
Scaled count matrix |
NucleicAcidSourceEnum
Value |
Description |
|---|---|
DNA |
DNA nucleic acid source |
RNA |
RNA nucleic acid source |
Unknown |
Unknown nucleic acid source |
ReadIndicatorEnum
Value |
Description |
|---|---|
Forward |
Forward read indicator |
Index |
Index read indicator |
Reverse |
Reverse read indicator |
Unknown |
Unknown read indicator |
ReverseTranscriptionPrimerEnum
Value |
Description |
|---|---|
Oligo-dT |
Oligo-dT reverse transcription primer |
Random Hexamer |
Random hexamer reverse transcription primer |
Unknown |
Unknown reverse transcription primer |
SingleCellIsolationMethodEnum
Value |
Description |
|---|---|
Cell Sorting |
Cell sorting isolation method |
Droplet-based |
Droplet-based isolation method |
Manual Picking |
Manual picking isolation method |
Microfluidics |
Microfluidics isolation method |
Other |
Other isolation method |
Unknown |
Unknown isolation method |
SpikeInEnum
Value |
Description |
|---|---|
ERCC |
ERCC spike-in |
None |
No spike-in |
Other |
Other spike-in |
Unknown |
Unknown spike-in |
scRNAseqWorkflowTypeEnumLevel2
Value |
Description |
|---|---|
CellRanger |
CellRanger workflow |
HCA Optimus |
HCA Optimus workflow |
Other |
Other workflow |
SEQC |
SEQC workflow |
STARsolo |
STARsolo workflow |
Unknown |
Unknown workflow |
dropEST |
dropEST workflow |
scRNAseqWorkflowTypeEnumLevel3and4
Value |
Description |
|---|---|
Cell annotation |
Cell annotation workflow |
CellRanger |
10x Genomics CellRanger workflow |
Cufflinks |
Cufflinks workflow |
DEXSeq |
DEXSeq workflow |
Differentiation trajectory analysis |
Differentiation trajectory analysis workflow |
HCA Optimus |
Human Cell Atlas Optimus workflow |
HTSeq - FPKM |
HTSeq FPKM workflow |
Other |
Other workflow type |
SEQC |
SEQC workflow |
STARsolo |
STARsolo alignment workflow |
dropEST |
dropEST workflow |