# scRNA-seq

HTAN scRNA-seq Data Model - Single-cell RNA sequencing data

## BaseSequencingAttributes

**Minimal base attributes shared across all sequencing types**

| Attribute | Type | Required | Description |
|-----------|------|----------|-------------|
| `CHECKSUM` | string | No | Checksum for data integrity verification |
| `FILENAME` | string | Yes | Name of the file |
| `FILE_FORMAT` | string | Yes | Format of the file (e.g., fastq, bam, vcf, h5ad) |
| `HTAN_DATA_FILE_ID` | string | Yes | HTAN Data File ID (Primary Key) |
| `HTAN_PARENT_ID` | string | Yes | HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |

## BaseSequencingLevel1Attributes

**Level 1 attributes - sequencing run and library (raw data)**

| Attribute | Type | Required | Description |
|-----------|------|----------|-------------|
| `LIBRARY_LAYOUT` | [LibraryLayoutEnum](#librarylayoutenum) | Yes | Library layout (paired-end or single-end) |
| `SEQUENCING_PLATFORM` | [SequencingPlatformEnum](#sequencingplatformenum) | Yes | Sequencing platform used |
| `SEQUENCING_BATCH_ID` | string | No | Sequencing batch identifier |
| `LIBRARY_PREPARATION_DAYS_FROM_INDEX` | integer | No | Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter 'Not Applicable') |
| `TECHNICAL_REPLICATE_GROUP` | string | No | Technical replicate group identifier |
| `PROTOCOL_LINK` | string | No | Link to sequencing protocol |
| `CHECKSUM` | string | No | Checksum for data integrity verification |
| `FILENAME` | string | Yes | Name of the file |
| `FILE_FORMAT` | string | Yes | Format of the file (e.g., fastq, bam, vcf, h5ad) |
| `HTAN_DATA_FILE_ID` | string | Yes | HTAN Data File ID (Primary Key) |
| `HTAN_PARENT_ID` | string | Yes | HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |

## BaseSequencingLevel2Attributes

**Level 2 attributes - alignment and alignment workflow**

| Attribute | Type | Required | Description |
|-----------|------|----------|-------------|
| `GENOMIC_REFERENCE` | [GenomicReferenceEnum](#genomicreferenceenum) | Yes | Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison. |
| `GENOMIC_REFERENCE_URL` | string | Yes | URL to genomic or transcriptomic reference |
| `GENOME_ANNOTATION_URL` | string | Yes | URL to genome or transcriptome annotation |
| `WORKFLOW_VERSION` | string | Yes | Major version of the workflow, or 'Not applicable' when no workflow version applies. |
| `WORKFLOW_LINK` | string | Yes | Link to workflow or command. DockStore.org recommended |
| `LIBRARY_LAYOUT` | [LibraryLayoutEnum](#librarylayoutenum) | Yes | Library layout (paired-end or single-end) |
| `SEQUENCING_PLATFORM` | [SequencingPlatformEnum](#sequencingplatformenum) | Yes | Sequencing platform used |
| `SEQUENCING_BATCH_ID` | string | No | Sequencing batch identifier |
| `LIBRARY_PREPARATION_DAYS_FROM_INDEX` | integer | No | Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter 'Not Applicable') |
| `TECHNICAL_REPLICATE_GROUP` | string | No | Technical replicate group identifier |
| `PROTOCOL_LINK` | string | No | Link to sequencing protocol |
| `CHECKSUM` | string | No | Checksum for data integrity verification |
| `FILENAME` | string | Yes | Name of the file |
| `FILE_FORMAT` | string | Yes | Format of the file (e.g., fastq, bam, vcf, h5ad) |
| `HTAN_DATA_FILE_ID` | string | Yes | HTAN Data File ID (Primary Key) |
| `HTAN_PARENT_ID` | string | Yes | HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |

## BaseSequencingLevel3Attributes

**Level 3+ attributes - inherits alignment and workflow; used for processed/analysis levels**

| Attribute | Type | Required | Description |
|-----------|------|----------|-------------|
| `GENOMIC_REFERENCE` | [GenomicReferenceEnum](#genomicreferenceenum) | Yes | Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison. |
| `GENOMIC_REFERENCE_URL` | string | Yes | URL to genomic or transcriptomic reference |
| `GENOME_ANNOTATION_URL` | string | Yes | URL to genome or transcriptome annotation |
| `WORKFLOW_VERSION` | string | Yes | Major version of the workflow, or 'Not applicable' when no workflow version applies. |
| `WORKFLOW_LINK` | string | Yes | Link to workflow or command. DockStore.org recommended |
| `LIBRARY_LAYOUT` | [LibraryLayoutEnum](#librarylayoutenum) | Yes | Library layout (paired-end or single-end) |
| `SEQUENCING_PLATFORM` | [SequencingPlatformEnum](#sequencingplatformenum) | Yes | Sequencing platform used |
| `SEQUENCING_BATCH_ID` | string | No | Sequencing batch identifier |
| `LIBRARY_PREPARATION_DAYS_FROM_INDEX` | integer | No | Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter 'Not Applicable') |
| `TECHNICAL_REPLICATE_GROUP` | string | No | Technical replicate group identifier |
| `PROTOCOL_LINK` | string | No | Link to sequencing protocol |
| `CHECKSUM` | string | No | Checksum for data integrity verification |
| `FILENAME` | string | Yes | Name of the file |
| `FILE_FORMAT` | string | Yes | Format of the file (e.g., fastq, bam, vcf, h5ad) |
| `HTAN_DATA_FILE_ID` | string | Yes | HTAN Data File ID (Primary Key) |
| `HTAN_PARENT_ID` | string | Yes | HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |

## scRNALevel1

**scRNA-seq Level 1 data - Raw sequencing files and metadata**

| Attribute | Type | Required | Description |
|-----------|------|----------|-------------|
| `FILE_FORMAT` | string | Yes | Format of the raw sequencing file (fastq or fastq.gz) |
| `FILENAME` | string | Yes | Name of the file. Must end with an extension matching the FILE_FORMAT (.fastq for fastq; .fastq.gz or .fq.gz for fastq.gz) |
| `SINGLE_CELL_ISOLATION_METHOD` | [SingleCellIsolationMethodEnum](#singlecellisolationmethodenum) | Yes | Method used to isolate single cells |
| `DISSOCIATION_METHOD` | [DissociationMethodEnum](#dissociationmethodenum) | Yes | Method used to dissociate tissue into single cells |
| `CRYOPRESERVED_CELLS_IN_SAMPLE` | boolean | No | Whether cells were cryopreserved in the sample |
| `NUCLEIC_ACID_SOURCE` | [NucleicAcidSourceEnum](#nucleicacidsourceenum) | Yes | Type of nucleic acid used for sequencing |
| `LIBRARY_CONSTRUCTION_METHOD` | [LibraryConstructionMethodEnum](#libraryconstructionmethodenum) | Yes | Method used to construct the sequencing library |
| `REVERSE_TRANSCRIPTION_PRIMER` | [ReverseTranscriptionPrimerEnum](#reversetranscriptionprimerenum) | Yes | Primer used for reverse transcription |
| `SPIKE_IN` | [SpikeInEnum](#spikeinenum) | Yes | Type of spike-in used, if any |
| `READ_INDICATOR` | [ReadIndicatorEnum](#readindicatorenum) | Yes | Type of read (forward, reverse, index) |
| `LIBRARY_LAYOUT` | [LibraryLayoutEnum](#librarylayoutenum) | Yes | Library layout (paired-end or single-end) |
| `SEQUENCING_PLATFORM` | [SequencingPlatformEnum](#sequencingplatformenum) | Yes | Sequencing platform used |
| `SEQUENCING_BATCH_ID` | string | No | Sequencing batch identifier |
| `LIBRARY_PREPARATION_DAYS_FROM_INDEX` | integer | No | Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter 'Not Applicable') |
| `TECHNICAL_REPLICATE_GROUP` | string | No | Technical replicate group identifier |
| `PROTOCOL_LINK` | string | No | Link to sequencing protocol |
| `CHECKSUM` | string | No | Checksum for data integrity verification |
| `HTAN_DATA_FILE_ID` | string | Yes | HTAN Data File ID (Primary Key) |
| `HTAN_PARENT_ID` | string | Yes | HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |

## scRNALevel2

**scRNA-seq Level 2 data - Workflow and processing metadata**

| Attribute | Type | Required | Description |
|-----------|------|----------|-------------|
| `FILE_FORMAT` | string | Yes | Format of the aligned file (bam or cram) |
| `FILENAME` | string | Yes | Name of the file. Must end with an extension matching the FILE_FORMAT (.bam for bam; .cram for cram) |
| `SCRNASEQ_WORKFLOW_TYPE` | [scRNAseqWorkflowTypeEnumLevel2](#scrnaseqworkflowtypeenumlevel2) | Yes | Generic name for the workflow used to analyze the dataset |
| `WHITELIST_CELL_BARCODE_FILE_LINK` | string | No | Link to whitelist cell barcode file |
| `CELL_BARCODE_TAG` | string | No | Tag used for cell barcodes |
| `UMI_TAG` | string | No | Tag used for UMIs |
| `GENOMIC_REFERENCE` | [GenomicReferenceEnum](#genomicreferenceenum) | Yes | Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison. |
| `GENOMIC_REFERENCE_URL` | string | Yes | URL to genomic or transcriptomic reference |
| `GENOME_ANNOTATION_URL` | string | Yes | URL to genome or transcriptome annotation |
| `WORKFLOW_VERSION` | string | Yes | Major version of the workflow, or 'Not applicable' when no workflow version applies. |
| `WORKFLOW_LINK` | string | Yes | Link to workflow or command. DockStore.org recommended |
| `LIBRARY_LAYOUT` | [LibraryLayoutEnum](#librarylayoutenum) | Yes | Library layout (paired-end or single-end) |
| `SEQUENCING_PLATFORM` | [SequencingPlatformEnum](#sequencingplatformenum) | Yes | Sequencing platform used |
| `SEQUENCING_BATCH_ID` | string | No | Sequencing batch identifier |
| `LIBRARY_PREPARATION_DAYS_FROM_INDEX` | integer | No | Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter 'Not Applicable') |
| `TECHNICAL_REPLICATE_GROUP` | string | No | Technical replicate group identifier |
| `PROTOCOL_LINK` | string | No | Link to sequencing protocol |
| `CHECKSUM` | string | No | Checksum for data integrity verification |
| `HTAN_DATA_FILE_ID` | string | Yes | HTAN Data File ID (Primary Key) |
| `HTAN_PARENT_ID` | string | Yes | HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |

## CoreFileAttributes

**Universal attributes that apply to all file-based data in HTAN**

| Attribute | Type | Required | Description |
|-----------|------|----------|-------------|
| `FILENAME` | string | Yes | Name of the file |
| `FILE_FORMAT` | string | Yes | Format of the file (e.g., fastq, bam, vcf, h5ad) |
| `HTAN_DATA_FILE_ID` | string | Yes | HTAN Data File ID (Primary Key) |
| `HTAN_PARENT_ID` | string | Yes | HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |

## scRNALevel3and4

**Single-cell RNA-seq Level 3 and 4 - Gene expression files and cell relationships**

| Attribute | Type | Required | Description |
|-----------|------|----------|-------------|
| `FILE_FORMAT` | string | Yes | Format of the file (only h5ad files accepted for Level 3/4) |
| `FILENAME` | string | Yes | Name of the file. Must end with .h5ad extension |
| `SCRNASEQ_WORKFLOW_TYPE` | [scRNAseqWorkflowTypeEnumLevel3and4](#scrnaseqworkflowtypeenumlevel3and4) | Yes | Generic name for the workflow used to analyze a data set |
| `SCRNASEQ_WORKFLOW_PARAMETERS_DESCRIPTION` | string | Yes | Parameters used to run the workflow. scRNA-seq level 3: e.g. Normalization and log transformation, ran empty drops or doublet detection, used filter on # genes/cell, etc. scRNA-seq Level 4: dimensionality reduction with PCA and 50 components, nearest-neighbor graph with k = 20 and Leiden clustering with resolution = 1, UMAP visualization using 50 PCA components, marker genes used to annotate cell types, information about droplet matrix (all barcodes) to cell matrix (only informative barcodes representing real cells) conversion |
| `DATA_CATEGORY` | [DataCategoryEnum](#datacategoryenum) | Yes | Specific content type of the data file |
| `MATRIX_TYPE` | [MatrixTypeEnum](#matrixtypeenum) | Yes | Type of data stored in matrix |
| `LINKED_MATRICES` | string | No | All matrices associated with every part of a SingleCellExperiment object. Comma-delimited list of filenames |
| `CELL_MEDIAN_NUMBER_READS` | integer | Yes | Median number of reads per cell |
| `CELL_MEDIAN_NUMBER_GENES` | integer | Yes | Median number of genes detected per cell |
| `CELL_TOTAL` | integer | Yes | Number of sequenced cells. Applies to raw counts matrix only |
| `ANNDATA_SCHEMA_VERSION` | string | Yes | Version of AnnData schema (must be 0.1 for CellxGene compliance) |
| `ANNDATA_STRUCTURE_VALIDATED` | boolean | Yes | Whether the h5ad file structure has been validated against AnnData 0.1 schema |
| `GENOMIC_REFERENCE` | [GenomicReferenceEnum](#genomicreferenceenum) | Yes | Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison. |
| `GENOMIC_REFERENCE_URL` | string | Yes | URL to genomic or transcriptomic reference |
| `GENOME_ANNOTATION_URL` | string | Yes | URL to genome or transcriptome annotation |
| `WORKFLOW_VERSION` | string | Yes | Major version of the workflow, or 'Not applicable' when no workflow version applies. |
| `WORKFLOW_LINK` | string | Yes | Link to workflow or command. DockStore.org recommended |
| `LIBRARY_LAYOUT` | [LibraryLayoutEnum](#librarylayoutenum) | Yes | Library layout (paired-end or single-end) |
| `SEQUENCING_PLATFORM` | [SequencingPlatformEnum](#sequencingplatformenum) | Yes | Sequencing platform used |
| `SEQUENCING_BATCH_ID` | string | No | Sequencing batch identifier |
| `LIBRARY_PREPARATION_DAYS_FROM_INDEX` | integer | No | Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter 'Not Applicable') |
| `TECHNICAL_REPLICATE_GROUP` | string | No | Technical replicate group identifier |
| `PROTOCOL_LINK` | string | No | Link to sequencing protocol |
| `CHECKSUM` | string | No | Checksum for data integrity verification |
| `HTAN_DATA_FILE_ID` | string | Yes | HTAN Data File ID (Primary Key) |
| `HTAN_PARENT_ID` | string | Yes | HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |

## Enums

### DataCategoryEnum

| Value | Description |
|-------|-------------|
| Exon Expression Quantification | Exon expression quantification |
| Gene Expression | Gene expression data |
| Gene Expression Quantification | Gene expression quantification |
| Isoform Expression Quantification | Isoform expression quantification |
| Other | Other data category |
| Splice Junction Quantification | Splice junction quantification |
| Transcript Expression | Transcript expression data |

### DissociationMethodEnum

| Value | Description |
|-------|-------------|
| Enzymatic | Enzymatic dissociation method |
| Mechanical | Mechanical dissociation method |
| Other | Other dissociation method |
| Unknown | Unknown dissociation method |

### GenomicReferenceEnum

Genomic or transcriptomic reference assembly used for alignment

| Value | Description |
|-------|-------------|
| GRCh37 | Genome Reference Consortium human build 37 |
| GRCh37.p13 | GRCh37 patch release 13 |
| GRCh38 | Genome Reference Consortium human build 38 |
| GRCh38.p13 | GRCh38 patch release 13 |
| GRCh38.p14 | GRCh38 patch release 14 |
| hg19 | UCSC human genome reference hg19 |
| hg38 | UCSC human genome reference hg38 |

### LibraryConstructionMethodEnum

| Value | Description |
|-------|-------------|
| 10X Genomics | 10X Genomics library construction method |
| Drop-seq | Drop-seq library construction method |
| Fluidigm C1 | Fluidigm C1 library construction method |
| InDrop | InDrop library construction method |
| Other | Other library construction method |
| Smart-seq | Smart-seq library construction method |
| Unknown | Unknown library construction method |

### LibraryLayoutEnum

| Value | Description |
|-------|-------------|
| Paired-end | Paired-end sequencing |
| Single-end | Single-end sequencing |

### MatrixTypeEnum

| Value | Description |
|-------|-------------|
| Batch Corrected Counts | Batch corrected count matrix |
| Normalized Counts | Normalized count matrix |
| Raw Counts | Raw count matrix |
| Scaled Counts | Scaled count matrix |

### NucleicAcidSourceEnum

| Value | Description |
|-------|-------------|
| DNA | DNA nucleic acid source |
| RNA | RNA nucleic acid source |
| Unknown | Unknown nucleic acid source |

### ReadIndicatorEnum

| Value | Description |
|-------|-------------|
| Forward | Forward read indicator |
| Index | Index read indicator |
| Reverse | Reverse read indicator |
| Unknown | Unknown read indicator |

### ReverseTranscriptionPrimerEnum

| Value | Description |
|-------|-------------|
| Oligo-dT | Oligo-dT reverse transcription primer |
| Random Hexamer | Random hexamer reverse transcription primer |
| Unknown | Unknown reverse transcription primer |

### SequencingPlatformEnum

| Value | Description |
|-------|-------------|
| ABI_SOLID | ABI SOLID sequencing platform |
| BGISEQ | BGI sequencing platform |
| CAPILLARY | Capillary sequencing platform |
| COMPLETE_GENOMICS | Complete Genomics sequencing platform |
| HELICOS | Helicos sequencing platform |
| ILLUMINA | Illumina sequencing platform |
| ION_TORRENT | Ion Torrent sequencing platform |
| LS454 | 454 sequencing platform |
| OXFORD_NANOPORE | Oxford Nanopore sequencing platform |
| PACBIO_SMRT | PacBio SMRT sequencing platform |

### SingleCellIsolationMethodEnum

| Value | Description |
|-------|-------------|
| Cell Sorting | Cell sorting isolation method |
| Droplet-based | Droplet-based isolation method |
| Manual Picking | Manual picking isolation method |
| Microfluidics | Microfluidics isolation method |
| Other | Other isolation method |
| Unknown | Unknown isolation method |

### SpikeInEnum

| Value | Description |
|-------|-------------|
| ERCC | ERCC spike-in |
| None | No spike-in |
| Other | Other spike-in |
| Unknown | Unknown spike-in |

### scRNAseqWorkflowTypeEnumLevel2

| Value | Description |
|-------|-------------|
| CellRanger | CellRanger workflow |
| HCA Optimus | HCA Optimus workflow |
| Other | Other workflow |
| SEQC | SEQC workflow |
| STARsolo | STARsolo workflow |
| Unknown | Unknown workflow |
| dropEST | dropEST workflow |

### scRNAseqWorkflowTypeEnumLevel3and4

| Value | Description |
|-------|-------------|
| Cell annotation | Cell annotation workflow |
| CellRanger | 10x Genomics CellRanger workflow |
| Cufflinks | Cufflinks workflow |
| DEXSeq | DEXSeq workflow |
| Differentiation trajectory analysis | Differentiation trajectory analysis workflow |
| HCA Optimus | Human Cell Atlas Optimus workflow |
| HTSeq - FPKM | HTSeq FPKM workflow |
| Other | Other workflow type |
| SEQC | SEQC workflow |
| STARsolo | STARsolo alignment workflow |
| dropEST | dropEST workflow |