WES
HTAN Whole Exome Sequencing Data Model Schema
CoreFileAttributes
Universal attributes that apply to all file-based data in HTAN
Attribute |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Name of the file |
|
string |
Yes |
Format of the file (e.g., fastq, bam, vcf, h5ad) |
|
string |
Yes |
HTAN Data File ID (Primary Key) |
|
string |
Yes |
HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |
BaseSequencingAttributes
Minimal base attributes shared across all sequencing types
Attribute |
Type |
Required |
Description |
|---|---|---|---|
|
string |
No |
Checksum for data integrity verification |
|
string |
Yes |
Name of the file |
|
string |
Yes |
Format of the file (e.g., fastq, bam, vcf, h5ad) |
|
string |
Yes |
HTAN Data File ID (Primary Key) |
|
string |
Yes |
HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |
BaseSequencingLevel1Attributes
Level 1 attributes - sequencing run and library (raw data)
Attribute |
Type |
Required |
Description |
|---|---|---|---|
|
Yes |
Library layout (paired-end or single-end) |
|
|
Yes |
Sequencing platform used |
|
|
string |
No |
Sequencing batch identifier |
|
integer |
No |
Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter ‘Not Applicable’) |
|
string |
No |
Technical replicate group identifier |
|
string |
No |
Link to sequencing protocol |
|
string |
No |
Checksum for data integrity verification |
|
string |
Yes |
Name of the file |
|
string |
Yes |
Format of the file (e.g., fastq, bam, vcf, h5ad) |
|
string |
Yes |
HTAN Data File ID (Primary Key) |
|
string |
Yes |
HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |
BaseSequencingLevel2Attributes
Level 2 attributes - alignment and alignment workflow
Attribute |
Type |
Required |
Description |
|---|---|---|---|
|
Yes |
Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison. |
|
|
string |
Yes |
URL to genomic or transcriptomic reference |
|
string |
Yes |
URL to genome or transcriptome annotation |
|
string |
Yes |
Major version of the workflow, or ‘Not applicable’ when no workflow version applies. |
|
string |
Yes |
Link to workflow or command. DockStore.org recommended |
|
Yes |
Library layout (paired-end or single-end) |
|
|
Yes |
Sequencing platform used |
|
|
string |
No |
Sequencing batch identifier |
|
integer |
No |
Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter ‘Not Applicable’) |
|
string |
No |
Technical replicate group identifier |
|
string |
No |
Link to sequencing protocol |
|
string |
No |
Checksum for data integrity verification |
|
string |
Yes |
Name of the file |
|
string |
Yes |
Format of the file (e.g., fastq, bam, vcf, h5ad) |
|
string |
Yes |
HTAN Data File ID (Primary Key) |
|
string |
Yes |
HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |
BaseSequencingLevel3Attributes
Level 3+ attributes - inherits alignment and workflow; used for processed/analysis levels
Attribute |
Type |
Required |
Description |
|---|---|---|---|
|
Yes |
Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison. |
|
|
string |
Yes |
URL to genomic or transcriptomic reference |
|
string |
Yes |
URL to genome or transcriptome annotation |
|
string |
Yes |
Major version of the workflow, or ‘Not applicable’ when no workflow version applies. |
|
string |
Yes |
Link to workflow or command. DockStore.org recommended |
|
Yes |
Library layout (paired-end or single-end) |
|
|
Yes |
Sequencing platform used |
|
|
string |
No |
Sequencing batch identifier |
|
integer |
No |
Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter ‘Not Applicable’) |
|
string |
No |
Technical replicate group identifier |
|
string |
No |
Link to sequencing protocol |
|
string |
No |
Checksum for data integrity verification |
|
string |
Yes |
Name of the file |
|
string |
Yes |
Format of the file (e.g., fastq, bam, vcf, h5ad) |
|
string |
Yes |
HTAN Data File ID (Primary Key) |
|
string |
Yes |
HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |
BulkWESLevel1
Bulk Whole Exome Sequencing Level 1 - Raw files
Attribute |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Format of the raw sequencing file (fastq or fastq.gz) |
|
string |
Yes |
Name of the file. Must end with an extension matching the FILE_FORMAT (.fastq for fastq; .fastq.gz or .fq.gz for fastq.gz) |
|
string |
No |
Read indicator |
|
Yes |
Method used for library selection |
|
|
integer |
Yes |
Read length in base pairs |
|
string |
No |
Target capture kit used |
|
string |
No |
Name of the library preparation kit |
|
string |
No |
Vendor of the library preparation kit |
|
string |
No |
Version of the library preparation kit |
|
string |
No |
Name of the adapter used |
|
string |
No |
Adapter sequence |
|
string |
No |
Name of the base caller |
|
string |
No |
Version of the base caller |
|
string |
No |
Flow cell barcode |
|
integer |
No |
Maximum fragment length |
|
integer |
No |
Mean fragment length |
|
integer |
No |
Minimum fragment length |
|
integer |
No |
Standard deviation of fragment length |
|
integer |
No |
Lane number |
|
string |
No |
Multiplex barcode |
|
integer |
No |
Days from index for library preparation |
|
string |
No |
Size selection range |
|
integer |
No |
Target sequencing depth |
|
boolean |
No |
Whether to trim adapter sequence |
|
Yes |
Library layout (paired-end or single-end) |
|
|
Yes |
Sequencing platform used |
|
|
string |
No |
Sequencing batch identifier |
|
string |
No |
Technical replicate group identifier |
|
string |
No |
Link to sequencing protocol |
|
string |
No |
Checksum for data integrity verification |
|
string |
Yes |
HTAN Data File ID (Primary Key) |
|
string |
Yes |
HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |
BulkWESLevel2
Bulk Whole Exome Sequencing Level 2 - Reads mapped to the genome and alignment QC
Attribute |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Format of the aligned file (bam or cram) |
|
string |
Yes |
Name of the file. Must end with an extension matching the FILE_FORMAT (.bam for bam; .cram for cram) |
|
string |
Yes |
Type of alignment workflow used |
|
string |
No |
Name of the index file |
|
float |
No |
Average base quality |
|
integer |
No |
Average insert size |
|
integer |
No |
Average read length |
|
float |
No |
Contamination estimate |
|
float |
No |
Contamination error estimate |
|
float |
Yes |
Mean coverage depth |
|
string |
No |
Adapter content information |
|
string |
No |
Basic statistics from QC |
|
string |
No |
Encoding information |
|
string |
No |
Overrepresented sequences |
|
string |
No |
Per base N content |
|
string |
No |
Per base sequence content |
|
string |
No |
Per base sequence quality |
|
string |
No |
Per sequence GC content |
|
string |
No |
Per sequence quality score |
|
string |
No |
Per tile sequence quality |
|
float |
No |
Percent GC content |
|
string |
No |
Sequence duplication levels |
|
string |
No |
Sequence length distribution |
|
string |
No |
QC workflow type |
|
string |
No |
QC workflow version |
|
string |
No |
Link to QC workflow |
|
integer |
No |
Number of read pairs on different chromosomes |
|
integer |
Yes |
Total number of reads |
|
integer |
Yes |
Total number of uniquely mapped reads |
|
integer |
Yes |
Total number of unmapped reads |
|
float |
No |
Proportion of duplicated reads |
|
float |
Yes |
Proportion of mapped reads |
|
float |
No |
Proportion of targets with no coverage |
|
float |
No |
Proportion of base mismatches |
|
integer |
No |
Number of short reads |
|
float |
No |
Proportion of coverage at 10x |
|
float |
No |
Proportion of coverage at 30x |
|
boolean |
No |
Whether this is the lowest level |
|
Yes |
Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison. |
|
|
string |
Yes |
URL to genomic or transcriptomic reference |
|
string |
Yes |
URL to genome or transcriptome annotation |
|
string |
Yes |
Major version of the workflow, or ‘Not applicable’ when no workflow version applies. |
|
string |
Yes |
Link to workflow or command. DockStore.org recommended |
|
Yes |
Library layout (paired-end or single-end) |
|
|
Yes |
Sequencing platform used |
|
|
string |
No |
Sequencing batch identifier |
|
integer |
No |
Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter ‘Not Applicable’) |
|
string |
No |
Technical replicate group identifier |
|
string |
No |
Link to sequencing protocol |
|
string |
No |
Checksum for data integrity verification |
|
string |
Yes |
HTAN Data File ID (Primary Key) |
|
string |
Yes |
HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |
BulkWESLevel3
Bulk Whole Exome Sequencing Level 3 - Called variants and MSI analysis
Attribute |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Format of the variant file (vcf or vcf.gz) |
|
string |
Yes |
Name of the file. Must end with an extension matching the FILE_FORMAT (.vcf for vcf; .vcf.gz for vcf.gz) |
|
string |
No |
URL to the germline variants workflow |
|
string |
No |
Type of germline variants workflow |
|
string |
No |
URL to the somatic variants workflow |
|
string |
No |
Type of somatic variants workflow |
|
No |
Type of sample for somatic variants |
|
|
string |
No |
URL to the structural variant workflow |
|
string |
No |
Type of structural variant workflow |
|
string |
No |
Link to MSI workflow |
|
float |
No |
MSI score |
|
No |
MSI status |
|
|
Yes |
Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison. |
|
|
string |
Yes |
URL to genomic or transcriptomic reference |
|
string |
Yes |
URL to genome or transcriptome annotation |
|
string |
Yes |
Major version of the workflow, or ‘Not applicable’ when no workflow version applies. |
|
string |
Yes |
Link to workflow or command. DockStore.org recommended |
|
Yes |
Library layout (paired-end or single-end) |
|
|
Yes |
Sequencing platform used |
|
|
string |
No |
Sequencing batch identifier |
|
integer |
No |
Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter ‘Not Applicable’) |
|
string |
No |
Technical replicate group identifier |
|
string |
No |
Link to sequencing protocol |
|
string |
No |
Checksum for data integrity verification |
|
string |
Yes |
HTAN Data File ID (Primary Key) |
|
string |
Yes |
HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. |
Enums
GenomicReferenceEnum
Genomic or transcriptomic reference assembly used for alignment
Value |
Description |
|---|---|
GRCh37 |
Genome Reference Consortium human build 37 |
GRCh37.p13 |
GRCh37 patch release 13 |
GRCh38 |
Genome Reference Consortium human build 38 |
GRCh38.p13 |
GRCh38 patch release 13 |
GRCh38.p14 |
GRCh38 patch release 14 |
hg19 |
UCSC human genome reference hg19 |
hg38 |
UCSC human genome reference hg38 |
LibraryLayoutEnum
Value |
Description |
|---|---|
Paired-end |
Paired-end sequencing |
Single-end |
Single-end sequencing |
LibrarySelectionMethodEnum
Value |
Description |
|---|---|
Hybrid Selection |
Hybrid selection method |
PCR |
PCR-based selection |
RANDOM |
Random selection |
other |
Other selection method |
MSIStatusEnum
Value |
Description |
|---|---|
MSI-H |
High microsatellite instability |
MSI-L |
Low microsatellite instability |
MSS |
Microsatellite stable |
Unknown |
Unknown MSI status |
SomaticVariantsSampleTypeEnum
Value |
Description |
|---|---|
Metastatic |
Metastatic tumor sample |
Normal |
Normal tissue sample |
Other |
Other sample type |
Primary |
Primary tumor sample |
Recurrent |
Recurrent tumor sample |