WES

HTAN Whole Exome Sequencing Data Model Schema

CoreFileAttributes

Universal attributes that apply to all file-based data in HTAN

Attribute

Type

Required

Description

FILENAME

string

Yes

Name of the file

FILE_FORMAT

string

Yes

Format of the file (e.g., fastq, bam, vcf, h5ad)

HTAN_DATA_FILE_ID

string

Yes

HTAN Data File ID (Primary Key)

HTAN_PARENT_ID

string

Yes

HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2.

BaseSequencingAttributes

Minimal base attributes shared across all sequencing types

Attribute

Type

Required

Description

CHECKSUM

string

No

Checksum for data integrity verification

FILENAME

string

Yes

Name of the file

FILE_FORMAT

string

Yes

Format of the file (e.g., fastq, bam, vcf, h5ad)

HTAN_DATA_FILE_ID

string

Yes

HTAN Data File ID (Primary Key)

HTAN_PARENT_ID

string

Yes

HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2.

BaseSequencingLevel1Attributes

Level 1 attributes - sequencing run and library (raw data)

Attribute

Type

Required

Description

LIBRARY_LAYOUT

LibraryLayoutEnum

Yes

Library layout (paired-end or single-end)

SEQUENCING_PLATFORM

SequencingPlatformEnum

Yes

Sequencing platform used

SEQUENCING_BATCH_ID

string

No

Sequencing batch identifier

LIBRARY_PREPARATION_DAYS_FROM_INDEX

integer

No

Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter ‘Not Applicable’)

TECHNICAL_REPLICATE_GROUP

string

No

Technical replicate group identifier

PROTOCOL_LINK

string

No

Link to sequencing protocol

CHECKSUM

string

No

Checksum for data integrity verification

FILENAME

string

Yes

Name of the file

FILE_FORMAT

string

Yes

Format of the file (e.g., fastq, bam, vcf, h5ad)

HTAN_DATA_FILE_ID

string

Yes

HTAN Data File ID (Primary Key)

HTAN_PARENT_ID

string

Yes

HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2.

BaseSequencingLevel2Attributes

Level 2 attributes - alignment and alignment workflow

Attribute

Type

Required

Description

GENOMIC_REFERENCE

GenomicReferenceEnum

Yes

Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison.

GENOMIC_REFERENCE_URL

string

Yes

URL to genomic or transcriptomic reference

GENOME_ANNOTATION_URL

string

Yes

URL to genome or transcriptome annotation

WORKFLOW_VERSION

string

Yes

Major version of the workflow, or ‘Not applicable’ when no workflow version applies.

WORKFLOW_LINK

string

Yes

Link to workflow or command. DockStore.org recommended

LIBRARY_LAYOUT

LibraryLayoutEnum

Yes

Library layout (paired-end or single-end)

SEQUENCING_PLATFORM

SequencingPlatformEnum

Yes

Sequencing platform used

SEQUENCING_BATCH_ID

string

No

Sequencing batch identifier

LIBRARY_PREPARATION_DAYS_FROM_INDEX

integer

No

Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter ‘Not Applicable’)

TECHNICAL_REPLICATE_GROUP

string

No

Technical replicate group identifier

PROTOCOL_LINK

string

No

Link to sequencing protocol

CHECKSUM

string

No

Checksum for data integrity verification

FILENAME

string

Yes

Name of the file

FILE_FORMAT

string

Yes

Format of the file (e.g., fastq, bam, vcf, h5ad)

HTAN_DATA_FILE_ID

string

Yes

HTAN Data File ID (Primary Key)

HTAN_PARENT_ID

string

Yes

HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2.

BaseSequencingLevel3Attributes

Level 3+ attributes - inherits alignment and workflow; used for processed/analysis levels

Attribute

Type

Required

Description

GENOMIC_REFERENCE

GenomicReferenceEnum

Yes

Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison.

GENOMIC_REFERENCE_URL

string

Yes

URL to genomic or transcriptomic reference

GENOME_ANNOTATION_URL

string

Yes

URL to genome or transcriptome annotation

WORKFLOW_VERSION

string

Yes

Major version of the workflow, or ‘Not applicable’ when no workflow version applies.

WORKFLOW_LINK

string

Yes

Link to workflow or command. DockStore.org recommended

LIBRARY_LAYOUT

LibraryLayoutEnum

Yes

Library layout (paired-end or single-end)

SEQUENCING_PLATFORM

SequencingPlatformEnum

Yes

Sequencing platform used

SEQUENCING_BATCH_ID

string

No

Sequencing batch identifier

LIBRARY_PREPARATION_DAYS_FROM_INDEX

integer

No

Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter ‘Not Applicable’)

TECHNICAL_REPLICATE_GROUP

string

No

Technical replicate group identifier

PROTOCOL_LINK

string

No

Link to sequencing protocol

CHECKSUM

string

No

Checksum for data integrity verification

FILENAME

string

Yes

Name of the file

FILE_FORMAT

string

Yes

Format of the file (e.g., fastq, bam, vcf, h5ad)

HTAN_DATA_FILE_ID

string

Yes

HTAN Data File ID (Primary Key)

HTAN_PARENT_ID

string

Yes

HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2.

BulkWESLevel1

Bulk Whole Exome Sequencing Level 1 - Raw files

Attribute

Type

Required

Description

FILE_FORMAT

string

Yes

Format of the raw sequencing file (fastq or fastq.gz)

FILENAME

string

Yes

Name of the file. Must end with an extension matching the FILE_FORMAT (.fastq for fastq; .fastq.gz or .fq.gz for fastq.gz)

READ_INDICATOR

string

No

Read indicator

LIBRARY_SELECTION_METHOD

LibrarySelectionMethodEnum

Yes

Method used for library selection

READ_LENGTH

integer

Yes

Read length in base pairs

TARGET_CAPTURE_KIT

string

No

Target capture kit used

LIBRARY_PREPARATION_KIT_NAME

string

No

Name of the library preparation kit

LIBRARY_PREPARATION_KIT_VENDOR

string

No

Vendor of the library preparation kit

LIBRARY_PREPARATION_KIT_VERSION

string

No

Version of the library preparation kit

ADAPTER_NAME

string

No

Name of the adapter used

ADAPTER_SEQUENCE

string

No

Adapter sequence

BASE_CALLER_NAME

string

No

Name of the base caller

BASE_CALLER_VERSION

string

No

Version of the base caller

FLOW_CELL_BARCODE

string

No

Flow cell barcode

FRAGMENT_MAXIMUM_LENGTH

integer

No

Maximum fragment length

FRAGMENT_MEAN_LENGTH

integer

No

Mean fragment length

FRAGMENT_MINIMUM_LENGTH

integer

No

Minimum fragment length

FRAGMENT_STANDARD_DEVIATION_LENGTH

integer

No

Standard deviation of fragment length

LANE_NUMBER

integer

No

Lane number

MULTIPLEX_BARCODE

string

No

Multiplex barcode

LIBRARY_PREPARATION_DAYS_FROM_INDEX

integer

No

Days from index for library preparation

SIZE_SELECTION_RANGE

string

No

Size selection range

TARGET_DEPTH

integer

No

Target sequencing depth

TO_TRIM_ADAPTER_SEQUENCE

boolean

No

Whether to trim adapter sequence

LIBRARY_LAYOUT

LibraryLayoutEnum

Yes

Library layout (paired-end or single-end)

SEQUENCING_PLATFORM

SequencingPlatformEnum

Yes

Sequencing platform used

SEQUENCING_BATCH_ID

string

No

Sequencing batch identifier

TECHNICAL_REPLICATE_GROUP

string

No

Technical replicate group identifier

PROTOCOL_LINK

string

No

Link to sequencing protocol

CHECKSUM

string

No

Checksum for data integrity verification

HTAN_DATA_FILE_ID

string

Yes

HTAN Data File ID (Primary Key)

HTAN_PARENT_ID

string

Yes

HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2.

BulkWESLevel2

Bulk Whole Exome Sequencing Level 2 - Reads mapped to the genome and alignment QC

Attribute

Type

Required

Description

FILE_FORMAT

string

Yes

Format of the aligned file (bam or cram)

FILENAME

string

Yes

Name of the file. Must end with an extension matching the FILE_FORMAT (.bam for bam; .cram for cram)

ALIGNMENT_WORKFLOW_TYPE

string

Yes

Type of alignment workflow used

INDEX_FILE_NAME

string

No

Name of the index file

AVERAGE_BASE_QUALITY

float

No

Average base quality

AVERAGE_INSERT_SIZE

integer

No

Average insert size

AVERAGE_READ_LENGTH

integer

No

Average read length

CONTAMINATION

float

No

Contamination estimate

CONTAMINATION_ERROR

float

No

Contamination error estimate

MEAN_COVERAGE

float

Yes

Mean coverage depth

ADAPTER_CONTENT

string

No

Adapter content information

BASIC_STATISTICS

string

No

Basic statistics from QC

ENCODING

string

No

Encoding information

OVERREPRESENTED_SEQUENCES

string

No

Overrepresented sequences

PER_BASE_N_CONTENT

string

No

Per base N content

PER_BASE_SEQUENCE_CONTENT

string

No

Per base sequence content

PER_BASE_SEQUENCE_QUALITY

string

No

Per base sequence quality

PER_SEQUENCE_GC_CONTENT

string

No

Per sequence GC content

PER_SEQUENCE_QUALITY_SCORE

string

No

Per sequence quality score

PER_TILE_SEQUENCE_QUALITY

string

No

Per tile sequence quality

PERCENT_GC_CONTENT

float

No

Percent GC content

SEQUENCE_DUPLICATION_LEVELS

string

No

Sequence duplication levels

SEQUENCE_LENGTH_DISTRIBUTION

string

No

Sequence length distribution

QC_WORKFLOW_TYPE

string

No

QC workflow type

QC_WORKFLOW_VERSION

string

No

QC workflow version

QC_WORKFLOW_LINK

string

No

Link to QC workflow

PAIRS_ON_DIFF_CHR

integer

No

Number of read pairs on different chromosomes

TOTAL_READS

integer

Yes

Total number of reads

TOTAL_UNIQUELY_MAPPED

integer

Yes

Total number of uniquely mapped reads

TOTAL_UNMAPPED_READS

integer

Yes

Total number of unmapped reads

PROPORTION_READS_DUPLICATED

float

No

Proportion of duplicated reads

PROPORTION_READS_MAPPED

float

Yes

Proportion of mapped reads

PROPORTION_TARGETS_NO_COVERAGE

float

No

Proportion of targets with no coverage

PROPORTION_BASE_MISMATCH

float

No

Proportion of base mismatches

SHORT_READS

integer

No

Number of short reads

PROPORTION_COVERAGE_10X

float

No

Proportion of coverage at 10x

PROPORTION_COVERAGE_30X

float

No

Proportion of coverage at 30x

IS_LOWEST_LEVEL

boolean

No

Whether this is the lowest level

GENOMIC_REFERENCE

GenomicReferenceEnum

Yes

Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison.

GENOMIC_REFERENCE_URL

string

Yes

URL to genomic or transcriptomic reference

GENOME_ANNOTATION_URL

string

Yes

URL to genome or transcriptome annotation

WORKFLOW_VERSION

string

Yes

Major version of the workflow, or ‘Not applicable’ when no workflow version applies.

WORKFLOW_LINK

string

Yes

Link to workflow or command. DockStore.org recommended

LIBRARY_LAYOUT

LibraryLayoutEnum

Yes

Library layout (paired-end or single-end)

SEQUENCING_PLATFORM

SequencingPlatformEnum

Yes

Sequencing platform used

SEQUENCING_BATCH_ID

string

No

Sequencing batch identifier

LIBRARY_PREPARATION_DAYS_FROM_INDEX

integer

No

Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter ‘Not Applicable’)

TECHNICAL_REPLICATE_GROUP

string

No

Technical replicate group identifier

PROTOCOL_LINK

string

No

Link to sequencing protocol

CHECKSUM

string

No

Checksum for data integrity verification

HTAN_DATA_FILE_ID

string

Yes

HTAN Data File ID (Primary Key)

HTAN_PARENT_ID

string

Yes

HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2.

BulkWESLevel3

Bulk Whole Exome Sequencing Level 3 - Called variants and MSI analysis

Attribute

Type

Required

Description

FILE_FORMAT

string

Yes

Format of the variant file (vcf or vcf.gz)

FILENAME

string

Yes

Name of the file. Must end with an extension matching the FILE_FORMAT (.vcf for vcf; .vcf.gz for vcf.gz)

GERMLINE_VARIANTS_WORKFLOW_URL

string

No

URL to the germline variants workflow

GERMLINE_VARIANTS_WORKFLOW_TYPE

string

No

Type of germline variants workflow

SOMATIC_VARIANTS_WORKFLOW_URL

string

No

URL to the somatic variants workflow

SOMATIC_VARIANTS_WORKFLOW_TYPE

string

No

Type of somatic variants workflow

SOMATIC_VARIANTS_SAMPLE_TYPE

SomaticVariantsSampleTypeEnum

No

Type of sample for somatic variants

STRUCTURAL_VARIANT_WORKFLOW_URL

string

No

URL to the structural variant workflow

STRUCTURAL_VARIANT_WORKFLOW_TYPE

string

No

Type of structural variant workflow

MSI_WORKFLOW_LINK

string

No

Link to MSI workflow

MSI_SCORE

float

No

MSI score

MSI_STATUS

MSIStatusEnum

No

MSI status

GENOMIC_REFERENCE

GenomicReferenceEnum

Yes

Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison.

GENOMIC_REFERENCE_URL

string

Yes

URL to genomic or transcriptomic reference

GENOME_ANNOTATION_URL

string

Yes

URL to genome or transcriptome annotation

WORKFLOW_VERSION

string

Yes

Major version of the workflow, or ‘Not applicable’ when no workflow version applies.

WORKFLOW_LINK

string

Yes

Link to workflow or command. DockStore.org recommended

LIBRARY_LAYOUT

LibraryLayoutEnum

Yes

Library layout (paired-end or single-end)

SEQUENCING_PLATFORM

SequencingPlatformEnum

Yes

Sequencing platform used

SEQUENCING_BATCH_ID

string

No

Sequencing batch identifier

LIBRARY_PREPARATION_DAYS_FROM_INDEX

integer

No

Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter ‘Not Applicable’)

TECHNICAL_REPLICATE_GROUP

string

No

Technical replicate group identifier

PROTOCOL_LINK

string

No

Link to sequencing protocol

CHECKSUM

string

No

Checksum for data integrity verification

HTAN_DATA_FILE_ID

string

Yes

HTAN Data File ID (Primary Key)

HTAN_PARENT_ID

string

Yes

HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2.

Enums

GenomicReferenceEnum

Genomic or transcriptomic reference assembly used for alignment

Value

Description

GRCh37

Genome Reference Consortium human build 37

GRCh37.p13

GRCh37 patch release 13

GRCh38

Genome Reference Consortium human build 38

GRCh38.p13

GRCh38 patch release 13

GRCh38.p14

GRCh38 patch release 14

hg19

UCSC human genome reference hg19

hg38

UCSC human genome reference hg38

LibraryLayoutEnum

Value

Description

Paired-end

Paired-end sequencing

Single-end

Single-end sequencing

LibrarySelectionMethodEnum

Value

Description

Hybrid Selection

Hybrid selection method

PCR

PCR-based selection

RANDOM

Random selection

other

Other selection method

MSIStatusEnum

Value

Description

MSI-H

High microsatellite instability

MSI-L

Low microsatellite instability

MSS

Microsatellite stable

Unknown

Unknown MSI status

SequencingPlatformEnum

Value

Description

ABI_SOLID

ABI SOLID sequencing platform

BGISEQ

BGI sequencing platform

CAPILLARY

Capillary sequencing platform

COMPLETE_GENOMICS

Complete Genomics sequencing platform

HELICOS

Helicos sequencing platform

ILLUMINA

Illumina sequencing platform

ION_TORRENT

Ion Torrent sequencing platform

LS454

454 sequencing platform

OXFORD_NANOPORE

Oxford Nanopore sequencing platform

PACBIO_SMRT

PacBio SMRT sequencing platform

SomaticVariantsSampleTypeEnum

Value

Description

Metastatic

Metastatic tumor sample

Normal

Normal tissue sample

Other

Other sample type

Primary

Primary tumor sample

Recurrent

Recurrent tumor sample