# WES HTAN Whole Exome Sequencing Data Model Schema ## CoreFileAttributes **Universal attributes that apply to all file-based data in HTAN** | Attribute | Type | Required | Description | |-----------|------|----------|-------------| | `FILENAME` | string | Yes | Name of the file | | `FILE_FORMAT` | string | Yes | Format of the file (e.g., fastq, bam, vcf, h5ad) | | `HTAN_DATA_FILE_ID` | string | Yes | HTAN Data File ID (Primary Key) | | `HTAN_PARENT_ID` | string | Yes | HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. | ## BaseSequencingAttributes **Minimal base attributes shared across all sequencing types** | Attribute | Type | Required | Description | |-----------|------|----------|-------------| | `CHECKSUM` | string | No | Checksum for data integrity verification | | `FILENAME` | string | Yes | Name of the file | | `FILE_FORMAT` | string | Yes | Format of the file (e.g., fastq, bam, vcf, h5ad) | | `HTAN_DATA_FILE_ID` | string | Yes | HTAN Data File ID (Primary Key) | | `HTAN_PARENT_ID` | string | Yes | HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. | ## BaseSequencingLevel1Attributes **Level 1 attributes - sequencing run and library (raw data)** | Attribute | Type | Required | Description | |-----------|------|----------|-------------| | `LIBRARY_LAYOUT` | [LibraryLayoutEnum](#librarylayoutenum) | Yes | Library layout (paired-end or single-end) | | `SEQUENCING_PLATFORM` | [SequencingPlatformEnum](#sequencingplatformenum) | Yes | Sequencing platform used | | `SEQUENCING_BATCH_ID` | string | No | Sequencing batch identifier | | `LIBRARY_PREPARATION_DAYS_FROM_INDEX` | integer | No | Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter 'Not Applicable') | | `TECHNICAL_REPLICATE_GROUP` | string | No | Technical replicate group identifier | | `PROTOCOL_LINK` | string | No | Link to sequencing protocol | | `CHECKSUM` | string | No | Checksum for data integrity verification | | `FILENAME` | string | Yes | Name of the file | | `FILE_FORMAT` | string | Yes | Format of the file (e.g., fastq, bam, vcf, h5ad) | | `HTAN_DATA_FILE_ID` | string | Yes | HTAN Data File ID (Primary Key) | | `HTAN_PARENT_ID` | string | Yes | HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. | ## BaseSequencingLevel2Attributes **Level 2 attributes - alignment and alignment workflow** | Attribute | Type | Required | Description | |-----------|------|----------|-------------| | `GENOMIC_REFERENCE` | [GenomicReferenceEnum](#genomicreferenceenum) | Yes | Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison. | | `GENOMIC_REFERENCE_URL` | string | Yes | URL to genomic or transcriptomic reference | | `GENOME_ANNOTATION_URL` | string | Yes | URL to genome or transcriptome annotation | | `WORKFLOW_VERSION` | string | Yes | Major version of the workflow, or 'Not applicable' when no workflow version applies. | | `WORKFLOW_LINK` | string | Yes | Link to workflow or command. DockStore.org recommended | | `LIBRARY_LAYOUT` | [LibraryLayoutEnum](#librarylayoutenum) | Yes | Library layout (paired-end or single-end) | | `SEQUENCING_PLATFORM` | [SequencingPlatformEnum](#sequencingplatformenum) | Yes | Sequencing platform used | | `SEQUENCING_BATCH_ID` | string | No | Sequencing batch identifier | | `LIBRARY_PREPARATION_DAYS_FROM_INDEX` | integer | No | Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter 'Not Applicable') | | `TECHNICAL_REPLICATE_GROUP` | string | No | Technical replicate group identifier | | `PROTOCOL_LINK` | string | No | Link to sequencing protocol | | `CHECKSUM` | string | No | Checksum for data integrity verification | | `FILENAME` | string | Yes | Name of the file | | `FILE_FORMAT` | string | Yes | Format of the file (e.g., fastq, bam, vcf, h5ad) | | `HTAN_DATA_FILE_ID` | string | Yes | HTAN Data File ID (Primary Key) | | `HTAN_PARENT_ID` | string | Yes | HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. | ## BaseSequencingLevel3Attributes **Level 3+ attributes - inherits alignment and workflow; used for processed/analysis levels** | Attribute | Type | Required | Description | |-----------|------|----------|-------------| | `GENOMIC_REFERENCE` | [GenomicReferenceEnum](#genomicreferenceenum) | Yes | Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison. | | `GENOMIC_REFERENCE_URL` | string | Yes | URL to genomic or transcriptomic reference | | `GENOME_ANNOTATION_URL` | string | Yes | URL to genome or transcriptome annotation | | `WORKFLOW_VERSION` | string | Yes | Major version of the workflow, or 'Not applicable' when no workflow version applies. | | `WORKFLOW_LINK` | string | Yes | Link to workflow or command. DockStore.org recommended | | `LIBRARY_LAYOUT` | [LibraryLayoutEnum](#librarylayoutenum) | Yes | Library layout (paired-end or single-end) | | `SEQUENCING_PLATFORM` | [SequencingPlatformEnum](#sequencingplatformenum) | Yes | Sequencing platform used | | `SEQUENCING_BATCH_ID` | string | No | Sequencing batch identifier | | `LIBRARY_PREPARATION_DAYS_FROM_INDEX` | integer | No | Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter 'Not Applicable') | | `TECHNICAL_REPLICATE_GROUP` | string | No | Technical replicate group identifier | | `PROTOCOL_LINK` | string | No | Link to sequencing protocol | | `CHECKSUM` | string | No | Checksum for data integrity verification | | `FILENAME` | string | Yes | Name of the file | | `FILE_FORMAT` | string | Yes | Format of the file (e.g., fastq, bam, vcf, h5ad) | | `HTAN_DATA_FILE_ID` | string | Yes | HTAN Data File ID (Primary Key) | | `HTAN_PARENT_ID` | string | Yes | HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. | ## BulkWESLevel1 **Bulk Whole Exome Sequencing Level 1 - Raw files** | Attribute | Type | Required | Description | |-----------|------|----------|-------------| | `FILE_FORMAT` | string | Yes | Format of the raw sequencing file (fastq or fastq.gz) | | `FILENAME` | string | Yes | Name of the file. Must end with an extension matching the FILE_FORMAT (.fastq for fastq; .fastq.gz or .fq.gz for fastq.gz) | | `READ_INDICATOR` | string | No | Read indicator | | `LIBRARY_SELECTION_METHOD` | [LibrarySelectionMethodEnum](#libraryselectionmethodenum) | Yes | Method used for library selection | | `READ_LENGTH` | integer | Yes | Read length in base pairs | | `TARGET_CAPTURE_KIT` | string | No | Target capture kit used | | `LIBRARY_PREPARATION_KIT_NAME` | string | No | Name of the library preparation kit | | `LIBRARY_PREPARATION_KIT_VENDOR` | string | No | Vendor of the library preparation kit | | `LIBRARY_PREPARATION_KIT_VERSION` | string | No | Version of the library preparation kit | | `ADAPTER_NAME` | string | No | Name of the adapter used | | `ADAPTER_SEQUENCE` | string | No | Adapter sequence | | `BASE_CALLER_NAME` | string | No | Name of the base caller | | `BASE_CALLER_VERSION` | string | No | Version of the base caller | | `FLOW_CELL_BARCODE` | string | No | Flow cell barcode | | `FRAGMENT_MAXIMUM_LENGTH` | integer | No | Maximum fragment length | | `FRAGMENT_MEAN_LENGTH` | integer | No | Mean fragment length | | `FRAGMENT_MINIMUM_LENGTH` | integer | No | Minimum fragment length | | `FRAGMENT_STANDARD_DEVIATION_LENGTH` | integer | No | Standard deviation of fragment length | | `LANE_NUMBER` | integer | No | Lane number | | `MULTIPLEX_BARCODE` | string | No | Multiplex barcode | | `LIBRARY_PREPARATION_DAYS_FROM_INDEX` | integer | No | Days from index for library preparation | | `SIZE_SELECTION_RANGE` | string | No | Size selection range | | `TARGET_DEPTH` | integer | No | Target sequencing depth | | `TO_TRIM_ADAPTER_SEQUENCE` | boolean | No | Whether to trim adapter sequence | | `LIBRARY_LAYOUT` | [LibraryLayoutEnum](#librarylayoutenum) | Yes | Library layout (paired-end or single-end) | | `SEQUENCING_PLATFORM` | [SequencingPlatformEnum](#sequencingplatformenum) | Yes | Sequencing platform used | | `SEQUENCING_BATCH_ID` | string | No | Sequencing batch identifier | | `TECHNICAL_REPLICATE_GROUP` | string | No | Technical replicate group identifier | | `PROTOCOL_LINK` | string | No | Link to sequencing protocol | | `CHECKSUM` | string | No | Checksum for data integrity verification | | `HTAN_DATA_FILE_ID` | string | Yes | HTAN Data File ID (Primary Key) | | `HTAN_PARENT_ID` | string | Yes | HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. | ## BulkWESLevel2 **Bulk Whole Exome Sequencing Level 2 - Reads mapped to the genome and alignment QC** | Attribute | Type | Required | Description | |-----------|------|----------|-------------| | `FILE_FORMAT` | string | Yes | Format of the aligned file (bam or cram) | | `FILENAME` | string | Yes | Name of the file. Must end with an extension matching the FILE_FORMAT (.bam for bam; .cram for cram) | | `ALIGNMENT_WORKFLOW_TYPE` | string | Yes | Type of alignment workflow used | | `INDEX_FILE_NAME` | string | No | Name of the index file | | `AVERAGE_BASE_QUALITY` | float | No | Average base quality | | `AVERAGE_INSERT_SIZE` | integer | No | Average insert size | | `AVERAGE_READ_LENGTH` | integer | No | Average read length | | `CONTAMINATION` | float | No | Contamination estimate | | `CONTAMINATION_ERROR` | float | No | Contamination error estimate | | `MEAN_COVERAGE` | float | Yes | Mean coverage depth | | `ADAPTER_CONTENT` | string | No | Adapter content information | | `BASIC_STATISTICS` | string | No | Basic statistics from QC | | `ENCODING` | string | No | Encoding information | | `OVERREPRESENTED_SEQUENCES` | string | No | Overrepresented sequences | | `PER_BASE_N_CONTENT` | string | No | Per base N content | | `PER_BASE_SEQUENCE_CONTENT` | string | No | Per base sequence content | | `PER_BASE_SEQUENCE_QUALITY` | string | No | Per base sequence quality | | `PER_SEQUENCE_GC_CONTENT` | string | No | Per sequence GC content | | `PER_SEQUENCE_QUALITY_SCORE` | string | No | Per sequence quality score | | `PER_TILE_SEQUENCE_QUALITY` | string | No | Per tile sequence quality | | `PERCENT_GC_CONTENT` | float | No | Percent GC content | | `SEQUENCE_DUPLICATION_LEVELS` | string | No | Sequence duplication levels | | `SEQUENCE_LENGTH_DISTRIBUTION` | string | No | Sequence length distribution | | `QC_WORKFLOW_TYPE` | string | No | QC workflow type | | `QC_WORKFLOW_VERSION` | string | No | QC workflow version | | `QC_WORKFLOW_LINK` | string | No | Link to QC workflow | | `PAIRS_ON_DIFF_CHR` | integer | No | Number of read pairs on different chromosomes | | `TOTAL_READS` | integer | Yes | Total number of reads | | `TOTAL_UNIQUELY_MAPPED` | integer | Yes | Total number of uniquely mapped reads | | `TOTAL_UNMAPPED_READS` | integer | Yes | Total number of unmapped reads | | `PROPORTION_READS_DUPLICATED` | float | No | Proportion of duplicated reads | | `PROPORTION_READS_MAPPED` | float | Yes | Proportion of mapped reads | | `PROPORTION_TARGETS_NO_COVERAGE` | float | No | Proportion of targets with no coverage | | `PROPORTION_BASE_MISMATCH` | float | No | Proportion of base mismatches | | `SHORT_READS` | integer | No | Number of short reads | | `PROPORTION_COVERAGE_10X` | float | No | Proportion of coverage at 10x | | `PROPORTION_COVERAGE_30X` | float | No | Proportion of coverage at 30x | | `IS_LOWEST_LEVEL` | boolean | No | Whether this is the lowest level | | `GENOMIC_REFERENCE` | [GenomicReferenceEnum](#genomicreferenceenum) | Yes | Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison. | | `GENOMIC_REFERENCE_URL` | string | Yes | URL to genomic or transcriptomic reference | | `GENOME_ANNOTATION_URL` | string | Yes | URL to genome or transcriptome annotation | | `WORKFLOW_VERSION` | string | Yes | Major version of the workflow, or 'Not applicable' when no workflow version applies. | | `WORKFLOW_LINK` | string | Yes | Link to workflow or command. DockStore.org recommended | | `LIBRARY_LAYOUT` | [LibraryLayoutEnum](#librarylayoutenum) | Yes | Library layout (paired-end or single-end) | | `SEQUENCING_PLATFORM` | [SequencingPlatformEnum](#sequencingplatformenum) | Yes | Sequencing platform used | | `SEQUENCING_BATCH_ID` | string | No | Sequencing batch identifier | | `LIBRARY_PREPARATION_DAYS_FROM_INDEX` | integer | No | Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter 'Not Applicable') | | `TECHNICAL_REPLICATE_GROUP` | string | No | Technical replicate group identifier | | `PROTOCOL_LINK` | string | No | Link to sequencing protocol | | `CHECKSUM` | string | No | Checksum for data integrity verification | | `HTAN_DATA_FILE_ID` | string | Yes | HTAN Data File ID (Primary Key) | | `HTAN_PARENT_ID` | string | Yes | HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. | ## BulkWESLevel3 **Bulk Whole Exome Sequencing Level 3 - Called variants and MSI analysis** | Attribute | Type | Required | Description | |-----------|------|----------|-------------| | `FILE_FORMAT` | string | Yes | Format of the variant file (vcf or vcf.gz) | | `FILENAME` | string | Yes | Name of the file. Must end with an extension matching the FILE_FORMAT (.vcf for vcf; .vcf.gz for vcf.gz) | | `GERMLINE_VARIANTS_WORKFLOW_URL` | string | No | URL to the germline variants workflow | | `GERMLINE_VARIANTS_WORKFLOW_TYPE` | string | No | Type of germline variants workflow | | `SOMATIC_VARIANTS_WORKFLOW_URL` | string | No | URL to the somatic variants workflow | | `SOMATIC_VARIANTS_WORKFLOW_TYPE` | string | No | Type of somatic variants workflow | | `SOMATIC_VARIANTS_SAMPLE_TYPE` | [SomaticVariantsSampleTypeEnum](#somaticvariantssampletypeenum) | No | Type of sample for somatic variants | | `STRUCTURAL_VARIANT_WORKFLOW_URL` | string | No | URL to the structural variant workflow | | `STRUCTURAL_VARIANT_WORKFLOW_TYPE` | string | No | Type of structural variant workflow | | `MSI_WORKFLOW_LINK` | string | No | Link to MSI workflow | | `MSI_SCORE` | float | No | MSI score | | `MSI_STATUS` | [MSIStatusEnum](#msistatusenum) | No | MSI status | | `GENOMIC_REFERENCE` | [GenomicReferenceEnum](#genomicreferenceenum) | Yes | Genomic or transcriptomic reference assembly used for alignment. If your genome reference is not among the valid values, please contact your data liaison. | | `GENOMIC_REFERENCE_URL` | string | Yes | URL to genomic or transcriptomic reference | | `GENOME_ANNOTATION_URL` | string | Yes | URL to genome or transcriptome annotation | | `WORKFLOW_VERSION` | string | Yes | Major version of the workflow, or 'Not applicable' when no workflow version applies. | | `WORKFLOW_LINK` | string | Yes | Link to workflow or command. DockStore.org recommended | | `LIBRARY_LAYOUT` | [LibraryLayoutEnum](#librarylayoutenum) | Yes | Library layout (paired-end or single-end) | | `SEQUENCING_PLATFORM` | [SequencingPlatformEnum](#sequencingplatformenum) | Yes | Sequencing platform used | | `SEQUENCING_BATCH_ID` | string | No | Sequencing batch identifier | | `LIBRARY_PREPARATION_DAYS_FROM_INDEX` | integer | No | Number of days between when the sample for assay was received in the lab and the libraries were prepared for sequencing. If not applicable please enter 'Not Applicable') | | `TECHNICAL_REPLICATE_GROUP` | string | No | Technical replicate group identifier | | `PROTOCOL_LINK` | string | No | Link to sequencing protocol | | `CHECKSUM` | string | No | Checksum for data integrity verification | | `HTAN_DATA_FILE_ID` | string | Yes | HTAN Data File ID (Primary Key) | | `HTAN_PARENT_ID` | string | Yes | HTAN Parent ID(s) - Foreign key(s) to parent entity (B for Biospecimen, D for data file). One or more IDs; for aggregated files provide multiple. Each ID must have B or D suffix. Supports HTA200-229 for phase 2. | ## Enums ### GenomicReferenceEnum Genomic or transcriptomic reference assembly used for alignment | Value | Description | |-------|-------------| | GRCh37 | Genome Reference Consortium human build 37 | | GRCh37.p13 | GRCh37 patch release 13 | | GRCh38 | Genome Reference Consortium human build 38 | | GRCh38.p13 | GRCh38 patch release 13 | | GRCh38.p14 | GRCh38 patch release 14 | | hg19 | UCSC human genome reference hg19 | | hg38 | UCSC human genome reference hg38 | ### LibraryLayoutEnum | Value | Description | |-------|-------------| | Paired-end | Paired-end sequencing | | Single-end | Single-end sequencing | ### LibrarySelectionMethodEnum | Value | Description | |-------|-------------| | Hybrid Selection | Hybrid selection method | | PCR | PCR-based selection | | RANDOM | Random selection | | other | Other selection method | ### MSIStatusEnum | Value | Description | |-------|-------------| | MSI-H | High microsatellite instability | | MSI-L | Low microsatellite instability | | MSS | Microsatellite stable | | Unknown | Unknown MSI status | ### SequencingPlatformEnum | Value | Description | |-------|-------------| | ABI_SOLID | ABI SOLID sequencing platform | | BGISEQ | BGI sequencing platform | | CAPILLARY | Capillary sequencing platform | | COMPLETE_GENOMICS | Complete Genomics sequencing platform | | HELICOS | Helicos sequencing platform | | ILLUMINA | Illumina sequencing platform | | ION_TORRENT | Ion Torrent sequencing platform | | LS454 | 454 sequencing platform | | OXFORD_NANOPORE | Oxford Nanopore sequencing platform | | PACBIO_SMRT | PacBio SMRT sequencing platform | ### SomaticVariantsSampleTypeEnum | Value | Description | |-------|-------------| | Metastatic | Metastatic tumor sample | | Normal | Normal tissue sample | | Other | Other sample type | | Primary | Primary tumor sample | | Recurrent | Recurrent tumor sample |