NIAGADS Guidelines for Submitting Genotype Data

To submit data, please contact niagads@upenn.edu with the required documentation. Please use the following guidelines when submitting data to NIAGADS.

Download complete instructions: Data Submission Guidelines

Required documentation for all data submissions:

  1. NIA AD Genetics Sharing Plan- Please include the NIA AD Genetics Sharing Plan, signed by both the PI and a supervisor with signatory authority for the PI’s institution.
  2. Informed Consent Form(s), IRB approved Consent Levels, and Institutional Certification - Please provide the IRB-approved informed consent form(s) that are in compliance with the NIH Genomic Sharing Policy (http://gds.nih.gov/) in PDF format. A signing official from the investigator’s institution must provide an Institutional Certification Document that also contains level of consent. Our submission process parallels that of dbGaP, so submitters of data are asked to use the same document.
  3. Phenotype Data File- Please use tab-delimited plain text (.txt) or excel (.xls/.xlsx) file formats along with a data dictionary listing each variable and their description. Please include a column indicating the level of consent for each subject according to the Institutional Certification document.
  4. Pedigree Data File- Please use tab-delimited plain text (.txt) or excel (.xls/.xlsx) file formats following the standard pedigree file format. 

Use standard labels for the following fields: FAMID (family ID), SUBJID (subject ID), FATHER (father ID), MOTHER (mother ID), SEX (1 for male and 2 for female).

Example:
FAMID SUBJID FATHER MOTHER SEX
100 1 0 0 1 45 0
100 2 0 0 2 43 0
100 3 1 2 1 12 0
100 4 1 2 1 10 0

Note: If you decide to use alternative genotype format (see below) to store your genotype data and the file also contains pedigree information (the five columns above), the pedigree data does not need to be saved in a separate file as required here.

 

If you are submitting Polymorphism Genotyping data:

  • APOE Genotypes- When available, please provide the APOE genotype.  Describe the lab(s) that performed genotyping and the genotyping methodology in the README file mentioned above.
  • Preferred Format- Computer files containing genotype or genetic mapping data should be plain text files in the genetic pedigree file format.  We ask the contributor(s) to use either the PLINK (.ped and .map files) or MERLIN pedigree formats (.ped, .map, and .dat files). For more detailed definition please refer to the following two URLs or the example file formats listed below.

PLINK: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped

MERLIN: http://www.sph.umich.edu/csg/abecasis/merlin/tour/input_files.html

There must be separate documentation (preferably in the README file) that clearly explains the format used for the files mentioned above. For PLINK pedigree format, including a URL to the definition of the file format is enough; for MERLIN pedigree format, both a detailed definition of the fields and a URL to the format definition should be included. The columns in each file should be listed and explained. Also, there should be some summary statistics, such as the number of individuals, number of markers, and so on. If there was some system used to divide the genotyping into "plates", etc., this system should be explained. For example, if the genotype files are named "plate1, plate2...,” this system should be explained.

Loci Labeling

Microsatellite markers, SNPs, genes, etc, should be labeled with the common usage employed at NCBI. For microsatellites, use the "DnSmmmm" format so that the marker appears in the Marshfield or deCODE maps. For SNPs, use dbSNP rs numbers, and not ss numbers. For genes, use the official NCBI Entrez Gene name and numerical gene ID, not an alias. For example, DRD2 is an official name and has aliases D2DR, D2R, etc.

Alternative Genotype Format- If it is difficult to format data into the formats above, data sets can be formatted as additional plain text files according to the following guidelines:

  • Data should be formatted as tables in plain text, using space or tab to separate records.
  • Include a line at the top of each file that indicates the labels of the columns. These labels should begin with letters and not contain spaces (i.e., standard rules for the definition of variables in computer programs).  The field labels should be distinct and case insensitive (e.g., SEX, Sex, and sex are considered to be the same).
  • Use standard labels for the following fields: FAMID (family ID), SUBJID (subject ID), FATHER (father ID), MOTHER (mother ID), SEX (1 for male and 2 for female), AGE, DX (for diagnosis: 1 for control, 2 for case).

Example:
FAMID SUBJID FATHER MOTHER SEX AGE DX RS1001 RS1002
100 1 0 0 1 45 0 A/T A/A
100 2 0 0 2 43 0 A/A C/C
100 3 1 2 1 12 0 A/T A/C
100 4 1 2 1 10 0 A/T A/C

Required files if you are submitting Next Generation Sequencing (NGS) data:

  1. Called reads prior to quality control in FASTQ format, compressed using the gzip or bzip2 program
  2. Mapped reads in BAM format (see SAMtools,  http://samtools.sourceforge.net).
  3. For studies focusing on genetic variants (e.g. genomic resequencing): called variants in the Variant Call Format (VCF).
  4. For read abundance studies (e.g. RNA-seq for gene expression profiling, ChIP-seq), provide summaries in tab-separated file format with explanations.
  5. Called reads prior to quality control in FASTQ format, compressed using the gzip or bzip2 program.
  6. Additional Relevant Information- The contributor(s) should also provide additional information to facilitate future analysis and enable replication of primary findings by other researchers:
  • Sequencer information (technology, machine type, version, protocol)
  • How the FASTQ files are generated (i.e. pipeline version, base calling software, settings).
  • How the BAM files are generated (i.e. workflow, read alignment software, parameters, reference genome, quality control parameters).
  • How the VCF files are generated (i.e. call program, stringency settings)For targeted enrichment sequencing and whole exome sequencing, provide information on the enrichment target regions using the UCSC Genome Browser BED format (see http://genome.ucsc.edu/FAQ/FAQformat.html). The genomic coordinates should be based on the reference genome version used for read alignment. Note how the coordinates are defined (using “zero-based” coordinate system, and the ending position does not include the rightmost basepair in the interval) in BED files.
  • For RNA-seq experiments, how the transcript-level summaries are generated (i.e. statistical/computational steps to generate the summaries, transcript and isoform assemblers, software used).
  • For ChIP-seq experiments, how the summaries are generated (i.e. peak caller software).
  • Other information that the submitter deems relevant.

Other considerations:

  • Family and Individual Identifiers

If using dashes to represent different identifiers of an individual, please explain the different parts of the ID.  Example: individual "SJ-12321-1". Do not use identification numbers with leading zeros to avoid issues with some computer programs such as Excel, which may convert ID text into numerals without notice.

  • Information for the reference genome

If a genetic or physical map is supplied, please explain the source.  For example, "NCBI physical map build 37.5".

  • Be consistent

Make sure all files use exactly the same system. We noticed in the past that this rule is typically violated when data are sent at different time points. Please make sure subsequent submissions use the same format as the original data.

If you have any further questions about data submission or would like to submit data, please contact niagads@upenn.edu.