#########################################################################################
## This is a README file on the individual data giving consents to share with public.
## The study was originally called "Genomic and multi-tissue proteomic integration for understanding the biology of disease and other complex traits" 
## at Medrixv https://www.medrxiv.org/content/10.1101/2020.06.25.20140277v1,
## but the study name has been changed to "Genomic and multi-tissue proteomic integration for understanding the genetic architecture of neurological diseases" in the current stage of peer-review.
#########################################################################################

#########################################################################################
## Overall, there are three parts of data to be shared in this repository: proteomics data, genotype data and consent information.
## Proteomics data are generated from three tissues, CSF, plasma, brain.
## Genotype data are measured with genotyping arrays (please check detailed array information in type-1b covariate tables).
## Consent information is curated for each participant with detailed future research category.

## The number of participants giving consents to share with public for CSF is 817;
## The number of participants giving consents to share with public for plasma is 528;
## The number of participants giving consents to share with public for brain is 343;

## 713 CSF proteins passed QC;
## 931 plasma proteins passed QC;
## 1079 brain proteins passed QC.
#########################################################################################


#########################################################################################
## As for part-1 proteomics data:
### There are three subtypes: a) proteomics-expression matrix; b) proteomics-covariate table; c) proteomics-annotation table.
#########################################################################################
### type-1a) proteomics-expression matrix (tab-delimited txt file)
#### type-1a.1 proteomics_exprs_t1CSF_toSharePublic.txt
#### type-1a.2 proteomics_exprs_t2plasma_toSharePublic.txt
#### type-1a.3 proteomics_exprs_t3brain_toSharePublic.txt
##### content description:
##### samples (rows) by proteins (columns)
##### PA_DB_UID are sample IDs to be ready to share with public
##### proteins are denoted as SOMAseqID (see proteomics-annotation table below for details)
##### missing values are denoted as NA

#########################################################################################
### type-1b) proteomics-covariate table  (tab-delimited txt file)
#### type-1b.1 proteomics_covar_t1CSF_toSharePublic.txt
#### type-1b.2 proteomics_covar_t2plasma_toSharePublic.txt
#### type-1b.3 proteomics_covar_t3brain_toSharePublic.txt
##### content description:
##### samples (rows) by covariates (columns)
##### PA_DB_UID are sample IDs to be ready to share with public
##### columns are age, sex, and genotype_platform (dummy variables)

#########################################################################################
### type-1c) proteomics-annotation table  (tab-delimited txt file)
#### type-1c.1 proteomics_t1CSF_featureFile.txt
#### type-1c.2 proteomics_t2plasma_featureFile.txt
#### type-1c.3 proteomics_t3brain_featureFile.txt
##### content description:
##### proteins (rows) by annotations (columns)
##### SOMAseqID: SOMAmer's unique ID from SOMAscan platform, used in proteomics-expression matrix.      
##### SeqId: SOMAmer's unique ID from SOMAscan platform with additional version of SOMAmers after '_'.
##### SomaId: SOMAmer's unique ID from SOMAscan platform and starting with "SL".
##### TargetFullName: SOMAmer binding to the target protein full names 
##### Target: SOMAmer binding to the target protein short names  
##### UniProt: protein ID from Uniprot database
##### EntrezGeneID: gene ID encoding the protein from NCBI database   
##### EntrezGeneSymbol: gene symbol encoding the protein from NCBI database


#########################################################################################
## As for part-2 genotype data:
## It is in the genotype-GWAS plink binary format, and has three associated files (.bed, .bim, .fam) in total.
## PA_DB_UID are used for sampleID (FID/IID), fatherID and motherID.
#########################################################################################
## plink binary format are described in https://www.cog-genomics.org/plink/1.9/formats
### .bed (PLINK binary biallelic genotype table): https://www.cog-genomics.org/plink/1.9/formats#bed
### .bim (PLINK extended MAP file): https://www.cog-genomics.org/plink/1.9/formats#bim
### .fam (PLINK sample information file): https://www.cog-genomics.org/plink/1.9/formats#fam


#########################################################################################
## As for part-3 consent information:
## It is a csv file with two columns.
## column-1 as PA_DB_UID for sample IDs with both genotype and proteomics profiled from at least one tissue
## column-2 as FutResearchCat short for future research category
#########################################################################################