What do tag snps define
The comparisons were carried out using sequence data for 10 representative candidate genes for atherosclerotic cardiovascular disease with varying levels of LD in a sample of European-Americans Figure 1 , Table 1. This difference may be due to long-range associations between SNPs.
For example, LD may exist between bins, which were partitioned based on LD r 2 or between haplotype blocks, such as All common haplotypes , whereas TagIT is able to incorporate such long-range LD. With increase in tagging criteria from 0. No significant change in TE was noted using larger sample sizes n ranged from 24 to 90 for each tSNPs selection method Supplementary Figure 2.
The prediction accuracy of tSNPs selected by different methods approached or exceeded the threshold of 0. Given the higher TE of Haplotype diversity and TagIT, the prediction accuracy of these two methods was higher in the gene regions with high LD and comparable to other methods in the moderate and low LD regions. A similar pattern for TE and prediction accuracy among different methods was noted Supplementary Figure 4. However, for a given method of tSNPs selection, no simple linear relationship between TE and prediction accuracy was obvious in the 10 genes.
For all the eight tSNPs selection methods, tagging effectiveness in high LD regions was much higher than that in moderate and low LD regions. Pairwise comparisons of tSNPs sets revealed poor consistency between tSNPs selected using any two of the eight methods. This may be due to the low haplotype diversity in each block for the haplotype-block-based methods, and therefore a greater likelihood for similar tSNPs to be selected to represent common haplotypes using two different methods.
Thus, the tSNPs sets identified by haplotype-block-free methods differed considerably. Forton et al 35 have suggested that haplotype reconstruction by tSNPs generated by haplotype-block-based methods is more accurate than haplotype-block-free methods. The International HapMap project is meant to facilitate the optimal selection of SNPs for cost-effective and robust whole-genome association studies. However, tSNPs may be transferable between different geographical samples of an ethnic group 37 or between various non-African populations.
At least two initiatives, SeattleSNPs 15 , 20 and the Environmental Genome Project EGP , 39 have resequenced several hundred candidate genes involved in inflammation and environmental response, to facilitate candidate-gene-based association studies. Comparing various tSNPs selection methods is far from straightforward.
First, selecting representative gene data sets for analysis is problematic because of different LD levels in different genes and the variability in the number of SNPs among genes. Second, the size of candidate genes and genomic regions to be studied could be much larger than the regions 50 kb maximum investigated in the present study and the extent of LD could also extend well beyond this size.
Third, there is no consensus on what are the most appropriate statistics to evaluate the performance of tSNPs sets. The measure we used for evaluating the quality of tSNP selection — prediction accuracy — aims to maximize the expected accuracy of predicting untyped SNPs, given the unphased genotype information of the tSNPs. Generating a matching TE to compare prediction accuracy of eight tSNPs selection methods would require significant computational resources beyond the scope of the present study.
A major expectation from using tSNPs is that the genotyping cost is reduced, whereas at the same time the statistical power for identifying associations is only minimally compromised.
Statistical power may be an important metric in deciding which method is the most optimal in association studies. A direct comparison of tSNPs selection methods in the context of statistical power may be possible in a simulation study, 42 but was outside the scope of the present study. Except for LD r 2 , which uses genotype data to calculate the pairwise LD measure r 2 , the methods for selecting tSNPs are based on haplotype data.
Although convenient, statistical inference of haplotypes is associated with a degree of uncertainty as a proportion of the inferred haplotypes may be incorrect.
This may reduce the statistical power of a haplotype approach to detect an association with disease. A limitation of the present analyses is that there are moderate amounts of missing data in the 10 genes missing data ranged from 1. In conclusion, our comparison of the performance of several methods for choosing tSNPs revealed that TE varied with the methods, being highest for Haplotype Diversity 5 and TagIT haplotype r 2.
Risch N, Merikangas K : The future of genetic studies of complex human diseases. Science ; : — Dissecting human disease in the postgenomic era. Jeffreys AJ, Neumann R : Reciprocal crossover asymmetry and meiotic drive in a human recombination hot spot. Nat Genet ; 31 : — Nat Genet ; 29 : — Genome Res ; 14 : — Hum Hered ; 55 : 27— Article Google Scholar. Feelings and Cancer.
Adjusting to Cancer. Day-to-Day Life. Support for Caregivers. Questions to Ask About Cancer. Choices for Care. Talking about Your Advanced Cancer. Planning for Advanced Cancer. Advanced Cancer and Caregivers.
Questions to Ask about Advanced Cancer. Managing Cancer Care. Finding Health Care Services. Advance Directives. Using Trusted Resources. Coronavirus Information for Patients. Clinical Trials during Coronavirus. Adolescents and Young Adults with Cancer. Emotional Support for Young People with Cancer. Cancers by Body Location. Late Effects of Childhood Cancer Treatment. Pediatric Supportive Care. Rare Cancers of Childhood Treatment. Childhood Cancer Genomics.
Study Findings. Metastatic Cancer Research. Intramural Research. Extramural Research. Cancer Research Workforce. Partners in Cancer Research. What Are Cancer Research Studies. Research Studies. Get Involved. Sign In. TAG antigen.
Tag Dosing Unit. HMMs are interesting alternatives to Markov models. A HMM consists of two layers. The maximum likelihood estimates of the parameters can be obtained by the classical Expectation-Maximization EM algorithm [ 42 — 44 ]. Some HMMs for haplotype sequences have already been introduced for purposes other than marker selection. We will compare results obtained with two HMMs: the model of Daly et al.
For the sake of completeness, we also include a simple HMM embedding the Markov model described in the previous section. We adopt a simple bottom-up strategy to select subsets of markers. Of course, this cannot ensure an optimal solution. Starting with an empty set of markers, we add markers one by one in such a way that the gain of information content is maximized at each step.
Let J denote the current set of markers. In the HMM framework, exact computation of the conditional entropy H X i X J is not tractable but we will now describe an efficient way to approximate it. Using a sample x 1 , The whole bottom-up selection algorithm would then be of complexity n 2.
Thereby, at each step of the algorithm, we can automatically restrict the computation to the relevant portion of each sequence x k. This decreases considerably the time complexity of the bottom-up algorithm for large sequences. With a practical sample size, we can only deal with unconstrained HMMs of a few hidden states. In order to use HMMs with large number of hidden states, we need to reduce the number of free parameters by imposing constraints on transition and emission probabilities.
Daly et al. Namely, all the transitions associated with a change of backbone are modeled as having the same probability. This model includes more haplotype backbones and is more meaningful than the unconstrained two-state model.
However, the choice of the number of hidden states is mainly arbitrary, even though four may be adequate to the particular region studied by Daly et al. Besides, at each position the four haplotype backbones are modeled as having the same marginal probability. Li and Stephens [ 22 ] introduced an attractive generalization of Daly's model that bypasses the choice of the number of hidden states. In this model, the hidden variable S i corresponds to the template sequence at locus i. A version of this HMM accounting for recombination hot-spots and other recombination rates heterogeneities has also been proposed [ 22 ].
This maximization is related to the pseudo-likelihood methods introduced by Besag [ 45 ]. As an exploratory effort, we drop the linear structure embedded in Markov models and HMMs while sticking to the principle of maximum entropy. We smooth the estimates of the conditional probabilities by adding 0. Later on, we refer to these models as the "greedy" models with context-size 1 and context-size 2. For comparison, we use two SNP tag procedures described in the literature.
Both methods select subsets whose sizes depend simultaneously on the thresholds chosen by the user and on the particular set of sequences. The first method is the block-based dynamic programming algorithm implemented in the HapBlock software [ 12 , 13 ]. The user needs to choose among three criteria to define blocks and five criteria to select the tag SNPs within the blocks. In keeping with [ 9 ] and [ 12 ], we use the common haplotype criterion to define potential haplotype blocks. The second method comes from a recent work by Carlson et al.
A greedy algorithm searches for a tag set such that any SNP not in the subset with a minor allele frequency higher than 0. It has been suggested that any set of SNPs approximately evenly spaced along the sequence is a good tag set [ 46 ].
To check this statement we use a simple procedure to build-up sets of increasing size whose SNPs are approximately evenly spaced.
We start with a single SNP which is the closest to the middle point of the sequence. One SNP at a time, we increase the size of the tag set by 1 finding the longest interval that does not contain an already selected SNP, but does contain at least one unselected SNP; 2 adding the SNP that is the closest to the middle point of this interval. We assess the performances of the tag SNP selection procedures on several data sets that differ by the number and density of markers as well as by the population sampled.
A genotype trio is made of the genotype of one child and those of its two parents. Trio data make it possible to infer haplotypes by simple Mendelian genetic rules at most genotyped positions assuming no recombination in the last generation.
We use the hap2 program [ 47 ] to infer haplotypes at unsolved positions and missing data. This program combines trio information with population information obtained from the entire sample. The phased data sets are available for download [see Additional file 2 ]. Our genome-wide data consists of two sets of haplotypes: one is sampled from a Yoruba population in Nigeria YRI data sets , the other comes from a Utah population of European ancestry CEU data sets.
In each population, 30 trios were genotyped at approximately 1,, SNPs roughly evenly spaced along the genome. The independent haplotypes are obtained after discarding children's haplotypes. The average spacing between adjacent SNPs in this genome-wide data set is around 3 kbp. This density is probably not enough to capture the detailed pattern of polymorphism. These ten regions were chosen to represent a wide variety of LD levels.
All our evaluations rely on a six fold cross-validation scheme. The haplotypes are randomly split into six disjoint test sets of 20 haplotypes and six complementary training sets of haplotypes. Parameter estimation and tag SNP selection are based on the training sets while evaluations rely on the test sets.
Botstein D, Risch N: Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease.
Nat Genet , 33 Suppl — Nature , — Nature Rev Genet , 6: 95— Am J Hum Genet , — Article Google Scholar. Human Heredity , — Article PubMed Google Scholar. Science , — Nat Genet , — Bioinformatics , — Hum Mol Genet , — Hum Genomics , 1: — Nothnagel M, Rohde K: The effect of single-nucleotide polymorphism marker selection on patterns of haplotype blocks and haplotype frequency estimates.
Genome Res , — Genet Epidemiol , — Kingman JFC: The coalescent. Stochastic Process Appl , — Li N, Stephens M: Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data.
Genetics , — Akaike H: A new look at the statistical identification model. Schwarz G: Estimating the dimension of a model. Annals of Statistics , 6: — Li L, Yu B: Iterated logarithmic expansions of the pathwise code lengths for exponential families. Anderson EC, Novembre J: Finding haplotype block boundaries by using the minimum-description-length principle.
Am J Hum Genet , 86— Pac Symp Biocomput , — Google Scholar. Shannon CE: A mathematical theory of communication. Bell Sys Tech Journal , — John Wiley; Book Google Scholar.
0コメント