Microsft Research
Home

HLA Completion Details

Table of Contents

Overview

This tool takes as input, HLA class I (loci A,B,C) typing data, specified at possibly multiple resolutions (2-digit, 4-digit, or combination of 2-/4-digit alternatives separated by a ‘/’) and probabilistically resolves the typing ambiguities (i.e., probabilistically “completes” the data to 4-digit resolution). Both phased and unphased outputs are provided (both at a 4-digit level). It is assumed the all HLA data input is defined at the molecular level (i.e., neither serological nor supertype). You can use the “Load Sample” followed by the “Compute” button to see an example. You must choose a single ethnicity from the drop-down menu (African, European, Asian, Hispanic, or Amerindian) for one set of data (and then a model trained on this ethnicity will be used). After the computation completes, please read the warnings output on the web page in the third box. Use the ‘Clear Results’ button before computing results for a new data set so that it will be obvious when the new computation has completed.

If you input data that has considerable ambiguity, the computation can take a long time, so please be patient. Also, if the input contains too much ambiguity, the tool will not process it and will instead provide a warning. To get around this, you may wish to download the executables (available soon) to your own machine and run it there.

Details and empirical evaluation of our methodology can be found in our 2008 PLoS Comp. Bio paper: J. Listgarten, Z. Brumme, C. Kadie, G. Xiaojiang, B. Walker, M. Carrington, P. Goulder, D. Heckerman, Statistical resolution of ambiguous HLA typing data, to appear in PLoS Computational Biology, 2008 (preprint).

Handling of special cases

The alleles A74XX, C17XX and C18XX  are always modeled at the 2-digit level. Thus if C1701 is provided as input, it is truncated to C17 and left at this 2-digit resolution. In such cases, the output will read "C17(01)" to denote that it was modeled only at the 2-digit level and the column “Lower-resolution model used” flag will be set to 1 in the output. If a 4-digit allele is specified in the input which never appears in the training data (i.e., the allele is not in the domain of the model), then this allele is modeled at the 2-digit level (in other words, we "back-off" to a 2-digit model). For example, if a new allele B5720 is identified, and given as input to our tool which does not currently model this allele, then we will model the allele as B57 and compute probabilities accordingly (effectively we "integrate out" the last 2 digits from our model). In such cases, the output will be written "B57(20)" to denote that only the first two digits were used to model the data, and the column “Lower-resolution model used” flag will be set to 1 in the output. If a list of alternative alleles is provided in the input but only some of these are in the model domain, then those not in the model domain will be removed as alternatives and will not appear in the output. For example, if an allele were specified as "B5701/B5720" (meaning it is one of these two alleles), then since the latter allele is not in the model domain, it would be ignored (only the first allele would be used to compute probabilities) and the output would be written as "B5701". Lastly, the 2-digit alleles B15 and B95 are treated as identical by the tool, as are A02 and A92.

Input file format

The program accepts tab-delimited text files (as produced, for example, by pasting from Excel).

It supports a dense format and a sparse format. Both require headers. Here is an example of the dense format:

pid A1 A2 B1 B2 C1 C2
c04 A0301 A2301 B4403 B5801 C0701 C1601
c09 A03 A2601/2602 B1302 B3801 C0602 C12

There are seven columns. The first column is the id of the case. It can be any string. The other values are allele expressions. They start with a class letter: "A", "B", "C", "A*", "B*", "C*", or "Cw*". After the class letter is the allele number. It can be 0, 2, 4, or more digits. When more than 4 digits are given, they are trimmed. Likewise, any "00" digits are trimmed.

Slashes are supported. For example, "A2601/2602" or "A2601/A2602" or "A26/2702".

The sparse format has only two columns. Here is an example:

pid hla
c04 A0301
c04 A2301
c04 B4403
c04 B5801
c04 C0701
c04 C1601
c09 A03
c09 A2601/2602
c09 B1302
c09 B3801
c09 C0602
c09 C12

Details of the training data & acknowledgements

The training data used for our model are an aggregate of two main data sets: i) those typed in the laboratory of Mary Carrington (see specific cohort acknowledgements below), and ii) data provided to us by the National Marrow Donor Program (NMDP), as described in Maiers et al, "High-resolution HLA alleles and haplotypes in the U.S. population" Hum Immunol. 2007 Sep; 68(9):779-88, though not the NMDP European data as this is transplant biased. All in all, there were 6057 African-descent data, 256 Amerindian data, 3088 Asian descent data, 8067 European descent data, and 2860 Hispanic descent data. A model for each ethnicity was trained separately.

The following cohorts and investigators generously allowed us to use their data, typed in the laboratory of Mary Carrington at NCI: the International HIV Controllers Study, the Multicenter AIDS Cohort Study, the Multicenter Hemophilia Cohort Study, the Washington and New York Men’s Cohort Study, the San Francisco City Clinic Cohort, the AIDS Linked to Intravenous Experience, the Swiss HIV Cohort, the Urban Health Study, the NIH Focal Segmental Glomerulosclerosis Genetic Study, Hepatitis C Antiviral Long-term Treatment against Cirrhosis, National Cancer Institute Surveillance Epidemiology and End Results Non-Hodgkin Lymphoma Case-Control Study, Woman Interagency Health Study, Classic Kaposi Sarcoma Case-Control Study I and II, Genetic Modifiers Study, Nairobi CTL Cohort, Grace John-Stewart, Stephen O’Brien, and Thomas O’Brien. Acquisition of this data has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health, under contract N01-CO-12400. The content of this publication does not necessarily reflect the views of policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government. This research was supported in part by the Intramural Research Program of NIH, National Cancer Institute, Center for Cancer Research.

Tips on how to use the probabilistic tool optimally

If you cannot or do not wish to use the full probabilistic output from our tool, as described below, you can instead, with loss of information, use the single best ‘completion’ for each individual. However, if you do this, be careful to i) always break ties at random to avoid introducing bias, ii) not filter out any individuals based on the probabilistic output, as this will bias your ensuing analyses. If you were to filter out individuals who have a low probability for their best ‘completion’, then you would throw out individuals who don't match the model well, and therefore, effectively, you are making a biased selection of your data in a way that specifically depends on the HLA type of the individuals being filtered out.  So if, for example, you were trying to correlate HLA type with progression/non-progression status, and if the non-progressors tended not to match the model as well, you might filter out more of them, and specifically only those with certain HLA types, therefore biasing your analysis. to match the model as well, you might filter out more of them, and specifically only those with certain HLA types, therefore biasing your analysis.e with certain HLA types, therefore biasing your analysis. to match the model as well, you might filter out more of them, and specifically only those with certain HLA types, therefore biasing your analysis.

If possible, avoid using the single best answer, and instead try to do whatever analysis you are doing by pretending each entry in the list is an individual and weighting that individual by the probability column. This is the optimal way to use the probabilistic output generated by the model because it averages over the uncertainty that youhave based on the output of the model. Although there are several ways one could use this output, the most general way would be as follows, and, technically is “sampling from the posterior HLA distribution”.

Sampling from the output HLA probabilitiy distribution means, for each individual, you will choose one HLA completion, probabilistically, based on the probability output that our model generates. So if there were two possible ‘completions’, one with probability 0.6, and the other with 0.4, then after doing this "sampling from the distribution" many, many times, on average you would pick the first 60% of the time, and the second, 40% of the time.

The way to achieve this, is to generate a random number between 0 and 1 (uniformly on this interval). Say you generated 0.34.  Then, using the list of completions for that individual, in the order they are written in the output, add up the probability of each possibility and all the ones before it, the so-called "cumulative probability distribution". (See this small example  in Excel to make things clearer.) Then you simply index into this cumulative probability distribution with the number 0.34 to pick one single HLA completion for that individual. This is called "sampling from the HLA posterior distribution", and you can think of it as a noisy version of "picking the single best completion".  Why do we want this strange noisy version? Once you have done this for each individual, you can compute a p-value from whatever statistical test you want to use just as you would have had you known the true 4-digit completions. Now, the trick here is to repeat this "noisy-picking" many times (say N=10, or 100 or 1000), and then to average the p-values that result from each set of noisy-pickings to get a single p-value.  As you can imagine, in a situation where there were no clear, single, best completions, this will give you a more reasonable answer than having just picked the one best completion for each person.  It is, in effect, a way to average over the uncertainty that you know you have based on the output of the model. The more times you do this (larger N), the less information you throw out.

Version History

Related Links