In this page, I compare the genotype data between Ian Hasse's paper and the TB genotype data that came in the Microsoft Access database.

I am comparing these data sets because Ian's paper had a very large proportion of genotyped cases (93%), while I have very few (about 60%). Since these data come from the same source, I attempt to find the reason for this difference.

Ian's Table

Ian's TB table


Ian found 900 distinct cases between 1996 and 2000. It is unclear exactly how he determined when the case occurred. He could have used initial date of treatment.

In the TB database that I am using, there is a variable called "Year of Case". I used this variable as a comparison, because very few rows have this field missing.

By using the "Year of Case", I found 902 distinct cases.

Relapse Cases

Ian does not appear to exclude relapse cases. However, in my data, 3 such cases occurred. I did not exclude them in this analysis.

Invalid Cases

In my data, there is a field called "invalid cases". However, in this period, it is never true.

Institutional Cases

Ian excluded 37 people who lived in institutions. He defines institutions as largely refugee dormitories, plus long-term care facilities, prisons or shelters.

I found only 19 people who lived in institutions. I drew this from the is_institution column in the situation_role table. This is a pretty big difference, and I'm still not sure why it exists.

Therefore Ian was left with 863 cases, while I was left with 883 cases.

Positive Culture

In my data, out of the 883 cases, 102 were culture negative, leaving 781 that were culture positive.

In Ian's data, out of 863 cases, 103 were culture negative, leaving 760 that were culture positive.


Ian says that they attempted to genotype all culture-positive TB. They first tried RFLP, which resulted in some having six or more bands, which they called 'high-copy'. The found clusters between all high-copy clusters in two ways: one the tried exact band matches, and the other, they tried a looser criterion, which tolerated one difference.

Because RFLP has poor specificity for low-copy TB. they spoligotyped all low-copy TB. They then found clusters within the low-copy TB group.

It is very improbable that a low-copy TB is related to a high-copy TB. Thus, you can cluster individuals within the high-copy and low-copy groups.

In Ian's data, out of the remaining 760, 54 did not have an informative genotype. This is where he gets his 93% genotyped number from.

In Ian's data, of the 706 culture-positive, 563 were high-copy, and 143 were low-copy.

In my data, of the 781 culture-postive, 586 had an entry in the tb_genetic table.

In my data, it is unknown what method was used to generate the data, but I highly suspect that I have only the high-copy clustering, because of the close numbers. Also, because I was given no way to distinguish between the two types (high-copy and low-copy), I further suspect that I only have one of them.



When using the strict criterion, within 563 high-copy cases, Ian found 27 clusters, ranging from 2-9 people. 69 people were part of a cluster.

In my data, which I suspect is high-copy only, within 586 cases, I found 29 clusters, ranging from 2-9 people. 75 people were part of a cluster.

The strong similarity between these numbers implies that my clusters operate on the 'strict' criterion. I would rather have the 'loose' criterion.

When using the loose criterion, within 563 high-copy cases, Ian found 45 clusters, ranging from 2-29 people. 194 people were part of a cluster.


Within 143 low-copy cases, Ian found 15 clusters. 49 people were part of a cluster.

I suspect that I do not have any low-copy spoligotype data.


I strongly suspect that I only have data for high-copy cases. Also, my clusters are probably using the 'strict' criterion.

What I need is the spoligotype data, and the RFLP loose criterion data.


TB Genotype Data Comparison (last edited 2010-05-19 13:28:08 by AmanVerma)