PKDD'99 Discovery Challenge
Guide to the Medical Data Set
Domain
The database was collected at Chiba University hospital. Each patient
came to the outpatient clinic of the hospital on collagen diseases, as recommended
by a home doctor or a general physician in the local hospital.
Collagen diseases are auto-immune diseases. Patients generate
antibodies attacking their own bodies. For example, if a patient generates
antibodies in lungs, he/she will chronically lose the respiratory function
and finally lose life. The disease mechanisms are only
partially known and their classification is still fuzzy. Some patients
may generate many kinds of antibodies and their manifestations may
include all the characteristics of collagen diseases.
In collagen diseases, thrombosis is one of the most important and severe
complications, one of the major causes of death.
Thrombosis is an increased coagulation of blood, that cloggs blood vessels.
Usually it will last several hours and can repeat over time. Thrombosis can
arise from different collagen diseases.
It has been found that this complication is closely related to anti-cardiolipin
antibodies. This was discovered by physicians, one of
whom donated the datasets for discovery challenge.
Thrombosis must be treated as an emergency.
It is important to detect and predict the possibilities of its occurence.
However, such database analysis has not been made by any experts on
immunology.
Domain experts are very much interested in discovering regularities
behind patients' observations.
Goals
- Search for patterns which detect and predict thrombosis.
- Search for temporal patterns specific/sensitive to thrombosis.
(Examination date is very close to the date on thrombosis.
If we can find specific/sensitive patterns before/after the
thrombosis, they are very useful.)
- Search for features which classifies collagen diseases correctly.
- Search for temporal patterns specific/sensitive to each collagen disease.
Domain experts told us that if useful patterns are discovered
then they are acceptable in major journals on rheumatology (collagen
diseases.)
Evaluation Scheme
One of the domain experts, who is well known for rheumatology,
will attend PKDD'99 conference and evaluate all the results.
The results will be also evaluated in the clinical environment in the
future.
Database
Database consists of three tables.
(TSUM_A.CSV, TSUM_B.CSV, TSUM_C.CSV).
The patients in these three tables are connected by ID number.
TSUM_A.CSV
Basic information about patients (input by doctors).
This dataset includes all patients (about 1000 records).
| item | meaning | remark |
| ID | identification of the patient | |
| Sex | | |
| Birthday | | YYYY/M/D |
| Description date | the first date when a patient data was
recorded | YY.MM.DD |
| First date | the date when a patient came to the hospital |
YY.MM.DD |
| Admission | patient was admitted to the hospital (+) or followed
at the outpatient clinic (-) | |
| Diagnosis |
disease names | multivalued attribute |
TSUM_B.CSV
Special laboratory examinations (input by doctors)
(measured by the Laboratory on Collagen Diseases). This
dataset does not include all the patients,
but includes the patients with these special tests.
| item | meaning | remark |
| ID | identification of the patient | |
| Examination Date | date of the test | YYYY/MM/DD |
| aCL IgG | anti-Cardiolipin antibody (IgG) concentration | |
| aCL IgM | anti-Cardiolipin antibody (IgM) concentration | |
| ANA | anti-nucleus antibody concentration | |
| ANA Pattern | pattern observed in the sheet of ANA
examination | |
| aCL IgA | anti-Cardiolipin antibody (IgA) concentration | |
| Diagnosis | disease names | multivalued attribute |
| KCT | meassure of degree of coagulation | |
| RVVT | meassure of degree of coagulation | |
| LAC | meassure of degree of coagulation | |
| Symptoms | other symptoms observed
| multivalued attribute |
| Thrombosis | degree of thrombosis |
0: negative (no thrombosis) 1: positive (the most severe one) 2: positive
(severe)
3: positive (mild) |
Examination date is very close to the date on thrombosis. In negative examples, these tests are
examined when thrombosis is suspected.
TSUM_C.CSV
Laboratory Examinations stored in Hospital Information Systems
(Stored from 1980 to March 1999) All the data include ordinary laboratory examinations
and have temporal stamps. The tests are not necessarily connected to
thrombosis.
| item | meaning | normal range |
| ID | identification of the patient | |
| Date | Date of the laboratory tests (YYMMDD) | |
| GOT | AST glutamic oxaloacetic transaminase | N < 60 |
| GPT | ALT glutamic pylvic transaminase | N < 60 |
| LDH | lactate dehydrogenase | N < 500 |
| ALP | alkaliphophatase | N < 300 |
| TP | total protein | 6.0 < N < 8.5 |
| ALB | albumin | 3.5 < N < 5.5 |
| UA | uric acid | N > 8.0 (Male) N > 6.5 (Female) |
| UN | urea nitrogen | N < 30 |
| CRE | creatinine | N < 1.5 |
| T-BIL | total bilirubin | N < 2.0 |
| T-CHO | total cholesterol | N < 250 |
| TG | triglyceride | N < 200 |
| CPK | creatinine phosphokinase | N < 250 |
| GLU | blood glucose | N < 180 |
| WBC | White blood cell | 3.5 < N < 9.0 |
| RBC | Red blood cell | 3.5 < N < 6.0 |
| HGB | Hemoglobin | 10 < N < 17 |
| HCT | Hematoclit | 29 < N < 52 |
| PLT | platelet | 100 < N < 400 |
| PT | prothrombin time | N < 14 |
| Note | comment for the test PT | |
| APTT | activated partial prothrombin time | N < 45 |
| FG | fibrinogen | 150 < N < 450 |
| AT3 | marker of DIC, one of the most important
complications
of collagen diseases | 70 < N < 130
|
| A2PI | marker of DIC | 70 < N < 130 |
| U-PRO | proteinuria | 0 < N < 30 |
| IGG | Ig G | 900 < N < 2000 |
| IGA | Ig A | 80 < N < 500 |
| IGM | Ig M | 40 < N < 400 |
| CRP | C-reactive protein | N= -, +-, or N < 1.0 |
| RA | Rhuematoid Factor | N= -, +- |
| RF | RAHA | N < 20 |
| C3 | complement 3 | N > 35 |
| C4 | complement 4 | N > 10 |
| RNP | anti-ribonuclear protein | N= -, +- |
| SM | anti-SM | N= -, +- |
| SCl70 | anti-scl70 | N= -, +- |
| SSA | anti-SSA | N= -, +- |
| SSB | anti-SSB | N= -, +- |
| CENTROMEA | anti-centromere | N= -, +- |
| DNA | anti-DNA | N < 8 |
| DNA-II | anti-DNA | N < 8 |
This database was donated by dr. Katsuhiko Takabayashi and prepared by
prof. Shusaku Tsumoto
For possible questions on the data and task description contact Petr Berka.
All questions and answers will be published as appendixes to this
document.
Asked Questions
- The description of attributes (and their normal ranges) doesn't correspond
to the data.
(21.7.1999)
Following several questions, I checked the original database in hospital,
we found several errors about attribute information.
We found errors from PT to TAT2. Please replace the first line with the
second one.
Old: PT APTT FG PIC TAT TAT2 U-PRO
New: PT Note APTT FG AT3 A2PI U-PRO
The Normal Range is:
AT3 70 < N < 130
A2PI 70 < N < 130
-
While most values fall into the normal range (as given by the Guide to the
Medical Data Set) this is not the case for item UN and CRE.
(16.7.1999)
These are experts' mistakes. UN: N<30 and CRE: N<1.5 are normal values.
Sorry.
-
For the following items the normal range does not fit to the data:
WBC: values 0.1<=N<=119.5, normal range 3500< N <9000,
RBC: values 0.01<=N<=6.57, normal range 350< N <600,
(16.7.1999)
Yes, they are also expert's mistakes. Please change the normal range to:
WBC: normal range 3.5 < N < 9.0
RBC: normal range: 3.5 < N < 6.0
-
Are the values of diagnosis ordered lists or just sets? ie, is
'RA, SJS' equal to 'SJS, RA'?
(14.7.1999)
The values are just sets, so RA,SJS is equal to SJS,RA.
-
What means the word 'susp' that comes after many of diagnosis, like
'SLE susp'? And what about the words that comes between parenthesis, like
'BEHCET (entero)', 'EN (r/o BEHCET)' and 'RA (seronegative)'?
All diagnosis like 'SLE susp' can be grouped into a higher level like
'SLE'?
(14.7.1999)
Susp stands for "suspected". So, their diagnosis have not been confirmed.
'BEHCET (entero)': entero stands for enterocolitis type of Bechet disease.
It is one type of Bechet
diseases in which colon is the main target of autoimmune process. Bechet
have several types.
In case of BECHET (neuro), the main target will be neuron.
'EN (r/o BEHCET)' :This means that this entercolitis case in which Bechet is
strongly suspected.
'RA (seronegative)': From the observations (symptoms), this case can be
diagnosed as RA. But, serum tests (laboratory examinations) are
negative. We have had such strange cases in real clinical practice. So, this
case is clinically RA, but negative from the labo tests. (So,
from the viewpoint of labo tests, they are "true-negative"
cases.)
-
We came across some attribute values in table TSUM_C which puzzled
us a bit. (30.6.1999)
I found one error in attribute information in Tsumoto_c.csv.
There is one laboratory examination between TAT and U-PRO.
All the questions about CRP,IGM, RF, IGG are coming from this error.
So, please replace the attribute-list:
ID Date GOT GPT LDH ALP TP ALB UA UN
CRE T-BIL T-CHO TG CPK
GLU WBC RBC HGB HCT PLT PT APTT FG PIC
TAT U-PRO IGG IGA
IGM CRP RA RF C3 C4 RNP SM SC170 SSA
SSB CENTROMEA DNA DNA-II
with:
ID Date GOT GPT LDH ALP TP ALB UA UN
CRE T-BIL T-CHO TG CPK
GLU WBC RBC HGB HCT PLT PT APTT FG PIC
TAT TAT2 U-PRO IGG IGA
IGM CRP RA RF C3 C4 RNP SM SC170 SSA
SSB CENTROMEA DNA DNA-II
-
Another little puzzle has been attribute ANA Pattern in table
TSUM_B. First of all, are its values ordered lists or just sets, if any?
ie, "P,S" = "S,P"? Are there any other consideration you think important
as to this attribute? (30.6.1999)
This values are just sets, so {P,S} = {S,P}.
-
Attributes RNP, SM, SC170, SSA, SSB, CENTROMEA, which are
expected to assume [-, +-], are often seen to have numbers as values. How
are they supposed to be interpreted? (30.6.1999)
Usually, these test have two kinds of measurements:
qualitative and quantative. We thought that they are measured by
qualitative methods. I will check the normal range.
-
We found values such as "<30" and ">=1000" for some numerical
attributes. How should they be interpreted? Could they be replaced with
some number? (30.6.1999)
It means that these values are too small or too large. For example, you can set some
values to each case. Say, "<30" can be transformed to "10" and
">=1000" to 1500.
-
New attribute U-PRO (after TAT-2) sometimes shows to assume
value TR. What does it mean?
TR means that due to some problems with blood serum, the laboratory
cannot measure. So, it means that "error in measurement due to
the problems with submitted blood serum".
-
We have downloaded and started to analyze the medical
data set. It seems that 350 of the patient IDs in TSUM_B.csv
have no correspondent entry in TSUM_A.csv.
Therefore, data of BOTH examinations exist only for
about 400 of the 1200 patients in TSUM_A.csv ? (30.6.1999)
Tsumoto_a.csv includes all the data of patients who are
followed by doctors at outpatient clinic in University Hospital
at least several months.
On the other hand, Tsumoto_b.csv includes the data of two types of
patients. The first one is a patient followed at University Hospital.
The second one is a patient who is not followed at University Hospita,
but specific laboratory examinations are made (even in this case,
we will register that patient and provide ID number of university hospital.)
So, tsumoto_a.csv and tsumoto_b.csv includes three types of patients:
- First type: a patient followed at outpatient clinic in University
hospital,
but no special examinations are made for this patient.
(Patients in the first type do not suffer from thrombosis:
that it, they are negative with respect to throbmosis.)
-
Second type: a patient followed at University hospital and
special examinations are made for this patient.
-
Third type: a patient who is not followed at University hospital,
but special examinations are made for this patient.
Thus, about 400 patients in Tsumoto_b.csv are belonging to third type.
But, they are not followed at University hosptial, they do not have temporal
data. So, please use first type and second type patients for the analysis
of thrombosis.
- Are for a given patient the values of attribute "Diagnosis" in table
TSUM_A the same as the values of attribute "Diagnosis" in table
TSUM_B ?
Yes. If not, please use the diagnosis in TSUM_A. That is the most
recent updated file about diagnosis.
- Are "Diagnosis" (table TSUM_A, TSUM_B) concepts, the
contributors to the discovery challenge should consider or, is the
only target attribute the "Thrombosis" ?
Diagnosis is also the target attribute.
My colleagues are not only interested in "Thrombosis", but also in
"Diagnosis".
- In table TSUM_C, you gave the normal range of values. What is
the possible range of all values (e.g. I have found value "+" for attributes with
normal range "-", "+-", but I have found value "-" also in
attribute with normal range N<8)
Okay, first, {-,+-,+} is a usual notation in medical "qualitative"
tests.
"-" is negative (in normal range), "+-" is not negative but at the border of normal range, "+" means
positive, or abnormal.
{-,+-,+} can be observed in a simple test: usually, each symbol
corresponds
to a range of "quantative" values. In the case of "-" in Normal range
(N<8),
"-" means that the value of this test is in the normal range (N<8).