Modeling Survival Data: Extending the Cox Model Datasets
Tumor recurrence data for patients with bladder cancer
- Treatment group: 1=placebo 2=thiotepa
- FU time and recurrence time are measured in months.
- Initial size of the largest tumor in centimeters.
- Initial number of tumors, '8' denotes 8 or more initial tumors.
- Up to 4 recurrence times are recorded per patient.
Reference: Wei, Lin and Weissfeld, JASA 1989, p 1067.
Chronic granulomatous disease
Data are from a placebo controlled trial of gamma interferon in chronic granulomatous disease (CGD). Uses the complete data on time to first serious infection observed through end of study for each patient, which includes the initial serious infections observed through the 7/15/89 interim analysis data cutoff, plus the residual data on occurrence of initial serious infections between the interim analysis cutoff and the final blinded study visit for each patient. Only one patient was taken off on the day of his last infection.
- Subject id, (128 subjects, max id=135)
- Center
- 174 Harvard Medical School
- 204 Scripps Institute
- 222 Copenhagen, Denmark
- 238 NIH
- 242 L.A. Children s Hospital
- 243 Mott Children s Hospital
- 245 Univ. of Utah
- 246 Children s Hospital of PA
- 248 Univ. of Washington
- 249 Univ. of MN
- 328 Univ. of Zurich, Switzerland
- 331 Texas Children s Hospital
- 332 Amsterdam, Netherlands
- 336 Mt. Sinai Medical Center
- Randomization date, month/day/year
- Treatment arm, 0=placebo, 1=rIFN-g
- Sex, 1=Male, 2=Female
- Age (in years) at time of study entry.
- Height (in cm) at time of study entry.
- Weight (in kg) at time of study entry.
- Pattern of Inheritance (stratification factor)
- 1 X-linked
- 2 Autosomal Recessive
- Use of corticosteriods at time of study entry
- 1 Used corticosteriods
- 2 Did not use corticosteriods
- Use of prophylactic antibiotics at time of study entry
- 1 Used prophylactic antibiotics
- 2 Did not use prophylactic antibiotics
- Institution category
- 1 US - NIH
- 2 US - Other
- 3 Europe - Amsterdam
- 4 Europe - Other
- Time from randomization to last follow-up
- Event times: time(s) from randomization until infection. The maximum number of infections observed was 7.
Reference: Fleming and Harrington, Counting Processes and Survival Analysis, appendix D.2.
Levamisole and 5FU in Stage C colon cancer
The colon data set is used in D. Lin, Cox regression analysis of multivariate failure time data, the marginal approach, Statistics in Medicine 13:2233-2247, 1994. This data is not exactly the same as in Lin’s paper — we have more follow-up. One can get close to Lin’s results by truncating follow-up on 15 Aug, 1989.
The Moertel paper discusses results from both stage B2 and stage C patients. This data file contains only the stage C results. Patients 1-929 are from the main trial, and the remainder from an older trial; Lin uses only the larger study.
- Patient number
- Study
- 1=Larger study (Moertel)
- 2=Historical study (Laurie)
- Treatment: 1=Observation, 2=Levamisole, 3=5-FU + Levamisole
- Sex: 0=female, 1=male
- Age
- Date of registration
- Obstruction: 0=No, 1=Yes
- Perforation: 0=No, 1=Yes
- Adherence: 0=No, 1=Yes
- Number of positive nodes
- Date of progression
- Progression status: 0=No, 1=Yes
- Date of last contact or death
- Last status: 0=alive, 1=dead
The remaining variables are only available for study 1
- First stratification factor
- 0 Treatment started 7-20 days post surgery
- 1 Treatment started 21-35 days post surgery
- Second stratification factor
- 0 1-4 lymph nodes involved
- 1 more than 4 involved lymph nodes
- Third stratification factor, invasion of local organs: 0, 1, or 2
- Location of primary neoplasm
- 1 Cecum
- 2 Right Colon
- 3 Hepatic flexure
- 4 Transverse colon
- 5 Splenic flexure
- 6 Left colon
- 7 Sigmoid colon
- 8 Rectosigmoid
- 9 Rectum
- 10 Multiple sites
- Histologic type
- 1 Adenocarcinoma
- 2 Colloid(mucinous)
- 3 Signet ring type
- 4 Other
- Differentiation
- 1 Well
- 2 Moderate well (gr 2-3)
- 3 Poor (grade 4)
- Extent of local spread
- 1 Submucosa/not muscle
- 2 Muscular/not serosa
- 3 Serosa/not contiguous
- 4 Contiguous structures
- Regional Implants: 0=No, 1=Yes
- Pre-operative CEA level (missing for many patients)
- Date of tumor resection
- Date of start of treatment
Reference: Moertel, et al, Levamisole and fluorouracil for adjucant therapy of resected colon carcinoma, NEJM 332: 352-58, 1990. The historical data is from Laurie, et. al., J Clin Oncol, 1989, vol 7, p 1447-56.
Diabetes
The 197 patients in this dataset were a 50% random sample of the patients with “high-risk” diabetic retinopathy as defined by the Diabetic Retinopathy Study (DRS). Each patient had one eye randomized to laser treatment and the other eye received no treatment. For each eye, the event of interest was the time from initiation of treatment to the time when visual acuity dropped below 5/200 two visits in a row (call it “blindness”). Thus there is a built-in lag time of approximately 6 months (visits were every 3 months). Survival times in this dataset are therefore the actual time to blindness in months, minus the minimum possible time to event (6.5 months). Censoring was caused by death, dropout, or end of the study.
- Subject id
- laser type: 1=xenon, 2=argon
- treated eye: 1=right 2=left
- age at diagnosis of diabetes
- type of diabetes: 1=juvenile (age at dx < 20), 2=adult
- Outcome for the treated eye:
- risk group: 6-12
- status: 0=censored, 1=blindness
- follow-up time
- Outcome for the untreated eye
- risk group: 6-12
- status: 0=censored, 1=blindness
- follow-up time
The risk group variable was used to define the “high risk” samples.
Reference: Huster, Brookmeyer and Self, Biometrics, 1989.
A reference for the Diabetic Retinopathy Study (DRS) which describes the design and interim results is American Journal of Ophthalmology, 1976, 81:4, pp 383-396.
Generator fans
The data come from a field engineering study of the time to failure of diesel generator fans. The ultimate goal was to decide whether or not to replace the working fans with a higher quality fan to prevent future failures. Seventy generators were studied. For each one, the number of hours of running time from its first being put into service until fan failure or until the end of the study (whichever came first) was recorded.
- hours of service
- status: 1=failure, 0=censored
Reference: Nelson, Journal of Quality Technology, 1:27-52, 1969.
Survival times of gastric cancer patients
- survival time
- status: 1=death, 0=censored
- treatment: 1=chemotherapy 2=combined chemotherapy/radiation
Reference: Moreau,T., O'Quigley,J., and Mesbah M(1985) A global goodness-of-fit statistic for the proportional hazards model Appl. Statist.,34,212:218 (p 213)
Infinite coefficient example
A simple data set that shows that an infinite coefficient is not always obvious. This was sent to us as a query about the S-Plus software of 'why does the robust variance fail'. The answer is that standard errors of an infinite coefficient really don’t make sense, and that an approximate jackknife is always far from infinity.
> coxph(Surv(t1, t2, status) ~ x1 + x2 + cluster(id))
coef exp(coef) se(coef) robust se z p
x1 7.64 2085 25.3 0.732 10.44 0.0e+00
x2 5.85 347 25.3 1.151 5.08 3.8e-07
Likelihood ratio test=9.84 on 2 df, p=0.0073 n=50
Both x1 and x2 are binary covariates. A table showing the number of event/censored observations of each type is
x2
0 1
+----------
x1 0| 1/7 1/7
1|24/10 0/0
There is no obvious “no hazard” column or row, such as usually causes infinite coefficients in a main effects model; or even a 0/x no events cell as would cause this for a model with interactions.
However, detailed examination shows that the one event in the 0,0 cell of the table happens to be the largest time point in the entire data set. The likelihood of the model as a whole is unchanged if this observation were censored; its contribution to the score statistic is (covariate value of the event - average covariate value), which is 0 since the average includes only one observation. A pair of covariates with pattern
no risk positive risk
pos risk no risk
will have both coefficients infinite in the bivariate model, while both may be finite in the univariate models.
Recurrent infection of kidney catheters
Data on the recurrence times to infection, at the point of insertion of the catheter, for kidney patients using portable dialysis equipment. Catheters may be removed for reasons other than infection, in which case the observation is censored. Each patient has exactly 2 observations.
The data set has been used by several authors to illustrate random effects (“frailty”) models for survival data. However, any non-zero estimate of the random effect is almost entirely due to one outlier, subject 21.
- patient id
- follow-up time
- status: 0=censored, 1=infection
- age
- sex (1=male, 2=female)
- disease type
- 0=Glomerulo Nephritis
- 1=Acute Nephritis
- 2=Polycystic Kidney Disease
- 3=Other
- estimate of the frailty, as listed in the reference below
Reference: McGilchrist and Aisbett, Biometrics 47, 461-66, 1991.
Survival of subjects with advanced lung cancer
- Enrolling institution
- Survival time
- Status 1=alive, 2=dead
- Age
- Sex 1=male, 2=female
- ECOG performace score, as judged by physician: 0, 1, 2, 3
- Karnofsky performace score, as judged by physician: 100, 90, ..., 30
- Karnofsky performace score, as judged by the patient (self)
- Daily calories consumed at meals
- Weight loss in the last 30 days (negative number = weight gain)
Reference: Loprinzi et al., J. Clinical Oncology, 1994.
Monoclonal Gammopathy of Undetermined Significance
All 241 patients who had been diagnosed at the Mayo Clinic with an apparently benign monoclonal gammopathy before January 1, 1971, were followed followed forward through 1992. Of primary interest was the possible development of serious plasma cell proliferative disorders, however, the advanced age of many patients makes death from other causes a significant competing risk.
Most subjects in the study were discovered incidentally in the process of being examined for other indications. The laboratory values (albumin, creatinine, etc.) may be related to the severity of those other indications, but have shown less relationship to MGUS per se.
- Subject number
- Age at first diagnosis of MGUS
- Sex: 1=male, 2=female
- Calendar year of first diagnosis of MGUS
- Type of plasma cell proliferative disorder (blank if none)
- AM systemic amyloidosis
- LP malignant lymphoproliferative disease
- MA macroglobulinemia
- MM multiple myeloma
- Time to plasma cell proliferative disorder
- Time to death or last follow-up
- Status at last follow-up: 0=alive, 1=dead
- Albumin level, at MGUS diagnosis
- Serum creatinine level, at MGUS diagnosis
- Hemoglobin level, at MGUS diagnosis
- Size of the monoclonal protien peak, at MGUS diagnosis
Reference: R Kyle, Benign monoclonal gammopathy — after 20 to 35 years of follow-up, Mayo Clinic Proc 1993; 68:26-36.
Primary Biliary Cirrhosis
A nearly identical data set found in appendix D of Fleming and Harrington. The differences with this data set are: age is in days, status is coded with 3 levels, and the sex and stage variables are not missing for obs 313-418.
The data is from the Mayo Clinic trial in primary biliary cirrhosis (PBC) of the liver conducted between 1974 and 1984. A total of 424 PBC patients, referred to Mayo Clinic during that ten-year interval, met eligibility criteria for the randomized placebo controlled trial of the drug D-penicillamine. The first 312 cases in the data set participated in the randomized trial and contain largely complete data. The additional 112 cases did not participate in the clinical trial, but consented to have basic measurements recorded and to be followed for survival. Six of those cases were lost to follow-up shortly after diagnosis, so the data here are on an additional 106 cases as well as the 312 randomized participants. Missing data items are denoted by a period.
- case number
- number of days between registration and the earlier of death, transplantion, or study analysis time in July, 1986
- status: 0=alive, 1=liver transplant, 2=dead
- drug: 1=D-penicillamine, 2=placebo
- age in days
- sex: 0=male, 1=female
- presence of ascites: 0=no 1=yes
- presence of hepatomegaly: 0=no 1=yes
- presence of spiders: 0=no 1=yes
- presence of edema: 0=no edema and no diuretic therapy for edema; .5=edema present without diuretics, or edema resolved by diuretics; 1=edema despite diuretic therapy
- serum bilirubin in mg/dl
- serum cholesterol in mg/dl
- albumin in gm/dl
- urine copper in ug/day
- alkaline phosphatase in U/liter
- SGOT in U/ml
- triglicerides in mg/dl
- platelets per cubic ml / 1000
- prothrombin time in seconds
- histologic stage of disease
Reference: Fleming and Harrington, Counting Processes and Survival Analysis, Wiley, 1991.
Primary Biliary Cirrhosis, sequential data
This data set is a follow-up to the original PBC data set, and contains the follow-up laboratory data for each study patient. An analysis based on the enclised data is found in Murtaugh PA. Dickson ER. Van Dam GM. Malinchoc M. Grambsch PM. Langworthy AL. Gips CH. “Primary biliary cirrhosis: prediction of short-term survival based on repeated patient visits.” Hepatology. 20(1.1):126-34, 1994.
The primary PBC data set contains only baseline measurements of the laboratory parameters. This data set contains multiple laboratory results, but only on the first 312 patients. Some baseline data values in this file differ from the original PBC file, for instance, the data errors in prothrombin time and age which were discovered after the original analysis, during research work on dfbeta residuals. (These two data points are discussed in Fleming and Harrington, figure 4.6.7). Another major difference is that there was significantly more follow-up for many of the patients at the time this data set was assembled.
One “feature” of the data deserves special comment. The last observation before death or liver transplant often has many more missing covariates than other data rows. The original clinical protocol for these patients specified visits at 6 months, 1 year, and annually thereafter. At these protocol visits lab values were obtained for a large pre-specified battery of tests. “Extra” visits, often undertaken because of worsening medical condition, did not necessarily have all this lab work. The missing values are thus potentially informative, and violate the usual “missing at random” (MCAR or MAC) assumptions that are assumed in analyses. Because of the earlier published results on the Mayo PBC risk score, however, the 5 variables involved in that computation were usually obtained, i.e., age, bilirubin, albumin, prothrombin time, and edema score.
- case number
- number of days between registration and the earlier of death, transplantation, or study analysis time
- status: 0=alive, 1=transplanted, 2=dead
- drug: 1=D-penicillamine, 0=placebo
- age in days, at registration
- sex: 0=male, 1=female
- day: number of days between enrollment and this visit date, remaining values on the line of data refer to this visit.
- presence of ascites: 0=no 1=yes
- presence of hepatomegaly: 0=no 1=yes
- presence of spiders: 0=no 1=yes
- presence of edema: 0=no edema and no diuretic therapy for edema; .5 =edema present without diuretics, or edema resolved by diuretics; 1 =edema despite diuretic therapy
- serum bilirubin in mg/dl
- serum cholesterol in mg/dl
- albumin in gm/dl
- alkaline phosphatase in U/liter
- SGOT in U/ml (serum glutamic-oxaloacetic transaminase, the enzyme name has subsequently changed to “ALT” in the medical literature)
- platelets per cubic ml / 1000
- prothrombin time in seconds
- histologic stage of disease
Litter matched rats
There are 3 rats per litter, one treated and 2 control.
- Litter number, even litter numbers are male rats, odd litter numbers are female
- Treatment: 0=no, 1=yes
- Follow-up time
- Status: 0-tumor, 1-censored (due to animal’s death)
Reference: Mantel, Bohidar, and Ciminera (1979), Cancer Research.
Randomized trial of rhDNase for treatment of cystic fibrosis
- subject id
- treatment arm: 0=placebo, 1=rhDNase
- FEV: “forced expiratory volume”, a measure of lung capacity
- FEV2: a second measurement of FEV
- randomization date
- last follow-up date on study
- infections: there are up to 5 infections, each is a pair of numbers, e.g., “90 104” shows that a patient had a lung infection and was on antibiotic therapy from day 90 to 104.
Note! A few subjects were infected at the time of enrollment, 951317 for instance has a first infection interval of -21 to 7. We do not count this first infection as an “event”, and the subject first enters the risk set at day 7.
Reference: TM Therneau and SA Mamilton, “rhDNase as an example of recurrent event analysis”, Statistics in Medicine, vol 16, 2029-2047, 1997.
To our embarrassment, we cannot exactly reproduce the numbers in the paper. There are multiple ways to define an infection, the number of endpoints and exact timing of them changed at times during the analysis, and we didn't save copies of the relevant data. (For instance, does an infection start with oral antibiotic or is IV antibiotic required?) None of the substantive conclusions is changed; this data set gives the results in the book.
Stanford Heart Transplant study
- Id
- Date of birth
- Date of acceptance
- Date of transplant — missing if no transplant was done
- Date last follow-up
- Status at last follow up 1=dead, 0=alive
- Prior surgery 1=yes 0=no
Reference: Crowley and Hu, JASA, 1977, p27-36.
Randomized trial of UDCA in PBC
A randomized trial of ursodeoxycholic acid in patients with primary biliary cirhossis, conducted at the Mayo Clinic from 1988 to 1992.
- id number
- rx: 0=placebo 1=UDCA
- date of entry to the study
- date of last complete follow-up
- histologic stage: 0=stage 1/2 at entry 1=stage 3/4
- bilirubin value at entry to the study
- Mayo risk score at entry to the study
- date of death
- date of liver transplant
- date of voluntary withdrawal
- date of histologic progression
- date of appearance of esophogeal varices
- date of ascites
- date of encephalopathy
- date of doubling of bilirubin
- date of “worsening of symptom score by 2”
A date is missing if a given complication did not arise. It is possible for death, transplant, or withdrawal to occur after the date of the last complete follow-up (they can be ascertained without a patient visit).
Reference: Lindor, et al., Gastroenterology, 1994, pages=1284—1290
Veteran’s Administration Lung Cancer Trial
- Treatment: 1=standard, 2=test
- Celltype: 1=squamous, 2=smallcell, 3=adeno, 4=large
- Survival in days
- Status: 1=dead, 0=censored
- Karnofsky score
- Months from Diagnosis
- Age in years
- Prior therapy: 0=no, 10=yes
NOTE: Prior to Nov 2001 the data set posted here was incorrect: line 1 of the data set had been omitted.