Technical Reports

The Department of Quantitative Health Sciences produces technical reports pertaining to faculty members' research interests and makes them available to others who share these interests. These publications often include more code than would be allowed in a journal. Some of these publications are precursors to peer-reviewed articles.

These technical reports are saved as PDF files; you need Adobe Acrobat Reader software to view or print them. Some reports are large and may take a few minutes to load.

Credit where credit is due

If you use technical reports from Mayo Clinic, please acknowledge the original contributor of the material.

Sample citation:

TM Therneau, PM Grambsch, VS Pankratz. Technical Report Series No. 66: Penalized Survival Models and Frailty. Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota; 2000.

Copyright 2005 Mayo Foundation for Medical Education and Research. All rights reserved. Permission is granted for unlimited distribution for noncommercial use.

Technical reports


The Concordance Statistic and the Cox Model (PDF)

The concordance statistic is used to measure the amount of agreement between two variables, often a risk score and time until an event in survival analysis. Surprisingly, the concordance statistic is a score statistic from a Cox model with a time varying coefficient. This relationship connects the literature on the concordance statistic and Cox model, specifically nonparametric techniques for survival analysis with time-weighted Cox models.

The authors also discuss the sensitivity of the concordance statistic with respect to the censoring distribution and introduce robust variance estimators for both the concordance statistic and comparisons between two correlated concordance statistics.

Therneau TM, Watson DA (December 2017)


The Basics of Propensity Scoring and Marginal Structural Models (PDF)

This report describes the basics of marginal structural models, propensity scores and inverse probability weighting. These methods are useful for addressing confounding in observational studies. A detailed SAS example is shown with discussion of model assumptions and checks.

Crowson CS, Schenck LA, Green AB, Atkinson EJ, Therneau TM (August 2013)


Comparison of Mayo Clinic Coding Systems (PDF)

This goal of this project was to determine how three different disease coding systems (HICDA, ICD-9 and SNOMED CT) compared at retrieving lists of individuals with specific medical conditions from among patients seen at Mayo Clinic throughout 2004.

St. Sauver J, Buntrock J, Rademacher D, Albrecht D, Gregg M, Ihrke D, Kaggal V, Weaver A (December 2010)


Attributable Risk Estimation in Cohort Studies (PDF)

Population attributable risk (PAR) is a function of time because both the prevalence of a risk factor and its effect on exposed individuals may change over time, as may the underlying risk of disease. Time-specific PAR can be estimated based on cumulative incidence adjusted for the competing risk of death.

Crowson CS, Therneau TM, O'Fallon WM (October 2009)


Poisson Models for Person-Years and Expected Rates (PDF)

This report summarizes approaches used to model observed events with respect to the expected number of events. Examples are provided for examining the excess risk or relative risk using both additive and multiplicative models.

Atkinson EJ, Crowson CS, Pederson RA, Therneau TM (September 2008)


Concordance for Survival Time Data: Fixed and Time-Dependent Covariates and Possible Ties in Predictor and Time (PDF)

Concordance, or synonymously the C-statistic, is a valuable measure of model discrimination in analyses involving survival time data. This report provides a definition of concordance in the case of survival data, allowing for time-dependent covariates with the counting process data representation and accounting for ties in the covariates and times.

Kremers, WK and Mayo Clinic William J. von Liebig Transplant Center (April 2007)


Finding Optimal Cutpoints for Continuous Covariates with Binary and Time-to-Event Outcomes (PDF)

This report provides an overview of the literature and describes a unified strategy for finding optimal cutpoints with respect to binary and time-to-event outcomes. Two SAS macros for identifying a cutpoint have been developed in conjunction with this technical report.

Williams BA, Mandrekar JN, Mandrekar SJ, Cha SS, Furth AF (June 2006)


Estimating Genetic Components of Variance for Quantitative Traits in Family Studies Using the MULTIC Routines (PDF)

This report provides an overview of the theory behind the variance components approach for analyzing one or more quantitative traits in the face of familial correlation. It also provides an introduction to the S-Plus/R multic library, which contains software to carry out this analysis.

de Andrade M, Atkinson EJ, Lunde E, Amos CI, Chen J (May 2006)


Joint Estimation of Calibration and Expression for High-Density Oligonucleotide Arrays (PDF)

The authors present a unified algorithm that incorporates normalization and class comparison in one analysis using probe level perfect match and mismatch data. The algorithm is based on calibration models common to most biological assays, and the resulting chip-specific parameters have a natural interpretation.

The authors show that the algorithm fits into the statistical generalized linear models framework, describe a practical fitting strategy and present results of the algorithm based on commonly used metrics.

Oberg AL, Mahoney DW, Ballman KV, Therneau TM (February 2006)


Evaluation of a Simultaneous Mass-Calibration and Peak-Detection Algorithm for FT-ICR Mass Spectrometry (PDF)

Electrospray ionization Fourier transform ion cyclotron resonance mass spectrometry (ESI-FT-ICR-MS) is a potentially superior biomarker discovery platform because it offers high mass-measurement accuracy, high mass-measurement precision and high resolving power over a broad mass-to-charge range.

The authors describe and evaluate a simultaneous mass-calibration and peak-detection algorithm that exploits resolved isotopic peak-spacing information as well as space-charge frequency shifts across isotopic clusters that represent the same molecular species but differ in charge states by integer values.

Eckel-Passow JE, Therneau TM, Oberg AL, Mason CJ, Muddiman DC (January 2006)


What Does PLIER Really Do? (PDF)

The PLIER (Probe Logarithmic Intensity ERror) algorithm was developed by Affymetrix and released in 2004. It is part of several commercially available software packages that analyze GeneChip data such as Strand Genomic's Avadis and Stratagene's ArrayAssist. The PLIER algorithm produces an improved gene expression value (a summary value for a probe set) for the GeneChip microarray platform as compared to the Affymetrix MAS5 algorithm.

In this report, the authors look at why the PLIER algorithm performs so well given that its derivation is based on a biologically implausible assumption.

Therneau TM, Ballman KV (November 2005)


An Exploration of Affymetrix Probe-Set Intensities in Spike-In Experiments (PDF)

In this report, the authors look at the characteristics of the relationship between the observed probe intensity values produced by the Affymetrix GeneChip platform and the known concentration level of the target gene. This is done using data from three publicly available spike-in gene experiments.

The report discusses characteristics of the relationship and implications for statistical models and analysis of Affymetrix GeneChip data. The authors learned a considerable amount from looking at plots of the data, which are provided in the appendices (Appendix A (PDF), Appendix B (PDF) and Appendix C (PDF)), and encourage readers to look and learn from the data themselves.

Ballman KV, Therneau TM (July 2005)


Evaluating Methods of Symmetry (PDF)

Knowing the symmetry of the underlying data is important for parametric analysis, fitting distributions or doing transformations to the data. The authors evaluate five different methods to assess skewness. They have also developed a comprehensive and efficient SAS macro for computing the various skewness measures and the appropriate power transformation, if one exists, to make an asymmetric distribution symmetric.

Mandrekar JN, Mandrekar SJ, Cha SS (January 2005)


Transmission Disequilibrium Methods for Family-Based Studies (PDF)

The study of the association of genetic markers with complex traits has generated a wide range of statistical methods, particularly those that are based on transmission disequilibrium. This report provides a review of methods in this area through approximately 1999.

Schaid DJ (July 2004)


Duane's Little Handbook of Advice for Young Biostatisticians on How to Work with Investigators (PDF)

This handbook is intended to provide young biostatisticians with a set of guidelines about how to effectively work with investigators. Not all of these guidelines will work well in every consulting situation. You may find that you may develop better ways for you to deal with some situations than those which are given here. The advice given here should, however, help you to at least formulate for yourself how you should conduct your own consultations.

Ilstrup DM (August 2004)


Normalization of Two-Channel Microarray Experiments: A Semiparametric Approach (PDF)

An important underlying assumption of any experiment is that the experimental subjects are similar across levels of the treatment variable, so that changes in the response variable can be attributed to exposure to the treatment under study.

This assumption is often not valid in the analysis of a microarray experiment due to systematic biases in the measured expression levels related to experimental factors such as spot location (often referred to as a print-tip effect), arrays, dyes and various interactions of these effects.

Thus, normalization is a critical initial step in the analysis of a microarray experiment, where the objective is to balance the individual signal intensity levels across the experimental factors, while maintaining the effect due to the treatment under investigation.

Eckel JE, Gennings C, Therneau TM, Burgoon LD, Boverhof DR, Zacharewski TR (July 2004)


Faster Cyclic Loess: Normalizing DNA Arrays via Linear Models (PDF)

This technical report describes a normalization technique that yields results similar to cyclic loess normalization with speed comparable to quantile normalization.

Ballman K, Grill D, Oberg A, Therneau T (November 2004)


An Introduction for Multiple Imputation Methods: Handling Missing Data With SAS V8.2 (PDF)

This report is organized to give a general overview of the basic concepts of data imputation, with emphasis on application. The purpose is to explain the basic principles of multiple imputation for handling missing data and how to implement this method using SAS version 8.2.

Vargas-Chanes D, Decker PA, Schroeder DR, Offord KP (July 2003)


Penalized Survival Models and Frailty (PDF)

The authors demonstrate that solutions for gamma shared frailty models can be obtained exactly via penalized estimation. Similarly, Gaussian frailty models are closely linked to penalized models. This makes it possible to apply penalized estimation to other frailty models using Laplace approximations. Fitting frailty models with penalized likelihoods can be made quite rapid by taking advantage of computational methods available for penalized models.

The authors have implemented penalized regression for the coxph function of S-Plus and illustrate the algorithms with examples using the Cox model.

Therneau TM, Grambsch PM, Pankratz VS (June 2000)


MCSTRAT: A SAS Macro to Analyze Data From a Matched or Finely Stratified Case-Control Design (PDF)

A case-control design is a common approach used to assess disease-exposure relationships, and the logistic regression model is the most common framework for the analysis of such data. This model expresses the logit transform of the disease probability as a linear combination of independent, or exposure, variables.

Vierkant RA, Kosanke JL, Therneau TM, Naessens JM (February 2000)


Calculating Incidence Rates Among Hospitalized Residents of Olmsted County, Minnesota. (PDF)

The purpose of this technical report is to describe the SAS macro, %inchosp, which allows users to calculate the incidence rate of any disease or event among hospitalized residents of Olmsted County, Minnesota, from 1980 to 1990 providing the location at onset is collected.

Lohse CM, Petterson TM, O'Fallon WM, Melton LJ III (February 1999)


Expected Survival Based on Hazard Rates (Update) (PDF)

This paper is an extension and update of Technical Report #52. An update to the rate tables themselves is based on the recently released data from the 1990 decennial census, which allowed the authors to replace extrapolated 1990 death rates with actual rates and to improve the extrapolated year 2000 values. Much of the material in the prior report is contained here to make this document useful on its own.

Therneau T, Offord J (February 1999)


Computing the Cox Model for Case Cohort Designs (PDF)

Prentice proposed a case-cohort design as an efficient subsampling mechanism for survival studies. Several other authors have expanded on these ideas to create a family of related sampling plans, along with estimators for the covariate effects. The authors describe how to obtain the proposed parameter estimates and their variance estimates using standard software packages, with SAS and S-Plus as particular examples.

Therneau TM, Li H (June 1998)


An Introduction to Recursive Partitioning Using the RPART Routines (PDF)

Short overview of the methods found in the RPART routines, which implement many of the ideas found in the CART (Classification and Regression Trees) book and programs of Breiman, Friedman, Olshen and Stone.

Therneau TM, Atkinson EJ (September 1997)


Extending the Cox Model (PDF)

Since its introduction, the proportional hazards model proposed by Cox has become the workhorse of regression analysis for censored data. In the last several years, the theoretical basis for the model has been solidified by connecting it to the study of counting processes and martingale theory. Comprehensive accounts of the underlying mathematics are given in the books of Fleming and Harrington and of Andersen, et. al.

These developments have, in turn, led to the introduction of several new extensions of the original model. These include the analysis of residuals, time varying covariates, time-dependent coefficients, multiple or correlated observations, multiple time scales, time-dependent strata, and estimation of underlying hazard functions.

The aim of this monograph is to show how many of these methods and extensions of the model can be approached using standard statistical software, in particular the S-Plus and SAS packages. As such, it should be a bridge between the statistical journals and actual practice.

Therneau TM (November 1996)


How Many Stratification Factors is "Too Many" to Use in a Randomization Plan? (PDF)
Published in Controlled Clinical Trials, 1993;14:98.

The issue of stratification and its role in patient assignment has generated much discussion, mostly focused on its importance to a study or lack thereof. This report focuses on a much narrower problem: Assuming that stratified assignment is desired, how many factors can be accommodated?

This question is investigated for two methods of balanced patient assignment. The first is based on the minimization method of Taves, and the second on the commonly used method of stratified assignment.

Simulation results show that the former method can accommodate a large number of factors (10-20) without difficulty, but that the latter begins to fail if the total number of distinct combinations of factor levels is greater than approximately n = 2. The two methods are related to a linear discriminant model, which helps to explain the results.

Therneau TM (1993)


Computerized Matching of Cases to Controls (PDF)

The purpose of this report is to describe a new SAS macro, %match, written to facilitate the matching of cases to controls, where one case is matched to one or more controls.

Bergstralh EJ, Kosanke JL (April 1995)


Extrapolation of the U.S. Life Tables (PDF)

Therneau TM, Scheib C (October 1994)


Generalized Population Attributable Risk Estimation (PDF)

Kahn MJ, O'Fallon WM, Sicks JD (April 1994, revised July 2000)


A Package for Survival Analysis in S (PDF)

Therneau TM (February 1999)


Expected Survival Based on Hazard Rates (PDF)

Therneau T, Sicks J, Bergstralh E, Offord J (March 1994)


Calculating Incidence, Prevalence and Mortality Rates for Olmsted County, Minnesota: An Update (PDF)

Bergstralh EJ, Offord KP, Chu CP, Beard CM, O'Fallon WM, Melton LJ III (April 1992)


Comparing Two Samples: Extensions of the t, Rank Sum, and Log Rank Tests (PDF)

O'Brien PC (November 1985)