Popular dataset originally analysed in Penrose et al. (1985). Lists estimates of the percentage of body fat determined by underwater weighing and various body measurements for 252 men.
bodyfat.raw
A data frame with 252 rows and 15 columns:
Density (gm/cm^3; determined from underwater weighing)
Percent body fat (from Siri's 1956 equation)
Age (years)
Weight (lbs)
Height (inches)
Neck circumference (cm)
Chest circumference (cm)
Abdomen 2 circumference (cm)
Hip circumference (cm)
Thigh circumference (cm)
Knee circumference (cm)
Ankle circumference (cm)
Biceps (extended) circumference (cm)
Forearm circumference (cm)
Wrist circumference (cm)
StatLib Datasets Archive: https://lib.stat.cmu.edu/datasets/bodyfat.
This data set can be used to illustrate data cleaning and multiple regression techniques (e.g. Johnson 1996). Percentage of body fat for an individual can be estimated from body density, for instance by using Siri's (1956) equation: $$bodyfat = 495/density - 450.$$ Volume, and hence body density, can be accurately measured by underwater weighing (e.g. Katch and McArdle, 1977). However, this procedure for the accurate measurement of body fat is inconvenient and costly. It is desirable to have easy methods of estimating body fat from body measurements.
"Measurement standards are apparently those listed in Benhke and Wilmore (1974), pp. 45-48 where, for instance, the abdomen 2 circumference is measured 'laterally, at the level of the iliac crests, and anteriorly, at the umbilicus'.
Johnson (1996) uses the original data in an activity to introduce students to data cleaning before performing multiple linear regression. An examination of the data reveals some unusual cases:
Cases 48, 76, and 96 seem to have a one-digit error in the listed density values.
Case 42 appears to have a one-digit error in the height value.
Case 182 appears to have an error in the density value (as it is greater than 1.1, the density of the "fat free mass"; resulting in a negative estimate of body fat percentage that was truncated to zero).
Johnson (1996) suggests some rules for correcting these values (see examples below).
Johnson, R. W. (1996). Fitting Percentage of Body Fat to Simple Body Measurements. Journal of Statistics Education, 4(1). doi:10.1080/10691898.1996.11910505 .
Penrose, K., Nelson, A. and Fisher, A. (1985). Generalized Body Composition Prediction Equation for Men Using Simple Measurement Techniques. Medicine and Science in Sports and Exercise, 17(2), 189. doi:10.1249/00005768-198504000-00037 .
Siri, W. E. (1956). Gross Composition of the Body, in Advances in Biological and Medical Physics (Vol. IV), eds. J. H. Lawrence and C. A. Tobias, Academic Press.
bodyfat <- bodyfat.raw
# Johnson's (1996) corrections
cases <- c(48, 76, 96) # bodyfat != 495/density - 450
bodyfat$density[cases] <- 495 / (bodyfat$bodyfat[cases] + 450)
bodyfat$height[42] <- 69.5
# Other possible data entry errors
# See https://stat-ata-asu.github.io/PredictiveModelBuilding/BFdata.html
bodyfat$ankle[31] <- 23.9
bodyfat$ankle[86] <- 23.7
bodyfat$forearm[159] <- 24.9
# Outlier and influential observation
outliers <- c(182, 39)
bodyfat[outliers, ]
#> density bodyfat age weight height neck chest abdomen hip thigh knee ankle
#> 182 1.1089 0.0 40 118.50 68.00 33.8 79.3 69.4 85.0 47.2 33.5 20.2
#> 39 1.0202 35.2 46 363.15 72.25 51.2 136.2 148.1 147.7 87.3 49.1 29.6
#> biceps forearm wrist
#> 182 27.7 24.6 16.5
#> 39 45.0 29.0 21.4
bodyfat <- bodyfat[-outliers, ]
# Body mass index (kg/m2)
bodyfat$bmi <- with(bodyfat, weight/(height*0.0254)^2)
# Alternate body mass index
bodyfat$bmi2 <- with(bodyfat, (weight*0.45359237)^1.2/(height*0.0254)^3.3)
# See e.g. https://en.wikipedia.org/wiki/Body_fat_percentage#From_BMI
# \text{(Adult) body fat percentage} = (1.39 \times \text{BMI})
# + (0.16 \times \text{age}) - (10.34 \times \text{gender}) - 9