Popular dataset originally analysed in Penrose et al. (1985). Lists estimates of the percentage of body fat determined by underwater weighing and various body measurements for 252 men.

`bodyfat.raw`

A data frame with 252 rows and 15 columns:

- density
Density (gm/cm^3; determined from underwater weighing)

- bodyfat
Percent body fat (from Siri's 1956 equation)

- age
Age (years)

- weight
Weight (lbs)

- height
Height (inches)

- neck
Neck circumference (cm)

- chest
Chest circumference (cm)

- abdomen
Abdomen 2 circumference (cm)

- hip
Hip circumference (cm)

- thigh
Thigh circumference (cm)

- knee
Knee circumference (cm)

- ankle
Ankle circumference (cm)

- biceps
Biceps (extended) circumference (cm)

- forearm
Forearm circumference (cm)

- wrist
Wrist circumference (cm)

StatLib Datasets Archive: https://lib.stat.cmu.edu/datasets/bodyfat.

This data set can be used to illustrate data cleaning and multiple regression techniques (e.g. Johnson 1996). Percentage of body fat for an individual can be estimated from body density, for instance by using Siri's (1956) equation: $$bodyfat = 495/density - 450.$$ Volume, and hence body density, can be accurately measured by underwater weighing (e.g. Katch and McArdle, 1977). However, this procedure for the accurate measurement of body fat is inconvenient and costly. It is desirable to have easy methods of estimating body fat from body measurements.

"Measurement standards are apparently those listed in Benhke and Wilmore (1974), pp. 45-48 where, for instance, the abdomen 2 circumference is measured 'laterally, at the level of the iliac crests, and anteriorly, at the umbilicus'.

Johnson (1996) uses the original data in an activity to introduce students to data cleaning before performing multiple linear regression. An examination of the data reveals some unusual cases:

Cases 48, 76, and 96 seem to have a one-digit error in the listed density values.

Case 42 appears to have a one-digit error in the height value.

Case 182 appears to have an error in the density value (as it is greater than 1.1, the density of the "fat free mass"; resulting in a negative estimate of body fat percentage that was truncated to zero).

Johnson (1996) suggests some rules for correcting these values (see examples below).

Johnson, R. W. (1996). Fitting Percentage of Body Fat to Simple Body
Measurements. *Journal of Statistics Education*, 4(1).
doi:10.1080/10691898.1996.11910505
.

Penrose, K., Nelson, A. and Fisher, A. (1985). Generalized Body Composition
Prediction Equation for Men Using Simple Measurement Techniques.
*Medicine and Science in Sports and Exercise*, 17(2), 189.
doi:10.1249/00005768-198504000-00037
.

Siri, W. E. (1956). Gross Composition of the Body, in *Advances in Biological
and Medical Physics* (Vol. IV), eds. J. H. Lawrence and C. A. Tobias,
Academic Press.

```
bodyfat <- bodyfat.raw
# Johnson's (1996) corrections
cases <- c(48, 76, 96) # bodyfat != 495/density - 450
bodyfat$density[cases] <- 495 / (bodyfat$bodyfat[cases] + 450)
bodyfat$height[42] <- 69.5
# Other possible data entry errors
# See https://stat-ata-asu.github.io/PredictiveModelBuilding/BFdata.html
bodyfat$ankle[31] <- 23.9
bodyfat$ankle[86] <- 23.7
bodyfat$forearm[159] <- 24.9
# Outlier and influential observation
outliers <- c(182, 39)
bodyfat[outliers, ]
#> density bodyfat age weight height neck chest abdomen hip thigh knee ankle
#> 182 1.1089 0.0 40 118.50 68.00 33.8 79.3 69.4 85.0 47.2 33.5 20.2
#> 39 1.0202 35.2 46 363.15 72.25 51.2 136.2 148.1 147.7 87.3 49.1 29.6
#> biceps forearm wrist
#> 182 27.7 24.6 16.5
#> 39 45.0 29.0 21.4
bodyfat <- bodyfat[-outliers, ]
# Body mass index (kg/m2)
bodyfat$bmi <- with(bodyfat, weight/(height*0.0254)^2)
# Alternate body mass index
bodyfat$bmi2 <- with(bodyfat, (weight*0.45359237)^1.2/(height*0.0254)^3.3)
# See e.g. https://en.wikipedia.org/wiki/Body_fat_percentage#From_BMI
# \text{(Adult) body fat percentage} = (1.39 \times \text{BMI})
# + (0.16 \times \text{age}) - (10.34 \times \text{gender}) - 9
```