site stats

Fitting Distributions to Data With R

In “Fitting Distributions with R” Vito Ricci writes;

“Fitting distributions consists in finding a mathematical function which represents in a good way a statistical variable. A statistician often is facing with this problem: he has some observations of a quantitative character $x_1, x_2, …, x_n$ and he wishes to test if those observations, being a sample of an unknown population, belong from a population with a pdf (probability density function) $\ f(x,\theta)$, where $\ \theta$ is a vector of parameters to estimate with available data.

We can identify 4 steps in fitting distributions:

  1. Model/function choice: hypothesize families of distributions;
  2. Estimate parameters;
  3. Evaluate quality of fit;
  4. Goodness of fit statistical tests.”

In SAS this can be done by using proc capability whereas in R we can do the same thing by using fdistrplus and some other packages. In this post I will try to compare the procedures in R and SAS.

Following code chunk creates 10,000 observations from normal distribution with a mean of 10 and standard deviation of 5 and then gives the summary of the data and plots a histogram of it.

1
2
3
4
5
set.seed(pi)
library(ggplot2)
x <- rnorm(10000, mean = 10, sd = 5)
summary(x)
qplot(x)

If we import the data we created in R into SAS and run the following code;

PROC CAPABILITY;
HISTOGRAM x / NORMAL;
RUN;

SAS gives us the following results;

  1. Moments
  2. Basic Statistical Measures (Location and Variability)
  3. Tests for Location
  4. Observed Quantiles
  5. Extreme Observations
  6. Histogram
  7. Parameter Estimates
  8. Goodness-of-Fit Test Results
  9. Estimated Quantiles

We can obtain same results in R by using e1071, raster, plotrix, stats, fitdistrplus and nortest packages.

1. Moments

N :

1
length(x)

Sum Weights : A numeric variable can be specified as a weight variable to weight the values of the analysis variable. The default weight variable is defined to be 1 for each observation. This field is the sum of observation values for the weight variable. In our case, since we didn’t specify a weight variable, SAS uses the default weight variable. Therefore, the sum of weight is the same as the number of observations. (Source)

1
sum(x)

Mean :

1
mean(x)

Sum Observations :

1
sum(x)

Std Deviation :

1
sd(x)

Skewness :

1
2
library(e1071)
skewness(x, type = 2)

Kurtosis :

1
kurtosis(x, type = 2)

Uncorrected SS : Sum of squared data values. (Source)

1
sum(x*x)

Corrected SS : The sum of squared distance of data values from the mean. (Source)

1
sum((x-mean(x))^2)

Coeff Variation : The ratio of the standard deviation to the mean. (Source)

1
2
library(raster)
cv(x)

Std Error Mean : The estimated standard deviation of the sample mean. (Source)

1
2
library(plotrix)
std.error(x)

2. Basic Statistical Measures (Location and Variability)

Range :

1
range(x)[[2]]-range(x)[[1]]

Interquartile Range :

1
IQR(x)

3. Tests for Location

Student’s t : Skipped this part

Sign : Skipped this part

Signed Rank :

1
wilcox.test(x)

4. Observed Quantiles

Signed Rank :

1
quantile(x, probs = c(0, 0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99, 1))

5. Extreme Observations : Skipped this part

6. Histogram

1
2
3
library(fitdistrplus)
normal.fit <- fitdist(x,"norm")
qplot(x, geom = 'blank') + geom_histogram(aes(y = ..density..), alpha = 0.5) + geom_line(aes(y = ..density.., linetype = 'Empirical Density'), stat = 'density', size=0.8) + stat_function(fun = dnorm, aes(linetype = 'Normal Density'), args=list(mean=mean(x), sd=sd(x))) + scale_linetype_discrete(name = "Densities") + scale_x_continuous(name="x") + scale_y_continuous(name="Density") + theme(legend.justification=c(0,1), legend.position=c(0,1))

6. Parameter Estimates

Mean (Mu) :

1
summary(normal.fit)$estimate[[1]]

Std Dev (Sigma) :

1
summary(normal.fit)$estimate[[2]] 

7. Goodness-of-Fit Test Results

Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling

1
gofstat.normal <- gofstat(normal.fit)

or

Kolmogorov-Smirnov :

1
2
3
library(nortest)
gofstat.normal$ks
lillie.test(x)

Cramer-von Mises :

1
2
gofstat.normal$cvm
cvm.test(x)

Anderson-Darling :

1
2
gofstat.normal$ad
ad.test(x)

Chi-Square :

1
2
3
gofstat.normal$chisq
gofstat.normal$chisqpvalue
gofstat.normal$chisqdf

8. Estimated Quantiles : Skipped this part

We can change the commands to fit other distributions. This is as simple as changing normal to something like beta(theta = SOME NUMBER, scale = SOME NUMBER) or weibull in SAS. Whereas in R one may change the name of the distribution in normal.fit <- fitdist(x,"norm") command to the desired distribution name. While fitting densities you should take the properties of specific distributions into account. For example, Beta distribution is defined between 0 and 1. So you may need to rescale your data in order to fit the Beta distribution.