Fitting Distributions to Data With R
In “Fitting Distributions with R” Vito Ricci writes;
“Fitting distributions consists in finding a mathematical function which represents in a good way a statistical variable. A statistician often is facing with this problem: he has some observations of a quantitative character $x_1, x_2, …, x_n$ and he wishes to test if those observations, being a sample of an unknown population, belong from a population with a pdf (probability density function) $\ f(x,\theta)$, where $\ \theta$ is a vector of parameters to estimate with available data.
We can identify 4 steps in fitting distributions:
 Model/function choice: hypothesize families of distributions;
 Estimate parameters;
 Evaluate quality of fit;
 Goodness of fit statistical tests.”
In SAS this can be done by using proc capability
whereas in R we can do the same thing by using fdistrplus
and some other packages. In this post I will try to compare the procedures in R and SAS.
Following code chunk creates 10,000 observations from normal distribution with a mean of 10 and standard deviation of 5 and then gives the summary of the data and plots a histogram of it.
1 2 3 4 5 

If we import the data we created in R into SAS and run the following code;
PROC CAPABILITY;
HISTOGRAM x / NORMAL;
RUN;
SAS gives us the following results;
 Moments
 Basic Statistical Measures (Location and Variability)
 Tests for Location
 Observed Quantiles
 Extreme Observations
 Histogram
 Parameter Estimates
 GoodnessofFit Test Results
 Estimated Quantiles
We can obtain same results in R by using e1071
, raster
, plotrix
, stats
, fitdistrplus
and nortest
packages.
1. Moments
N :
1


Sum Weights : A numeric variable can be specified as a weight variable to weight the values of the analysis variable. The default weight variable is defined to be 1 for each observation. This field is the sum of observation values for the weight variable. In our case, since we didn’t specify a weight variable, SAS uses the default weight variable. Therefore, the sum of weight is the same as the number of observations. (Source)
1


Mean :
1


Sum Observations :
1


Std Deviation :
1


Skewness :
1 2 

Kurtosis :
1


Uncorrected SS : Sum of squared data values. (Source)
1


Corrected SS : The sum of squared distance of data values from the mean. (Source)
1


Coeff Variation : The ratio of the standard deviation to the mean. (Source)
1 2 

Std Error Mean : The estimated standard deviation of the sample mean. (Source)
1 2 

2. Basic Statistical Measures (Location and Variability)
Range :
1


Interquartile Range :
1


3. Tests for Location
Student’s t : Skipped this part
Sign : Skipped this part
Signed Rank :
1


4. Observed Quantiles
Signed Rank :
1


5. Extreme Observations : Skipped this part
6. Histogram
1 2 3 

6. Parameter Estimates
Mean (Mu) :
1


Std Dev (Sigma) :
1


7. GoodnessofFit Test Results
KolmogorovSmirnov, Cramervon Mises, and AndersonDarling
1


or
KolmogorovSmirnov :
1 2 3 

Cramervon Mises :
1 2 

AndersonDarling :
1 2 

ChiSquare :
1 2 3 

8. Estimated Quantiles : Skipped this part
We can change the commands to fit other distributions. This is as simple as changing normal
to something like beta(theta = SOME NUMBER, scale = SOME NUMBER)
or weibull
in SAS. Whereas in R one may change the name of the distribution in normal.fit < fitdist(x,"norm")
command to the desired distribution name. While fitting densities you should take the properties of specific distributions into account. For example, Beta distribution is defined between 0 and 1. So you may need to rescale your data in order to fit the Beta distribution.