CE 397: Environmental Risk Assessment

Department of Civil Engineering , The University of Texas at Austin

Solution to Assignment #1: Statistical Analysis of Environmental Data

Statistics of the Data

Frequency Histogram

Exceedance Probabilities

Difference in Means Between two Data Sets

1. Descriptive Statistics

(a) Statistics of the Original 90 Benzene Values (in mg Benzene / kg soil) :

 Column1 Mean 258704 Standard Error 166934 Median 500 Mode 19 Standard Deviation 1583676 Sample Variance 2508030041719 Kurtosis 78 Skewness 8.62 Range 14599999 Minimum 1 Maximum 14600000 Sum 23283324 Count 90

The range of the data is 1 mg/kg to 14,600,000 mg/kg or 14.6 gm Benzene / kg soil. The mean is 258,700 mg/kg, the median is 500 mg/kg. Since the mean is much greater than the median, these data contain a small number of very large values that strongly influence the value of the mean and make it not a very good indicator of the midrange of the data. The Coefficient of Skewness = 8.62 which shows that the data are highly positively skewed.

(b) Statistics of the logarithms to base 10 of the 90 benzene data:

 Log C Mean 2.837 Standard Error 0.167 Median 2.699 Mode 1.279 Standard Deviation 1.588 Sample Variance 2.521 Kurtosis -0.121 Skewness 0.534 Range 7.289 Minimum -0.125 Maximum 7.164 Sum 255.363 Count 90 Confidence Level(95.0%) 0.333

The coefficient of skewness of these data is 0.534, which is much more reasonable. The median is 2.699 and the mean is 2.837. These values correspond to in the real space to 102.699 = 425 mg/kg and 102.837 = 687 mg/kg, respectively, which are reasonable values for the midrange of the data. The fact that the mean is greater than the median is consistent with the fact that the data are positively skewed

The logarithms are a better representation of the data than the original values because the range has been greatly reduced (now –0.125 to 7.164) and the data are distributed in a more reasonable way over that range.

2. Frequency Histograms

The histogram of the logarithms of the data:

 -2 Frequency Cumulative % -1 0 .00% 0 2 2.22% 1 9 12.22% 2 21 35.56% 3 20 57.78% 4 20 80.00% 5 7 87.78% 6 7 95.56% 7 3 98.89% 8 1 100.00% 9 0 100.00% More 0 100.00%

3. Exceedence Probabilities

The exceedence probability is estimated as P(C>c*) = m/(n+1), where m is number of data values fulfilling that condition, and n is the total number of data values (n = 90).

 Threshold Concentration, c* Number of samples exceeding c* Exceedence Probability 1000 38 38/91 = 0.418 10000 18 18/91 = 0.198 100000 11 11/91 = 0.121

Note that there are two values of exactly 1000 mg/kg and these have not been considered in the calculation of the exceedance probability since we are seeking the exceedence probability and these values equal but do not exceed the threshold concentration. A similar conclusion holds for the single value of 100 mg/kg. The exceedance probabilities of the last 30 values and the last 60 values in the data set are shown below. In general, the last values of the data set have a lower chance of exceeding threshold levels than does the data set as a whole.

 Threshold Concentration, c* Exceedance Prob with 30 data Exceedence Prob with 60 data 1000 11/31 = 0.355 30/61 = 0.492 10000 6/31 = 0.194 17/61 = 0.279 100000 2/31 = 0.065 4/61 = 0.066

4. Recursive Estimation of the Mean

The logs of the data have been analyzed to estimate the standard error of the mean and thus the approximate 95% confidence limits on the mean with the result appearing as:

 N Values Log(C) Mean Stdev Stdev/n^0.5 m+2Se m-2Se 1 2180 3.338 3.338 2 90 1.954 2.646 0.979 0.692 4.031 1.262 3 477 2.679 2.657 0.692 0.400 3.457 1.858 4 4400 3.643 2.904 0.750 0.375 3.654 2.153 5 18 1.255 2.574 0.983 0.439 3.453 1.695 6 70 1.845 2.453 0.928 0.379 3.210 1.695 7 310 2.491 2.458 0.847 0.320 3.098 1.818 8 490 2.690 2.487 0.789 0.279 3.045 1.929 9 300 2.477 2.486 0.738 0.246 2.978 1.994 10 350 2.544 2.492 0.696 0.220 2.932 2.052 11 6000 3.778 2.609 0.766 0.231 3.070 2.147 12 1500 3.176 2.656 0.748 0.216 3.088 2.224 13 82 1.914 2.599 0.745 0.207 3.012 2.186 14 210 2.322 2.579 0.720 0.192 2.964 2.194 15 460 2.663 2.585 0.694 0.179 2.943 2.226 16 1800 3.255 2.627 0.691 0.173 2.972 2.281 17 120 2.079 2.594 0.682 0.165 2.925 2.264 18 86 1.934 2.558 0.680 0.160 2.878 2.237 19 510 2.708 2.566 0.662 0.152 2.869 2.262 20 130 2.114 2.543 0.652 0.146 2.835 2.252 21 1600 3.204 2.575 0.651 0.142 2.859 2.290 22 360000 5.556 2.710 0.899 0.192 3.093 2.327 23 750 2.875 2.717 0.879 0.183 3.084 2.351 24 30 1.477 2.666 0.896 0.183 3.031 2.300 25 270 2.431 2.656 0.879 0.176 3.008 2.305 26 8 0.903 2.589 0.927 0.182 2.952 2.225 27 109 2.037 2.568 0.915 0.176 2.921 2.216 28 44 1.643 2.535 0.915 0.173 2.881 2.190 29 86 1.934 2.515 0.905 0.168 2.851 2.178 30 1100 3.041 2.532 0.895 0.163 2.859 2.205

Or in graphical form as:

It can be seen that there is little reduction the error of estimate of the mean after about 15 samples because a longer sample brings in additional areas of the site that have differing concentration levels. Thus the statistical sample is not homogeneous, so adding more data from further afield increases the variance of the data set which offsets the reduction in the standard error of the mean with the square root of the number of samples.

5. T-test for the Difference in Two Means

For the original data, the means and standard deviations of the first and second 45 data values are:

 First 45 Second 45 Mean 13957 503451 StDev 57593 2224234 Coeff var 4.12 4.42

It can be seen that the standard deviation is much greater than the mean in both cases, so it doesn’t make a lot of sense to use the normal distribution to study these data since there is a high chance of getting a negative concentration and that isn’t physically reasonable. The coefficient of variation is the ratio of the standard deviation to the mean, which is about 4 in both data sets, so their relative variability is about the same, but the mean value is higher in the second set than in the first.

Lets use the logs of the data for the test instead, since these data are more nearly normally distributed.

 First 45 Second 45 2.513 3.162 1.105 1.914

If X = the logs to base 10 of the first 45 data and Y = the logs to base 10 of the second 45 data, then the t statistic is computed as:

with m = n = 45, this result is found to be:

What this calculation says is that the difference between the two means of –0.649 is about twice the standard deviation of the pooled sample, 0.329. This is on the borderline of being considered a statistically significant difference. In general, if the t statistic is > 2 or < -2 then the difference in the means is statistically significant, and if |t| < 2, the hypothesis that there is no difference between the two means cannot be rejected.

To be more precise, we should find the t statistic value for n = n+m-2 = 88 degrees of freedom = 88, for which the values are (from Probability Concepts in Engineering Planning and Design, Vol. 1, Table A.2, by A. H-S. Ang and W.H. Tang, Wiley Publishers, 1975):

 Percentile Value 0.9 0.95 0.975 0.99 T statistic 1.29 1.65 1.98 2.36

This means that the observed t-statistic is about at the 95% confidence limit since that limit has a range on the percentile value from (0.025 to 0.975).