CE 397: Environmental Risk Assessment

Department of Civil Engineering , The University of Texas at Austin


Solution to Assignment #1: Statistical Analysis of Environmental Data

Statistics of the Data

Frequency Histogram

Exceedance Probabilities

Standard Error of the Mean

Difference in Means Between two Data Sets

1. Descriptive Statistics

(a) Statistics of the Original 90 Benzene Values (in mg Benzene / kg soil) :

 

Column1

Mean

258704

Standard Error

166934

Median

500

Mode

19

Standard Deviation

1583676

Sample Variance

2508030041719

Kurtosis

78

Skewness

8.62

Range

14599999

Minimum

1

Maximum

14600000

Sum

23283324

Count

90

 

The range of the data is 1 mg/kg to 14,600,000 mg/kg or 14.6 gm Benzene / kg soil. The mean is 258,700 mg/kg, the median is 500 mg/kg. Since the mean is much greater than the median, these data contain a small number of very large values that strongly influence the value of the mean and make it not a very good indicator of the midrange of the data. The Coefficient of Skewness = 8.62 which shows that the data are highly positively skewed.

 

(b) Statistics of the logarithms to base 10 of the 90 benzene data:

 

Log C

Mean

2.837

Standard Error

0.167

Median

2.699

Mode

1.279

Standard Deviation

1.588

Sample Variance

2.521

Kurtosis

-0.121

Skewness

0.534

Range

7.289

Minimum

-0.125

Maximum

7.164

Sum

255.363

Count

90

Confidence Level(95.0%)

0.333

 

The coefficient of skewness of these data is 0.534, which is much more reasonable. The median is 2.699 and the mean is 2.837. These values correspond to in the real space to 102.699 = 425 mg/kg and 102.837 = 687 mg/kg, respectively, which are reasonable values for the midrange of the data. The fact that the mean is greater than the median is consistent with the fact that the data are positively skewed

 

The logarithms are a better representation of the data than the original values because the range has been greatly reduced (now –0.125 to 7.164) and the data are distributed in a more reasonable way over that range.

 


2. Frequency Histograms

 

The histogram of the logarithms of the data:

 

-2

Frequency

Cumulative %

-1

0

.00%

0

2

2.22%

1

9

12.22%

2

21

35.56%

3

20

57.78%

4

20

80.00%

5

7

87.78%

6

7

95.56%

7

3

98.89%

8

1

100.00%

9

0

100.00%

More

0

100.00%

 

 


3. Exceedence Probabilities

 

The exceedence probability is estimated as P(C>c*) = m/(n+1), where m is number of data values fulfilling that condition, and n is the total number of data values (n = 90).

 

Threshold Concentration, c* Number of samples exceeding c* Exceedence Probability
1000 38 38/91 = 0.418
10000 18 18/91 = 0.198
100000 11 11/91 = 0.121

 

Note that there are two values of exactly 1000 mg/kg and these have not been considered in the calculation of the exceedance probability since we are seeking the exceedence probability and these values equal but do not exceed the threshold concentration. A similar conclusion holds for the single value of 100 mg/kg. The exceedance probabilities of the last 30 values and the last 60 values in the data set are shown below. In general, the last values of the data set have a lower chance of exceeding threshold levels than does the data set as a whole.

 

Threshold Concentration, c* Exceedance Prob with 30 data Exceedence Prob with 60 data
1000 11/31 = 0.355 30/61 = 0.492
10000 6/31 = 0.194 17/61 = 0.279
100000 2/31 = 0.065 4/61 = 0.066

 

4. Recursive Estimation of the Mean

 

The logs of the data have been analyzed to estimate the standard error of the mean and thus the approximate 95% confidence limits on the mean with the result appearing as:

 

N Values Log(C) Mean Stdev Stdev/n^0.5 m+2Se m-2Se

1

2180

3.338

3.338

2

90

1.954

2.646

0.979

0.692

4.031

1.262

3

477

2.679

2.657

0.692

0.400

3.457

1.858

4

4400

3.643

2.904

0.750

0.375

3.654

2.153

5

18

1.255

2.574

0.983

0.439

3.453

1.695

6

70

1.845

2.453

0.928

0.379

3.210

1.695

7

310

2.491

2.458

0.847

0.320

3.098

1.818

8

490

2.690

2.487

0.789

0.279

3.045

1.929

9

300

2.477

2.486

0.738

0.246

2.978

1.994

10

350

2.544

2.492

0.696

0.220

2.932

2.052

11

6000

3.778

2.609

0.766

0.231

3.070

2.147

12

1500

3.176

2.656

0.748

0.216

3.088

2.224

13

82

1.914

2.599

0.745

0.207

3.012

2.186

14

210

2.322

2.579

0.720

0.192

2.964

2.194

15

460

2.663

2.585

0.694

0.179

2.943

2.226

16

1800

3.255

2.627

0.691

0.173

2.972

2.281

17

120

2.079

2.594

0.682

0.165

2.925

2.264

18

86

1.934

2.558

0.680

0.160

2.878

2.237

19

510

2.708

2.566

0.662

0.152

2.869

2.262

20

130

2.114

2.543

0.652

0.146

2.835

2.252

21

1600

3.204

2.575

0.651

0.142

2.859

2.290

22

360000

5.556

2.710

0.899

0.192

3.093

2.327

23

750

2.875

2.717

0.879

0.183

3.084

2.351

24

30

1.477

2.666

0.896

0.183

3.031

2.300

25

270

2.431

2.656

0.879

0.176

3.008

2.305

26

8

0.903

2.589

0.927

0.182

2.952

2.225

27

109

2.037

2.568

0.915

0.176

2.921

2.216

28

44

1.643

2.535

0.915

0.173

2.881

2.190

29

86

1.934

2.515

0.905

0.168

2.851

2.178

30

1100

3.041

2.532

0.895

0.163

2.859

2.205

 

Or in graphical form as:

It can be seen that there is little reduction the error of estimate of the mean after about 15 samples because a longer sample brings in additional areas of the site that have differing concentration levels. Thus the statistical sample is not homogeneous, so adding more data from further afield increases the variance of the data set which offsets the reduction in the standard error of the mean with the square root of the number of samples.

 

5. T-test for the Difference in Two Means

For the original data, the means and standard deviations of the first and second 45 data values are:

First 45 Second 45
Mean

13957

503451

StDev

57593

2224234

Coeff var

4.12

4.42

 

It can be seen that the standard deviation is much greater than the mean in both cases, so it doesn’t make a lot of sense to use the normal distribution to study these data since there is a high chance of getting a negative concentration and that isn’t physically reasonable. The coefficient of variation is the ratio of the standard deviation to the mean, which is about 4 in both data sets, so their relative variability is about the same, but the mean value is higher in the second set than in the first.

Lets use the logs of the data for the test instead, since these data are more nearly normally distributed.

First 45 Second 45

2.513

3.162

1.105

1.914

If X = the logs to base 10 of the first 45 data and Y = the logs to base 10 of the second 45 data, then the t statistic is computed as:

with m = n = 45, this result is found to be:

What this calculation says is that the difference between the two means of –0.649 is about twice the standard deviation of the pooled sample, 0.329. This is on the borderline of being considered a statistically significant difference. In general, if the t statistic is > 2 or < -2 then the difference in the means is statistically significant, and if |t| < 2, the hypothesis that there is no difference between the two means cannot be rejected.

To be more precise, we should find the t statistic value for n = n+m-2 = 88 degrees of freedom = 88, for which the values are (from Probability Concepts in Engineering Planning and Design, Vol. 1, Table A.2, by A. H-S. Ang and W.H. Tang, Wiley Publishers, 1975):

Percentile Value 0.9 0.95 0.975 0.99
T statistic 1.29 1.65 1.98 2.36

This means that the observed t-statistic is about at the 95% confidence limit since that limit has a range on the percentile value from (0.025 to 0.975).


Return to the Class Home Page