MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_01C99806.1AB1DEA0" This document is a Single File Web Page, also known as a Web Archive file. If you are seeing this message, your browser or editor doesn't support Web Archive files. Please download a browser that supports Web Archive, such as Windows® Internet Explorer®. ------=_NextPart_01C99806.1AB1DEA0 Content-Location: file:///C:/EF5444F3/Ex5Trends.htm Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="windows-1252"
CE 397 Statistics=
in
Water Resources
Exercise 5
Analyzing Trends<= o:p>
by:
Cody Hudson, Ashlynn Stillwell,
Patrick Frasier and David Maidment
University of Texas at Austin
February 2009
Contents
In this exercise we will =
explore
how to deal with trends. How can w=
e tell
if Y is dependent on X? How do we =
tell
if a data point is an outlier? How=
can
we tell if Y is dependent on X1, X2, X3 or more variables?
Goals of this Exe=
rcise
To answer these questions, we will evaluate two differ= ent data sets. In the first we will ev= aluate data to see if there is a trend between two variables using simple linear regression methods. We will use th= e same data set to determine if a data point is an outlier. Finally, we will evaluate= the case where there may be multiple explanatory variables.
Computer Requirem=
ents
This exercise is to be performed in Microsoft Excel (2= 007 version used here) using the Data Analysis Toolpack from Exercise #2. The data for this exercise is at: = http://www.ce.utexas.edu/prof/maidment/StatWR2009/Ex5/Ex5Data.xls
The data used in this exercise were derived from the following web sites:
Census (links to each decade): http://www.census= .gov/popest/archives/index.html
TWDB Historical Water Use: http://www.twdb.state.tx.us/wushistorical/DesktopDefault.aspx?PageID= =3D1
Fresh Water Withdrawal by Country: http://www.worldwater.org/data= .html
Population by Country: http:= //en.wikipedia.org/wiki/List_of_countries_by_population
GDP by Country: https://www.cia.gov/library/publications/the-world-factbook/fiel= ds/2195.html
Average annual rainfall in largest city: http://www.worldclimate.com/
Simple Trend Anal=
ysis
We all love Texas. We all want Texas to continue to have e= nough water to serve its residents. This= part of the exercise evaluates water use in Texas over the past 35 years. We will look at water use in Texas and = plot it versus census population data starting in 1974. Water use data was taken from the Texas= Water Development Board (TWDB). Populati= on census data was taken from the census.gov webpage.
Lets make a simple study of the trends in these= data through time.
Analysis 1: Population Growth vs Time.<= /b> First, create a graph for population growth versus time. Add a trendline to this graph by right-clicking on the data points, selecting ‘Add Trendline’, and selecting linear. Check the boxes to add the equation and= R2 value to the chart as shown below.
With this graph we see a = strong trend of increasing population growth with time.
Analysis 2: Water Use vs Time. Next,
graph water use versus time with a trendline. What did you expect to see?
Analysis 3: Water Use vs Population. Finally, graph total water use versus population growth with a = trendline. Does there appear to be a trend in the = data? The total water use includes all source= s, including manufacturing, steam electric, irrigation, mining, and livestock.= These water uses are most likely not af= fected by increased population.
Analysis 4: Municipal Water Use vs Time= . Now let’s consider municipal water use only.
Graph municipal water use= versus time with a trendline. What did you expect to see?
Analysis 5: Municipal Water Use vs Popu= lation. Finally, graph municipal water use versus population with a trendline. Does there appear to be a trend in the = data?
To be turned i=
n: A
graph of the data with the trend lines and equations for Analyses 1-5, plus=
a
table that summarizes the five trend equations and their R2
values. Discuss the results. Which relationships do you think are
statistically significant? What is=
the
population growth rate per year in Texas?
Interpreting Regr=
ession
Statistics
Now, anybody can fit trend lines but the differences a=
mong
the equations you have just fitted need to be examined more closely and the
significance of the individual coefficients studied. We can do this with the Regression
Statistics. Let’s take a closer look at the trendline<=
/span>
for municipal water use versus
population using the Excel Regression function from the Data Analysis <=
span
class=3DSpellE>Toolpack. Under Data, click on Data Analysis, Regression.
The Input Y Range will be the municipal water use and = Input X will be population. Check labels= as appropriate. We’d also like to hav= e the 95% confidence level and residuals for later, as shown below.
Excel gives us lots of fun output regarding the relati= onship between municipal water use and population. I’ll explain more about how to interpret this information in class.<= span style=3D'mso-spacerun:yes'> Being able to do this properly is what separates naïve statistics from properly informed statistical analysis.
The right way to describe this equation is:
Water Use =3D=
-7407.22 + 0.18348 * Population R2 =3D 0.903 Se =3D 159033 F =3D 195.83
(-0.03053) (13.994)
The numbers in parentheses under the coefficients are the t-va=
lues
that tell you how significant the coefficients are. The R2 value tells you what=
% of
the variability in Water Use is explained by the equation. The Standard Error (Se) val=
ue
defines the vertical uncertainty around the line for estimates made by this
equation. The F-ratio is a more f=
ormal
measure of the statistical significance of the equation as a whole than R2.
We also get p-values for = the parameter estimates for the intercept (0.976) and population slope coeffici= ent (4.07E-12). Since the p-value for = the intercept is not less than a significance level (a) of 0.05, we fail to reject the null hypothesis that the intercept is zero, = thus the intercept estimate is not statistically different from zero. Another way we can determine whether a parameter estimate is statistically significant is to look at the upper and lower 95% confidence intervals. Fo= r the intercept, zero is contained within the interval, thus the estimate is not statistically significant. On the = other hand, the slope coefficient estimate for population is highly significant.<= /p>
Repeat the Regression ana= lysis but this time check the box that sets “Constant is Zero”. This eliminates the intercept of the equation.
TO BE TURNED I=
N: The table of interpretive statistics fr=
om the
simple regression of municipal water use vs
population for the regressions with and without an intercept. Write both equations in the format
illustrated above. Which equation =
has
the greater statistical significance? What is the average municipal water
demand in gallons per capita day in Texas?
Is it changing through time? Note: 1 acre-ft =3D 325,851 US gallons.=
We’re going to use the same data set to evaluate if can
consider a data point to be an outlier.
Use the information from the regression function in the Data Analysi=
s Toolpack. Th=
e summary
will contain the value of the standard error, s.
According to Helsel and Hi=
rsch
p.246, “leverage is a measure of an ‘outlier’ in the x direction.” A high
leverage point is one where
For deviation in the x-direction, the statistics hi is computed as:
Where SSx =
sub>is
the sum of the squares x.
For deviations in the y direction we use the standardized residual esi. It is the actual residual
Where=
the s in
this equation is the standard error of estimate of the regression equation.=
An extreme outlier is one for which |esi|>3.
Outlier?
Using the information in = the Excel regression summary, determine if the 1974 data point is an outlier in= the x and y.
To be turned in: Without m=
aking
any calculations, does the point for municipal water use in 1974 appear to =
be
an outlier? Why or why not? Using the procedure outlined above from=
Helsel and Hirsch, does the point for municipal water=
use
in 1974 have high leverage? Is this
point an outlier?
Mulitple Linear
Regression is used when it is thought that the (y) variable is dependent on=
multiple
(x) variables. For example, stormwater runoff is not just dependent on rainfall b=
ut on
soil types and land slopes as well.
Expanding the model for the (y) variable will help us determine which
factors are statistically important and which do not affect the (y)
variable.
Data=
was
compiled concerning multiple countries.
We are interested in which factors would affect water use such as Gr=
oss
Domestic Product (GDP), population, and annual rainfall. Annual rainfall was approximated as the
annual rainfall in the most populated city of the country. In some cases (small countries) an
approximate national average was obtained.
GDP and population data were taken from the most recent data availab=
le.
Annual Fresh Water Withdrawals |
Population |
GDP |
Average Annual Rainfall of Largest City
(mm) |
|
US |
477 |
305,862,0=
00 |
14330 |
960 |
Canada |
44.72 |
33,563,00=
0 |
1564 |
817.5 |
Mexico |
78.22 |
106,682,5=
00 |
1143 |
749 |
Australia |
24.06 |
21,597,12=
1 |
1069 |
1200 |
China |
549.76 |
1,335,962=
,132 |
4222 |
970 |
Brazil |
59.3 |
190,769,0=
00 |
1665 |
1350 |
India |
645.84 |
1,144,810=
,000 |
1237 |
2200 |
Sudan |
37.32 |
38,560,00=
0 |
62.19 |
250 |
Cameroon |
0.99 |
18,549,00=
0 |
25 |
4060 |
Ukraine |
37.53 |
46,191,02=
2 |
198 |
615 |
UK |
11.75 |
61,612,30=
0 |
2787 |
584 |
France |
33.16 |
65,073,48=
2 |
2978 |
642 |
Japan |
88.43 |
127,704,0=
00 |
4844 |
1565 |
Vietnam |
71.39 |
87,375,00=
0 |
90.88 |
1976 |
Saudi
Arabia |
17.32 |
24,735,00=
0 |
528.3 |
81 |
Honduras
|
0.86 |
7,106,000=
|
13.78 |
907 |
Source: =
span>The
World's Water, Pacific Institute; Wikipedia; CIA
Here=
we have
three independent variables and one dependent variable, thus we cannot graph
our data. To determine which facto=
rs
were important we use multiple lin=
ear
regression. Use the Data Analy=
sis Toolpack to do a regression analysis on the data. Highlight all the independent variables=
as
the input X range and the fresh water withdrawals and the dependent Y
variable. We will not use residual=
s in
this step.
Note=
the
p-values for each coefficient estimate in the output of the regression anal=
ysis. Which do you think is the most signific=
ant
factor? What other issues could be
affecting this model that were not accounted for=
?
To be turned in: Present t=
he
output table of regression statistics and write the estimated model for a
country’s fresh water withdrawals as a function of population, GDP, and ave=
rage
annual rainfall in the standard form with t-statistics, R2, S
2. =
Using the procedure outlined above fro=
m Helsel and Hirsch, does the point for municipal water=
use
in 1974 have high leverage? Is this
point an outlier?
3. =
The table of interpretive statistics f=
rom
the simple regression of municipal water use vs
population for the regressions with and without an intercept. Write both equations in the format
illustrated above. Which equation =
has
the greater statistical significance? What is the average municipal water
demand in gallons per capita day in Texas?
Is it changing through time? Note: 1 acre-ft =3D 325,851 US gallons.=