 #### 聯系方式 #### 您當前位置：首頁 >> Python編程Python編程

###### 日期：2019-06-05 10:27

MULTIPLE CHOICE [25 marks]

Question 1

A survey of deer was performed in a catchment based on surveying in a number of 1km by 1km

parcels of land. Based on exploratory data analysis you have the following information.

y = 14.3 deer/ha; s = 3.0 deer/ha; n = 10; t

0.025

10 = 2.228; t

0.025

9 = 2.262.

The 95% confidence interval around the mean is?

(a) [2.65, 28.65]

(b) [12.15, 16.45]

(c) [8.03, 24.65]

(d) [3.24, 21.32]

(e) none of the above

Question 2

A survey of deer was performed in a catchment based on surveying in a number of 1km by 1km

parcels of land. Based on exploratory data analysis you have the following information.

y = 14.3 deer/ha; s = 3.0 deer/ha; n = 10; 

When you survey next year you want to have survey with a standard error of the mean equal to

0.75. Based on the statistics above how many parcels of land will you survey next year?

Question 3

When designing a monitoring scheme which of the following statements is correct?

(a) If we have a large covariance between the 2 sampling periods then it is most likely best that

we resample the same locations or sampling units.

(b) If we have a small covariance between the 2 sampling periods then it is most likely best that

we resample the same locations or sampling units.

(c) If we have zero covariance between the 2 sampling periods then it is most likely best that

we resample the same locations or sampling unit.

(d) It does not matter whether we resample the same units.

(e) None of the above.

Question 4

Soil carbon was measured in a field at the start of a season and at the end of the season. The

aim is to estimate the change in mean carbon for the field and see if the change is statistically

significant. You sampled different locations between the 2 sampling events.

The outputs from 2 different analyses are shown below.

Two Sample t-test

data: init and fin

t = 0.28987, df = 10, p-value = 0.7778

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

Page 2 of 31

-28.97550 37.64217

sample estimates:

mean of x mean of y

42.83333 38.50000

Paired t-test

data: init and fin

t = 6.0613, df = 5, p-value = 0.001764

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

2.495572 6.171095

sample estimates:

mean of the differences

4.333333

Which of the following statement is true?

(a) The most appropriate analysis is the paired t-test and we can state that there was a signifi-

cant change in soil carbon over the season.

(b) The most appropriate analysis is the paired t-test and we can state that there wasn’t a

significant change in soil carbon over the season.

(c) The most appropriate analysis is the two sample t-test and we can state that there was a

significant change in soil carbon over the season.

(d) The most appropriate analysis is the two sample t-test and we can state that there wasn’t a

significant change in soil carbon over the season.

(e) None of the above.

Question 5

When performing a survey - stratified random sampling is better than simple random sampling

because it is likely to

(a) give a more representative sample.

(b) give a better estimate of the mean and variance.

(c) give more precise estimates of the mean.

(d) all of a, b and c.

(e) none of these/insufficient information.

Questions 6-7 relate to the analysis of the following dataset.

The protein content of milk(%) of two breeds of cattle was compared, with a random sample of

20 of each breed being selected. The following table of descriptive statistics was obtained.

Breed n Mean SD

Breed 1 20 3.352 0.212

Breed 2 20 3.681 0.233

The data were analysed using a pooled (i.e. equal variance) two-sample t-test, and a t-value

of 4.67 was obtained. The data were subsequently re-analysed using an analysis of variance.

Page 3 of 31

Question 6

In this re-analysis, what is the F-value?

Question 7

In this re-analysis, what are the Breed df and Residual df?

Question 8

You are establishing a plant breeding trial at a site with five soil types which would impact on the

yield of the different varieties of wheat you intend to use. The experimental design you would

use is a

(a) completely randomised design.

(b) paired design.

(c) randomised complete block design.

(d) factorial treatment design.

(e) none of these/insufficient information.

Question 9

A field experiment was conducted to compare yields of 10 varieties of wheat. A randomised

complete block design was used for the experiment with four blocks being used, each block containing

all 10 varieties. The data generated from this design were to be analysed using ANOVA.

What are the residual degrees of freedom?

Question 10

A field experiment was conducted to compare weight gains of sheep under 2 pasture systems. 40

sheep were available for the experiment and 20 sheep were randomly allocated to 2 paddocks,

each representing one pasture system. Each sheep was weighed before and after the experiment

to estimate weight gain. Which statement is true?

(a) This is an example of confounding as we cannot disentangle whether it is the pasture system

causing differences in weight gain or other factors that may vary between each paddock, e.g.

soil.

(b) The experimental unit is the paddock and sampling unit is a sheep.

(c) Both the experimental and sampling unit are a sheep.

(d) a and b are True.

(e) a, b and c are True.

Question 11

An experiment was being planned, and R was used to generate a randomisation for the experimental

design. The following output was obtained:

Page 4 of 31

> library(agricolae)

> (Trt <- LETTERS[1:5])

 "A" "B" "C" "D" "E"

> design.crd(trt = Trt, r = 4)\$book

Based on this code and output, this would be appropriate for the following experimental design:

(a) completely randomised design with four treatments and five replicates per treatment.

(b) completely randomised design with five treatments and four replicates per treatment.

(c) randomised complete block design with four treatments and five blocks.

(d) randomised complete block design with five treatments and four blocks.

(e) none of these.

Questions 12-15 relate to the analysis of the following dataset.

An experiment was designed to assess the usefulness of synthetic protein dietary supplements

in cattle, involving a comparison of three different supplements (Supplement A, Supplement B,

Supplement C), with a Control (no supplement). The level of total protein content (g/100 ml)

was measured in the blood of cows. A total of 40 similar cows was used for this experiment,

with ten cows being randomly allocated to each of the four treatments. The following analysis

was undertaken in R. However, not all of the output from the LSD.test function is shown below.

Df Sum Sq Mean Sq F value Pr(>F)

Supplement 3 2.651 0.8838 4.347 0.0103 *

Residuals 36 7.319 0.2033

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

test p.ajusted name.t ntr alpha

Fisher-LSD none Supplement 4 0.05

MSerror Df Mean CV t.value LSD

0.2033186 36 7.24625 6.222648 2.028094 0.4089702

Page 5 of 31

Protein std r LCL UCL Min Max Q25 Q50 Q75

Control 7.017 0.3652716 10 6.727814 7.306186 6.32 7.55 6.8675 7.015 7.2200

Supp A 7.683 0.4541671 10 7.393814 7.972186 7.00 8.57 7.3950 7.645 7.8975

Supp B 7.130 0.4212943 10 6.840814 7.419186 6.34 7.77 6.9550 7.180 7.3275

Supp C 7.155 0.5441456 10 6.865814 7.444186 6.53 8.27 6.6700 7.225 7.4225

Question 12

The proportion of variation in protein explained by Supplement is:

(a) 0.266

(b) 0.362

(c) 0.638

(d) 0.734

(e) none of the above.

Question 13

Assuming that Residuals ～ N(0, s2), what is the correct value of s2:

(a) 2.651

(b) 7.319

(c) 0.8838

(d) 0.2033

(e) none of the above.

Question 14

From the output, which treatment(s) show significantly (α = 0.05) the highest blood protein

content?

(a) Control + Supp B + Supp C

(b) Supp A

(c) Supp B

(d) Supp C

(e) none of the above.

Question 15

From the output, which treatment(s) show significantly (α = 0.05) the lowest blood protein

content?

(a) Control + Supp B + Supp C

(b) Supp A

(c) Supp B

(d) Supp C

(e) none of the above.

Questions 16-20 relate to the analysis of the following dataset.

Researchers are interested in what explains the relative abundance of C3 and C4 plants at 73

sites in North America. The data contains the following:

Response variable (y)

C3: relative abundance of C3 plants at 73 sites

C4: relative abundance of C4 plants at 73 sites

Predictor variables

1. MAP: Mean annual rainfall (mm) at the site.

2. MAT: Mean annual temperature (degrees C) at the site.

3. JJAMAP: proportion of mean annual rainfall in June, July and August (summer rainfall).

Page 6 of 31

4. DJFMAP: proportion of mean annual rainfall in December, January and February (winter

rainfall).

5. LAT: Latitude in centesimal degrees.

6. LONG: Longitude in centesimal degrees.

The researchers were first interested in C3 plants before also looking at C4 plants. After initial

inspection of the data, the researchers decided to log10 transform both the response variables

(C3 and C4). Part of the output of the model predicting the abundance of C3 plants gives the

coefficients and the statistical detail, based on this, answer the following 2 questions.

Estimate Std. Error t value Pr(>|t|)

(Intercept) -6.986574e-01 3.329192e-01 -2.0985795 0.039684448

MAP 7.784743e-05 5.843685e-05 1.3321634 0.187388322

MAT 1.637150e-03 3.123558e-03 0.5241298 0.601944104

JJAMAP -9.622251e-02 1.276384e-01 -0.7538678 0.453609777

DJFMAP -1.854179e-01 1.924485e-01 -0.9634680 0.338829626

LONG 2.931004e-03 2.695546e-03 1.0873509 0.280836839

LAT 1.243240e-02 2.667459e-03 4.6607671 0.000015784

Question 16

How many degrees of freedom are there to calculate the significance of this model?

(a) 7

(b) 73

(c) 66

(d) you can’t work that out from this data.

Question 17

Is the model significant?

(a) no, because all βi 6= 0

(b) yes, because at least one βi > 0

(c) yes, because all βi 6= 0

(d) you can’t tell this from this information.

The researchers subsequently did the same analysis on the C4 plants and found the following

results:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.886485e-01 2.963954e-01 2.32341153 0.02324746

MAP 8.551407e-05 5.202588e-05 1.64368332 0.10499743

MAT 3.492573e-03 2.780880e-03 1.25592384 0.21357141

JJAMAP 2.952668e-01 1.136355e-01 2.59836713 0.01154167

DJFMAP -1.495957e-02 1.713354e-01 -0.08731162 0.93068818

LONG -5.033666e-03 2.399823e-03 -2.09751523 0.03978172

LAT -5.316684e-03 2.374818e-03 -2.23877505 0.02854770

Question 18

Which of the following statements is TRUE?

(a) With each unit increase in JJAMAP rainfall the abundance of C4 plants increases by 1.97

units, with all other variables held constant.

(b) With each unit increase in JJAMAP rainfall the abundance of C4 plants increases by 0.2953

units, with all other variables held constant.

(c) With each unit increase in JJAMAP rainfall the abundance of C4 plants decreases by 1.97

units.

(d) You cannot calculate this from this data.

Page 7 of 31

Question 19

Rather than including all the variables, the researchers decided to look at simpler models that

included fewer variables. The first model they tried was based only on Longitude and Latitude.

Call:

lm(formula = Log10C4 ~ LONG + LAT, data = C4data1)

Residuals:

Min 1Q Median 3Q Max

-0.120517 -0.043905 -0.007991 0.046580 0.161236

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.475395 0.127024 11.615 < 2e-16 ***

LONG -0.010037 0.001126 -8.915 3.79e-13 ***

LAT -0.007724 0.001366 -5.653 3.17e-07 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06119 on 70 degrees of freedom

Multiple R-squared: 0.636, Adjusted R-squared: 0.6256

F-statistic: 61.16 on 2 and 70 DF, p-value: 4.347e-16

Explain from a statistical model point of view why the p-value and the estimate of LONG and

LAT have changed compared to the earlier full model.

(a) Because it is a totally different model and data.

(b) Because LONG and LAT individually have a lot more explaining power with fewer variables

in the model.

(c) Because the true value of the estimates of LONG and LAT is only visible in a simple linear

regression.

(d) Because the estimates of LONG and LAT are partial regression coefficients and both estimate

and p-value changes with the number of variables in the model.

Question 20

The researchers also looked at different model which also includes JJAMAP, and then ran a

partial F-test to see which model was the best.

Call:

lm(formula = Log10C4 ~ LONG + LAT + JJAMAP, data = C4data1)

Residuals:

Min 1Q Median 3Q Max

-0.12407 -0.03799 -0.01356 0.03388 0.14974

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.228906 0.144208 8.522 2.22e-12 ***

LONG -0.008134 0.001230 -6.614 6.60e-09 ***

LAT -0.008287 0.001303 -6.359 1.89e-08 ***

JJAMAP 0.230981 0.074997 3.080 0.00297 **

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Page 8 of 31

Residual standard error: 0.05779 on 69 degrees of freedom

Multiple R-squared: 0.68, Adjusted R-squared: 0.6661

F-statistic: 48.87 on 3 and 69 DF, p-value: < 2.2e-16

Analysis of Variance Table

Model 1: Log10C4 ~ LONG + LAT + JJAMAP

Model 2: Log10C4 ~ LONG + LAT

Res.Df RSS Df Sum of Sq F Pr(>F)

1 69 0.23042

2 70 0.26209 -1 -0.031676 9.4855 0.002973 **

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which model did the researchers choose as their final model and why?

- Model 1: Model with Latitude, Longitude and JJAMAP.

- Model 2: Model with only Latitude and Longitude.

(a) Model 2, as this model has fewer variables and based on the principle of parsimony this is

the best model.

(b) Model 1, as this has the higher r-squared.

(c) Model 1 as the F-test says the difference between the models is significant and therefore

(d) Model 2 as the F-test says the difference between the models is significant and therefore

adding a variable is not warranted.

Question 21

The researchers redid the analysis with a smaller dataset (they removed 20% of the data) so

they could validate the best model. They verified that they got similar results with the reduced

calibration data set as with the full data set by looking at the coefficients in the following table:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.225171731 0.158891238 7.710757 2.586135e-10

LONG -0.007966020 0.001385858 -5.748077 4.091923e-07

LAT -0.008301983 0.001501368 -5.529612 9.134469e-07

JJAMAP 0.206330437 0.089569087 2.303590 2.505033e-02

They then calculated the correlation, r, between the calibration data and the validation data, to

check the performance of the models. They also calculated Lin’s concordance.

Calibration_r Validation_r

1 0.809256 0.8798249

Calibration_Lins.est Calibration_Lins.lower Calibration_Lins.upper

1 0.7914643 0.6811801 0.8666261

Validation_Lins.est Validation_Lins.lower Validation_Lins.upper

1 0.7494391 0.4662542 0.8933096

Based on this which of one of the following conclusions can be drawn?

(a) The model does not perform well as the validation results are considerably lower than the

calibration results.

(b) The correlation coefficient alone is sufficient to indicate how well the model performs in

validation and calibration.

(c) Lin’s concordance indicates the relationship between validated and calibrated predictions does

Page 9 of 31

not follow the 1:1 line and therefore the model performs poorly.

(d) Lin’s concordance indicates that both validation and calibration results follow the 1:1 line

between predicted and observed quite well and therefore the model performs well.

Question 22

Which of the following are multivariate methods

(a) Regression.

(b) Classification.

(c) Clustering.

(d) Ordination.

(e) All of the above.

Question 23

You are going to run an nMDS on a data set which contains numbers of ant species collected at

different sites. Some of the ant species are much higher in abundance than other ant species,

and there are many species absent from some sites. Which of the following statements is true?

(a) You should use a Bray-Curtis similarity matrix.

(b) You should use a 4th root transformation.

(c) You should not use a Euclidean distance matrix.

(d) All of a, b and c.

(e) None of a, b, c.

Question 24

Which of the following is required by K-means clustering?

(a) defined distance metric.

(b) number of clusters.

(c) initial guess as to cluster centroids.

(d) All a, b, and c.

(e) None of a, b, and c.

Question 25

In cluster analysis, objects with larger distances between them are more similar to each other

than are those at smaller distances.

(a) True.

(b) False.

End of multiple choice questions

Page 10 of 31

Question A [6 marks]

An experiment was performed to examine the impact of different insecticide treatments on the

number of living insect larvae in a rice crop. The experiment had 9 insecticide treatments arranged

in a randomised complete block design with 4 blocks. The control (no insecticide used)

was coded T9. The aim is to have the smallest amount of larvae after treatment.

The R output below shows the results of an analysis on the data.

Analysis of Variance Table

Response: larvae

Df Sum Sq Mean Sq F value Pr(>F)

rep 3 385.64 128.546 3.7550 0.024209 *

trt 8 1255.50 156.937 4.5843 0.001719 **

Residuals 24 821.61 34.234

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

test p.ajusted name.t ntr alpha

Fisher-LSD none trt 9 0.05

MSerror Df Mean CV t.value LSD

34.2338 24 7.916667 73.90693 2.063899 8.538879

Theoretical Quantiles

Sample Quantiles

The R output below shows the results of an analysis on the data after log transformation.

Analysis of Variance Table

Page 11 of 31

Response: log_larvae

Df Sum Sq Mean Sq F value Pr(>F)

rep 3 0.9567 0.31889 3.6511 0.0267223 *

trt 8 3.9823 0.49779 5.6995 0.0004092 ***

Residuals 24 2.0961 0.08734

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

test p.ajusted name.t ntr alpha

Fisher-LSD none trt 9 0.05

MSerror Df Mean CV t.value LSD

0.08733941 24 0.7547053 39.15865 2.063899 0.431299

log_larvae groups

Choose and justify the appropriate output based on meeting the model assumptions. Use this

output to identify which insecticide treatments are better at controlling insects than the control

(T9). Explain how you made this decision.

Page 12 of 31

Question B [7 marks]

An experiment was conducted to examine to compare the yields of leys grown from four mixtures

(labelled A, B, C and D) and three seed rates (4, 6 and 8 g per unit area). A randomised

complete block design was used with three blocks. An analysis of the original data detected

an unstable variance and to remedy this a natural log transformation was used. The following

ANOVA was obtained together with means on the log-scale. Some of the entries have been

omitted in the ANOVA table.

(i) Fill in the values of the cells in the analysis of variance table marked with a *.

Source df SS MS F-ratio P-value

Block * 0.12469 0.0623 4.75

Rate * 0.34706 0.1735 13.22 <0.001

Mixture * 0.76693 0.2556 19.47 <0.001

Rate × Mixture * 0.21969 * * 0.036

Residual * 0.28879 *

Total * 1.74715

A B C D Overall

4 g 2.960 2.462 2.586 2.739 2.700

6 g 2.965 2.486 2.863 2.940 2.807

8 g 2.974 2.788 3.048 2.951 2.940

Overall 2.966 2.579 2.823 2.895 2.816

Page 13 of 31

(ii) What is the least significant difference (LSD) to compare a pair of means of different

p

mixtures but the same seed rate? The LSD is calculated as tcrit × SED where SED =(ResMS × 2/rep).

The following selected critical values from the t-tables will be useful:

df P

One tailed 0.025

Two tailed 0.05

6 2.447

22 2.074

30 2.042

40 2.021

(iii) Prior to the experiment, the experimenters were particularly interested in comparing seed

mixtures B and C at a seed rate of 6 g per unit area. Provide an estimate of this difference and

the 95% confidence interval for this comparison, initially on the log-scale, then on the original

scale of measurement.

Page 14 of 31

Question C [7 marks]

Data on water quality parameters and algae counts were collected in many different Queensland

lakes and rivers for different algae species. Here we concentrate on the data for the

Cylindrospermopsis raciborskii (CR). The water qualiy data considered were Sulfide (mg/L),

Total N(itrogen) (mg/L), Total P(hosphorus) (mg/L), Dissolved Oxygen (DO, mg/L), Turbidity

(NTU), Temperature (degrees C), pH, and electrical conductivity (Cond) (uS/m). The

researchers are interested in developing the best possible model to predict the occurrence of CR

in Queensland waters.

Here is a snippet of the data.

| | SulfidesmgL| TotalNmgL| TotalPmgL| ConduScm| DOmgL|

|:--|-----------:|---------:|---------:|---------:|---------:|

|23 | 0.02| 1.450| 0.085| 788.9091| 1.308182|

|26 | 0.00| 0.455| 0.040| 235.4000| 14.428000|

|38 | 0.02| 1.600| 0.095| 1009.5000| 3.662500|

|54 | 0.02| 0.650| 0.045| 399.0000| 5.240000|

|61 | 0.01| 0.295| 0.024| 153.5000| 7.500000|

(i) On inspection of the data, the researchers decide to log10 transform most of the data

columns, namely SulfidesmgL, TotalNmgL, TotalPmgL, ConduScm, DOmgL, TurbidityNTU,

and CR. Subsequently a correlation table was generated. Based on the correlation matrix, explain

which independent variable you suspect will be the best predictor for CR in a single variable

linear regression (simple linear regression)

SulfidesmgL TotalNmgL TotalPmgL ConduScm DOmgL

SulfidesmgL 1.000000000 0.011520667 0.009485123 0.123405309 0.022352894

TotalNmgL 0.011520667 1.000000000 0.999325874 -0.014672749 -0.007831802

TotalPmgL 0.009485123 0.999325874 1.000000000 -0.017052459 -0.009546349

ConduScm 0.123405309 -0.014672749 -0.017052459 1.000000000 -0.173858384

DOmgL 0.022352894 -0.007831802 -0.009546349 -0.173858384 1.000000000

pH 0.103157999 -0.147361880 -0.158040039 0.239738159 0.109014015

Temperature 0.091086576 -0.016655411 -0.012593383 0.212635719 -0.118745738

TurbidityNTU 0.140494654 -0.027861083 -0.013408112 -0.001087102 -0.070986845

CR 0.022925155 -0.056605979 -0.055010325 -0.232874072 0.109641652

pH Temperature TurbidityNTU CR

SulfidesmgL 0.10315800 0.09108658 0.140494654 0.02292515

TotalNmgL -0.14736188 -0.01665541 -0.027861083 -0.05660598

TotalPmgL -0.15804004 -0.01259338 -0.013408112 -0.05501032

ConduScm 0.23973816 0.21263572 -0.001087102 -0.23287407

DOmgL 0.10901402 -0.11874574 -0.070986845 0.10964165

pH 1.00000000 0.22301849 -0.087793362 -0.12148631

Temperature 0.22301849 1.00000000 0.204382392 -0.29034452

TurbidityNTU -0.08779336 0.20438239 1.000000000 -0.11048542

CR -0.12148631 -0.29034452 -0.110485416 1.00000000

Page 15 of 31

(ii) The researchers continued with a multiple regression analysis on all the data for CR. The

model output is given below. Explain whether this is a satisfactory statistical relationship.

Call:

lm(formula = log10CR ~ ., data = CR_QAlgae_tr)

Residuals:

Min 1Q Median 3Q Max

-2.28110 -0.69332 -0.00444 0.64203 2.07150

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.91824 1.84634 4.289 4.39e-05 ***

pH -0.26345 0.23046 -1.143 0.2559

Temperature -0.01908 0.02783 -0.686 0.4946

log10SulfidesmgL -3.07906 12.90546 -0.239 0.8120

log10TotalNmgL 4.87464 1.04979 4.643 1.12e-05 ***

log10TotalPmgL -11.31783 2.34518 -4.826 5.44e-06 ***

log10ConduScm -0.79975 0.38771 -2.063 0.0419 *

log10DOmgL 0.27440 0.53338 0.514 0.6082

log10TurbidityNTU -0.55953 0.21525 -2.599 0.0109 *

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8497 on 93 degrees of freedom

Multiple R-squared: 0.3331, Adjusted R-squared: 0.2758

F-statistic: 5.807 on 8 and 93 DF, p-value: 5.251e-06

(iii) why is the adjusted r

2

value important for model selection and relate this to the principle

of parsimony?

Page 16 of 31

(iv) The following residual plots were obtained for the full multi-linear regression model with three

variables. Based on the residual plots, argue whether there are any concerns about continuing

with the regression analysis.

Theoretical Quantiles

Standardized residuals

(v) Backward elimination was performed and the following output was obtained. Using the

output explain in some statistical detail what the best final model for predicting log10(CR)

is and how the output of the variable selection procedure informs you of this (Indicate what

information you would be looking at).

Start: AIC=-24.64

log10CR ~ pH + Temperature + log10SulfidesmgL + log10TotalNmgL +

log10TotalPmgL + log10ConduScm + log10DOmgL + log10TurbidityNTU

Df Sum of Sq RSS AIC

Page 17 of 31

- log10SulfidesmgL 1 0.0411 67.191 -26.5783

- log10DOmgL 1 0.1911 67.341 -26.3509

- Temperature 1 0.3394 67.489 -26.1264

- pH 1 0.9436 68.093 -25.2175

<none> 67.150 -24.6407

- log10ConduScm 1 3.0722 70.222 -22.0777

- log10TurbidityNTU 1 4.8790 72.029 -19.4865

- log10TotalNmgL 1 15.5683 82.718 -5.3725

- log10TotalPmgL 1 16.8166 83.966 -3.8447

Step: AIC=-26.58

log10CR ~ pH + Temperature + log10TotalNmgL + log10TotalPmgL +

log10ConduScm + log10DOmgL + log10TurbidityNTU

Df Sum of Sq RSS AIC

- log10DOmgL 1 0.1766 67.368 -28.3106

- Temperature 1 0.3536 67.545 -28.0430

- pH 1 0.9782 68.169 -27.1040

<none> 67.191 -26.5783

- log10ConduScm 1 3.1269 70.318 -23.9386

- log10TurbidityNTU 1 5.0982 72.289 -21.1186

- log10TotalNmgL 1 15.5567 82.748 -7.3361

- log10TotalPmgL 1 16.8151 84.006 -5.7966

Step: AIC=-28.31

log10CR ~ pH + Temperature + log10TotalNmgL + log10TotalPmgL +

log10ConduScm + log10TurbidityNTU

Df Sum of Sq RSS AIC

- Temperature 1 0.5775 67.945 -29.4399

- pH 1 0.8772 68.245 -28.9909

<none> 67.368 -28.3106

- log10ConduScm 1 3.6333 71.001 -24.9526

- log10TurbidityNTU 1 5.2317 72.599 -22.6819

- log10TotalNmgL 1 15.3830 82.751 -9.3325

- log10TotalPmgL 1 16.6387 84.006 -7.7963

Step: AIC=-29.44

log10CR ~ pH + log10TotalNmgL + log10TotalPmgL + log10ConduScm +

log10TurbidityNTU

Df Sum of Sq RSS AIC

- pH 1 1.2354 69.180 -29.6020

<none> 67.945 -29.4399

- log10ConduScm 1 4.6170 72.562 -24.7341

- log10TurbidityNTU 1 6.1241 74.069 -22.6372

- log10TotalNmgL 1 18.6416 86.587 -6.7104

- log10TotalPmgL 1 19.8658 87.811 -5.2783

Step: AIC=-29.6

log10CR ~ log10TotalNmgL + log10TotalPmgL + log10ConduScm + log10TurbidityNTU

Df Sum of Sq RSS AIC

<none> 69.180 -29.6020

Page 18 of 31

- log10TurbidityNTU 1 5.6547 74.835 -23.5879

- log10ConduScm 1 6.2639 75.444 -22.7609

- log10TotalNmgL 1 18.5837 87.764 -7.3326

- log10TotalPmgL 1 19.1970 88.377 -6.6223

Page 19 of 31

Question D [5 marks]

The climate dataset contains various sea surface temperature anomalies (AO and NPI) as well

as rainfall data, ice cover data, temperature, and the year each of the average measurements

were made. Explore the R output and then answer questions i, ii, and iii.

Arctic Oscillation AO - Annual

North Pacific Index NPI - Annual

Ice - Annual, January to July, October to December, coverage, ice free days

Temp - Annual, summer and winter

Rain - Annual, summer and winter

Importance of components:

PC1 PC2 PC3 PC4 PC5 PC6 PC7

Standard deviation 1.7825 1.6162 1.4399 1.3051 1.00640 0.90898 0.67192

Proportion of Variance 0.2444 0.2009 0.1595 0.1310 0.07791 0.06356 0.03473

Cumulative Proportion 0.2444 0.4453 0.6048 0.7358 0.81373 0.87729 0.91202

PC8 PC9 PC10 PC11 PC12 PC13

Standard deviation 0.62628 0.50743 0.48147 0.38410 0.29745 0.16192

Proportion of Variance 0.03017 0.01981 0.01783 0.01135 0.00681 0.00202

Cumulative Proportion 0.94219 0.96200 0.97983 0.99118 0.99798 1.00000

PC1 PC2 PC3 PC4 PC5

AO 0.156287502 -0.50815718 0.12983848 -0.008177511 0.14421247

NPI 0.250939630 -0.42144515 -0.27943937 0.050866762 0.10421175

Temp 0.165175832 0.20243032 -0.01502539 -0.474768071 -0.56365852

SummerTemp 0.180231809 -0.29613178 -0.25368167 -0.195303153 -0.16756395

WinterTemp -0.266893813 0.14433019 -0.14234741 -0.428832652 0.34726092

Rain 0.213236238 -0.24402773 0.44640553 -0.314164836 0.06113113

SummerRain -0.001514271 -0.13388458 0.58070456 -0.229915571 0.21086146

WinterRain 0.149930136 -0.24867925 -0.47822831 -0.113466132 0.09667562

Ice -0.480852604 -0.23695449 0.01721970 -0.093068314 -0.20416734

Ice_JanJul -0.442093681 -0.16548521 -0.04652799 -0.151069775 -0.35809493

Ice_OctDec -0.271927458 -0.42494690 0.06788669 0.026653770 -0.21552433

IceCover 0.153102429 -0.05717217 0.21768767 0.532971801 -0.39965508

IceFreeDays 0.435433891 0.08921713 0.02359594 -0.263081329 -0.26139430

Page 20 of 31

Screeplot climate PCA

Variances

IceFreeDays

(i) According to Kaiser’s criterion, how many criteria would you consider in your analysis?

Page 21 of 31

(ii) Consider the biplot of the climate data and report three relationships of interest.

Page 22 of 31

Equations

Sample variance

Confidence interval for mean, given unknown standard deviation

95%CI = y ± t0.025

Variance of the mean for simple random sampling (SiR)

Degrees of freedom for simple random sampling

Mean for stratified random sampling (StR)

Variance of the mean for stratified random sampling (StR)

Degrees of freedom for stratified random sampling

df = n H

Variance of the change in mean

Covariance between 2 sets of observations

Total sum of squares

T otalSS = T reatmentSS + ResidualSS

One way ANOVA table

Source df SS MS F-ratio

Treatment (t 1) TrtSS TrtSS/(t 1) TrtMS/ResMS

Residual (N t) ResSS ResSS/(N t)

Total (N 1) TotSS

ANOVA table with Block

Source df SS MS F-ratio

Block b 1 BlkSS BlkSS/(b 1)

Treatment t 1 TrtSS TrtSS/(t 1) TrtMS/ResMS

Residual (b 1)(t 1) ResSS ResSS/((b 1)(t 1))

Total bt 1 TotSS

Full factorial ANOVA table with Block

Source df SS MS F-ratio

Block b 1 BlkSS BlkSS/(b 1)

Treatment A tA 1 TrtASS TrtASS/(tA 1) TrtAMS/ResMS

Treatment B tB 1 TrtBSS TrtBSS/(tB 1) TrtAMS/ResMS

Treatment AB (tA 1)(tB 1) TrtABSS TrtABSS/((tA 1)(tB 1)) TrtABMS/ResMS

Residual (b 1)(tAB 1) ResSS ResSS/((b 1)(tAB 1))

Total bt 1 TotSS

Treatment SS

T reatmentSS = T reatmentASS + T reatmentBSS + T reatmentABSS

Page 24 of 31

A.1 Some probabilities for the cumulative standard normal

distribution

The distribution tabulated is for the normal distribution with mean 0 and standard deviation 1.

For each value of z, the table gives the proportion, P, of the distribution less than z, P(Z < z).ENVX1001 Introductory Statistical Methods

A.2 Some right-tail critical values for the Student’s T

distribution

The distribution tabulated is that of Student’s t. The first column is the degrees of freedom (df).

The remaining columns give either the one tailed (upper tail) critical values so that P(Tdf > t) =P, or the two tailed critical values so that P(Tdf > t or Tdf < –t) = P where P is the probability

shown at the top of the columns.ENVX1001 Introductory Statistical Methods

A.3 Some right-tail critical values for the Chi-Squared (2)

Distribution

The distribution tabulated is that of 2. The first column is the degrees of freedom (df). The

remaining columns give the upper tail critical values so that P(2df > x2) = P, where P is the

probability shown at the top of the columns.

For larger degrees of freedom than tabulated here, use the normal approximation to the 2,and refer z to the “Table of Probabilities for the Standard Normal

Distribution”.

4

A.4 Table of Probabilities of Fisher’s F Distribution

The distribution tabulated is that of Fisher’s F. The numerator degrees of freedom (1) are given by the column position and the denominator degrees of freedom

(?2) are given by the row position. The values in the body of the table are the upper tail critical values so that PF f P, where P is the probability shown

(0.10, 0.05, 0.01).