聯系方式

您當前位置:首頁 >> Java編程Java編程

日期:2019-06-07 11:28

PSTAT 126 Final Project Option 1: the CDI Data

1 Description

The following description can be found in Appendix C.2 of Applied Linear Regression Models, fourth

edition, by Kutner, Nachtsheim, and Neter:

This data set provides selected county demographic information (CDI) for 440 of the most

populous counties in the United States. Each line of the data set has an identification number with

a county name and state abbreviation and provides information on 14 variables for a single county.

Counties with missing data were deleted from the data set. The information generally pertains to

the years 1990 and 1992. The 16 variables are

Variable Name Description

County County name

State Two-letter state abbreviation

LandArea Land area (square miles)

TotalPop Estimated 1990 population

Pop18 Percent of 1990 CDI population aged 18–34

Pop65 Percent of 1990 CDI population aged 65 years old and older

Physicians Number of professionally active nonfederal physicians during 1990

Beds Total number of beds, cribs, and bassinets during 1990

Crimes Total number of serious crimes in 1990, including murder, rape, robbery, aggravated

asault, burglary, larceny-theft, and motor vehicle theft, as reported by law enforcement

agencies

HSGrad Percent of adult population (persons 25 years old or older) wo completed 12 or more years

of school

Bachelor Percent of adult population (percsons 25 years old or older) with bachelor’s degree

Poverty Percent of 1990 CDI population with income below poverty level

Unemp Percent of 1990 CDI labor force that is unemployed

IncPerCap Per capita income of 1990 CDI population (dollars)

PersonalInc Total personal income of 1990 CDI population (in millions of dollars)

Region Geographic region classification is that used by the U.S. Bureau of the Census, where:

1 = NE, 2 = NC, 3 = S, 4 = W

The file CDI.rds contains these data and is available on Gauchospace.

2 Project Components

The overall project consists of a thorough investigation of two regression models that combine concepts and

methods of linear regression used throughout the quarter.

2.1 Part I

You will investigate the model

Physicians ~ log(TotalPop) + LandArea + IncPerCap (1)

by answering the following questions.

1

a) What relationships do you expect to see between the response and each of the predictors, and why? What

kind of associations, if any, do you expect will be present between the three predictors, and why? Do

some exploratory analysis (e.g. plots and/or numerical summaries) to test you intuition.

b) Fit the model in (1) and provide interpretations of the estimated coefficients. Report the value of R2 and

explain its meaning.

c) Do diagnostic checks to assess whether or not the linear regression assumptions seem to hold. If the

model assumptions do not hold in your view, investigate possible transformations for predictors and/or

response. Once suitable transformations are found, repeat b) for this new model and use this model for

the remainder of Part I. Otherwise, move on to d).

d) Using your fitted model, compute 95% confidence intervals for each of the coefficients in the model, and

provide an interpretation for each. Conduct a test for the existence of a linear relationship between the

predictors and response at α = 0.01. Give the null and alternative hypotheses (defining any notation

that you use), value of the test statistic and its null distribution, the p-value or critical value, and your

decision.

e) Does the variance increase or decrease with log(TotalPop)? Perfom a test to make your conclusion. If

you conclude that the variance is not constant, refit the model using weighted least squares and comment

on any differences to the fitted coefficients or their standard errors.

f) Summarize your analysis and comment on any interesting or unexpected findings.

2.2 Part II

You will investigate the model

Physicians ~ TotalPop + Region (2)

a) Fit the model in (2), and check the diagnostics. Find transformations if necessary.

b) Using your transformations from a), refit the new model. For each region separately, write out an equation

that expresses the estimated mean of number of physicians as a function of total population and personal

income. Based on these equations, explain why this model is called a parallel regression model.

c) Does the geographic region have a significant effect on the number of physicians in a county? Explain

your answer. If geographic region is not important, remove it from the model from now on.

d) Use model selection techniques from class, build on your current model by selecting relevant predictors

from Pop65, Crimes, Bachelor, Poverty, and PersonalInc. Perform a partial F-test to assess whether

the improvement from adding these predictors compared to the first model is statistically significant at

α = 0.05.

e) Using the model chosen in d), identify any influential points. For any data points with large influence,

use leverages and/or residuals (standardized or studentized) to explain why they are influential.

f) Summarize your analysis and comment on any interesting or unexpected findings.

2


版權所有:留學生編程輔導網 2018 All Rights Reserved 聯系方式:QQ:99515681 電子信箱:[email protected]
免責聲明:本站部分內容從網絡整理而來,只供參考!如有版權問題可聯系本站刪除。

体彩22选5开奖走势图百度