### Transcription

REGRESSION[25 Points]The results below show a regression of team winning percentage (WIN PCT) on the log of theaverage player salary using the data in the previous problem. (The regression was not computedwith Minitab, but it shows the same numbers that Minitab would have ---------------------------------------Least squares regression .LHS WIN PCT Mean .49976Standard deviation .06738---------No. of observations 468 DegFreedomMean squareRegressionSum of Squares .1057251.10573ResidualSum of Squares 2.01448466.00432TotalSum of Squares 2.12020467.00454---------Standard error of e .06575 Root MSE.06561FitR-squared .04987 R-bar squared.04783Model testF[ 1,466] 24.45696 Prob F F*.00000----------- ----------------- StandardProb.95% ConfidenceWIN PCT CoefficientErrort t T*Interval----------- -----------------Constant .47675***.0055685.78 .0000.46586.48765LogAVGSALRY .02225***.004504.95 .0000.01343.03108----------- -----------------Note: ***, **, * Significance at 1%, 5%, 10% ------------------------------------Answer each of the following. If the answer cannot be determined from the numbers given, youmay respond “unable to tell.”(a) What is the value of se, the standard error of the regression?(b) What is the standard deviation of WIN PCT?(c) How many data points were used in this regression?(d) What is the value of R2 for this regression?(e) What is the average of logAVGSALRY in this data set?(f) What is the correlation coefficient r between WIN PCT and logAVGSALRY. Is it positiveor negative? How do you know?(g) Does this estimated regression provide convincing evidence of a statistical relationshipbetween WIN PCT and logAVGSALRY? Explain.(h) The slope of the regression line is .02225. How would you interpret this value?(i) There are 31 teams in the data set. To see if there was a systematic difference betweenteams, I added 30 dummy variables for the first 30 teams in the sample to this model. R 2went up from 0.04987 to 0.36437.(ia) Shouldn’t I have added all 31 dummy variables to the equation? Why only 30?(ib) Would you conclude that there is a systematic difference between teams? (Hint: The Ftable on page 10 does not contain a column for 30 degrees of freedom. But, given thestatistic, you should be able to draw a conclusion.)1

Part VI. Model Building [50 points]Predicting Presidential Electionsand Other ThingsRay Fair, Stanford U. Press, tions, Federal Reserve behavior, andinflation—and “not quite so big” topics—winequality—the book takes on questions of moredirect, personal interest. Who of your friends ismost likely to have an extramarital affair? Howimportant is class attendance for academicperformance in college?This book, published in 2002, describes many of Ray Fair’s [see http://fairmodel.econ.yale.edu/] researchprojects over a 35 year career as an economist. Among Ray’s most famous papers is “A Theory ofExtramarital Affairs,” in which he examined from the point of view of an economist the time that marriedpeople spent in extramarital activity. As part of that research, he examined survey data sets gathered by twomagazines, Psychology Today and Redbook. We are going to examine the first of these. The variables inthis data set, which included both men and women are:y Number of affairs in the past year,z1 Sex,z2 Age,z3 Number of years married,z4 Children (1 yes, 0 no),z5 Religiousness (rated 1 to 5)z6 Education (years)z7 Occupation (rating on Hollingshead scale),z8 Self rating of marriage (rated 1 to 5).The first set of regression results (not computed with Minitab) ---------------------LHS NAFFAIR Mean 1.455907Standard deviation 3.298758ResidualsSum of squares 5668.953Standard error of e 3.094501Degrees of freedom 592FitR-squared .1317381Adjusted R-squared .1200048Model testF[ 8,592] (prob) 11.2(.00)-------- ----------Variable CoefficientStandard Error b/St.Er. P valueMean of X-------- ----------Constant 5.87201***1.137505.162.0000SEX .05409.30049.180.8572.47587AGE -.05098**.02262-2.254.024232.4875YRSMARRD .16947***.041224.111.00008.17770NUMKIDS -.14262.35020-.407.6838.71547RELIGOUS -.47761***.11173-4.275.00003.11647EDUC -.01375.06414-.214.830316.1664OCCUPATN .10492.088881.180.23784.19468RATEMARR -.71188***.12001-5.932.00003.93178-------- ----------Note: ***, **, * Significance at 1%, 5%, 10% --------------------------2

a. How many observations were used to compute the regressionb. Do you think this regression provides a “significant” explanation of number of extramarital affairs?c. What are the significant predictors of the number of extramarital affairs in these results?d. What does the reported value p 0.0242 reported with AGE mean?e. Can you provide a specific interpretation to the reported value of -0.47761 reported with RELIG in theresults?f. Explain the meaning of the F statistic and the associated p value reported.g. Based on these results, test the hypothesis that the coefficient on Sex in the model is zero.Part VII. Logistic Regressiona. The book description states something about “who is most likely to have an affair.” Of course, thelinear regression model does not do that at all. I created another variable, IFAFFAIR 1 if the number ofaffairs is greater than 0 and 0 if it is 0. In this data set, using 600 observations, the proportion of peoplewho reported having an affair is .250. What would you report to the reporter from People Magazine who isinterested in a plausible range of values for the proportion of people in the population who are havingextramarital affairs.b. These are the results of a binary logistic regression. What does the model suggest is the best variablefor predicting whether someone will have an affair? ---------------------------Binary Logistic Model for Binary ChoiceDependent variableIFAFFAIR-------- ----------Variable CoefficientStandard Error b/St.Er. P valueMean of X-------- ----------Constant 1.37726.887761.551.1208SEX .28029.239101.172.2411.47587AGE -.04426**.01825-2.425.015332.4875YRSMARRD .09477***.032212.942.00338.17770NUMKIDS .39767.291511.364.1725.71547RELIGOUS -.32472***.08975-3.618.00033.11647EDUC .02105.05051.417.676916.1664OCCUPATN .03092.07178.431.66664.19468RATEMARR -.46845***.09091-5.153.00003.93178-------- ----------Note: ***, **, * Significance at 1%, 5%, 10% --------------------------3

VIII. Simple RegressionThe results below show an analysis of the yearly costs of running a railroad in Switzerland (thereare 50 Swiss railroads and these are several years of data). The results show the regression of thelog of total costs on the log of total output (freight plus --------------------------------------logTOTALCOST Mean 9.26126Standard deviation 1.09804---------No. of observations 605 DegFreedomMean squareRegressionSum of Squares 619.6861619.68635ResidualSum of Squares 108.557603.18003TotalSum of Squares 728.2436041.20570---------Standard error of e .42430FitR-squared .85093 R-bar squared.85069Model testF[ 1,603] 3442.15897 Prob F F*.00000-------- ----------------- StandardProb.95% ConfidencelogTOTCO CoefficientErrort t T*Interval-------- -----------------Constant -3.83900.22395-17.14 .0000-4.27794 -3.40006LogOUTPUT1.00283.0170958.67 .0000.969331.03634Answer each of the following based on this regression. If the answer cannot be determined fromthe numbers given, you may respond “unable to tell.”[2 points] (a) What is the value of se, the standard error of the regression? 0.42430[2 points] (b) What is the standard deviation of the log of TOTALCOST? 1.09804[2 points] (c) What is the value of R2 for this regression? 0.85093[2 points] (d) What is the sample average of logTOTALCOST in this data set? 9.26126[3 points] (e) What is the correlation coefficient r between logTOTALCOST and logOUTPUT.Is it positive or negative? How do you know?[2 points] (f) Does this estimated regression provide convincing evidence of a statisticalrelationship between logTOTALCOST and logOUTPUT Explain.[2 points] (g) The slope of the regression line is 1.00283. How would you interpret this value?4

IX. Multiple RegressionThe regression below is based on an idea that there are other things besides output that explainthe cost of running a Swiss railroad. These other things areSTOPS number of stationsTUNNEL 1 if the railroad has to go through lots of tunnelsVIRAGE 1 if the railroad has lots of sharp curveslogNETWORK is the log of the number of meters of track on the railroadThere are 3 distinct language areas in Switzerland, German (Zurich), French (Geneva), andItalian (Lugano, Belinzona). A not very good theory suggests that costs depend on where therailroad mostly operates. FRENCH and GERMAN are dummy region ------------------------------------logTOTALCOST Mean 9.26126---------No. of observations 605 DegFreedomMean squareRegressionSum of Squares 644.722792.10320ResidualSum of Squares 83.5211597.13990TotalSum of Squares 728.2436041.20570---------Standard error of e .37403FitR-squared .88531 R-bar squared.88397Model testF[ 7,597] 658.34390 Prob F F*.00000-------- ----------------- StandardProb.95% ConfidencelogTOTAL CoefficientErrort t T*Interval-------- -----------------Constant -3.35136.29007-11.55 .0000-3.91988 -2.78283logOUTPU .66764.0333520.02 .0000.60227.73300STOPS -.00225.00136-1.66 .0973-.00491.00041logNETWO .39339.039689.91 .0000.31562.47117TUNNEL .32096.045017.13 .0000.23274.40918VIRAGE -.15045.03474-4.33 .0000-.21854-.08236FRENCH -.01281.03122-.41 .6815-.07401.04838GERMAN .04195.033241.26 .2069-.02320.10710[2 points] (a) How many observations were used to compute the regression? 605[2 points] (b) Which variable(s) are “statistically significant?” Based on the P values less than[2 points] (c) There are three language regions, but I included only two dummy variables in themodel. Shouldn’t I have included ITALIAN in the equation?[2 points] (d) Based on the value of F, is this regression “significant?” Explain. The F is[2 points] (f) The “P-value” for the coefficient on STOPS is 0.0973. How should I interpretthis value?[3 points] (g) Form a confidence interval for the coefficient on TUNNEL.[3 points] (h) Based on these results, should I conclude that the costs are higher in the Germanspeaking part of the country? Justify your answer.[4 points] (i) I suspect that there are no significant differences between the three cities. To findout, I recomputed my regression model without the FRENCH and GERMAN variables (so allcities are now treated the same). The R squared fell to 0.88498. Do the results support mysuspicion?5

Part IX. Multiple Regression Modeling[50 points]The regressions on the next page show a study of box office revenues for a sample of movies in2008/2009. The variables in the model are as follows: The dependent variable is the log of box office revenues The predictors are:Log of production budgetIndex of ‘star power’Dummy variable for whether the movie is a sequelMPAA rating, G 1, PG 2, PG13 3, R 4Dummy variables for 4 kinds of movies (the left out category isevery other kind of movie other than these 4)CantWait3 is Fandango’s index of how much fans ‘can’t wait’LogAddct is Trailer Addict.com’s index of fans ‘can’t wait’LogFndGo is another internet buzz variable from FandangoLogCmSon is a buzz variable from RottenTomatos.coma. Based on the first regression, which variables are statistically significant in explainingLogBox?b. Form a confidence interval for the true coefficient on CNTWAIT3 using the first regression.c. Do the four internet buzz variables in the first regression provide a significant additional fit inthe model compared to the second regression? How can you determine this?d. Interpret the coefficient on SEQUEL in the first regression model.e. Is it possible to determine the regression sum of squares in the second regression? How?f. How many observations were used to compute the regressions?g. Do you think the third regression provides a “significant” explanation of LogBox? Explain.h. What does the reported value p 0.2122 reported with LogBudget in the first equation mean?i. Explain the meaning of the F statistic and the associated p value reported with the firstmodel.6

FIRST ------LHS LOGBOXMean 16.47993Standard deviation .94297Model sizePredictors 12Degrees of freedom 49 (Residual)ResidualsSum of Squared Res. 20.54972 (Sum of squared residuals)Standard error of e .64760 (Mean squared residual)FitR-squared .62114Adjusted R-squared .52836Model testF[ 12,49] (prob) 6.7(.0000)-------- ------------------------------------------ StandardProb.LOGBOX CoefficientErrortt T -------- ------------------------------------------Constant 12.5388***.9876612.70 .0000LOGBUDGT .23193.183461.26 .2122STARPOWR .00175.01303.13 .8935SEQUEL .43480.296681.47 .1492MPRATING -.26265*.14179-1.85 .0700ACTION -.83091***.29297-2.84 .0066COMEDY -.03344.23626-.14 .8880ANIMATED -.82655**.38407-2.15 .0363HORROR .33094.36318.91 .3666CNTWAIT3 2.59489***.909812.85 .0063LOGADCT .29451**.131462.24 .0296LOGFNDGO .02322.11460.20 .8403LOGCMSON .05950.12633.47 .6397-------- -------------------------------------------SECOND ---------Model sizePredictors 8Degrees of freedom 53ResidualsSum of squares 35.82544Standard error of e .82216FitR-squared .33951-------- ----------------------------------------- StandardProb.LOGBOX CoefficientErrortt T -------- -----------------------------------------Constant 14.0582***.9400314.95 .0000LOGBUDGT .70488***.204343.45 .0011STARPOWR .00680.01540.44 .6605SEQUEL .64860*.325551.99 .0515MPRATING -.12656.16306-.78 .4411ACTION -.29625.33730-.88 .3838COMEDY .00299.29845.01 .9920ANIMATED -.74281.47975-1.55 .1275HORROR 1.03248**.432032.39 .0204-------- ------------------------------------------THIRD ----------ResidualsSum of squares 42.84084Standard error of e .86694FitR-squared .21018Adjusted R-squared .15475Model testF[ 4,57] (prob) 3.8(.0084)-------- ------------------------------------------ StandardProb.LOGBOX CoefficientErrortt T -------- ------------------------------------------Constant 14.6306***.8470817.27 .0000LOGBUDGT .46695**.195062.39 .0200STARPOWR .00241.01567.15 .8785SEQUEL .57934*.325301.78 .0802MPRATING -.00429.14595-.03 .9766-------- -------------------------------------------7

REGRESSION [25 Points] The results below show a regression of team winning percentage (WIN_PCT) on the log of the average player salary using the data in the previous problem. (The regression was not computed with Minitab, but it shows the same numbers that Minitab would have reporte