 ### Transcription

Main topicsExploratory data analysisProbabilityStatistical inference and hypothesis testingSimple and multiple linear regression4UNIVARIATE EXPLORATORY DATA ANALYSIS1. Graphical summaries of the data2. Numerical descriptive measures3. BoxplotMULTIVARIATE EXPLORATORY DATA ANALYSISSTATISTICAL INFERENCE0. I.I.D. draws from the normal distribution1. Binomial distribution2. The central limit theorem3. Estimating p, population and sample values4. The sampling distribution of the estimator5. Confidence interval for pHYPOTHESIS TESTING1. How to relate two things2. Correlations and covariances3. Linearly related variables4. Portfolio example5. Simple linear regressionBASIC PROBABILITY1. Probability and random variables2. Bivariate random variables3. Marginal distribution4. Conditional distribution5. Independence6. Computing joints from conditionals and marginalsMORE ON PROBABILITY1. Continuous distributions2. Normal distribution3. Cumulative distribution function4. Expectation as a long run average5. Expected value and variance of continuous random variables6. Random variables and formulas7. Covariance/correlation for pairs of random variables8. Independence and correlation1. Hypothesis testing2. P-values.3. Confidence intervals, tests, and p-values in general.SIMPLE LINEAR REGRESSION1. Simple linear regression model2. Estimates and plug-in prediction3. Confidence intervals and hypothesis testing4. Fits, residuals, and R-squaredMULTIPLE LINEAR REGRESSION1. Multiple linear regression model2. Estimates and plug-in prediction3. Confidence intervals and hypothesis testing4. Fits, residuals, R-squared, and the overall F-test5. Categorical explanatory variables: dummy variablesTOPICS IN REGRESSION1. Residuals as diagnostics2. Transformations as cures3. Logistic regression4. Understanding multicolinearity5. Autoregressive models6. Financial time series5Univariate Exploratory Data Analysis1. Graphical summaries of the data1.1 Dot plot1.2 Histogram1.3 Time series plot2. Numerical descriptive measures2.1 Measures of central tendency2.1.1 The sample mean2.1.2 The median2.2 Measures of dispersion2.2.1 The sample variance2.2.2 The sample standard deviation2.3 Measure of asymmetric: skewness2.4 Meausre of extremety: kurtosis2.5 Quantiles2.6 Empirical rule3. Boxplot62

Summary of the lecture In this class you will learn how to graphsmall sets of quantitative observations: dotplotlarge sets of quantitative observations: histogramobservations that are collected as time evolves: time-series plot You also will learn how to construct a boxplot, which can be prove useful whencomparing observations from several samples Even though graphs are extremely useful and relatively simple to draw, in manysituations numerical summaries are required, for instance as input into other systems. We will also talk aboutmeasures of central tendency (mean and median)measures of dispersion (variance, standard deviation)measure of asymmetry (skewness)measure of extremity (kurtosis) We will also discuss the empirical rule that says that roughly 68% of the observationsin any sample should fall within one sample standard deviation around the sample meanand 95% should fall within two sample standard deviations around the sample mean.7Book material Chapter 1Types of statistics (pages 6-7 (12 &13)* ) and types of variables (pages 8-9 (12& 13))Chapter 2Frequency distributions and Histogram (pages 25 -33 (12), 22-37 (13))Chapter 3Sample mean (page 58 (12 &13)) and sample median (page 62 (12& 13))Measures of dispersion (pages 71-77 (12), 71-80 (13))Empirical rule (page 80 (12), 82 (13))Chapter 4Dotplots (pages 97-98 (12), 99-100 (13))Boxplots (pages 108-111 (12), 110-113 (13))Skewness (pages 114-117 (12) , 113-117 (13))*Numbers in parentheses refer to the book edition81. Graphical Summaries of the DataTwo key ideasExploratory (descriptive) issues:Look at the data (sample).Understand its structure without generalizing.Inference issues:Use data (sample) to generalize results toa larger population of interest.93

ExampleProblem: How many of 100,000 voters (population)prefer A over B? We can’t ask them all!Solution: Ask a sample of 500 voters.Summarize, describe the data: 300 voters for A (A 1) , 200 for B (B 0).We will learn how to generalize to the population. For now, we just learn howto analyze (describe) the data.10Let us look at some data. Data are the statistician’s rawmaterial, the numbers that we use to interpret reality.All statistical problems involve either the collection,description and analysis of data, or thinking about thecollection, description and analysis of data.There are many aspects of data. Data may be:univariate (one variable per case) ormultivariate (more than one variable per case).There are also different types of data:discrete (transactions in a given day) andcontinuous (SP500)11The Canadian Return DataHere is a specific data set (or sample). We have107 monthly returns on a broad based portfolio ofCanadian assets (more on portfolios 050.020.040.06-0.010.020.04Interpret: Each number corresponds to a month.They are given in time order (go across columns first).Our first observation is .07. In the first month, thereturn was .07, in the 11th .03.124

1.1 The dot plotWe are interested in ways to summarize or “see” the data.The previous table was very unclear.To display the returns we can use a simple graphical tool: the dot plot.For each number simplyplace a dot above thecorrespondingpoint on thenumber line.Interpret:The returns arecentered or locatedat about .01.The spread or variationin the returns is huge.13center orlocation of the datavariation or spread about the centerNotice that the data has a nice mound or bell shape.There is a central peak and right and left “tails” thatdie off roughly symmetrically.14Some datadoes nothave themound shape.Daily volumeof tradesin thecattle pit.It is skewed to the right or positively skewed.155

We also have data on countries other than Canada.Let us compare Canada with Japan.16It really helps to get things on the same scale.How is Japan different from Canada?17Mutual fund dataLet us use the dot plot to compare returns on some otherkinds of assets.We will look at returns on different mutual funds such asthe equally weighted market and T-bills.The equally weighted market represents returns on a portfoliowhere you spread your money out equally over a widevariety of stocks.186

Character DotplotData on 4 differentkinds of returns:Dreyfusgrowth fundPutmanincome fund(Note that each dotis now 2 points)Equally weightedmarketT-bills(each dot is 7 points here.This is the risk free asset). .: . ::.:: :: ::::: .:.::::::: ::::::::::::::::::::::::::::::::::::::::::::. :::::::::::::::::::.- --------- --------- --------- --------- --------- -----drefusEach dot represents 2 points.::.::::::::::::: :::::::.::::::.::.:::::::::::.- --------- --------- --------- --------- --------- -----Putnminc.:: :.: : ::: : ::: :.:::.::: .:: .:.:.::::::::::::::.::::::::::::::::::::. :.:::::::::::::::::::::.:.:.- --------- --------- --------- --------- --------- -----eqmrktEach dot represents 7 points:::::::::::::- --------- --------- --------- --------- --------- -----tbill-0.20-0.100.000.100.200.3019The beer datanbeerm: the number of beers male MBA students claimthey can drink without getting drunknbeerf: same for femalesCharacter DotplotWe call a pointlike this an.outlier::::. ::::. : . : : :.: : :: . --------- --------- --------- --------- --------- ------nbeerm. : : . --------- --------- --------- --------- --------- ------nbeerf0.04.08.012.016.020.0Generally the males claim they can drink more,their numbers are centered or located at larger values.201.2 The histogramSometimes the dot plot can look rather jumpy.The histogram gives us a smoother picture of the data.The height of each bar tells us how many observations arein the corresponding interval.nbeerf4.02.05.06.00.57.06.02.55.03 women have anumber of beersbetween1.5 and 4.5.3 women have anumber of beersin the interval(1.5, 4.5).1.54.5217

Here is thehistogramof the Canadianreturns.The number ofbars you useaffects how “smooth”the picture looks.221.3 The time series plotWe just looked at two kinds of data:1) the return data2) the number of beersFor the return data, each number corresponds to a month.For the beer data, each number corresponds to a person.The return data has an important feature that thebeer data does not have.It has an order!There is a first one, a second one, and .23A sequence of observations taken over time isoften called a time series.We could have daily data (temperature),annual data (inflation),quarterly data (inflation, GDP)and so on.For time series data, the time series plot is animportant way to look at the data.248

Time series plot of the Canadian returns:On theverticalaxis wehavereturns.On thehorizontalaxis wehave “time”.Do you see a pattern?25Time series plot of Daily volume of trades in the cattle pit:On theverticalaxis wehavevolumes.On thehorizontalaxis wehave days.Do you see a pattern?26Monthly US beer production.Now, do you see a pattern?279

Australia: monthly production of beer.megalitres. April 1956 - Aug 1995Two components: a seasonal (annual) cycle plus an increasingtrend from 100 to 175, then a constant trend for the second halfof the time series.282. Numerical Descriptive MeasuresWe have looked at graphs.Suppose we are now interested in having numericalsummaries of the data rather than graphical representations.We have seen that two important features of any dataset are:1) how spread out the data is, and2) the central or typical value of the data set.29In this part of the notes we will describe methods tosummarize a data set numerically.First, we will introduce measures of central tendencyto determine the “center” of a distribution of datavalues, or possibly the “most typical” data value.Measures of central tendency include: the mean andthe median.Second, we will discuss measures of dispersion, suchas the sample standard deviation and the samplevariance.3010

2.1 Measures of Central Tendency2.1.1 The sample meanSuppose we collect n pieces of data. We need some way ofdescribing the data. We writethe last number, n is the numberof numbers, or the “number ofobservations.” You may also hear itreferred to as the “sample size.”the first numberThey are the values that we observe.31Here, x is just a name for the set of numbers, we couldjust as easily use y (or Buddy).x52862n 5Sometimes the order of the observations means something.In our return data the first observation corresponds to thefirst time period.Sometimes it does not. In our beer data we just have a listof numbers, each of which corresponds to a student.32The sample mean is just the average of the numbers “x”:We often use thenumbers x.symbol to denote the mean of theWe call it “x bar”.3311

Here is a more compact way to write the same thing ConsiderWe use a shorthand for it (it is just notation):This is summation notation34Using summation notation we have:The sample mean:35Graphical interpretation of the sample meanLet us go back to our standard dot plotsCharacter Dotplot. .::. --------- --------- --------- --------- --------- -------nbeerf.::::.::::.:.::: . ::::. --------- --------- --------- --------- --------- -------nbeerm0.02.55.07.510.012.5In some sense, the men claim to drink more.To summarize this we can compute the average valuefor both men and women.(I deleted the outlier, I do not believe him!).3612

Mean of nbeerf 4.2222Mean of nbeerm 7.8625“On average women claimthey can drink 4.2 beers. Menclaim they can drink 7.8 beers”In the picture, I think of the mean as the “center” of the data.Character Dotplot. .::. --------- --------- --------- --------- --------- -------nbeerf.4.22::::.::::.:.::: . ::::. --------- --------- --------- --------- --------- -------nbeerm0.02.55.07.510.012.57.8637Let us compare the means of the Canadian and Japanesereturns.Mean of canada 0.0090654Mean of japan 0.0023364This is a big difference.It was hard to see this difference in the dot plots (page 14)Because the difference is small compared to the variation.38More on summation notation (take this as an aside)Let us look at summation in more detail.means that for each value of i, from 1 to n,we add to the sum the value indicated,in this case xi.add in this value for each i3913

To understand how it works let us consider someexamples.Think of each row as anobservation on both x and y.To make things concrete, thinkof each row as corresponding toa year and let x and y be annualreturns on two different n year 1 asset “x” had return 7%.In year 4 asset “y” had return 3%.40compute x bar.compute y bar.41For each value of i, we can add in anything we want:4214

2.1.2 The medianAfter ordering the data, the median is the middle value of thedata.If there is an even number of data points, the median is theaverage of the two middle values.Example1,2,3,4,5Median 31,1,2,3,4,5Median (2 3)/2 2.543Mean versus medianAlthough both the mean and the median are goodmeasures of the center of a distribution of measurements,the median is less sensitive to extreme values.The median is not affected by extreme values sincethe numerical values of the measurements are notused in its computation.Example1,2,3,4,51,2,3,4,100Mean: 3Mean: 22Median: 3Median: 3442.2 Measures of DispersionThe mean and the median give us information about the centraltendency of a set of observations, but they shed no light on thedispersion, or spread of the data.Example: Which data set is more variable ?5,5,5,5,51,3,5,8,8Mean: 5Mean: 5Do you only care about the average return on a mutual fund oryou need a measure of risk, too?Here is one 4515

2.2.1 The Sample VarianceCharacter Dotplot.- --------- --------- --------- --------- --------- -----x.- --------- --------- --------- --------- --------- -----y0.0300.0450.0600.0750.0900.105The y numbers are more spread out than the x numbers.We want a numerical measure of variation or spread.The basic idea is to view variability in terms of distancebetween each measurement and the mean.46Character Dotplot.- --------- --------- --------- --------- --------- -----x.- --------- --------- --------- --------- --------- -----y0.0300.0450.0600.0750.0900.10547We cannot just look at the distance between eachmeasurement and the mean. We need an overallmeasure of how big the differences are(i.e., just one number like in the case of the mean).Also, we cannot just sum the individual distancesbecause the negative distances cancel out with thepositive ones giving zero always (Why?).We average the squared distances and define4816

So, the sample variance of the x data is defined to be:Sample variance:We use n-1 instead of n for technical reasons that willbe discussed later.Think of it as the average squared distance ofthe observations from the mean.49Questions1) What is the smallest value a variance can be?2) What are the units of the variance?It is helpful to have a measure of spread whichis in the original units. The sample variance is not in theoriginal units. We now introduce a measure of dispersionthat solves this problem: the sample standard deviation502.2.2 The sample standard deviationIt is defined as the square root of the sample variance (easy).The sample standard deviation:The units of the standard deviation are the sameas those of the original data.5117

Example 1 (numerical)Assume as before: 0.04, -0.02, 0.02, -0.04 0.02, 0.01, 0.01, 0.0252The samplestandard deviationfor the y datais bigger thanthat for the x data.This numericallycaptures thefact that y has“more variation”about its meanthan x.53Example 2 (graphical)The standard deviationsmeasure the fact that thereis more spread in the JapanesereturnsCharacter Dotplot.:: ::: :.::: :.:: : :::: ::::::: :::: :::: :::. : :::: :::: :::: :::.----- --------- --------- --------- --------- --------- -canada.::. : . :::.:: :.:.: ::: .::: :::: : :. . :.:: :::: :::: :::: : ::: : . :.----- --------- --------- --------- --------- --------- 85418

2.3 Measure of asymmetry: SkewnessMeasures asymmetry of a distribution.Symmetric data has zero skewness.Negatively skewness (the left tail is longer – mean median)Occurs when the values to the left of (less than) the mean are fewer butfarther from the mean than are values to the right of the mean.Positively skewness (the right tail is longer – mean median)Example: investment returns -5%, -10%, -15%, 30%People like bets with positive skewness.Willing to accept low, or even negative, expected returns when an assetexhibits positive skewness.55Mean MedianMean Median562.4 Measure of extremity: KurtosisMeasures the degree to which exceptional values occur morefrequently (high kurtosis) or less frequently (low kurtosis)A reference distribution is the normal distribution, whose kurtosisis three.High kurtosis results in exceptional values that are called "fat tails."Fat tails indicate a higher percentage of very low and very highreturns than would be expected with a normal distribution.Low kurtosis results in "thin tails" and a wide middle with morevalues close to the average than there would be in a normaldistribution, and tails are thinner than there would be in anormal distribution.5719

Volume data58Kurtosis: historical facts KURTOSIS was used by Karl Pearson in 1905 in "Das Fehlergesetz und seineVerallgemeinerungen durch Fechner und Pearson. A Rejoinder," Biometrika, 4, 169-212, in thephrase "the degree of kurtosis." He states therein that he has used the term previously (OED).According to the OED and to Schwartzman the term is based on the Greek meaning a bulging,convexity.He introduced the terms leptokurtic, platykurtic and mesokurtic, writing in Biometrika (1905), 5.173: "Given two frequency distributions which have the same variability as measured by thestandard deviation, they may be relatively more or less flat-topped than the normal curve. If moreflat-topped I term them platykurtic, if less flat-topped leptokurtic, and if equally flat-toppedmesokurtic" (OED2).In his "Errors of Routine Analysis" Biometrika, 19, (1927), p. 160 Student provided a mnemonic:59 Computing skewness and excess kurtosisExcess kurtosis is kurtosis minus 3.Excel computes excess kurtosis.6020

Volume data: kurtosis and outliers61Same kurtosis, different skewness62Same skewness, different kurtosis10 largest .7477835.237826321

Left histogram: higher variability.Left histogram: lower kurtosis or thinner tails.Bottom curves: left tail behavior of both histograms.64Same mean, variance, skewnessDifferent kurtosisIn both cases, mean 0, variance 3 and skewness 0.Excess kurtosis is 0.054 for the thin-tail distribution (black).Excess kurtosis is 65.18 for the fat-tail distribution (red).Percentage of observations below cutoffcutoffRedBlack-10 0.10640.0000-9 0.14480.0000-8 0.20380.0005-7 0.29930.0065-6 0.46360.0571-5 0.76960.3571-4 1.40041.6004-3 2.88345.1393-2 6.9663 11.8255-1 19.5501 19.4970652.5 QuantilesQuartiles: divide the data into 4 equal parts.Q1 Median of the first half of the dataQ2 MedianQ3 Median of the second half of the dataIQ Interquartile rangeIQ Q3-Q1Deciles: divide the data into 10 equal parts.Percentiles: divide the data into 100 equal parts.6622

2.6 The Empirical RuleWe now have two numerical summaries for the datawhere the data ishow spread out,how variable the data isThe mean is pretty easy to interpret (some sort of “center” of thedata).We know that the bigger sx is, the more variable the data is, but howdo we really interpret this number?What is a big sx, what is a small one ?67The empirical rule will help us understand sx andrelate the summaries back to the dot plot (or thehistogram).Empirical RuleFor “mound shaped data”:Approximately 68% of the data is in the intervalApproximately 95% of the data is in the interval68Let us see this with the Canadian returnsThe empiricalrule says thatroughly 95%of theobservationsare between thedashed lines androughly 68% betweenthe dotted lines.Looks reasonable.6923

Same thingviewed fromthe perspectiveof the timeseries plot.5% outsidewould be about5 points.There are 4 pointsoutside, which ispretty close.703. BOXPLOT1-2-2-3-3-4-4-4-4-4-4-5-5-5.5-771Step by step illustrationData: 65 69 70 63 63 72 63 60 69 66 71 73 70 65 74 69 69 87Sort: 60 63 63 63 65 65 66 69 69 69 69 70 70 71 72 73 74 87Q1 Q2 Q3 IQ 1.5*IQ Q1-1.5*IQ Q3 1.5*IQ 7224

SolutionSort: 60 63 63 63 65 65 66 69 6969 69 70 70 71 72 73 74 87Q1 65Q2 69Q3 71IQ Q3-Q1 71-65 61.5*IQ 9Q1-1.5*IQ 65-9 56Q3 1.5*IQ 71 9 8073Example: European returns74Example: Annual salary (in thousands of dollars)7525

Example: SP500 components1st row: ordered by skewness2nd row: ordered by kurtosis76S&P500: kurtosis and skewnessSkewness and logarithm of excess kurtosis for the S&P500 components.77S&P500: Components with fattest and thinnest tails 7826

Example: Number of siblings - MBA studentsData collected from Business Stats students on January10th 2009 3345579Xbar 1.73Median 1.50Var 1.087St.dev. 1.042Q1 1.00Q3 2.00Skewness 1.616Excess kurtosis 3.09380Example: Number of siblings – Boston College # %( & # " ' 8127

Xbar 2.022013St.Dev. 1.640233Skewness 2.165848Excess kurtosis 8.029811Q1 1Q2 2Q3 382 83Seasonality is more pronounced in Durham and Chicago.Variability is also higher in Durham and Chicago.Longer winters in Chicago (really?!?!)8428

The time series were smoothed by replacing each observation by theaverage of 21 neighboring days, 10 to the left and 10 to the right of theobservation.Smoothing the time series helps to highlight the short-term patterns.85The time series were smoothed by replacing each observation by the average of364 neighboring days, 182 to the left and 182 to the right of the observation.Smoothing the time series helps to highlight the long-term patterns.86MonthlybehaviorRio: variability seemsto be constantthroughout the year.Durham, Chicago:variability seems to behigher during coldermonths than duringwarmer months.Dot: mediansvertical bar: Q1 to Q38729

January behavior ! ! 88Example: Highest temperatures in the 3.9MISSOURI47.85089Highest temperaturesCount50Mean45.604sample variance13.901sample standard deviation3.728Minimum37.8Maximum56.7Range18.9mean - 2s38.147mean 2s53.061percent in interval (95.44%)mean - 3smean 3spercent in interval 790.7281st quartile43.300Median45.6003rd quartile47.800interquartile range4.5009030

91Example: US 2004 unemployment ratesMean (xbar) 5.2078431variance 1.1691373standard deviation (s) 1.0812665Q1 4.6 (Georgia)Q2 5.2 (Rhode Island)Q3 5.7 (New Mexico)skewness 0.4798145kurtosis 0.3317919Empirical rule[xbar-1*s;xbar 1*s] [4.13;6.289110][xbar-2*s;xbar 2*s] [ 3.05;7.370376][xbar-3*s;xbar 3*s] [ 331

Multivariate Exploratory Data Analysis1. How to relate two things2. Correlations and covariances3. Linearly related variables3.1 Mean and variance of a linear function3.2 Linear combinations3.3 Mean and variance of a linear combination: 2 inputs3.4 Mean and variance of a linear combination: 3 inputs3.5 Mean and variance of a linear combination: k inputs4. Portfolio example5. Simple linear regression94Summary of the lectureIn this class you will learn how to Relate two sets of variables: sample linear correlation coefficient Compute sample mean, variance and standard deviation of linearcombinations of variables Study the practical example of portfolio allocationBookSkewness (pages 114-117 (12)*, 113-117 (13))What is correlation analysis? (pages 429-435 (12), 458-465 (13))*Number in parenthesis refers to the book edition95Example: Comparing international stock returnsJuly 10, 1987 until December 31, 1997 (2733 days) - Amsterdam (EOE) , Frankfurt(DAX), Paris (CAC40), London (FTSE100), Hong Kong (Hang Seng) Tokyo(Nikkei), Singapore (Singapore All Shares), New York (S&P500).9632

Histograms97Statistical summary98It is considered good to havea large mean returnanda small standard deviation.9933

1. How to Relate Two ThingsThe mean and standard deviation help us summarize a bunchof numbers which are measurements of just one thing (one variable)A fundamental and totally different question is howone thing relates to another.In this section of the notes we look at scatter plotsand how covariance and correlation can be used tosummarize them.When examining two things (variables) at the time, the scatterplot will be our main graphical tool whereas covariance andcorrelation will be our main numerical 12.012.012.0.NowweIs the number of beers you can drinkrelated to your f123

Most of the computations in the classroom examples are simple enough to be performed by a scientific calculator and/or excel. Several of the computation and plots that appear in the lecture notes were obtained from MINITAB, R, Excel or MegaStat for Excel. MegaStat for Excel is a set of routines that can