
Transcription
Kernel Smoothing Methods (Part 1)Henry TanGeorgetown UniversityApril 13, 2015Georgetown UniversityKernel Smoothing1
Introduction - Kernel SmoothingPreviouslyBasis expansions and splines.Use all the data to minimise least squares of a piecewise definedfunction with smoothness constraints.Kernel SmoothingA different way to do regression.Not the same inner product kernel we’ve seen previouslyGeorgetown UniversityKernel Smoothing2
Kernel SmoothingIn BriefFor any query point x0 , the value of the function at that point f (x0 ) issome combination of the (nearby) observations, s.t., f (x) is smooth.The contribution of each observation xi , f (xi ) to f (x0 ) is calculated usinga weighting function or Kernel Kλ (x0 , xi ).λ - the width of the neighborhoodGeorgetown UniversityKernel Smoothing3
Kernel Introduction - QuestionQuestionSicong 1) Comparing Equa. (6.2) and Equa. (6.1), it is using the Kernelvalues as weights on yi to calculate the average. What could be theunderlying reason for using Kernel values as weights?AnswerBy definition, the kernel is the weighting function.The goal is to give more importance to closer observations withoutignoring observations that are further away.Georgetown UniversityKernel Smoothing4
K-Nearest-Neighbor AverageConsider a problem in 1 dimension xA simple estimate of f (x0 ) at any point x0 is the mean of the k pointsclosest to x0 .fˆ(x) Ave(yi xi Nk (x))Georgetown UniversityKernel Smoothing(6.1)5
KNN Average ExampleTrue functionKNN averageObservations contributing to fˆ(x0 )Georgetown UniversityKernel Smoothing6
Problem with KNN AverageProblemRegression function fˆ(x) is discontinuous - “bumpy”.Neighborhood set changes discontinuously.SolutionWeigh all points such that their contribution drop off smoothly withdistance.Georgetown UniversityKernel Smoothing7
Epanechnikov Quadratic Kernel ExampleEstimated function is smoothYellow area indicates the weight assigned to observations in that region.Georgetown UniversityKernel Smoothing8
Epanechnikov Quadratic Kernel EquationsPNKλ (x0 , xi )yiˆf (x0 ) Pi 1Ni 1 Kλ (x0 , xi )(6.2) x x0 )λ(6.3)Kλ (x0 , x) D((D(t) Georgetown University34 (1 t 2)0Kernel Smoothingif t 1otherwise(6.4)9
KNN vs Smooth Kernel ComparisonGeorgetown UniversityKernel Smoothing10
Other DetailsSelection of λ - covered laterMetric window widths vs KNN widths - bias vs varianceNearest Neigbors - multiple observations with same xi - replace withsingle observation with average yi and increase weight of thatobservationBoundary problems - less data at the boundaries (covered soon)Georgetown UniversityKernel Smoothing11
Popular KernelsEpanechnikov Compact (only local observations have non-zero weight)Tri-cube Compact and differentiable at boundaryGaussian density Non-compact (all observations have non-zero weight)Georgetown UniversityKernel Smoothing12
Popular Kernels - QuestionQuestionSicong2) The presentation in Figure. 6.2 is pretty interesting, it mentions that“The tri-cube kernel is compact and has two continuous derivatives at theboundary of its support, while the Epanechnikov kernel has none.” Canyou explain this more in detail in class?Answer((1 t 3 )30023 2D (t) 3 ( 3t )(1 t )Tricube Kernel - D(t) (Epanechnikov Kernel - D(t) Georgetown University34 (10if t 1;otherwise t 2)if t 1;otherwiseKernel Smoothing13
Problems with the Smooth Weighted AverageBoundary BiasAt some x0 at a boundary, more of the observations are on one side of the x0 The estimated value becomes biased (by those observations).Georgetown UniversityKernel Smoothing14
Local Linear RegressionConstant vs Linear RegressionTechnique described previously : equivalent to local constant regression ateach query point.Local Linear Regression : Fit a line at each query point instead.NoteThe bias problem can exist at an internal query point x0 as well if theobservations local to x0 are not well distributed.Georgetown UniversityKernel Smoothing15
Local Linear RegressionGeorgetown UniversityKernel Smoothing16
Local Linear Regression Equationsminα(x0 ),β(x0 )NXKλ (x0 , x1 )[yi α(x0 ) β(x0 )xi ]2(6.7)i 1Solve a separate weighted least squares problem at each target point (i.e.,solve the linear regression on a subset of weighted points).Obtain fˆ(x0 ) α̂(x0 ) β̂(x0 )x0where α̂, β̂ are the constants of the solution above for the query point x0Georgetown UniversityKernel Smoothing17
Local Linear Regression Equations 2fˆ(x0 ) b(x0 )T (BT W(x0 )B) 1 BT W(x0 )y NXli (x0 )yi(6.8)(6.9)i 16.8 : General solution to weighted local linear regression6.9 : Just to highlight that this is a linear model (linear contribution fromeach observation).Georgetown UniversityKernel Smoothing18
Question - Local Linear Regression MatrixQuestionYifang1. What is the regression matrix in Equation 6.8? How does not Equation6.9 derive from 6.8?AnswerApparently, for a linear model (i.e., the solution is comprised of a linearsum of observations), the least squares minimization problem has thesolution as given by Equation 6.8.Equation 6.9 can be obtained from 6.8 by expansion, but it isstraightforward since y only shows up once.Georgetown UniversityKernel Smoothing19
Historical (worse) Way of Correcting Kernel BiasModifying the kernel based on “theoretical asymptotic mean-square-errorconsiderations” (don’t know what this means, probably not important).Linear local regression : Kernel correction to first order (automatic kernelcarpentry)Georgetown UniversityKernel Smoothing20
Locally Weighted Regression vs Linear Regression QuestionQuestionGrace 2. Compare locally weighted regression and linear regression that welearned last time. How does the former automatically correct the modelbias?AnswerInterestingly, simply by solving a linear regression using local weights, thebias is accounted for (since most functions are approximately linear at theboundaries).Georgetown UniversityKernel Smoothing21
Local Linear Equivalent KernelDots are the equivalent kernel weight li (x0 ) from 6.9Much more weight are given to boundary points.Georgetown UniversityKernel Smoothing22
Bias EquationNPli (x0 )f (xi ),Using a taylor series expansion on fˆ(x0 ) i 1the bias fˆ(x0 ) f (x0 ) is dependent only on superlinear terms.More generally, polynomial-p regression removes the bias of p-order terms.Georgetown UniversityKernel Smoothing23
Local Polynomial RegressionLocal Polynomial RegressionSimilar technique - solve the least squares problem for a polynomialfunction.Trimming the hills and Filling the valleysLocal linear regression tends to flatten regions of curvature.Georgetown UniversityKernel Smoothing24
Question - Local Polynomial RegressionQuestionBrendan 1) Could you use a polynomial fitting function with an asymptoteto fix the boundary variance problem described in 6.1.2?AnswerAsk for elaboration in class.Georgetown UniversityKernel Smoothing25
Question - Local Polynomial RegressionQuestionSicong 3) In local polynomial regression, can the parameter d also be avariable rather than a fixed value? As in Equa. (6.11).AnswerI don’t think so. It seems that you have to choose the degree of yourpolynomial before you can start solving the least squares minimizationproblem.Georgetown UniversityKernel Smoothing26
Local Polynomial Regression - Interior Curvature BiasGeorgetown UniversityKernel Smoothing27
Cost to Polynomial RegressionVariance for BiasQuadratic regression reduces the bias by allowing for curvature.Higher order regression also increases variance of the estimated function.Georgetown UniversityKernel Smoothing28
Variance ComparisonsGeorgetown UniversityKernel Smoothing29
Final Details on Polynomial RegressionLocal linear removes bias dramatically at boundariesLocal quadratic increases variance at boundaries but doesn’t helpmuch with bias.Local quadratic removes interior bias at regions of curvatureAsymptotically, local polynomials of odd degree dominate localpolynomials of even degree.Georgetown UniversityKernel Smoothing30
Kernel Width λEach kernel function Kλ has a parameter which controls the size of thelocal neighborhood.Epanechnikov/Tri-cube Kernel , λ is the fixed size radius around thetarget pointGaussian kernel, λ is the standard deviation of the gaussian functionλ k for KNN kernels.Georgetown UniversityKernel Smoothing31
Kernel Width - Bias Variance TradeoffSmall λ Narrow WindowFewer observations, each contribution is closer to x0 :High variance (estimated function will vary a lot.)Low bias - fewer points to bias functionLarge λ Wide WindowMore observations over a larger area:Low variance - averaging makes the function smootherHigher bias - observations from further away contribute to the value at x0Georgetown UniversityKernel Smoothing32
Local Regression in RpPreviously, we considered problems in 1 dimension.Local linear regression fits a local hyperplane, by weighted least squares,with weights from a p-dimensional kernel.Example : dimensions p 2, polynomial degree d 2b(X ) (1, X1 , X2 , X12 , X22 , X1 X2 )At each query point x0 Rp , solveminNPβ(x0 ) i 1Kλ (x0 , x1 )(yi b(xi )T β(x0 ))2to obtain fit fˆ(x0 ) b(x0 )T β̂(x0 )Georgetown UniversityKernel Smoothing33
Kernels in RpRadius based Kernels x x0 0)Convert distance based kernel to radius based: D( x xλ ) D(λThe Euclidean norm depends on the units in each coordinate - eachpredictor variable has to be normalised somehow, e.g., unit standarddeviation, to weigh them properly.Georgetown UniversityKernel Smoothing34
Question - Local Regression in RpQuestionYifangWhat is the meaning of “local” in local regression? Equation 6.12 uses akernel mixing polynomial kernel and radial kernel?AnswerI think “local” still means weighing observations based on distance fromthe query point.Georgetown UniversityKernel Smoothing35
Problems with Local Regression in RpBoundary problemMore and more points can be found on the boundary as p increases.Local polynomial regression still helps automatically deal with boundaryissues for any pCurse of DimensionalityHowever, for high dimensions p 3, local regression still isn’t very useful As with the problem with Kernel width,difficult to maintain localness of observations (for low bias)sufficient samples (for low variance)Since the number of samples increases exponentially in p.Non-visualizableGoal of getting a smooth fitting function is to visualise the data which isdifficult in high dimensions.Georgetown UniversityKernel Smoothing36
Structured Local Regression Models in RpStructured KernelsUse a positive semidefinite matrix A to weigh the coordinates (instead ofnormalising all of them) TKλ,A (x0 , x) D( (x x0 ) λA(x x0 ) )Simplest form is the diagonal matrix which simply weighs the coordinateswithout considering correlations.General forms of A are cumbersome.Georgetown UniversityKernel Smoothing37
Question - ANOVA Decomposition and BackfittingQuestionBrendan 2. Can you touch a bit on the backfitting described in 6.4.2? Idon’t understand the equation they give for estimating gk .Georgetown UniversityKernel Smoothing38
Structured Local Regression Models in Rp - 2Structured Regression FunctionsIgnore high order correlations, perform iterative backfitting on eachsub-function.In more detailAnalysis-of-variance (ANOVA) decomposition formPPf (X1 , X2 , ., Xp ) α gj (Xj ) gkl (Xk , Xl ) .jk lIgnore high order cross terms, e.g., gklm (Xk , Xl , Xm ) (to reduce complexity)Iterative backfitting Assume that all terms are known except for some gk (Xk ).Perform local (polynomial) regression to find ĝk (Xk ).Repeat for all terms, and repeat until convergence.Georgetown UniversityKernel Smoothing39
Structured Local Regression Models in Rp - 3Varying Coefficients ModelPerform regression over some variables while keeping others constant.Select the first p out of q predictor variables from (X1 , X2 , ., Xq ) andexpress the function asf (X ) α(Z ) β1 (Z )X1 . βq (Z )Xqwhere for any given z0 , the coefficients α, βi are fixed.This can be solved using the same locally weighted least squares problem.Georgetown UniversityKernel Smoothing40
Varying Coefficients Model ExampleHuman Aorta ExamplePredictors - (age, gender, depth) , Response - diameterLet Z be (gender, depth) and model diameter as a function of depthGeorgetown UniversityKernel Smoothing41
Question - Local Regression vs Structured Local RegressionQuestionTavish1. What is the difference between Local Regression and Structured LocalRegression? And is there any similarity between the “structuring”described in this chapter to the one described in previous one where theinput are transformed/structured into a different inputs that are fed tolinear models?AnswerVery similar I think. But the interesting thing is that different methodshave different “natural” ways to perform the transformations orsimplifications.Georgetown UniversityKernel Smoothing42
Discussion QuestionQuestionTavish3)As a discussion question and also to understand better, this main idea ofthis chapter to find the best parameters for kernels keeping in mind thevariance-bias tradeoff. I would like to know more as to how a goodfit/model can be achieved and what are the considerations for trading offvariance for bias and vice-versa? It would be great if we can discuss someexamples.Starter AnswerAs far as the book goes - if you want to examine the boundaries, use alinear fit for reduced bias. If you care more abount interior points, use aquadratic fit to reduce internal bias (without as much of a cost in internalvariance).Georgetown UniversityKernel Smoothing43
Kernel Width Each kernel function K has a parameter which controls the size of the local neighborhood. Epanechnikov/Tri-cube Kernel , is the xed size radius around the target point Gaussian kernel, is the standard deviation of the gaussian function k for KNN kernels. G