Transcription

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpmlGaussian Processes for Machine Learning

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpmlAdaptive Computation and Machine LearningThomas Dietterich, EditorChristopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate EditorsBioinformatics: The Machine Learning Approach,Pierre Baldi and Søren BrunakReinforcement Learning: An Introduction,Richard S. Sutton and Andrew G. BartoGraphical Models for Machine Learning and Digital Communication,Brendan J. FreyLearning in Graphical Models,Michael I. JordanCausation, Prediction, and Search, second edition,Peter Spirtes, Clark Glymour, and Richard ScheinesPrinciples of Data Mining,David Hand, Heikki Mannila, and Padhraic SmythBioinformatics: The Machine Learning Approach, second edition,Pierre Baldi and Søren BrunakLearning Kernel Classifiers: Theory and Algorithms,Ralf HerbrichLearning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond,Bernhard Schölkopf and Alexander J. SmolaIntroduction to Machine Learning,Ethem AlpaydinGaussian Processes for Machine Learning,Carl Edward Rasmussen and Christopher K. I. Williams

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpmlGaussian Processes for Machine LearningCarl Edward RasmussenChristopher K. I. WilliamsThe MIT PressCambridge, MassachusettsLondon, England

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpmlc 2006 Massachusetts Institute of TechnologyAll rights reserved. No part of this book may be reproduced in any form by any electronic or mechanicalmeans (including photocopying, recording, or information storage and retrieval) without permission inwriting from the publisher.MIT Press books may be purchased at special quantity discounts for business or sales promotional use.For information, please email special [email protected] or write to Special Sales Department,The MIT Press, 55 Hayward Street, Cambridge, MA 02142.Typeset by the authors using LATEX 2ε .This book was printed and bound in the United States of America.Library of Congress Cataloging-in-Publication DataRasmussen, Carl Edward.Gaussian processes for machine learning / Carl Edward Rasmussen, Christopher K. I. Williams.p. cm. —(Adaptive computation and machine learning)Includes bibliographical references and indexes.ISBN 0-262-18253-X1. Gaussian processes—Data processing. 2. Machine learning—Mathematical models.I. Williams, Christopher K. I. II. Title. III. Series.QA274.4.R37 2006519.2'3—dc2220050534331098765432

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpmlThe actual science of logic is conversant at present only with things eithercertain, impossible, or entirely doubtful, none of which (fortunately) we have toreason on. Therefore the true logic for this world is the calculus of Probabilities,which takes account of the magnitude of the probability which is, or ought tobe, in a reasonable man’s mind.— James Clerk Maxwell [1850]

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpmlContentsxiSeries Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiSymbols and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Introduction1.1 A Pictorial Introduction to Bayesian Modelling . . . . . . . . . . . . . . .1.2 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Regression2.1 Weight-space View . . . . . . . . . . . . . . . . . . . .2.1.1 The Standard Linear Model . . . . . . . . . . .2.1.2 Projections of Inputs into Feature Space . . . .2.2 Function-space View . . . . . . . . . . . . . . . . . . .2.3 Varying the Hyperparameters . . . . . . . . . . . . . .2.4 Decision Theory for Regression . . . . . . . . . . . . .2.5 An Example Application . . . . . . . . . . . . . . . . .2.6 Smoothing, Weight Functions and Equivalent Kernels 2.7 Incorporating Explicit Basis Functions . . . . . . . . .2.7.1 Marginal Likelihood . . . . . . . . . . . . . . .2.8 History and Related Work . . . . . . . . . . . . . . . .2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . .135.778111319212224272929303 Classification3.1 Classification Problems . . . . . . . . . . . . . . . . . . .3.1.1 Decision Theory for Classification . . . . . . . . .3.2 Linear Models for Classification . . . . . . . . . . . . . . .3.3 Gaussian Process Classification . . . . . . . . . . . . . . .3.4 The Laplace Approximation for the Binary GP Classifier .3.4.1 Posterior . . . . . . . . . . . . . . . . . . . . . . .3.4.2 Predictions . . . . . . . . . . . . . . . . . . . . . .3.4.3 Implementation . . . . . . . . . . . . . . . . . . . .3.4.4 Marginal Likelihood . . . . . . . . . . . . . . . . . 3.5 Multi-class Laplace Approximation . . . . . . . . . . . . .3.5.1 Implementation . . . . . . . . . . . . . . . . . . . .3.6 Expectation Propagation . . . . . . . . . . . . . . . . . . .3.6.1 Predictions . . . . . . . . . . . . . . . . . . . . . .3.6.2 Marginal Likelihood . . . . . . . . . . . . . . . . .3.6.3 Implementation . . . . . . . . . . . . . . . . . . . .3.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . .3.7.1 A Toy Problem . . . . . . . . . . . . . . . . . . .3.7.2 One-dimensional Example . . . . . . . . . . . . .3.7.3 Binary Handwritten Digit Classification Example .3.7.4 10-class Handwritten Digit Classification Example3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .33343537394142444547485152565757606062637072 Sections.marked by an asterisk contain advanced material that may be omitted on a first reading.

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpmlviiiContents 3.9 Appendix: Moment Derivations . . . . . . . . . . . . . . . . . . . . . . . .3.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Covariance Functions4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Mean Square Continuity and Differentiability4.2 Examples of Covariance Functions . . . . . . . . . .4.2.1 Stationary Covariance Functions . . . . . . .4.2.2 Dot Product Covariance Functions . . . . . .4.2.3 Other Non-stationary Covariance Functions .4.2.4 Making New Kernels from Old . . . . . . . .4.3 Eigenfunction Analysis of Kernels . . . . . . . . . . . 4.3.1 An Analytic Example . . . . . . . . . . . . .4.3.2 Numerical Approximation of Eigenfunctions .4.4 Kernels for Non-vectorial Inputs . . . . . . . . . . .4.4.1 String Kernels . . . . . . . . . . . . . . . . .4.4.2 Fisher Kernels . . . . . . . . . . . . . . . . .4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . .747579. 79. 81. 81. 82. 89. 90. 94. 96. 97. 98. 99. 100. 101. 1025 Model Selection and Adaptation of Hyperparameters1055.1 The Model Selection Problem . . . . . . . . . . . . . . . . . . . . . . . . . 1065.2 Bayesian Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.3 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.4 Model Selection for GP Regression . . . . . . . . . . . . . . . . . . . . . . 1125.4.1 Marginal Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 1125.4.2 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.4.3 Examples and Discussion . . . . . . . . . . . . . . . . . . . . . . . 1185.5 Model Selection for GP Classification . . . . . . . . . . . . . . . . . . . . . 124 5.5.1 Derivatives of the Marginal Likelihood for Laplace’s Approximation 125 5.5.2 Derivatives of the Marginal Likelihood for EP . . . . . . . . . . . . 1275.5.3 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1275.5.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1286 Relationships between GPs and Other Models6.1 Reproducing Kernel Hilbert Spaces . . . . . . . . . . . . . . . . .6.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Regularization Defined by Differential Operators . . . . .6.2.2 Obtaining the Regularized Solution . . . . . . . . . . . . .6.2.3 The Relationship of the Regularization View to GaussianPrediction . . . . . . . . . . . . . . . . . . . . . . . . . . .6.3 Spline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 A 1-d Gaussian Process Spline Construction . . . . . . . . 6.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . .6.4.1 Support Vector Classification . . . . . . . . . . . . . . . .6.4.2 Support Vector Regression . . . . . . . . . . . . . . . . . 6.5 Least-squares Classification . . . . . . . . . . . . . . . . . . . . .6.5.1 Probabilistic Least-squares Classification . . . . . . . . . . . . . . . . . . . . . . . . . .Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .129129132133135135136138141141145146147

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpmlContents 6.66.7ixRelevance Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 149Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1507 Theoretical Perspectives7.1 The Equivalent Kernel . . . . . . . . . . . . . . . . .7.1.1 Some Specific Examples of Equivalent Kernels 7.2 Asymptotic Analysis . . . . . . . . . . . . . . . . . . .7.2.1 Consistency . . . . . . . . . . . . . . . . . . .7.2.2 Equivalence and Orthogonality . . . . . . . . . 7.3 Average-case Learning Curves . . . . . . . . . . . . . . 7.4 PAC-Bayesian Analysis . . . . . . . . . . . . . . . . .7.4.1 The PAC Framework . . . . . . . . . . . . . . .7.4.2 PAC-Bayesian Analysis . . . . . . . . . . . . .7.4.3 PAC-Bayesian Analysis of GP Classification . .7.5 Comparison with Other Supervised Learning Methods 7.6 Appendix: Learning Curve for the Ornstein-Uhlenbeck7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Process. . . . .8 Approximation Methods for Large Datasets8.1 Reduced-rank Approximations of the Gram Matrix . . . . .8.2 Greedy Approximation . . . . . . . . . . . . . . . . . . . . .8.3 Approximations for GPR with Fixed Hyperparameters . . .8.3.1 Subset of Regressors . . . . . . . . . . . . . . . . . .8.3.2 The Nyström Method . . . . . . . . . . . . . . . . .8.3.3 Subset of Datapoints . . . . . . . . . . . . . . . . .8.3.4 Projected Process Approximation . . . . . . . . . . .8.3.5 Bayesian Committee Machine . . . . . . . . . . . . .8.3.6 Iterative Solution of Linear Systems . . . . . . . . .8.3.7 Comparison of Approximate GPR Methods . . . . .8.4 Approximations for GPC with Fixed Hyperparameters . . . 8.5 Approximating the Marginal Likelihood and its Derivatives 8.6 Appendix: Equivalence of SR and GPR Using the NyströmKernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Approximate. . . . . . . . . . . . . . .9 Further Issues and Conclusions9.1 Multiple Outputs . . . . . . . . . .9.2 Noise Models with Dependencies .9.3 Non-Gaussian Likelihoods . . . . .9.4 Derivative Observations . . . . . .9.5 Prediction with Uncertain Inputs .9.6 Mixtures of Gaussian Processes . .9.7 Global Optimization . . . . . . . .9.8 Evaluation of Integrals . . . . . . .9.9 Student’s t Process . . . . . . . .9.10 Invariances . . . . . . . . . . . . .9.11 Latent Variable Models . . . . . .9.12 Conclusions and Future 9190190191191192192193193194194196196

C. E. Ras

Bioinformatics: The Machine Learning Approach, second edition, Pierre Baldi and Søren Brunak Learning Kernel Classifiers: Theory and Algorithms, Ralf Herbrich Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, Bernhard Sch olkopf and Alexander J. Smola Introduction to Machine Learning, Ethem Alpaydin Gaussian Processes for Machine Learning