Transact-SQL
Reinforcement Learning
R Programming
React Native
Python Design Patterns
Python Pillow
Python Turtle
Verbal Ability
Company Questions
Cloud Computing
Data Science
Data Structures
Operating System
Computer Network
Compiler Design
Computer Organization
Discrete Mathematics
Ethical Hacking
Computer Graphics
Software Engineering
Web Technology
Cyber Security
C Programming
Data Mining
Data Warehouse
← ^ →
Computational Learning Theory
José M. Vidal .
2596 Accesses
30 Citations
Explore all metrics
We examine methods for constructing regression ensembles based on a linear program (LP). The ensemble regression function consists of linear combinations of base hypotheses generated by some boosting-type base learning algorithm. Unlike the classification case, for regression the set of possible hypotheses producible by the base learning algorithm may be infinite. We explicitly tackle the issue of how to define and solve ensemble regression when the hypothesis space is infinite. Our approach is based on a semi-infinite linear program that has an infinite number of constraints and a finite number of variables. We show that the regression problem is well posed for infinite hypothesis spaces in both the primal and dual spaces. Most importantly, we prove there exists an optimal solution to the infinite hypothesis space problem consisting of a finite number of hypothesis. We propose two algorithms for solving the infinite and finite hypothesis problems. One uses a column generation simplex-type algorithm and the other adopts an exponential barrier approach. Furthermore, we give sufficient conditions for the base learning algorithm and the hypothesis set to be used for infinite regression ensembles. Computational results show that these methods are extremely promising.
Download to read the full article text
Avoid common mistakes on your manuscript.
Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithm: Bagging, boosting and variants. Machine Learning , 36 , 105–142.
Google Scholar
Bennett, K., Demiriz, A., & Shawe-Taylor, J. (2000). A column generation algorithm for boosting. In Pat Langley (Ed.), Proceedings Seventeenth International Conference on Machine Learning (pp. 65–72). San Francisco: Morgan Kaufmann.
Bertoni, A., Campadelli, P., & Parodi, M. (1997). Aboosting algorithm for regression. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Proceedings ICANN'97, Int. Conf. on Artificial Neural Networks , Vol. V of LNCS (pp. 343–348), Berlin: Springer.
Bertsekas, D. (1995). Nonlinear programming . Belmont, MA: Athena Scientific.
Bradley, P., Mangasarian, O., & Rosen, J. (1998). Parsimonious least norm approximation. Computational Optimization and Applications, 11:1 , 5–21.
Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation, 11:7 , 1493–1518. Also Technical Report 504, Statistics Department, University of California, Berkeley.
Breneman, C., Sukumar, N., Bennett, K., Embrechts, M., Sundling, M., & Lockwood, L. (2000). Wavelet representations of molecular electronic properties: Applications in ADME, QSPR, and QSAR. Presentation, QSAR in Cells Symposium of the Computers in Chemistry Division's 220th American Chemistry Society National Meeting.
Censor,Y., & Zenios, S. (1997). Parallel optimization: Theory, algorithms and application . Numerical Mathematics and Scientific Computation. Oxford: Oxford University Press.
Chen, S., Donoho, D., & Saunders, M. (1995). Atomic decomposition by basis pursuit Technical Report 479, Department of Statistics, Stanford University.
Collins, M., Schapire, R., & Singer,Y. (2000). Adaboost and logistic regression unified in the context of information geometry. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory .
Cominetti, R., & Dussault, J.-P. (1994). A stable exponential penalty algorithm with superlinear convergence J.O.T.A., 83:2 .
Demiriz, A., Bennett, K., Breneman, C., & Embrechts, M. (2001). Support vector machine regression in chemometrics. Computer Science and Statistics. In Proceeding of the Conference on the 32 Symposium on the Interface , to appear.
Dietterich, T. (1999). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40:2 .
Drucker, H., Schapire, R., & Simard, P. (1993). Boosting performance in neural networks. International Journal of Pattern Recognition and Artificial Intelligence, 7 , 705–719.
Duffy, N., & Helmbold, D. (2000). Leveraging for regression. In Colt'00 (pp. 208-219).
Embrechts, M., Kewley, R., & Breneman, C. (1998). Computationally intelligent data mining for the automated design and discovery of novel pharmaceuticals. In C. D. et al. (Ed.), Intelligent engineering systems through artifical neural networks , pp. 391-396. ASME Press.
Fisher, J., D. H. (Ed.). Improving regressors using boosting techniques . In Proceedings of the Fourteenth International Conference on Machine Learning.
Frean, M., & Downs, T. (1998). A simple cost function for boosting. Technical Report, Department of Computer Science and Electrical Engineering, University of Queensland.
Freund, Y., & Schapire, R. (1996). Game theory, on-line prediction and boosting. In COLT . San Mateo, CA: Morgan Kaufman. ACM Press, New York, NY, pp. 325–332.
Freund, Y., & Schapire, R. (1994). A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT: European Conference on Computational Learning Theory . LNCS.
Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. In Proc. 13th International Conference on Machine Learning (pp. 148–146). San Mateo, CA: Morgan Kaufmann.
Friedman, J., Hastie, T., & Tibshirani, R. (1998). Additive logistic regression: A statistical view of boosting. Technical Report, Department of Statistics, Sequoia Hall, Stanford Univerity.
Friedman, J. (1999). Greedy function approximation. Technical Report, Department of Statistics, Stanford University.
Frisch, K. (1955). The logarithmic potential method of convex programming. Memorandum, University Institute of Economics, Oslo.
Grove, A., & Schuurmans, D. (1998). Boosting in the limit: Maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Artifical Intelligence .
Hettich, R., & Kortanek, K. (1993). Semi-infinite programming: Theory, methods and applications. SIAM Review , 3 , 380–429.
Kaliski, J. A., Haglin, D. J., Roos, C., & Terlaky, T. (1997). Logarithmic barrier decomposition methods for semi-infinite programming. International Transactions in Operational Research, 4:4 , 285–303.
Kivinen, J., & Warmuth, M. (1999). Boosting as entropy projection. In Proc. 12th Annual Conference on Computational Learning Theory (pp. 134–144). New York: ACM Press.
LeCun, Y., Jackel, L. D., Bottou, L., Brunot, A., Cortes, C., Denker, J., Drucker, H., Guyon, I., Müller, U. A., Säckinger, E., Simard, P., & Vapnik, V. (1995). Comparison of learning algorithms for handwritten digit recognition. In F. Fogelman-Soulié, & P. Gallinari (Eds.), Proceedings ICANN'95-International Conference on Artificial Neural Networks (Vol. II, pp. 53–60). Nanterre, France. EC2.
Luenberger, D. (1984). Linear and nonlinear programming (2nd edn.). Reading: Addison-Wesley Publishing Co., Reprinted with corrections in May, 1989.
Mackey, M. C., & Glass, L. (1977). Oscillation and chaos in physiological control systems. Science , 197 , 287–289.
Maclin, R., & Opitz, D. (1997). An empirical evaluation of bagging and boosting. In Proc. of AAAI .
Mason, L., Bartlett, P., & Baxter, J. (1998). Improved generalization through explicit optimization of margins. Technical Report, Deparment of Systems Engineering, Australian National University.
Mason, L., Baxter, J., Bartlett, P., & Frean, M. (1999). Functional gradient techniques for combining hypotheses. In A. Smola, P. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 221–247). Cambridge, MA: MIT Press.
Mika, S., Rätsch, G., & Müller, K.-R. (2001). A mathematical programming approach to the Kernel Fisher algorithm. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in Neural Information Processing Systems, 13 , 591–597.
Mosheyev, L., & Zibulevsky, M. (2000). Penalty/barrier multiplier algorithm for semidefinite programming. Optimization Methods and Software, 13:4 , 235–262.
Müller, K.-R., Kohlmorgen, J., & Pawelzik, K. (1995). Analysis of switching dynamics with competing neural networks. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences , E78-A:10 , 1306–1315.
Müller, K.-R., Smola, A., Rätsch, G., Schölkopf, B., Kohlmorgen, J., & Vapnik, V. (1999). Predicting time series with support vector machines. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in Kernel methods-support vector learning (pp. 243–254). Cambridge, MA: MIT Press. Short version appeared in ICANN'97, Springer Lecture Notes in Computer Science.
Müller, K.-R., Smola, A., Rätsch, G., Schölkopf, B., Kohlmorgen, J., & Vapnik, V. (1997). Predicting time series with support vector machines. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Artificial neural networks-ICANN'97 (pp. 999–1004). Berlin: Springer. Lecture Notes in Computer Science, Vol. 1327.
Pawelzik, K., Kohlmorgen, J., & Müller, K.-R. (1996). Annealed competition of experts for a segmentation and classification of switching dynamics. Neural Computation , 8:2 , 342–358.
Rätsch, G. (2001). Robust boosting via convex optimization . Ph.D. Thesis, University of Potsdam, Neues Palais 10, 14469 Potsdam, Germany.
Rätsch, G., Onoda, T., & Müller, K.-R. (2001). Soft margins for AdaBoost. Machine Learning , 42:3 , 287–320. also NeuroCOLT Technical Report NC-TR-1998-021.
Rätsch, G., Schölkopf, B., Mika, S., & Müller, K.-R. (2000a). SVM and boosting: One class. Technical report 119, GMD FIRST, Berlin. Accepted for publication in IEEE TPAMI.
Rätsch, G., Schölkopf, B., Smola, A., Mika, S., Onoda, T., & Müller, K.-R. (2000b). Robust ensemble learning. In A. Smola, P. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 207–219). Cambridge, MA: MIT Press.
Rätsch, G. R., & Warmuth, M. K. (2001). Marginal boosting. Royal Holloway College, NeuroCOLT2 Technical report, 97. London.
Rätsch, G., Warmuth, M., Mika, S., Onoda, T., Lemm, S., & Müller, K.-R. (2000). Barrier boosting. In COLT'2000 (pp. 170–179). San Mateo, CA: Morgan Kaufmann.
Ridgeway, G. D., & Madigan, T. R. (1999). Boosting methodology for regression problems. In D. Heckerman, & J. Whittaker (Eds.), Proceedings of Artificial Intelligence and Statistics '99 (pp. 152-161). http:/www.rand.org/methodology/stat/members/gregr.
Schapire, R., Freund,Y., Bartlett, P., & Lee,W. (1997). Boosting the margin:Anewexplanation for the effectiveness of voting methods. In Proc. 14th International Conference on Machine Learning (pp. 322–330). San Mateo, CA: Morgan Kaufmann.
Schölkopf, B., Burges, C., & Smola, A. (Eds.). (1999). Advances in Kernel methods-support vector learning . Cambridge, MA: MIT Press.
Schölkopf, B., Smola, A., Williamson, R. C., & Bartlett, P. L. (2000). New support vector algorithms. Neural Computation, 12 , 1207–1245.
Schwenk, H., & Bengio, Y. (1997). AdaBoosting neural networks. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Proc. of the Int. Conf. on Artificial Neural Networks (ICANN'97) , Vol. 1327 of LNCS (pp. 967–972). Berlin: Springer.
Smola, A. J. (1998). Learning with Kernels. Ph.D. Thesis, Technische Universit¨at Berlin.
Smola, A., Schölkopf, B., & Rätsch, G. (1999). Linear programs for automatic accuracy control in regression. In Proceedings ICANN'99, Int. Conf. on Artificial Neural Networks , Berlin: Springer.
Vapnik, V. (1995). The nature of statistical learning theory . New York: Springer Verlag.
Weigend, A., & N. A. Gershenfeld (Eds.) (1994). Time series prediction: Forecasting the future and understanding the past . Addison-Wesley. Santa Fe Institute Studies in the Sciences of Complexity.
Zemel, R., & Pitassi, T. (2001). A gradient-based boosting algorithm for regression problems. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in Neural Information Precessing Systems 13 (pp. 696–702). Cambridge, MA: MIT Press.
Download references
Authors and affiliations.
GMD FIRST, Kekulèstr. 7, 12489, Berlin, Germany
Gunnar Rätsch
Department of Decision Sciences and Eng. Systems, Rensselaer Polytechnic Institute, Troy, NY, 12180, USA
Ayhan Demiriz
Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY, 12180, USA
Kristin P. Bennett
You can also search for this author in PubMed Google Scholar
Reprints and permissions
Rätsch, G., Demiriz, A. & Bennett, K.P. Sparse Regression Ensembles in Infinite and Finite Hypothesis Spaces. Machine Learning 48 , 189–218 (2002). https://doi.org/10.1023/A:1013907905629
Download citation
Issue Date : July 2002
DOI : https://doi.org/10.1023/A:1013907905629
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
IMAGES
VIDEO
COMMENTS
A hypothesis is a function that best describes the target in supervised machine learning. The hypothesis that an algorithm would come up depends upon the data and also depends upon the restrictions and bias that we have imposed on the data. The Hypothesis can be calculated as: y = mx + b y =mx+b. Where, y = range. m = slope of the lines.
Infinite Hypothesis Space The previous analysis was restricted to finite hypothesis spaces Some infinite hypothesis spaces are more expressive than others - E.g., Rectangles, vs. 17- sides convex polygons vs. general convex polygons - Linear threshold function vs. a conjunction of LTUs Need a measure of the expressiveness of an infinite
Finite hypothesis space A rst simple example of PAC learnable spaces - nite hypothesis spaces. Theorem (uniform convergence for nite H) Let Hbe a nite hypothesis space and ': YY! [0;1] be a bounded loss function, then Hhas the uniform convergence property with M( ; ) = ln(2jHj ) 2 2 and is therefore PAC learnable by the ERM algorithm. Proof .
Learning Bound for Finite H - Consistent Case. Theorem: let H be a finite set of functions from X to 1} and L an algorithm that for any target concept c H and sample S returns a consistent hypothesis hS : . Then, for any 0 , with probability at least. (hS ) = 0 >.
VC Dimension. (Vapnik & Chervonenkis, 1968-1971; Vapnik, 1982, 1995, 1998) Definition: the VC-dimension of a hypothesis set H is defined by. VCdim(H ) = max{m : H (m) = 2m}. Thus, the VC-dimension is the size of the largest set that can be fully shattered by H . Purely combinatorial notion.
10-806 Foundations of Machine Learning and Data Science Maria-Florina Balcan Lecture 4-5: September 21st and September 23rd, 2015 Sample Complexity Results for In nite Hypothesis Spaces The Shattering Coe cient Let C be a concept class over an instance space X, i.e. a set of functions functions from X to f0;1g(where both Cand Xmay be in nite).
Allows unlimited data and computational resources. PAC Model. Only requires learning a Probably Approximately Correct Concept: Learn a decent approximation most of the time. Requires polynomial sample complexity and computational complexity. 6. • Learning in the limit model is too strong.
hypothesis space, i.e., if you have small hypothesis space, you do not have to see too many examples. There is also a trade-o of the hypothesis space. If the space is small, then it generalizes well, but it many not be expressive enough. 4 Consistent Learners Using the results from the previous section, we can get this general scheme for PAC ...
The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H. Definition: If arbitrarily large finite sets can be shattered by H, then VCdim :H L∞ VC-dimension (Vapnik-Chervonenkis dimension) H shatters S if |HS| L2||. Shattering, VC-dimension The VC-dimension of a hypothesis space H is the ...
What if the hypothesis space is not finite? •Q: If H is infinite (e.g. the class of perceptrons), what measure of hypothesis-space complexity can we use in place of |H| ? • A: the largest subset of 𝒳for which H can guarantee zero training error, regardless of the target function. this is known as the Vapnik-Chervonenkis dimension (VC ...
We will see that PAC can provide learnability bounds for a finite hypothesis space, and by using the Vapnik-Chervonenkis (VC) dimension, such results can be extended to an infinite hypothesis space. These results will give us theoretical characterizations of the difficulty of machine learning problems and the capabilities of certain models.
nite hypothesis spaces since we are using the cardinality of H. This led us to brie y discuss about the generalization of Occam's Razor to in nite hypothesis spaces at the end of last week's lecture. Sample Complexity for In nite Hypothesis Space In order to generalize Occam's Razor to in nite hypothesis spaces, we have to somewhat ...
Computational learning theory, or statistical learning theory, refers to mathematical frameworks for quantifying learning tasks and algorithms. These are sub-fields of machine learning that a machine learning practitioner does not need to know in great depth in order to achieve good results on a wide range of problems. Nevertheless, it is a sub-field where having a high-level understanding of ...
Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004 Sample Complexity for infinite hypothesis spaces • Another measure of the complexity of H called Vapnik-Chervonenkis dimension, or VC(H) • We will use VC(H) instead of |H| • Results in tighter bounds • Allows characterizing sample complexity of infinite hypothesis spaces and is ...
Supervised machine learning is often described as the problem of approximating a target function that maps inputs to outputs. This description is characterized as searching through and evaluating candidate hypothesis from hypothesis spaces. The discussion of hypotheses in machine learning can be confusing for a beginner, especially when "hypothesis" has a distinct, but related meaning […]
Definition. Let be a space which we call the input space, and be a space which we call the output space, and let denote the product .For example, in the setting of binary classification, is typically a finite-dimensional vector space and is the set {,}. Fix a hypothesis space of functions :.A learning algorithm over is a computable map from to .In other words, it is an algorithm that takes as ...
In Chapter 5 we introduced the main notions of machine learning, with particular regard to hypothesis and data representation, and we saw that concept learning can be formulated in terms of a search problem in the hypothesis space H.As H is in general very large, or even infinite, well-designed strategies are required in order to perform efficiently the search for good hypotheses.
The number of dichotomies could be finite even for an infinite hypothesis space \(\mathcal {H}\); ... Award committee pointed out that Valiant's paper published in 1984 created a new research area known as computational learning theory that puts machine learning on a sound mathematical footing.
The hypothesis class can be finite or infinite, for example a discrete set of shapes to encircle certain portion of the input space is a finite hypothesis space, whereas hpyothesis space of parametrized functions like neural nets and linear regressors are infinite. ... is commonly used. For example, this famous book on machine learning and ...
The hypothesis is one of the commonly used concepts of statistics in Machine Learning. It is specifically used in Supervised Machine learning, where an ML model learns a function that best maps the input to corresponding outputs with the help of an available dataset. In supervised learning techniques, the main aim is to determine the possible ...
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #5 Scribe: David Bieber February 19, 2013 Recall Occam's razor. With probability at least 1 , a hypothesis h2Hconsistent with mexamples sampled independently from distribution Dsatis es err(h) lnjHj+ln 1 m: Sample complexity for in nite hypothesis spaces
Fact: Every consistent learner outputs a hypothesis belonging to the version space. Therefore, we need to bound the number of examples needed to assure that the version space contains no unacceptable hypothesis.
In Section 3.2 we investigate the dual of this linear program for ensemble regression. In Section 3.3, we propose a semi-infinite linear program formulation for "boosting" of infinite hypothesis sets, first in the dual and then in the primal space. The dual problem is called semi-infinite because it has an infinite number of constraints and ...
We explicitly tackle the issue of how to define and solve ensemble regression when the hypothesis space is infinite. Our approach is based on a semi-infinite linear program that has an infinite number of constraints and a finite number of variables. We show that the regression problem is well posed for infinite hypothesis spaces in both the ...