One of the fundamental questions of machine learning is how to compare examples. If an algorithm could perfectly determine whether two examples were semantically similar or dissimilar, most subsequent ma- chine learning tasks would become trivial. For example, in classification settings, one would only require one labeled example per class and could then, during test-time, categorize all similar examples with the same class-label. An analogous reduction applies to regression if a continuous estimate of the degree of similarity were available. It is not surprising that many popular machine learning algorithms, such as Support Vector Machines, Gaussian Processes, kernel regression, k-means or k-nearest neighbors (kNN) fundamentally rely on a representation of the input data for which a reliable, although not perfect, measure of dissimilarity is known.
A common choice of dissimilarity measure is an uninformed norm, like the Euclidean distance. Here it is assumed that the features are represented in a Euclidean subspace in which similar inputs are close and dissimilar inputs are far away. Although the Euclidean distance is convenient and intuitive, it ignores the fact that the semantic meaning of “similarity” is inherently task- and data-dependent. To illustrate this point, imagine two researchers who use the same data set of written documents for clustering. The first one is interested in clustering the articles by author, whereas the second wants to cluster by topic. Given the nature of their respective tasks, both should use very different metrics to measure document similarity, even if the underlying features are computed through similar means (e.g., bag-of-words or tf-idf ). Often, domain experts adjust the feature representations by hand — but clearly, this is not a robust approach. It is therefore desirable to learn the metric (or data representation) explicitly for each specific application.
Manifold Learning (often also referred to as non-linear dimensionality reduction) pursuits the goal to embed data that originally lies in a high dimensional space in a lower dimensional space, while preserving characteristic properties. This is possible because for any high dimensional data to be interesting, it must be intrinsically low dimensional. For example, images of faces might be represented as points in a high dimensional space (let’s say your camera has 5MP -- so your images, considering each pixel consists of three values [r,g,b], lie in a 15M dimensional space), but not every 5MP image is a face. Faces lie on a sub-manifold in this high dimensional space. A sub-manifold is locally Euclidean, i.e. if you take two very similar points, for example two images of identical twins, you can interpolate between them and still obtain an image on the manifold, but globally not Euclidean -- if you take two images that are very different --- for example Arnold Schwarzenegger and Hillary Clinton -- you cannot interpolate between them. I develop algorithms that map these high dimensional data points into a low dimensional space, while preserving local neighborhoods. This can be interpreted as a non-linear generalization of PCA.
Learning from Noisy Cameras
The classical approach to understanding video data is to find and track features from frame to frame. But many image sets are noisy, small, and observe objects undergoing complicated deformations. For many real world cases, such as MRI images of a heart beating, these deformations have just one or two root causes. I have worked to apply Manifold Learning to image data sets, in order to capture the root causes of the image deformation. This gives new tools for registration, segmentation, and de-noising of medical or aerial image data sets, an important first step in many image analysis pipelines. It also offers approach for video action recognition, and in general for understanding non-rigid and deformable motion.
I was among the earliest popularizers of the Isomap algorithm, demonstrating its potential for problems such as video classification and pose estimation [plessICCV]. With my former student Richard Souvenir, I proposed the first algorithm to support clustering and parameterization of data sets that are drawn from multiple intersecting manifolds [souvenirManifoldClustering].
Subsequently, we used Pattern Theory as a framework in order to define image distance functions that correspond to natural image changes [souvenirPless07] -- in order to organize images by their type of variation, including overall motion, deformation, and appearance change. This allowed us to develop problem specific tools to improve image segmentation [qilongCVPR, georgCVPR2008], and image interpolation and registration [manifoldInterpICIP]. This was applied to a large retrospective study of patients with lung cancer, in which we helped quantify the effects of radiation dose on health outcomes by capturing the breathing motion in the archived lung CT imagery. This paper received the ``reviewers choice'' award at the International Conference on the Use of Computers in Radiation Therapy [georgICCR].
Learning to Rank
The goal of learning to rank is to develop machine learning algorithms that learn preferred orderings from experience. This is a particular important problem in web-search. Given a query consisting of a few words, the task is to accurately predict which web-page the user wants to obtain from the web. My research has focussed primarily on extensions of boosted regression trees, which have shown to be arguably the best algorithms for web-ranking. (All winning teams of the Yahoo Learning to Rank Challenge used boosted regression trees in one form or another.)
Check out our implementation of boosted regression tree and random forests.
Optimization is a computational problem with broad scientific and engineering applications. We have been working on the theory and algorithms for improving the quality and time cost of optimization. We are also studying applications of optimization. Our recent research contributions include:
- Constraint partitioning approach for large-scale optimization (AIJ-05, CP-05, PAAP-08)
- Extended duality theory (COA-08)
- Stochastic constrained optimization (JOGO-07, GrC-08)
- Intensity modulated radiotherapy optimization (LAA-07, ICCR-07)
- Wireless sensor network optimization (RTSS-08, MobiHoc-10, RTSS-10, ICDCS-11, ECRTS-11, RTAS-11, TPDS-11, RTAS-12, InfoCom-12)
- Design optimization for streaming applications (SAAHPC-10, ASAP-11)
- Nonlinear optimization in biological systems (PLoS-12)
Multitask learning aims to improve the performance of learning algorithms by learning classifiers for multiple tasks jointly. This works particularly well if these tasks have some commonality and are generally slightly under sampled. One example is a spam-filter. Everybody has a slightly different distribution over spam or not-spam emails (e.g. all emails in russian are spam for me -- but not so for my russian colleagues), yet there is definitely a common aspect across users. Multi-task learning works, because encouraging a classifier (or a modification thereof) to also performs well on a slightly different task is a better regularization than uninformed regularizers e.g. to enforce that all weights are small (the typical l2-regularization).