L'apprentissage statistique joue de nos jours un rôle croissant dans de nombreux domaines scientifiques et doit de ce fait faire face à des problèmes nouveaux. Il est par conséquent important de proposer des méthodes d'apprentissage statistique adaptées aux problèmes modernes posés par les diffé…
Consider the usual regression problem in which we want to study the conditional distribution of a response Y given a set of predictors X. Sufficient dimension reduction (SDR) methods aim at replacing the high-dimensional vector of predictors by a lower-dimensional function R(X) with no loss of information about the dependence of the response variable on the predictors. Almost all SDR methods restrict attention to the class of linear reductions, which can be represented in terms of the projection of X onto a dimension-reduction subspace (DRS). Several methods have been proposed to estimate the basis of the DRS, such as sliced inverse regression (SIR; Li, 1991), principal Hessian directions (PHD; Li, 1992), sliced average variance estimation (SAVE; Cook and Weisberg, 1991), directional regression (DR; Li et al., 2005) and inverse regression estimation (IRE; Cook and Ni, 2005). A novel SDR method, called MSIR, based on finite mixtures of Gaussians has been recently proposed (Scrucca, 2011) as an extension to SIR. The talk will present the MSIR methodology and some recent advances. In particular, a BIC criterion for the selection the dimensionality of DRS will be introduced, and its extension for the purpose of variable selection. Finally, the application of MSIR in classification problems, both supervised and semi-supervised, will be discussed.
Information visualization is a research area that focuses on making structures and content of large and complex data sets visually understandable and interactively analyzable. The goal of information visualization tools and techniques is to increase our ability to gain insight and make decisions for many types of datasets, tasks, and analysis scenarios. With the increase in size and complexity of data sets today, the research area of information visualization increasingly gains in importance and recognition. In this talk I will present principles of data representation and interaction and introduce a number of existing applications, tools, and techniques and show how they can be applied to questions in statistics and statistical learning.
In the early days of kernel machines research, the "kernel trick" was considered a useful way of constructing nonlinear learning algorithms from linear ones, by applying the linear algorithms to feature space mappings of the original data. Recently, it has become clear that a potentially more far reaching use of kernels is as a linear way of dealing with higher order statistics, by mapping probabilities to a suitable reproducing kernel Hilbert space (i.e., the feature space is an RKHS). I will describe how probabilities can be mapped to reproducing kernel Hilbert spaces, and how to compute distances between these mappings. A measure of strength of dependence between two random variables follows naturally from this distance. Applications that make use of kernel probability embeddings include: - Nonparametric two-sample testing and independence testing in complex (high dimensional) domains. As an application, we find whether text in English is translated from the French, as opposed to being random extracts on the same topic. - Bayesian inference, in which the prior and likelihood are represented as feature space mappings, and a posterior feature space mapping is obtained. In this case, Bayesian inference can be undertaken even in the absence of a model, by learning the prior and likelihood mappings from samples.
Functional data are becoming increasingly common in a variety of fields. Many studies underline the importance to consider the representation of data as functions. This has sparked a growing attention in the development of adapted statistical tools that allow to analyze such kind of data : functional data analysis (FDA). The aims of FDA are mainly the same as in the classical statistical analysis, e.g. representing and visualizing the data, studying variability and trends, comparing different data sets, as well as modeling and predicting,... Recent advances in FDA allow to construct different classification methods, based on the comparison between centrality curves or using change points,... We review some procedures that have been used to classify functional data. The main point is to show the good practical behaviors of these procedures on a sample of curves. In addition, theoretical advances on functional estimations related to these classification methods are provided.
A new family of 12 probabilistic models, introduced recently, aims to simultaneously cluster and visualize high-dimensional data. It is based on a mixture model which fits the data into a latent discriminative subspace with an intrinsic dimension bounded by the number of clusters. An estimation procedure, named the Fisher-EM algorithm has also been proposed and turns out to outperform other subspace clustering in most situations. Moreover the convergence properties of the Fisher-EM algorithm are discussed; in particular it is proved that the algorithm is a GEM algorithm and converges under weak conditions in the general case. Finally, a sparse extension of the Fisher-EM algorithm is proposed in order to perform a selection of the original variables which are discriminative.
Cluster analysis is concerned with finding homogeneous groups in a population. Model-based clustering methods provide a framework for developing clustering methods through the use of statistical models. This approach allows for uncertainty to be quantified using probability and for the properties of a clustering method to be understood on the basis of a well defined statistical model. Mixture models provide a basis for many model-based clustering methods. Ranking data arise when judges rank some or all of a set of objects. Examples of ranking data include voting data from elections that use preferential voting systems (eg. PR-STV) and customer preferences for products in marketing applications. A mixture of experts model is a mixture model in which the model parameters are functions of covariates. We explore the use of mixture of experts models in cluster analysis, so that clustering can be better understood. The choice of how and where covariates enter the mixture of experts model has implications for the clustering performance and the interpretation of the results. The use of covariates in clustering is demonstrated on examples from studying voting blocs in elections and examining customer segments marketing.
Cluster analysis is an important tool in a variety of scientific areas including pattern recognition, document clustering, and the analysis of microarray data. Although many clustering procedures such as hierarchical, strict partitioning and overlapping clusterings aim to construct an optimal partition of objects or, sometimes, variables, there are other methods, known as co-clustering or block clustering procedures, which consider the two sets simultaneously. In several situations, compared with the classical clustering algorithms, the co-clustering has been shown to be more effective in discovering hidden clustering structures in the data matrix. I will present different aims of co-clustering under several approaches. I will focus on block mixture models and the non-negative matrix factorization approach. Models, algorithms and applications will be presented.
Networks are highly used to represent complex systems as sets of interactions between units of interest. For instance, regulatory networks can describe the regulation of genes with transcriptional factors while metabolic networks focus on representing pathways of biochemical reactions. In social sciences, networks are commonly used to represent relational ties between actors. Numerous graph clustering algorithms have been proposed since the earlier work of Moreno [2]. Most of them partition the vertices into disjoint clusters depending on their connection profiles. However, recent studies showed that these techniques were too restrictive since most existing networks contained overlapping clusters. To tackle this issue, we proposed the Overlapping Stochastic Block Model (OSBM) in [1]. This approach allows the vertices of a network to belong to multiple classes and can be seen as a generalization of the stochastic block model [3]. In [1], we developed a variational method to cluster the vertices of networks and showed that the algorithm had good clustering performances on both simulated and real data. However, no criterion was proposed to estimate the number of classes from the data, which is a major issue in practice. Here, we tackle this limit using a Bayesian framework. Thus, we introduce some priors over the model parameters and consider variational Bayes methods to approximate the full posterior distribution. We show how a model selection criterion can be obtained in order to estimate the number of (overlapping) clusters in a network. On both simulated and real data, we compare our work with other approaches.
The idea of selecting a model via penalizing a log-likelihood type criterion goes back to the early seventies with the pioneering works of Mallows and Akaike. One can find many consistency results in the literature for such criteria. These results are asymptotic in the sense that one deals with a given number of models and the number of observations tends to infinity. A non asymptotic theory for these type of criteria has been developed these last years that allows the size as well as the number of models to depend on the sample size. For practical relevance of these methods, it is desirable to get a precise expression of the penalty terms involved in the penalized criteria on which they are based. We will discuss some heuristics to design data-driven penalties, review some new results and discuss some open problems.
In this communication, we focus on data arriving sequentially by block in a stream. A semiparametric regression model involving a common EDR (Effective Dimension Reduction) direction B is assumed in each block. Our goal is to estimate this direction at each arrival of a new block. A simple direct approach consists in pooling all the observed blocks and estimate the EDR direction by the SIR (Sliced Inverse Regression) method. But some disadvantages appear in practice such as the storage of the blocks and the running time for high dimensional data. To overcome these drawbacks, we propose an adaptive SIR estimator of B based on the SIR approach for a stratified population developed by Chavent et al.(2011). The proposed approach is faster both from computational complexity and running time points of view, and provides data storage benefits. We show the consistency of our estimator at the root-n rate and give its asymptotic distribution. We propose an extension to multiple indices model. We also provide a graphical tool in order to detect if a drift occurs in the EDR direction or if some aberrant blocks appear in the data stream. In a simulation study, we illustrate the good numerical behavior of our estimator. One important advantage of this approach is its adaptability to changes in the underlying model.
We consider a classification problem: the goal is to assign class labels to an unlabeled test data set, given several labeled training data sets drawn from different but similar distributions. In essence, the goal is to predict labels from (an estimate of) the marginal distribution (of the unlabeled data) by learning the trends present in related classification tasks that are already known. In this sense, this problem belongs to the category of so-called "transfer learning" in machine learning. The probabilistic model used is that the different training and test distributions are themselves i.i.d. realizations from a distribution on distributions. Conceptually, this setting can be related to traditional random effects models in statistics, although here the approach is nonparametric and distribution-free. This problem arises in several applications where data distributions fluctuate because of biological, technical, or other sources of variation. We develop a distribution-free, kernel-based approach to the problem. This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. Experimental results on flow cytometry data are presented.