Estimating treatment effects over time holds significance in various domains, including precision medicine, epidemiology, economy, and marketing. This paper introduces a unique approach to counterfactual regression over time, emphasizing long-term predictions. Distinguishing itself from existing models like Causal Transformer, our approach highlights the efficacy of employing RNNs for long-term forecasting, complemented by Contrastive Predictive Coding (CPC) and Information Maximization (InfoMax). Emphasizing efficiency, we avoid the need for computationally expensive transformers. Leveraging CPC, our method captures long-term dependencies in the presence of time-varying confounders. Notably, recent models have disregarded the importance of invertible representation, compromising identification assumptions. To remedy this, we employ the InfoMax principle, maximizing a lower bound of mutual information between sequence data and its representation. Our method achieves state-of-the-art counterfactual estimation results using both synthetic and real-world data, marking the pioneering incorporation of Contrastive Predictive Encoding in causal inference
This work addresses the challenge of causal representation learning (CRL) for complex, high-dimensional, time-varying data. We enhance transparency and confidence in learned causal abstractions by linking them to observational space. The existing literature rarely explores the association between latent causal variables and observed ones, with only one notable work imposing a simplistic single-latent-factor decoding constraint. Our approach, in contrast, allows for a flexible entangling of latent factors, reflecting the complexity of real-world datasets. We introduce a structural sparsity pattern over generative functions and leverage induced grouping structures over observed variables for better model understanding. Our regularization technique, based on sparse subspace clustering over the Jacobian matrix of the decoder, promotes the sparsity and readability of model results. We apply our model to real-world datasets, including Saint-Gobain purchase data and MIMIC III medical data.
Additive Noise Models (ANMs) are a common model class for causal discovery from observational data. Due to a lack of real-world data for which an underlying ANM is known, ANMs with randomly sampled parameters are commonly used to simulate data for the evaluation of causal discovery algorithms. While some parameters may be fixed by explicit assumptions, fully specifying an ANM requires choosing all parameters. Reisach et al. (2021) show that, for many ANM parameter choices, sorting the variables by increasing variance yields an ordering close to a causal order and introduce ‘var-sortability’ to quantify this alignment. Since increasing variances may be unrealistic and cannot be exploited when data scales are arbitrary, ANM data are often rescaled to unit variance in causal discovery benchmarking. We show that synthetic ANM data are characterized by another pattern that is scale-invariant and thus persists even after standardization: the explainable fraction of a variable’s variance, as captured by the coefficient of determination R2, tends to increase along the causal order. The result is high ‘R2-sortability’, meaning that sorting the variables by increasing R2 yields an ordering close to a causal order. We propose a computationally efficient baseline algorithm termed ‘R2- SortnRegress’ that exploits high R2-sortability and that can match and exceed the performance of established causal discovery algorithms. We show analytically that sufficiently high edge weights lead to a relative decrease of the noise contributions along causal chains, resulting in increasingly deterministic relationships and high R2. We characterize R2-sortability on synthetic data with different simulation parameters and find high values in common settings. Our findings reveal high R2-sortability as an assumption about the data generating process relevant to causal discovery and implicit in many ANM sampling schemes. It should be made explicit, as its prevalence in real-world data is an open question. For causal discovery benchmarking, we provide implementations of R2-sortability, the R2-SortnRegress algorithm, and ANM simulation procedures in our library CausalDisco.
Proprietary and closed APIs are becoming increasingly common to process natural language, and are impacting the practical applications of natural language processing, including few-shot classification. Few-shot classification involves training a model to perform a new classification task with a handful of labeled data. This paper presents three contributions. First, we introduce a scenario where the embedding of a pre-trained model is served through a gated API with compute-cost and data-privacy constraints. Second, we propose a transductive inference, a learning paradigm that has been overlooked by the NLP community. Transductive inference, unlike traditional inductive learning, leverages the statistics of unlabeled data. We also introduce a new parameter-free transductive regularizer based on the Fisher-Rao loss, which can be used on top of the gated API embeddings. This method fully utilizes unlabeled data, does not share any label with the third-party API provider and could serve as a baseline for future research. Third, we propose an improved experimental setting and compile a benchmark of eight datasets involving multiclass classification in four different languages, with up to 151 classes. We evaluate our methods using eight backbone models, along with an episodic evaluation over 1,000 episodes, which demonstrate the superiority of transductive inference over the standard inductive setting
We tackle the Few-Shot Open-Set Recognition (FSOSR) problem, i.e. classifying instances among a set of classes for which we only have a few labeled samples, while simultane- ously detecting instances that do not belong to any known class. We explore the popular transductive setting, which leverages the unlabelled query instances at inference. Moti- vated by the observation that existing transductive methods perform poorly in open-set scenarios, we propose a gen- eralization of the maximum likelihood principle, in which latent scores down-weighing the influence of potential out- liers are introduced alongside the usual parametric model. Our formulation embeds supervision constraints from the support set and additional penalties discouraging overconfi- dent predictions on the query set. We proceed with a block- coordinate descent, with the latent scores and parametric model co-optimized alternately, thereby benefiting from each other. We call our resulting formulation Open-Set Likeli- hood Optimization (OSLO). OSLO is interpretable and fully modular; it can be applied on top of any pre-trained model seamlessly. Through extensive experiments, we show that our method surpasses existing inductive and transduc- tive methods on both aspects of open-set recognition, namely inlier classification and outlier detection. Code is avail- able at https://github.com/ebennequin/few- shot-open-set.
La complexité toujours grandissante des systèmes industriels amène de nouveaux verrous scientifiques pour les tâches liées à la maintenance prévisionnelle. Dans cet article nous présentons une revue des méthodes utilisées pour réaliser un diagnostic de fautes, et pointons leurs limites pour gérer des données multi-sources et hétérogènes, propres à l'industrie 4.0. Nous formalisons théoriquement ce nouveau cadre et proposons StreaMulT, une architecture permettant de gérer des séquences multimodales au fil de l'eau, non alignées et arbitrairement longues.
Every day, a new method is published to tackle Few-Shot Image Classification, showing better and better performances on academic benchmarks. This is truly great news, yet we observe that these current benchmarks do not accurately represent the real industrial use cases that we encountered. In this work, through both qualitative and quantitative studies, we expose that the widely used benchmark \textit{tiered}ImageNet is strongly biased towards tasks composed of very semantically dissimilar classes, e.g. bathtub, cabbage, pizza, schipperke, and cardoon. This makes \textit{tiered}ImageNet (and similar benchmarks) irrelevant to evaluate the ability of a model to solve real-life use cases usually involving more fine-grained classification. We combat this bias using semantic information about the classes of \textit{tiered}ImageNet and generate an improved, balanced benchmark. Going further, we also introduce a new benchmark for Few-Shot Image Classification using the Danish Fungi 2020 dataset. This benchmark proposes a wide variety of evaluation tasks with various fine-graininess. Moreover, this benchmark includes \textit{many-way} tasks (e.g., composed of 100 classes), which is a challenging setting yet very common in industrial applications. Our experiments bring out the correlation between the difficulty of a task and the semantic similarity between its classes, as well as a heavy performance drop of state-of-the-art methods on many-way few-shot classification, raising questions about the scaling abilities of our models. We hope that our work will encourage the community to further question the quality of standard evaluation processes and their relevance to real-life applications.
This paper tackles the problem of processing and combining efficiently arbitrary long data streams, coming from different modalities with different acquisition frequencies. Common applications can be, for instance, long-time industrial or real-life systems monitoring from multimodal heterogeneous data (sensor data, monitoring report, images, etc.). To tackle this problem, we propose StreaMulT, a Streaming Multimodal Transformer, relying on cross-modal attention and an augmented memory bank to process arbitrary long input sequences at training time and run in a streaming way at inference. StreaMulT reproduces state-of-the-art results on CMU-MOSEI dataset, while being able to deal with much longer inputs than other models such as previous Multimodal Transformer.
Domain Invariant Representations (IR) has improved drastically the transferability of representations from a labelled source domain to a new and unlabelled target domain. Unsupervised Domain Adaptation (UDA) in presence of label shift remains an open problem. To this purpose, we present a bound of the target risk which incorporates both weights and invariant representations. Our theoretical analysis highlights the role of inductive bias in aligning distributions across domains. We illustrate it on standard benchmarks by proposing a new learning procedure for UDA. We observed empirically that weak inductive bias makes adaptation robust to label shift. The elaboration of stronger inductive bias is a promising direction for new UDA algorithms.
Unsupervised Domain Adaptation (UDA) aims to bridge the gap between a source domain, where labelled data are available, and a target domain only represented with unlabelled data. If domain invariant representations have dramatically improved the adaptability of models, to guarantee their good transferability remains a challenging problem. This paper addresses this problem by using active learning to annotate a small budget of target data. Although this setup, called Active Domain Adaptation (ADA), deviates from UDA's standard setup, a wide range of practical applications are faced with this situation. To this purpose, we introduce Stochastic Adversarial Gradient Embedding (SAGE), a framework that makes a triple contribution to ADA. First, we select for annotation target samples that are likely to improve the representations' transferability by measuring the variation, before and after annotation, of the transferability loss gradient. Second, we increase sampling diversity by promoting different gradient directions. Third, we introduce a novel training procedure for actively incorporating target samples when learning invariant representations. SAGE is based on solid theoretical ground and validated on various UDA benchmarks against several baselines. Our empirical investigation demonstrates that SAGE takes the best of uncertainty vs diversity samplings and improves representations transferability substantially.
Few-Shot Learning (FSL) algorithms have made substantial progress in learning novel concepts with just a handful of labelled data. To classify query instances from novel classes encountered at test-time, they only require a support set composed of a few labelled samples. FSL benchmarks commonly assume that those queries come from the same distribution as instances in the support set. However, in a realistic set-ting, data distribution is plausibly subject to change, a situation referred to as Distribution Shift (DS). The present work addresses the new and challenging problem of Few-Shot Learning under Support/Query Shift (FSQS) i.e., when support and query instances are sampled from related but different distributions. Our contributions are the following. First, we release a testbed for FSQS, including datasets, relevant baselines and a protocol for a rigorous and reproducible evaluation. Second, we observe that well-established FSL algorithms unsurprisingly suffer from a considerable drop in accuracy when facing FSQS, stressing the significance of our study. Finally, we show that transductive algorithms can limit the inopportune effect of DS. In particular, we study both the role of Batch-Normalization and Optimal Transport (OT) in aligning distributions, bridging Unsupervised Domain Adaptation with FSL. This results in a new method that efficiently combines OT with the celebrated Prototypical Networks. We bring compelling experiments demonstrating the advantage of our method. Our work opens an exciting line of research by providing a testbed and strong baselines. Our code is available at this https URL.
In this paper, we explore contrastive learning for few-shot classification, in which we propose to use it as an additional auxiliary training objective acting as a data-dependent regularizer to promote more general and transferable features. In particular, we present a novel attention-based spatial contrastive objective to learn locally discriminative and class-agnostic features. As a result, our approach overcomes some of the limitations of the cross-entropy loss, such as its excessive discrimination towards seen classes, which reduces the transferability of features to unseen classes. With extensive experiments, we show that the proposed method outperforms state-of-the-art approaches, confirming the importance of learning good and transferable embeddings for few-shot learning. Code: https://github.com/yassouali/SCL.
We propose here a generalization of regression trees, referred to as Probabilistic Regression (PR) trees, that adapt to the smoothness of the prediction function relating input and output variables while preserving the interpretability of the prediction and being robust to noise. In PR trees, an observation is associated to all regions of a tree through a probability distribution that reflects how far the observation is to a region. We show that such trees are consistent, meaning that their error tends to 0 when the sample size tends to infinity, a property that has not been established for similar, previous proposals as Soft trees and Smooth Transition Regression trees. We further explain how PR trees can be used in different ensemble methods, namely Random Forests and Gradient Boosted Trees. Lastly, we assess their performance through extensive experiments that illustrate their benefits in terms of performance, interpretability and robustness to noise.
Learning Invariant Representations to adapt deep classifiers of a source domain to a new target domain has recently attracted much attention. In this paper, we show that the search for invariance favors the compression of representations. We point out this may have a bad impact on adaptability of representations expressed as a minimal combined domain error. By considering the risk of compression, we show that weighting representations can align representation distributions without impacting their adaptability. This supports the claim that representation invariance is too strict a constraint. First, we introduce a new bound on the target risk that reveals a trade-off between compression and invariance of learned representations. More precisely, our results show that the adaptability of a representation can be better controlled when the compression risk is taken into account. In contrast, preserving adaptability may overestimate the risk of compression that makes the bound impracticable. We support these statements with a theoretical analysis illustrated on a standard domain adaptation benchmark. Second, we show that learning weighted representations plays a key role in relaxing the constraint of invariance and then preserving the risk of compression. Taking advantage of this trade-off may open up promising directions for the design of new adaptation methods.
Unsupervised Domain Adaptation (UDA) has attracted a lot of attention in the last ten years. The emergence of Domain Invariant Representations (IR) has improved drastically the transferability of representations from a labelled source domain to a new and unlabelled target domain. However, a potential pitfall of this approach, namely the presence of \textit{label shift}, has been brought to light. Some works address this issue with a relaxed version of domain invariance obtained by weighting samples, a strategy often referred to as Importance Sampling. From our point of view, the theoretical aspects of how Importance Sampling and Invariant Representations interact in UDA have not been studied in depth. In the present work, we present a bound of the target risk which incorporates both weights and invariant representations. Our theoretical analysis highlights the role of inductive bias in aligning distributions across domains. We illustrate it on standard benchmarks by proposing a new learning procedure for UDA. We observed empirically that weak inductive bias makes adaptation more robust. The elaboration of stronger inductive bias is a promising direction for new UDA algorithms.
In this work, we propose a new unsupervised image segmentation approach based on mutual information maximization between different constructed views of the inputs. Taking inspiration from autoregressive generative models that predict the current pixel from past pixels in a raster-scan ordering created with masked convolutions, we propose to use different orderings over the inputs using various forms of masked convolutions to construct different views of the data. For a given input, the model produces a pair of predictions with two valid orderings, and is then trained to maximize the mutual information between the two outputs. These outputs can either be low-dimensional features for representation learning or output clusters corresponding to semantic labels for clustering. While masked convolutions are used during training, in inference, no masking is applied and we fall back to the standard convolution where the model has access to the full input. The proposed method outperforms current state-of-the-art on unsupervised image segmentation. It is simple and easy to implement, and can be extended to other visual tasks and integrated seamlessly into existing unsupervised learning methods requiring different views of the data.
In this paper, we present a novel cross-consistency based semi-supervised approach for semantic segmentation. Consistency training has proven to be a powerful semi-supervised learning framework for leveraging unlabeled data under the cluster assumption, in which the decision boundary should lie in low density regions. In this work, we first observe that for semantic segmentation, the low density regions are more apparent within the hidden representations than within the inputs. We thus propose cross-consistency training, where an invariance of the predictions is enforced over different perturbations applied to the outputs of the encoder. Concretely, a shared encoder and a main decoder are trained in a supervised manner using the available labeled examples. To leverage the unlabeled examples, we enforce a consistency between the main decoder predictions and those of the auxiliary decoders, taking as inputs different perturbed versions of the encoder’s output, and consequently, improving the encoder’s representations. The proposed method is simple and can easily be extended to use additional training signal, such as image-level labels or pixel-level labels across different domains. We perform an ablation study to tease apart the effectiveness of each component, and conduct extensive experiments to demonstrate that our method achieves state-of-the-art results in several datasets. Code is available at https://github.com/yassouali/CCT.
Tree-based ensemble methods, as Random Forests and Gradient Boosted Trees, have been successfully used for regression in many applications and research studies. Furthermore, these methods have been extended in order to deal with uncertainty in the output variable, using for example a quantile loss in Random Forests (Meinshausen, 2006). To the best of our knowledge, no extension has been provided yet for dealing with uncertainties in the input variables, even though such uncertainties are common in practical situations. We propose here such an extension by showing how standard regression trees optimizing a quadratic loss can be adapted and learned while taking into account the uncertainties in the input. By doing so, one no longer assumes that an observation lies into a single region of the regression tree, but rather that it belongs to each region with a certain probability. Experiments conducted on several data sets illustrate the good behavior of the proposed extension.
In this work, an EM estimation algorithm of a Structural Equation Model (SEM) and its Latent Variables (LVs) is proposed. Unlike the more prominent Covariance-Based SEM (CBSEM) approach, this estimation is not based on the constrained estimation of the covariance structure of the data. Latent variables are considered as missing data and the EM algorithm is used to maximize the likelihood of the entire model, providing simultaneously estimators of the model's coefficients and predictions of LVs. Contrary to CBSEM which does not take into consideration the structural part (equations of LVs exclusively) of the SEM for the LVs prediction, this EM approach considers the whole data and the complete model. In such context, when LVs are latent factors, SEM includes factorial models in add to a structural part. Then contrary to the EM algorithm for maximum likelihood factor analysis, this work extends the EM estimation in order to take into account the structural part of the SEM (links between the latent factors). Through a simulation study, accuracy and algorithmic performances are investigated. The prevail approaches CBSEM and PLS-PM are compared to the EM estimation according to different criteria. Finally, this approach is applied to a real environmental dataset, providing interesting conclusions.
Health-related quality of life (HRQoL) data are measured via patient questionnaires, completed by the patients themselves at different time points. We focused on oncology data gathered through the use of European Organization for Research and Treatment of Cancer questionnaires, which decompose HRQoL into several functional dimensions, several symptomatic dimensions, and the global health status (GHS). We aimed to perform a global analysis of HRQoL and reduce the number of analyses required by using a two-step approach. First, a structural equation model (SEM) was used for each time point; in these models, the GHS is explained by two latent variables. Each latent variable is a factor that summarizes, respectively, the functional dimensions and the symptomatic dimensions to the global measurement. This is achieved through the maximization of the likelihood of each SEM using the EM algorithm, which has the advantage of giving an estimation of the subject-specific factors and the influence of additional explanatory variables. Then, to consider the longitudinal aspect, the GHS variable and the two factors were concatenated for each patient visit at which the questionnaire was completed. The GHS and the two factors estimated in the first step can then be explained by additional explanatory variables using a linear mixed model.
Antoine Barbieri and Myriam Tami contributed equally to this work.
De Saporta B. (jury president), Saporta G. (reviewer), Saracco J. (reviewer), Trinchera L. (examinator), Lavergne C. (supervisor), Bry X. (co-supervisor)
Structural equation models enable the modeling of interactions between observed variables and latent ones. The two leading estimation methods are partial least squares on components and covariance-structure analysis. In this work, we first describe the PLS-PM and CBSEM methods and, then, we propose an estimation method using the EM algorithm in order to maximize the likelihood of a structural equation model with latent factors. Through a simulation study, we investigate how fast and accurate the method is, and thanks to an application to real environmental data, we show how one can handly construct a model or evaluate its quality. Finally, in the context of oncology, we apply the EM approach on health-related quality-of-life data. We show that it simplifies the longitudinal analysis of quality-of-life and helps evaluating the clinical benefit of a treatment.
This R package is an implementation of a simultaneous EM estimation of a structural equation model and its latent factors. Fits structural equation models using EM algorithm. More precisely, the model is a multi-block latent factor model with one structural equation which the latent variables are latent factors. The EM approach consists in viewing the latent factors as missing data and using the EM algorithm to maximize the whole model’s likelihood, which simultaneously provides estimators not only of the model’s coefficients, but also of the values of latent factors.
Currently, before each construction project involving welding operations, TOTAL is forced into a welding procedure qualification and its validation. This procedure is based on expensive pre-qualification tests, which can last several months and sometimes have a random result. The goal of this work is to build a predictive model which could limit this preliminary phase, improve it success rate and thus make it less expensive. With past Test Certificates, a data base has been built. It includes input and output variables of heterogeneous nature and each quantitative observation has been measured with an uncertainty. Expert knowledge from welding engineers allows one to make assumptions about the distributions that could be associated with each quantitative variable as well as their variances. The challenge is to develop a method to learn a predictive model of the welding quality, handling uncertainty measures and including all the relevant variables - quantitative or qualitative - as input. Random Forests, a popular ensemble method, allow to tackle part of these limitations (namely handling heterogeneous variables as input) while yielding state-of-the-art prediction. To deal with uncertainty measures, this work proposes to extend Random Forests by replacing the hard partitioning into regions as the basis of the classification/regression trees making up the forest by a soft partitioning. The prediction is then based on a weighted average of the predictions on each region, where the weights reflect the probability that a particular instance belongs to a region. We will present numerical and empirical results of the approach. Then we will discuss some theoretical results or guarantees and perspectives about a quantile regression extension.
We propose an estimation method of a Structural Equation Model (SEM). It consists in viewing the Latent Variables (LV’s) as missing data and using the EM algorithm to maximize the whole model’s likelihood, which simultaneously provides estimates not only of the model’s coefficients, but also of the values of LV’s. Through a simulation study, we investigate how fast and accurate the method is, and eventually apply it to real data.
The ensemble methods are popular machine learning techniques which are powerful when one wants to deal with both classification or prediction problems. A set of classifiers (regression or classification trees) is constructed, and the classification or the prediction of a new data instance is done by tacking a weighted vote. A tree is a piece-wise constant estimator on partitions obtained from the data. These partitions are induced by recursive dyadic split of the set of input variables. For example, CART (Classification And Regression Trees) [Breiman et al., 1984] is an efficient algorithm for the construction of a tree. The goal is to partition the space of input variable values in the most as possible "homogeneous" K disjoint regions. More precisely, each partitioning value has to minimize a risk function. However, in practice, experimental measures can be observed with uncertainty. This work proposes to extend CART algorithm to this kind of data. We present an induced model adapted to uncertainty data and both a prediction and split rule for a tree construction taking into account the uncertainty of each quantitative observation from the data base.
Currently, a welding procedure qualification is based on practical pre-qualification tests much expensive such as for exemple mechanical tests. The goal of this work is to build a predictive model which could both replace these costed pre-qualification tests and be coherent with physical-chemical rules and welding engineers’ expert knowledge. With past Test Certificates, a data base has been built. The challenge is to design an appropriate model allowing predict the welding quality and including all the relevant variables-quantitative or qualitative-as input. But measuring the welding quality is a complex task which can be subdivided into different ones. We choose to focus on welding mechanical quality. A preliminary statistical analysis based on linear regression modeling gave encouraging results. But it is limited to quantitative variables and is subject to some predictive quality issues (Saporta, 2006; Wikistat, 2016c). Ensemble methods allow to tackle these limitations. These last methods are taking into account the predictions of several single estimators (decision trees) in order to improve the robustness and prediction performances with respect to a single estimator. There are two kinds of ensemble methods(Friedman et al., 2001): some based on “bagging” such as Random Forest (Breiman, 2001; Genuer and Poggi, 2016; Liaw et al., 2002; Louppe, 2014) and the others on “boosting” such as Gradient Boosted Trees (Freund et al., 1999; Ridgeway, 2007). We will present the predictive performances of several tested models among them (Wikistat, 2016b; Meyer et al., 2017; Wikistat, 2016a; Strobl et al., 2007; Kuhn, 2008; Ridgeway, 2017; Liaw and Wiener, 2015; Hothorn et al., 2017; Kuhn et al., 2017), in the case of the Ultimate Tensile Strength variable prediction. Then we will discuss the encouraging prospects about the development of a Gradient Boosted Trees based method to learning welding mechanical prediction with both quantitative and qualitative variables.
The health-related quality of life data is measured through self-questionnaires filled up at different times. We focused on the oncology data reported through the EORTC questionnaires which decompose the health-related quality of life into several functioning dimensions, several symptomatic dimensions and the Global Health Status. The aim is to explain the latter, which represents the most general concept, through the other dimensions. First, a similar structural equation model is used at each time, in which the global health status is explained by two latent variables. Each latent variable is a factor which summarizes respectively the functional dimensions and the symptomatic dimensions. This is achieved through the maximization of the likelihood of each structural equation model using the EM algorithm, with the advantage to give an estimation of the subject-specific factors. Then, to consider the longitudinal aspect, the global health status variable and the two factors are concatenated for each visit. The global health status can be then explained by the two factors estimated in the first step and additional explanatory variables using a linear mixed model. This model takes into account the inter-subject variability via specific-subject random effects and other covariates such as the treatment.