==

Challenges and an Empirical Evaluation Framework for Causal Inference with Natural Language

Using natural language, such as text, for causal inference has enormous potential for impactful research in computational social science and other domains.

The Fundamental Problem of Cuasal Inference is that we can never observe the counterfactual, and so for a given observation in a study, we can only ever know the outcome in the treated case or the untreated case — never both.

In observational studies, propensity scores can be used to estimate the counterfactual and adjust for observed confounders. When doing so with natural language, which is unstructured and extremely high dimensional, the challenge is to build a propensity score model that is effective at recovering confounders and covariates. However, evaluating these natural language propensity score models is extremely challenging, as ground truth causal effects are few and far between.

In this work, we present a series of 6 tasks which together compose a principled empirical evaluation framework for causal inference in natural langauge.

What do we contribute?

Task

Six Evaluation Tasks

Our primary contribution is a series of six tasks, inspired by real challenges in natural language. Each task has a series of levels in increasing difficulty, allowing us to evaluate the specific strengths and weaknesses of different models.

Nueral Network

SHERBERT Model

SHERBERT is a novel cauSal HiERarchical variant of BERT, which effectively scales to long text inputs by using multiple pre-trained BERT models to produce an intermediate embedding.

Paper (pre-print available on arXiv)

[PDF]

Abstract: Leveraging text, such as social media posts, for causal inferences requires the use of NLP models to ‘learn’ and adjust for confounders, which could otherwise impart bias. However, evaluating such models is challenging, as ground truth is almost never available. We demonstrate the need for empirical evaluation frameworks for causal inference in natural language by showing that existing, commonly used models regularly disagree with one another on real world tasks. We contribute the first such framework, generalizing several challenges across these real world tasks. Using this framework, we evaluate a large set of commonly used causal inference models based on propensity scores and identify their strengths and weaknesses to inform future improvements. We make all tasks, data, and models public to inform applications and encourage additional research.

Tasks

All code used for this publication, and datasets for each task, are available on GitHub.

Each task is designed to evaluate a model's ability to handle one of 6 different challenges in natural language. Tasks have multiple levels of difficulty, allowing us to see how models perform at handling each challenge.

Tasks consist of semi-synthetic data, generated by inserting synthetic comments into real reddit users' post histories according to a known probability distribution, allowing us to control the true ATE, and therefore empirically evaluate the bias of each model on each task. Semi-synthetic data provides the best of both worlds: the realism of real world data, with the principled structure and known ATE of synthetic data.

A description of each task is below. For more details, refer to the paper.

Linguistic Complexity

Models should be able to recognize the mutual importance of different phrases.

Signal Intensity

Models should be able to detect weak signals.

Order of Text

Models should be able to distinguish between the order of posts.

Strength of Selection Effect

Models should be able to adjust for confounding, even when there is reduced overlap in the distribution of language between treated and untreated users.

Number of Users

Models would be able to perform well even with limited training data.

Absence of Non-Zero Treatment Effect

Models should not predict causality when none is present.

Diagram of SHERBERT Model

SHERBERT Model

Our implementation of the SHERBERT model is available on GitHub.

SHERBERT is a novel cauSal HiERarchical variant of BERT, which effectively scales to long text inputs by using multiple pre-trained BERT models to produce an intermediate embedding.

Acknowledgements

We would like to thank Adobe Research for their support of this work.