Paper reading: "Characterization of Overlap in Observational Studies"

"Characterization of Overlap in Observational Studies"

Fredrik D. Johansson, Dennis Wei, Michael Oberst, Tian Gao, Gabriel Brat, David Sontag, Kush R. Varshney

https://arxiv.org/pdf/1907.04138v2.pdf

Doing causal inference requires overlap between the treated and untreated groups. Suppose we were testing (from observational data) a pill meant to improve a patient's memory; if all the pill-takers were old and the abstainers were all young, there would be little we could infer from their results on a memory test due to the lack of overlap in the "age" covariate.

Johansson et. al. describe an algorithm for finding and describe the region of overlap in an observational study in a form that is succinct and easily interpretable by a human being who is not a machine learning expert. A primary goal is to build guardrails for the application of learnings from observational studies:

When researchers publish the findings of a clinical trial, they also share the eligibility criteria (e.g., Age ≥ 18, Serum M protein ≥ 1g/dl or Urine M protein ≥ 200 mg/24 hrs, Recent diagnosis (National Cancer Institute, 2012)) .... We seek to provide the same for observational studies.

One common approach for dealing with lack of overlap in observational studies is to set thresholds on the propensity score, the probability of receiving a treatment. But a propensity score threshold is less valuable than a set of rules because it is uninterpretable—though it drops non-overlapping data on the cutting room floor, it doesn't provide introspection into where the overlap actually occurs.

Defining Overlap

How would we like to talk about overlap? The researchers describe three criteria:

(D.1) They cover regions where all populations (treatment groups) are well-represented.

(D.2) They exclude all other regions, including those outside the support of the study.

(D.3) They can be expressed using a small set of simple rules.

To satisfy the first criteria, we can take data that has propensity scores that are "high-but-not-too-high". To satisfy the second, we'd like to take a region that gives us the most bang for our buck in terms of probability mass; this can be formalized using the concept of an "$\alpha$-minimum-volume set". We want the minimum-volume set that gives us "enough" probability mass:

Now take the intersection of the first region and the second. That's overlap!

Identifying the region of overlap

Our goal now is to come up with a set of boolean rules that fit either the "OR of ANDS" or the "AND of OR"s pattern. (These are also known as disjunctive and conjunctive normal forms, respectively. They're desirable because they're easily interpretable by our mediocre human logic engines: "(Parlor AND Candlestick) OR (Kitchen AND Lead Pipe)".)

The algorithm for finding these rules, termed OverRule by the researchers, works as follows:

First we'll find boolean rules that approximate the desired probability support region
Then we'll fit the propensity-scoring model that defines membership in the bounded-propensity set
Finally we'll find rules that approximate the bounded-propensity region and estimate the overlap between the two rule-based regions

(Note: the researchers found via in-person evaluations with medical practitioners that "fitting rules for $\hat{S}$ and $B^\epsilon$ separately improved interpretability as it makes clear which rules apply to which task and prevents the bulk of the rules from being consumed by one of the two tasks."

The estimation of these two sets is framed as an optimization problem:

But the volume might be hard to compute! So we estimate it simply by drawing samples from a uniform distribution over our covariates (the set $\textit{U}$ below) and counting the number of samples in our rule set. We can use a similar technique for estimating the probability mass $P(C)$: simply the count the number of our $m$ observations that are captured by $C$.

That sampling turns the above optimization problem into a "Neyman-Pearson-like classification problem":

The researchers derive a theoretical bound on the regret of this Neyman-Pearson classifier that depends on the dimension of the input data, the size of the dataset, and regularization parameters.

(I wasn't familiar with the Neyman-Pearson classification problem; it is a paradigm in which we set a tolerance on the false-positive rate [a.k.a. Type I error] and minimize the false-negative rate [Type II error]. Somewhat intuitively, the false-negative rate will be minimized at exactly the point where the false-positive rate is equal to our tolerance.)

In a similar manner, once we've fit our propensity-scoring model over $\hat{S}$ we can reduce the problem of finding rules for $B^\epsilon$ to a Neyman-Pearson classifier.

The resulting optimization target is intractable (it's an instance of NP-complete integer programming), but can be approximated using an integer programming app technique called "column generation" that comes from the "Boolean Decision Rules" paper linked below. I skimmed through that part, but we'll try to tackle that paper in an upcoming reading.

Evaluation
The researchers evaluated their algorithm on several well-known datasets (Iris, LaLonde's Jobs) and some novel ones: an opioid prescription dataset and an antibiotic prescription dataset for treatment of UTIs. They compared OverRule to several classical methods of balance-checking in causal inference as well as to MaxBox, another research framework for identifying human-interpretable overlap. In the Iris dataset, OverRule proved to be more precise than MaxBox:

For a reader unfamiliar with rule-generation algorithms, the most interesting part of the experimentation section was reading the generated rulesets. They're intuitive and easy to digest, even for a reader without experience in medicine. The two-step nature of the algorithm is beneficial to rule interpretability: it allows us to mentally separate the support rules (where most of the probability mass lives) from the propensity-overlap rules (where both treatment and non-treatment are sufficiently likely).

Further reading

"Boolean Decision Rules via Column Generation" https://arxiv.org/pdf/1805.09901.pdf

Source code for the OverRule algorithm https://github.com/clinicalml/overlap-code

Search This Blog

Unconfoundable: a Causal Inference Blog

Paper reading: "Characterization of Overlap in Observational Studies"

Comments

Post a Comment

Popular posts from this blog

Blog reading: "Using Causal Inference to Improve the Uber User Experience"

Paper reading: "Improve User Retention with Causal Learning"

Paper reading: "Estimating individual treatment effect: generalization bounds and algorithms"