It was recently brought to my attention by several people that Andrew Gelman, a statistician at Columbia whose blog is a must-read for anyone interested in applied statistics, had discussed my critique of Flaxman et al.’s paper on the effectiveness of non-pharmaceutical interventions during the first wave of the COVID-19 pandemic in Europe. He first wrote a pretty detailed comment in response to Nathanael Schilling, who linked to my blog post in a comment where he criticized Flaxman et al.’s paper. Here is the first part of this comment:
I read the linked post by Lemoine and I can’t understand why you’re so harsh regarding Flaxman et al. I agree with several of Lemoine’s points, and I think Flaxman et al. would agree too: the model is indeed a variant of SEIR, there are indeed many real-world complexities that are not included in the model, in reality all coefficients should vary across countries and over time—this involves the usual compromises that are made when fitting statistical models to data—any measured changes are correlational not necessarily causal, policies vary across counties, there’s lots of variation in behavior and exposure within any country, and it’s impossible to pin down the effects of policy in a setting where people are continually changing their behavior on their own. That last issue is key, and I see this even with my own behavior: back in March we were so scared that we were bringing wipes with us when we went outside so as to swap the supermarket carts. Now we’re not doing that, but there are people outside wearing masks while jogging. The number of human contacts we make is somewhat under our control. There was all sorts of fuss about Thanksgiving gatherings, but you can end up in close proximity to people in a simple subway ride.
I think the way to interpret results from this sort of model is as statistical averages. I went back to the Flaxman et al. paper and this is what they say in their abstract:
We study the effect of major interventions across 11 European countries for the period from the start of the COVID-19 epidemics in February 2020 until 4 May 2020 . . . Our model relies on fixed estimates of some epidemiological parameters (such as the infection fatality rate), does not include importation or subnational variation and assumes that changes in Rt are an immediate response to interventions rather than gradual changes in behaviour. . . . We estimate that—for all of the countries we consider here—current interventions have been sufficient to drive Rt below 1 (probability Rt < 1.0 is greater than 99%) and achieve control of the epidemic. . . . Our results show that major non-pharmaceutical interventions—and lockdowns in particular—have had a large effect on reducing transmission. Continued intervention should be considered to keep transmission of SARS-CoV-2 under control.
This all seems reasonable to me, except for near the end when they say, “Our results show . . .” They really should’ve said, “Our fitted model implies . . .” or “Our model estimates . . .” or “The data are consistent with the conclusion that . . .”
Other than that, they’re pretty clear. Indeed, right in the abstract they state that their model (a) “relies on fixed estimates of some epidemiological parameters,” (b) “does not include importation or subnational variation,” and (c) “assumes that changes in Rt are an immediate response to interventions rather than gradual changes in behaviour.” These are all points that Lemoine makes, too, but it would help if he were to clarify that these are not any kind of secret—Flaxman et al. say them all right in their abstract!
Now, a flaw is a flaw. Again, the model is far from perfect, and stating a flaw in the abstract of the paper does not suddenly then make the model correct. My point here is that no sleuthing was needed to find these problems. The problems in the model are there for the usual reasons, which is that there’s a limit to what can be learned from any particular dataset. Policies and behaviors are entangled, and so any estimate will be some sort of average. Again, that’s implied by the Flaxman et al. abstract but I do wish they’d not said “Our results show” near the end.
There is a lot I agree with in this part of his comment, so I don’t want to spend too much time on it, but I still don’t agree with everything.
Specifically, I disagree that Flaxman et al. were clear enough about what their analysis shows and, more importantly, does not show. I guess that even Gelman doesn’t disagree on that point since he criticizes the language they used at the end of their abstract, but I don’t think his criticism was strong enough. Of course, I have no doubt that Gelman and people who are familiar enough with statistical modeling understand the limitations of this kind of work, but most people don’t and I think that includes many scientists. Anyone who doesn’t believe me can type the url of Flaxman et al.’s paper in the search bar on Twitter and have a look at the results. What you will find is thousands of people, many of them scientists, citing this paper as proof that lockdowns are incredibly effective. I just tried and here is the first result that search returned:Note that Pakdel is not some random guy, he is a physician with a PhD, so if he thought Flaxman et al.’s analysis was conclusive imagine what people with no statistical training must conclude from reading the abstract of that paper. (Not, to be clear, that I have a very high opinion of physicians, even when they have a PhD.) Again, this is literally the first result that search returned, but I’ve seen hundreds of scientists, to say nothing of regular people, citing Flaxman et al.’s paper uncritically.
Now, to be fair with Flaxman et al., it’s true that, no matter how careful you are and how many caveats you include, there will always be people who overinterpret your results. But the fact is that, as even Gelman acknowledges, they were not careful enough in writing the abstract of their paper. In my opinion, they should have known that, in the context where this paper was published, it would be read by many people, such as journalists but not just them, who wouldn’t understand how serious the limitations they mention in the abstract were. The truth is that, unless you explain the import of those limitations, most people will not understand it. Thus, not only do I think that Flaxman et al. should not have included the sentence where they claim to have shown that non-pharmaceutical interventions and lockdowns in particular had dramatically reduced transmission, but I think they should have explicitly noted that it was difficult to interpret their results causally. Indeed, because they failed to do that, the vast majority of people who read this abstract just understood that non-pharmaceutical interventions and lockdowns in particular had definitely saved millions of lives, which the paper doesn’t show. In fact, it’s worse than that, because I think the full results actually make that claim extremely unlikely, which brings me to the second part of Gelman’s response.
This is the part where he talks about my point on the country-specific effect in Sweden, which in my opinion is the most important point. Here is the rest of Gelman’s comment about my post, where he addresses that point:
The other thing Lemoine sys is that they’re hiding data from Sweden. I doubt that “they swept that fact under the rug.” They had tons of results and I’m not sure how they decided what to include in the paper. Lemoine writes: “What I don’t understand, or would not understand if I didn’t know how peer review actually works, is that no reviewer asked for it.”
I’ve seen a lot of peer review over the years, and I think the best way to summarize peer review is that (a) it tends to be results-oriented rather than methods-oriented (even when reviewers criticize methods it’s often because they don’t believe the claimed results) and (b) it’s random. A paper can have obvious flaws that don’t get caught; other times, reviewers will obsess over something trivial. In this case, I expect that (a) the reviewers believed the result so they wanted to see the paper published, and (b) they probably did suggest some alternative analyses, but I’m guessing that the journal was also putting lots of pressure on the authors to keep the paper short. I’m not sure what restrictions were on the supplementary material, or why they didn’t include a few hundred more graphs there. But I really really doubt that they were trying to hide something. I was giving them some advice when they were doing this work—I’m not sure if my advice was about this particular paper, but in any case I’ve known Flaxman for awhile, ever since he spent some time working with our group a few years ago. What was happening is they were acutely aware of the statistical challenges in estimating the effects of policies in this observational setting, also they were acutely aware of the flaws in their model—which is why they mentioned these flaws right in their abstract—and they were doing their best. If they really thought the country-specific effect for Sweden didn’t make sense, I think they would’ve highlighted this point, not hidden it.
In any case, if you take out some of the speculations and rhetoric, I think it’s good that Lemoine did simulations, as that’s a great way to understand what a model is doing. As I wrote a few months ago, the model is transparent, so you should be able to map back from inferences to assumptions.
Honestly, I was taken aback by this response, it just feels like a sort of scientific gaslighting to me.
To be honest, it feels a bit awkward for me to criticize Gelman, because whatever understanding I have of statistics, I feel like it owes quite a lot to his blog and his published work, but nevertheless I think he is side-stepping the substantive issue here. While I refuse to take back anything I said about why Flaxman et al.’s did not discuss this country-specific effect, because I still think I was right and I don’t want to pretend that I have changed my mind, let’s just ignore that issue here in the interest of having a productive discussion and focus on whether they should have discussed the country-specific effect in Sweden, regardless of why they didn’t. In effect, what Gelman is saying is that Flaxman et al. had lots of results, so choices had to be made about what to discuss and they didn’t think this particular result was worth discussing. Again, let’s ignore whether this is really why they didn’t discuss the country-specific effect in Sweden, which strictly speaking is the only question Gelman addresses here, to focus on whether it made sense not to discuss it. I think it’s absolutely indisputable that it should have been discussed, it’s just not something reasonable people can disagree about, yet Flaxman et al. did not even mention it.
Remember that, in their model, Flaxman et al. assumed that non-pharmaceutical interventions had the same effect in every country, except for the last intervention that was allowed to have a different effect in each country. The last intervention was a lockdown everywhere except in Sweden, which didn’t lock down, where it was a ban on public events. Because of this country-specific effect, their model found that banning public events had reduced transmission by ~72.2% in Sweden, but only by ~1.6% everywhere else. Moreover, according to the prior they used for the country-specific effect, the probability that it would be that large was only ~0.025%. So if Flaxman et al. think it was okay not to mention this small detail, they should be willing to say that it’s possible that, for some mysterious reasons, banning public events was almost 45 times more effective in Sweden than everywhere else. If they don’t believe that it’s possible, which of course they don’t because it’s crazy, then it undermines their whole analysis since it means their model was badly misspecified, so they obviously should have discussed that. Even if you think they just didn’t realize the extent to which it undermined their conclusion, you should at least acknowledge that they were wrong about that. It’s simply not serious to reply that they had lots of results and that choices had to be made.
In fact, even Flaxman et al. now acknowledge this point, though only reluctantly. As he explained in another post, after he read my post, Gelman wrote to Flaxman to ask him about my critique. Flaxman explained to him that, as it happened, a Swedish team had written a critique of Flaxman et al.’s paper in which they made similar points to me, which Nature published a few days ago along with a response by Flaxman et al. (Incidentally, the critique was apparently sent to Nature only a week after the publication of Flaxman et al.’s paper, but it took 6 months before it was finally published, by which time it had already been cited more than 500 times. So while it’s good that it was eventually published, I think it illustrates the problems with pre-publication peer review, which I had already mentioned in my post about Flaxman et al.’s paper.) Here is the passage in their response where they acknowledge that the country-specific effect in Sweden should have been mentioned:
The focus of Soltesz et al. is the size of the random effect assigned by our model to the last intervention in Sweden. Specifically, a large random effect is needed to explain the Swedish data, and this could have been more explicitly stated in our original paper. Soltesz et al. claim that the difference between effect sizes in a full pool model and in Flaxman et al. points to our model having little practical statistical identifiability. On this basis, Soltesz et al. question whether the effectiveness of lockdown can be resolved to the degree our paper stated.
To say that it “could have been more explicitly stated” in the paper is really the understatement of the year, since they didn’t mention it at all in the paper or even in the supplementary material, but more importantly their reply still fail to acknowledge how much this undermines their conclusion.
After this passage, they defend their use of a partial pooling approach, i. e. allowing the effect of the last intervention to vary by adding a country-specific effect to the model instead of assuming the effect of every intervention is exactly the same everywhere or running the model on each country separately. They argue that, even if you use the full pooling approach or run the model separately on each country, lockdown is still the only intervention that consistently has a significant effect, it’s just that a partial pooling approach allows you to overcome the limitations of the data due in particular to the fact that in most countries the interventions are closely spaced in time and sometimes even coincide exactly. But that is just missing the point, which is that since the epidemic followed a similar trajectory in Sweden as in other countries despite the absence of lockdown, their conclusion that only lockdowns had a meaningful effect on transmission is overwhelmingly unlikely to be true. At least, even if lockdowns are what did most of the work in most countries, it’s clear that other, less stringent interventions plus more or less spontaneous behavioral changes would have done a similar job even in the absence of a lockdown, as they evidently did in Sweden.
It’s not enough to say, as Flaxman et al. do in their response, that “additional covariates beyond the timing of mandatory measures are likely to be needed to provide a fully satisfactory explanation of the trajectory of the epidemic in Sweden”. Indeed, unless you believe there are magical faeries in Sweden that made banning public events 45 times, not 50% or even 3 times, more effective than in other European countries, the problem is deeper than that. The right conclusion from this observation is not that you need to have some covariates in your model to explain what happened in Sweden, it’s that your model is misspecified in a more fundamental way and that you can’t trust your conclusion. As I argued in my post, if their model found that only lockdowns had a meaningful effect on transmission, it’s probably because it tends to ascribe most of the reduction in transmission to whatever intervention took place after the peak of infections and was closest to that peak or, if no intervention took place after the peak, to the last intervention, but in the data that intervention happened to be a lockdown in most countries.
In their response, Flaxman et al. show with simulated data that their model was not bound to find that only lockdowns had a meaningful effect on transmission:
We used our model to simulate synthetic epidemics for all 11 countries, keeping the original timing and ordering of interventions and the same initialization priors, but assigning hypothetical effect sizes to each intervention. We assigned small effect sizes (5% with a tight prior) to all but one NPI, giving the remaining one an effect size with a mean of 59%, also with a tight prior, across countries. In addition, to better reflect reality, we simulate another, country-varying NPI, at a random time, which we treat as unobserved in our model. This unknown and unobserved NPI has a diffuse prior bounded between 0% and 100%, with a mean of 27%, and it is included to assess whether an omitted variable (for example, representing spontaneous behaviour change in response to government messaging) could bias the effect-size estimates of our modelled NPIs. We keep the dates for NPIs the same as the ones in the real data to account for concerns raised about the possible effects of coincident timing on the identifiability of effect sizes.
Next, we fitted the Flaxman et al. model to these simulated datasets (20 different simulations for each setting). As shown in Fig. 2, the estimates from the Flaxman et al. fitted model (without any information about the unobserved NPI) are in agreement with the NPI effect sizes that were used to generate the data. This analysis provides further evidence that the results we found were not merely artefacts of the modelling approach; if there is a strong signal in the data for a specific NPI, our model can recover it.
However, I never said that their model was bound to find that only lockdowns had a meaningful effect on transmission even if that were not in fact the case, what I said is that it could easily reach this conclusion for purely mathematical reasons if the interventions were timed in a certain way relative to the epidemic trajectory, even if the data generating process was in fact inconsistent with that conclusion. In fact, that’s precisely what the country-specific effect in Sweden makes extremely likely, which is why it should obviously have been discussed in their paper.
To be clear, I’m not saying that Flaxman et al., who I’m sure know more statistics than me, don’t understand this methodological point. On the contrary, I have no doubt they do, especially since they explicitly make that point in their response to Soltesz et al.’s letter to Nature:
However, this does not on its own show that the converse [of the claim that “if there is a strong signal in the data for a specific NPI, our model can recover it” in the passage I quoted above] is necessarily true. To evaluate competing explanations for the observed dynamics of transmission, additional empirical evidence—such as NPI efficacy or alternative epidemiological explanations—is needed.
What I’m saying is that, as I noted above, the epidemic trajectory in Sweden actually constitutes such additional empirical evidence against their conclusion. As I said in my critique of their paper, “the truth is that, with the data and methods they used, it’s impossible to estimate the effect of non-pharmaceutical interventions”. However, we don’t need complicated statistical methods to know that their conclusion is extremely implausible, at least if it’s interpreted as the claim that interventions other than a lockdown would not have meaningfully reduced transmission even if a lockdown had not been implemented (which is how it was interpreted by everyone), we just need to look at what happened in Sweden.
> Anyone who doesn’t believe me can type the url of Flaxman et al.’s paper in the search bar on Twitter and have a look at the results. What you will find is thousands of people, many of them scientists, citing this paper as proof that lockdowns are incredibly effective. I just tried and here is the first result that search returned:
That’s a terrible way to assess this issue. Type in the Imperial College study and you find tons of responses where people ignore the conditional aspect of their projections. Reaction in the Twitterverse is not a direct function of the scientific attributes of the language used by researchers.
That isn’t to say that criticism of how they frame uncertainty isn’t reasonable. It isn’t to say it’s valid to question gowvreasearchets parameterize uncertainty. But judging through reverse-engineering from responses among highly polarized fanatics on Twitter is just an exercise in confirmation bias.
>… at least if it’s interpreted as the claim that interventions other than a lockdown would not have meaningfully reduced transmission even if a lockdown had not been implemented (which is how it was interpreted by everyone), we just need to look at what happened in Sweden.
I’m having trouble parsing this to figure out what it means. Lots o’ negatives to unwind.
Are you saying that what happened in Sweden shows that interventions absent a “lockdown” (whatever that means) meaningfully reduce transmission? If that’s right, how so? What interventions implemented in Sweden absent a “lockdown” reduced transmission? Limiting public gathering size? (The whole definition of “lockdown” seems so arbitrary to me as to render most these discussions pretty useless.)
“Are you saying that what happened in Sweden shows that interventions absent a “lockdown” (whatever that means) meaningfully reduce transmission?”
The post is about Flaxman et al. paper more precisely a critique of it. It’s not about what Lemoine says but about what the model seems to say. What Lemoine says is that one of the results of Flaxman’s model is that, in Sweden, interventions other than lockdowns created a large reduction in transmission. Lemoine argues this result – from the model itself – contradicts the interpretation Flaxman et al. made of it.
I don’t think the definition of lockdown is relevant when discussing Sweden vs other countries. After all, there is no definition of “lockdown” that would allow us to say Sweden was a “lockdown country”. At worse changing the definition of lockdown could only recategorize a “lockdown country” into a non-lockdown country so it will only be to the disavantage of lockdown efficacy.
> I don’t think the definition of lockdown is relevant when discussing Sweden vs other countries. After all, there is no definition of “lockdown” that would allow us to say Sweden was a “lockdown country”. At worse changing the definition of lockdown could only recategorize a “lockdown country” into a non-lockdown country so it will only be to the disavantage of lockdown efficacy.
First, I think the term “lockdown” is largely rhetorical gamesmanship. But regardless, it’s a problem in that people are arguing about the effect when it has no universal meaning. That’s part of what this discussion is about – looking for a sensitivity analysis.
The primary goal of “lockdowns” is to reduce mobility and social interaction. Say in country X, those factors reduce w/o state mandates and in a very similar country Y they change behavior similarly w/mandates. Then does “lockdown” mean that the changes are mandated (when people would be changing their behaviors similarly anyway)?