Richard Tol and the 97% consensus

I said in my earlier post that I might write about Richard Tol’s publicly stated views on John Cook’s 97% consensus paper. Since then, however, I’ve had a rather frustrating Twitter discussion with Richard Tol that, sadly, ended quite sourly – as they have a tendency to do. Let me make it clear. I’m not trying to wind-up Richard Tol, nor am I trying to tick him off – although I probably will end up doing so, even if that is not my intent. I’m simply commenting on what he has already openly stated in the public domain.

Anyway, Richard Tol now has a 7th draft of his paper criticising the Cook et al. study, and appears to want to resubmit this to a journal. For those who don’t know, his first submission was rejected by the editor of the journal to which it was submitted. To be fair, Richard has made a number of changes to the paper and has listened to some of the criticisms and suggestions. He even acknowleged that something I had said had made him check something again and slightly change his test. Admittedly, he only mentioned this during a Twitter debate about how best to undertake academic discussions and so was more, I think, trying to make a point about discussions, than actually acknowledging that I had said something credible – although he did acknowledge it and so I should at least be grateful. Now, I do still have issues with Richard’s statistical tests, but I thought I would make a more fundamental point.

Richard’s been making a great deal of noise about John Cook not releasing all his data. He wants all the ratings, not just the final ratings, and even thinks that individual keystrokes and time stamps should be provided, but accepts that this may be asking a bit much. He’s also been critical of the lack of a robust survey strategy. Now, here’s where I have an issue. If the goal of the Cook et al. work was to survey a group of people to, for example, determine their views on climate change, then Richard Tol would be perfectly correct. Such work would require that you had a well-defined survey strategy and that you kept track of all your data so that you could eliminate biases, or discuss biases if any exist. You’d need to know something about the sample; for example, what was the age distribution, the gender distribution, political affiliations, scientific background. In such a case I would completely agree with Richard.

However, the goal of the Cook et al. work was not to survey a group of people, it was to survey a set of abstracts that had been extracted from a database using a well-defined search. The people involved were simply a tool that analysed these abstracts so as to ultimately give each abstract a rating that reflected it’s position with regards to anthropogenic global warming (AGW). In some sense, what is important is whether or not the final rankings “properly” reflect the position of each abstract, not really how this rating was achieved. Now, I should be careful. I’m not suggesting that one doesn’t need to know about the strategy, simply that the requirements with respect to the intermediate data is different to what would be required if the goal of the work was to study the people doing the rating, rather than to study the abstracts.

Let me try and give you an analogy. I do quite a lot of computational work. I take some set of initial conditions (my raw data if you like) and I evolve them using a simulation to produce a result that I then analyse. If someone thinks my results are wrong, they wouldn’t typically ask for my code so that they can check it line by line. They might ask for my code and I might give it to them, but they would then redo the simulations (after checking the code). More typically, they would simply do the simulations using their, or another, code. No journal would let them publish a paper pointing out that my code had an error in line 1002 (for example). A journal would expect that they show the significance of the error by redoing some of the simulations.

So, in my view, to check the validity of the Cook et al. work, you need to redo the analysis of some of the abstracts to see if the new results are consistent with those obtained by Cook et al. Simply showing that some of their intermediate data fails a statistical test, doesn’t really tell you if there’s anything significant wrong with the Cook et al. results. In fact, in earlier drafts of his paper, Richard Tol acknowledged that the Cook et al. results probably did reasonably reflect the level of agreement in the scientific community (this appears to have been left out of the recent draft, although maybe I’ve missed it). Failing a statistical test may indeed indicate something is wrong, but doesn’t prove it. Furthermore, there’s another issue that I have. Presumably someone could design such a study with a very precise and rigorous analysis procedure. This could be designed to pass all (or most) of Richard Tol’s tests. But this doesn’t tell you that such a procedure can suitably rank a sample of abstracts, it just tells you that it satisfies a set of tests that someone thinks are important. Just to be clear, let me restate something. If the goal was to study the people, then passing these tests might well be relevant, but the goal wasn’t to study the people, it was to study the abstracts.

So, this is where I get a little more controversial and somewhat more critical of Richard Tol. The problem I’m having with this whole event is understanding Richard Tol’s motivation. His claim is that he is simply interested in making sure that a piece of work is robust and done properly. Even if the results are correct – he says – if the strategy is flawed, the work has no merit. However, what I have issues with are Richard’s own style and his own strategy. Firstly, he’s often remarkably rude and unpleasant. I have been told that this is pretty standard in his field, but I find it a strange way to interact with other academics. It shows a lack of decency, and if you’re not willing to be decent why should others be decent towards you? As a far as his strategy goes, he has spent quite a lot of time trying to convince people that John Cook’s reluctance to release all his data implies that he’s trying to hide something. Richard’s paper then consists of a set of statistical tests that the Cook et al. data apparently fail, hence indicating a problem with the work. However, as I try to explain above, it’s not clear that these tests are actually telling us anything about whether or not the Cook et al. results are robust. They might be perfectly fine tests to do if the goal was to study the people rating the abstracts, but that wasn’t the goal.

Richard Tol’s intentions may well be good and honourable. I obviously can’t claim otherwise. However, from my perspective, this all seems a little suspicious. Make people think that Cook et al. is hiding something and then when they do release data, run a set of statistical tests that the data fails. Those who don’t know better will think this means the Cook et al. study is nonsense, when – as far as I can tell – it’s told you nothing of the sort. I’m not claiming that there aren’t problems with the Cook et al. study, simply that Richard Tol’s tests aren’t a particularly good indicator of whether or not there are problems. It might indicate something, but until you test the actual abstract ratings, how can you know? It makes me think that Richard is following a similar strategy to that adopted by McIntyre & McKitrick when they tried to debunk Michael Mann’s hockey stick paper (although, with all due respect to Cook et al., I’m not suggesting that the Cook et al. paper is of the same calibre as Michael Mann’s hockey stick paper). If you’re unfamiliar with the story, you can read John Mashey’s comment here. Basically, do something that looks credible but that’s complicated enough that few will have the knowledge or understanding to know if it is actually credible or not.

I appreciate that my last paragraph has suggested that I think Richard Tol’s motives may not be as pure as – I assume – he would like others to think. If this seems unfair, I apologise. Richard is more than welcome to simply ignore this, of course. It is just a blog post and is just my impression, based on what I’ve seen and read. He’s, of course, welcome to comment to clarify his position and to explain his motives. Whether I respond to his comments or not may depend on the tone he chooses to use. I am now on leave again, so am keen to have a nice relaxing week off before going back to work, where I plan to focus on my lecture preparation, write a chapter of a book, and do some of my own research; rather than reading and commenting on climate science. We’ll see if I succeed.

This entry was posted in Climate change, Global warming and tagged , , , , , , , . Bookmark the permalink.

239 Responses to Richard Tol and the 97% consensus

  1. toby52 says:

    I have no time for Richard Tol and am instinctively suspicious of his motives. Here is someone who boasted on his Irish Economy blog of cordial dinners with Dr Rejendra Pachauri, while hobnobbing with Lord Lawson of the Global Warming Policy Foundation and repeating every slur against Pachauri, and also against Professor Michael Mann.

    While Tol’s science might be sound, to me he is someone who always plays both sides to what is his best momentary advantage, which is at the moment to attack John Cook. He is a latter-day Vicar of Bray, who did not care if a Catholic or Protestant was King, as long as he remained Vicar of Bray. You are right to question his motives.

  2. Thanks for the comment. I’m reluctant to make too strong a statement myself, partly because I’m trying to keep things reasonably civil, partly because I’m naive and would to assume people do have the best of intentions, and partly because I don’t know enough to really say more. I will say, however, that I have found it all rather odd and not something I’ve encountered before in my 20 or so years in academia.

  3. You raise a lot of the issues I have with how Tol is approaching the Cook paper. As what he’s doing is completely irrelevant to the process of checking if the results are valid. Especially as already all the data he needs to do this is available.

    He also completely ignored my point that the raters didn’t have any strange biases or problems associated with them as they are in agreement with the self-ratings from the authors of the papers. In fact the raters are more conservative compared to how the authors rated their papers.

    That with the agreement between the results from the raters and the authors doesn’t indicate any big issues with the paper. If he wants to tackle this he should see if he can replicate the same results, him not doing this completely baffles me.

  4. Thanks Collin. I noticed your Twitter exchange with Richard. I’ve tried to discuss the same issues with Richard as you have, and have – similarly – have had little success in convincing him that his interpretation is wrong. Of course, maybe we’re the ones who are wrong 🙂

  5. Barry Woods says:

    If you (Cook) are planning the media blitz and marketing of the 97% consensus project – WHILST you are analysing the papers. Just a tiny chance a bit of confirmation bias might slip in for the end result….(/sarc off)

    “To achieve this goal, we mustn’t fall into the trap of spending too much time on analysis and too little time on promotion. As we do the analysis, would be good to have the marketing plan percolating along as well.” – John Cook

    Not a good place to be, for even the perception that his happened, for Cook and his team.

  6. Barry Woods says:

    Ari Jokimaki responded to Cook,

    “I have to say that I find this planning of huge marketing strategies somewhat strange when we don’t even have our results in and the research subject is not that revolutionary either (just summarizing existing research).” – Ari Jokimäki

  7. Possibly, but that’s not the point I’m making. The point I’m making is that the tests that Richard Tol is applying to the Cook et al. data may not be appropriate for the survey they’ve undertaken. I’d be much more convinced by what Richard was trying to do if he addressed, for example, the emails you’ve exposed or addressed actual issues with the ratings of the abstracts (rather than the process that produced the ratings). I don’t think there’s a major issue with the Cook et al. work, but that’s my view and I could be wrong and someone might even be able to convince me that I’m wrong. However, Richard doing a bunch of statistical tests as if this were some kind of social science survey, is certainly not going to convince me that there are major issues.

  8. And that might suggest that some on the team realised that planning how to market the results before you had them might not be an ideal strategy. Kudos to Ari Jokimaki.

  9. Spence_UK says:

    I think you demonstrate quite a significant naivety in how scientific experiments are falsified. In my experience, particularly in contentious cases, the approach you describe just doesn’t work. Usually you have to show a flaw with the original experimenter’s work to bring about an effective conclusion.

    – Blondlot’s N-rays. UK royal soc and many US institutions repeat Blondlot’s experiment, fail to replicate results. Blondlot insists experiments were done wrong. Other institutions claim to repeat Blondlot’s results. Blondlot’s work finally debunked when Wood goes to inspect Blondlot’s own apparatus and tricks him into making a false positive, showing bias in Blondlot’s methods.

    – Benveniste and memory of water molecules. Again, other labs show mixed results with some failing to replicate and some claiming success. In the end the only way to resolve the dispute was to look in great detail at Benveniste’s methods and discover a key part of the analysis was not blinded. Maddox and Randi blind that part of the study at Benveniste’s lab, results disappear.

    I could produce many other similar examples.

    As you can see, when others repeat the experiment from scratch and get different results, it is rarely accepted as an effective debunking, as both claim the others did something wrong. On the other hand, finding a flaw with the original method or data is a far more effective way of bringing a conclusion to a controversional issue. So I think Tol is right in his approach. It is a more efficient use of time and resources and more likely to reach a meaningful conclusion.

  10. And I think you’ve demonstrated that you don’t understand my post, or haven’t read it properly. Why don’t you try again?

  11. Also, what your examples illustrates is that people tried to replicate and couldn’t. That’s how it’s meant to work. If you think someone pointing out an error in some process is going to convince people that the original results are wrong, then I think you are horribly mistaken.

  12. Spence_UK says:

    I’m demonstrating that a fundamental idea in your post (how problems are found in scientific research) is misguided, and contradicted by history. Sorry if that’s not clear to you; I can explain, but I can’t force you to understand.

  13. Spence_UK says:

    No, in both examples some labs successfully claimed replication and other labs claimed that replication failed. In both instances, they pointed at each other and claimed the other group did something wrong. There was no resolution in these cases by independent replication, the resolution came from digging into the intermediate steps and finding methodological errors in the details.

  14. If I’m wrong I would be one of the first people to correct my stance, however I severely doubt that this is the case with the Cook paper. It has similar results as other reviews and nothing jumps out that is a problem.

    How Tol is approaching this by constantly speculating seems to indicate that there’s nothing fundamentally wrong with the paper. Simply because he’s focussing in on details that are irrelevant for checking if the conclusions in the paper are correct.

  15. Certainly, but it was the lack of agreement about the replication that drove the check for methodological errors. You can’t find an error and know that the result will not be replicated without checking that it is indeed not replicated. That’s really all that I’m suggesting. I’m not suggesting that finding a methodological error is a bad thing to do, simply that – alone – it isn’t enough. Plus, what Richard is doing, isn’t really this. He is assuming that the Cook et al. survey is equivalent to a survey of the volunteers (in which case the data he wants would be relevant) but it isn’t, it’s a survey of the abstracts.

  16. No you’re not. It’s not contradicted by history. All your examples included an attempt at replication of the results. That’s precisely what I’m suggesting should be done. I too can explain and, I too, cannot force you to understand. I, however, can try not to be condescending.

  17. Scientifically, I view Cook’s 97% paper as a largely irrelevant bean-counting exercise. If I want to know what the consensus is amongst climate scientists, I’d rather wait till the IPCC is published this autumn. But I do concede that Cook et al has value in showing the public that the few climate skeptics who are scientists with sufficient expertise to publish a paper are a tiny minority of the community. The only thing that surprises me is that the consensus was not higher. Looking at the list contrarian papers, I see some familiar names – Chilingar, Scarfetta – who write predictable bad papers that few climate scientists would rate above nonsense.

    But I find Tol’s comment awful, if this is the seventh draft, I am not surprised an earlier draft was rejected. I think comments should challenge the heart of a paper, not be a laundry list of complaints that the analysis wasn’t done the way the author would have preferred.

    I have published comments, and worry about getting the tone right. Tol obviously doesn’t suffer from such qualms – “incompetent”, “Data quality is low.” Is is possible to get more abrupt? Of course the data are not perfect – there will inevitably be errors as there are in any dataset – but unless Tol can demonstrate that the errors make a material difference to the conclusions, he cannot reasonably claim that “Data quality is low.”

    I agree there are issues around mitigation papers. If the aim is to assess how much consensus there is about IPCC WG1, they are irrelevant. But if we are interested in the whole IPCC remit, these papers are relevant.

    Since Tol clearly agrees with the conclusions, it is not clear why he has put so much effort into raising so many, largely spurious or unquantified, issues with Cook et al.

  18. I largely agree. Ideally a paper such as Cook et al. shouldn’t be necessary (and maybe some could justifiably argue that it isn’t). It’s really Richard Tol’s behaviour that I’ve found quite amazing. Your last paragraph is exactly what I’ve been thinking. Why? He appears to agree with the results and has – as you say – put an awful lot of effort into doing a bunch of tests that he hasn’t really explained or justified (and actually got quite a number of these tests wrong initially – rather ironic given how he was claiming to be the rigorous careful one).

  19. We have a winner. 🙂

  20. I’m sure what you’ve said is really amusing, but I can’t quite work out why 🙂

  21. It wasn’t intended to be funny. Just that this is in my opinion the best comment so far that almost exactly holds the same position as I have.

  22. I see, yes. It was pretty close to my views to, so I agree 🙂

  23. Spence_UK says:

    No, the studies were replicated because people knew the results were wrong, i.e. questions were raised from the moment the first study was published. But the replications achieved nothing but continued disagreement over why the replications gave differing results.

    Just like questions are raised today by many good scientists – Mike Hulme and Richard Tol being examples. The replication stage would achieve nothing. Identifying methodological problems at intermediate steps would get straight to the heart of the matter.

    All I’m trying to do here is explain a little bit of scientific history. I’m not saying replication is bad – on less controversial topics, it is very useful. On this type of topic, it would achieve nothing and history shows us that.

    While the history of science is interesting, I’m not actually that interested in this particular study – the Itia group from the National Technical University of Athens explained four years ago why the consensus argument is based on an unscientific premise. This article was written discussing the claims made by Doran and Zimmerman 2009 but the same issues apply:

    Those who don’t learn from history are doomed to repeat its mistakes.

  24. Sorry, but whatever the order is, replication is crucial. Maybe you think it would be simpler if we simply identified the errors without having to replicate, but the reality is that without attempting replication, you don’t know if the error is significant or not. This really shouldn’t be something we should be arguing about. Replication (or not replicating) is a fundamental part of determining if something is credible or not.

    Also, Richard Tol is not a scientist. He is an economist and may well be a very good one, but he is not a scientist. Mike Hulme apparently has a degree in Geography so maybe could claim some scientific training, but appears to have focused on policy, rather than on science.

    But here’s a simple question for you. Do you dispute that all the examples that you provided involved an attempt at replication? Also, can you find an example where an error in methodology was found and accepted without any attempt at replication?

  25. Martin says:

    Maybe you have missed how high-profile economists, Nobels included, are punching each other constantly and very loudly. Krugman bashing Reinhart/Rogoff was a hugely publicised example. Another is Acemoglu/Robinson against Jeff Sachs, see for example here:

    Tol is not especially dickish here, he is just being an economist: papers are made available in draft form and floating around for years before the go into formal peer-review (if at all), often with major revisions. Whoever doesn’t share their data and everything that has been done gets bashed until suicidal, especially if the topic is of interest to the broader public. Here is the exasperate last-ditch attempt of mentioned Reinhart/Rogoff to stop Krugman’s prolonged onslaught on them:

    So, whatever you think about this really boring (but interestingly obsessive, for quite a few people, without whom this would probably be over since quite some time and possibly would not have entailed more than a couple of tweets by Tol) episode, I would not ascribe this to Tol, personally, but rather to the profession.

  26. Well, I believe that I – at least partly – addressed this in my post

    Firstly, he’s often remarkably rude and unpleasant. I have been told that this is pretty standard in his field, but I find it a strange way to interact with other academics.

    I discovered (or was lead to believe) that this form of exchange is not unusual in economics. However, that still doesn’t change that I find it remarkably unpleasant. I would say that if there is to be an exchange of views between economists and scientists, it’s not obvious that it’s the scientists that should accommodate the economists when it comes to the form of such exchanges.

  27. KR says:

    What amazes me is how utterly irrelevant Tols analysis is to the Cook et al paper.

    In his 7th draft (as per the earlier ones), Tables 1-3 and Figures S5-S14, computed from rolling statistics of the Supplemental data listing, are _entirely_ dependent on the order presented in that supplemental data: sorted by year and alphabetically (Note, incidentally, that alphabetic ordering represents a crude keyword sort based on the beginning of the titles – plenty of clusters there).

    Yet Cook et al states “Abstracts were randomly distributed via a web-based system to raters…”, meaning that the ordered supplemental data and the rolling windows have _exactly zilch_ to say about the ordering of abstract ratings. Tols tests for skew, autocorrelation, drift, and “fatigue” are therefore meaningless with respect to rating order – yet they make up the majority of his figures and results, not to mention insinuations. They only tell you about groupings under year/alphabetic ordering.

    Tol claims that the consensus trends are due to compositional changes between impacts, mitigation, methods and paleoclimate – yet the major trend in consensus is in the first 25% of the data, while the composition of the set of papers only changes in the last 50%. These are therefore unrelated.

    Tol also makes considerable noise about disagreements between the abstract and full paper ratings – these are related but very different data sets; it would be stunning only if they were in complete agreement. Why would that difference between data sets represent an error in their methods or conclusions?

    Finally, he spends considerable time on several very small numbers – 40 papers reclassified, 7 authors in disagreement – when the survey looks at 12,000 papers. Whether these small numbers were discussed with insufficient clarity, handled correctly or incorrectly, or mistakenly given over to rabid squirrels, he’s making noise about less than 1% of the data – insignificant to the conclusions even if Tols unsupported (and over the top) accusations of poor practice were correct.

    These are appalling errors for someone who claims to be a statistician.

    The rest of the comment primarily consists of Tol wishing that different or additional statistical tests and different samplings were used in Cook et al. However, Cook et al applied the methods and tests they stated, on the sample they stated, and reported the results from those tests. While there are possible grounds for additional papers using different techniques, Tol has failed to show any significant error in the Cook et al analysis. It is (IMO) a horrible attempt at a comment.

    If Tol feels there are errors in the Cook et al 2013 paper, he should rate a group of surveys (which is the real and released data) himself and see if the consensus and trends thereof are significantly different under analysis. Oreskes 2004 did that with 928 papers – it’s well within the means of a single person to do such analysis. The data for this paper _is the set of abstracts for analysis_ – he has everything he needs. In the meantime he gives the appearance of searching everywhere for objections – due primarily to ruffled feathers on his part regarding how many of his papers were analyzed and how they were classified.

  28. chris says:

    Spence, you’re arguing over two points: (i) on replication and (ii) on the nature/significance of scientific consensus. I’ll address (i) here.

    (i) replication: In my experience the spurious nature of flawed papers is identified either by the paper being so obviously wrong that it is essentially ignored, or by different groups/labs investigating the claims under their own conditions (e.g. in their own system) – at some point the weight of evidence supports the view that the claims of the spurious paper are unsupportable and the field moves on. This involves repetition of various degrees. Sometimes methodological flaws in the original study are identified but not always.

    The Beneviste affair is entirely unrepresentative of standard scientific practice. No informed individual took Beneviste’s claim seriously at the time [as described in Caroline Picard’s 1994 account published in Social Studies of Science (vol 24, 7-37) ]and Nature were engaging in one of their occasional provocative episodes in publishing it. If one were to make a parallel with the Cook/Tol episode it would be that resolving these issues generally involves some dispassionate analysis and publishing in the scientific literature.

    There is another interesting parallel that Caroline Picard’s paper highlights which relates to the rather unpleasant bullying perpetrated by Tol and his supporters. Surprisingly in the pre-internet days this also happened during the Beneviste affair, as Caroline Picard recounts:

    “Another peculiar feature of these mirror-image trials [i.e. the argey-bargey between Beneviste and Maddox camps] was the personalism and irrationalism with which charges and counter- charges were exchanged. The emotionalism of the debates was itself a constitutive feature of the farcical (that is, laughable) atmosphere of the ‘ghostly imprint’ saga. Both the Benveniste and Maddox camps employed various low-level arguments: non-sequiturs of different kinds and ad hominem arguments ranging from name- calling to attempts to undermine the other’s credibility.”

    Plus ca change….!

    Tol should quietly get on with doing some science in publishing it. Then we can assess whether is complaints have merit…

  29. > So I think Tol is right in his approach.

    Richard won’t be able to replicate more results with the data he requests.

    It is basically a fishing expedition.

  30. > Since Tol clearly agrees with the conclusions […]

    Richard might be of two minds on this.

    In response to Rob Honeycut:

    So, your complaint, Richard, is that John got the right answer with the wrong methods?

    Richard tweeted:


    We can also read in his drafts:

    There is no doubt in my mind that the literature on climate change overwhelmingly supports the hypothesis that climate change is caused by humans. I have very little reason to doubt that the consensus is indeed correct.


    Is it possible to claim that the consensus is correct and that Cook & al did not get the right answer?

    Seems that it is.

  31. It’s getting a little late, but I think what you’ve just written here essentially describes the exact issues I had with his test. Thanks, nice summary.

  32. Martin says:

    No, there is no cheap lesson to be learnt here. But, though both nominal topics (i.e. quantitatively establishing where the consensus lies, and whatever Tol has written) bore me to tears, this is the one interesting question arising here, IMO: how to review research in today’s world. While I do not see that economics should rule how it is done, aggressive data sharing has its advantages (as long as anonymity issues are accounted for, and as long as the original author does not have to give up a considerable comparative advantage by sharing her data). While I grant that Tol’s kinda-sorta allusion that something may be fishy if Cook doesn’t share everything is fishy in and of itself, his argument rings true in a rather trite sense: what, exactly, is the argument for not simply sharing all the data, as long as the above-mentioned limitations are accounted for? Put differently: I also do not see why the review system in sciences should serve as a default – it’s an institutionalised approach, not something with justified on epistemological grounds. And though decency might be an issue in this special case, it has nothing to do with the underlying question: either data are to be shared, or there are reasons not to share them. If the one to share them with smells from elderberries has absolutely no bearing on the question.

  33. It’s getting a little late, so I’ll respond briefly. Technically, sure. Why not share all your data? One point I was trying to make in this post, though, is that there is a difference between a survey of a group of people in order to determine their views, and using a group of people to carry out an analysis. It you’re going to share all your data, those you share your data with have to understand the difference between these two situations. So, the issue I have with Richard Tol’s tests are that he appears to want test the intermediate data (i.e., not the final ratings, but the intermediate ratings) because he appears to think that if this data fails his tests then it shows that the survey is flawed. If the goal of the survey was to test the ability of a group of volunteers to rate abstracts, he’d have a point. The goal, however, was not to do this; it was to determine a rating for each abstract. So providing all the data is fine, as long as everyone understands the relevance of this data.

    As far as your latter point is concerned, I’m not really getting what you’re suggesting (although it is getting late and I have drunk more of the bottle of wine than I probably should have). There’s no question that there are issues with how academia operates and how reviews are done. That, however, is a much bigger issue than the issue of Richard Tol vs Cook et al.

  34. > [A]s long as anonymity issues are accounted for […]

    In this case, I don’t think it would be possible to preserve anonymity, since we know the raters and we know a ball-park for the number of ratings they did.

    All this for an ad hoc interest in rater’s fatigue whence all the ABSTRACT ratings were corroborated or arbitrated, a little detail Richard’s tests failed to capture to date.

    Anyway, this:

    Well played, Martin!

  35. BBD says:

    Meanwhile, radiative physics.

  36. Old arguments
    Part of my comment has a series of statistical tests that show that the data as reported have patterns that cannot be explained by chance or anything that is stated in the paper. This is indicative of poor data quality or worse. There are more revealing tests, but these require a further release of data. Alternatively, Cook could perform these tests and publish the results.

    A further data release would also show (if present) individual rater bias (which, dear Collin, is not the same as the average rater bias) and rater fatigue.

    Abstract ratings and paper ratings have a kappa of 8% and a weighted kappa of 16-26%. A score below 70% is deemed unsatisfactory.

    New arguments
    Wotts argues that the ratings were done objectively. This is not the case. Cook could have run a text interpretation algorithm. He asked humans instead. Survey researchers have known since at least the 1950s that answers should also be interpreted with the psyche of the surveyed in mind. In Cook’s case, each abstract was rated twice. Ratings disagreed in 33% of cases. This means that 18% of abstracts was rated incorrectly. The reconciliation & re-rating process reduced the error rate to 12%.

  37. Richard, thanks for comment. I didn’t say they were done objectively, simply that the humans were a tool used to rate the abstracts. You’re correct, they could have chosen to use a text interpretation algorithm, rather than using humans, but they didn’t. I agree that there may be “errors” hence my view that redoing a portion of the ratings would be a better way to test the robustness of the Cook et al. results than doing statistical tests on the “intermediate data”. Let me stress that I’m not claiming that the Cook et al. results are “correct” and that there aren’t problems, simply that your tests are – at best – an indicator and not proof of a problem.

    The issue that I was trying to address in this post is that your tests appear to be the kind of tests that one would use if the goal was to survey the humans (for example, how good are humans at rating abstracts). But this wasn’t the goal. The goal was to rate abstracts, not survey humans. Therefore finding some kind of rater bias or rater fatigue doesn’t tell you that the ultimate, reconciled ratings are wrong, simply that there were steps in the algorithm that needed correcting.

  38. Sure, Wotts, if you want to measure the temperature, the last thing you should do is check your thermometer.

  39. Tom Curtis says:

    I am puzzled as to why Tol continues to burn his academic reputation by insisting that by some statistical analysis of the filing order he can determine facts contingent on the order of rating which is known to be different, and stated to be random with respect to the former. Until he ditches that obvious nonsense from his discussion, there is no reason to waste time in the hope that by some miracle the rest of the critique will rise above that abysmal standard.

  40. I’ve repeatedly told you that a comparison was done between the results from the raters and how the authors rated their papers. This showed that the raters tended to be conservative (towards rejection/neutrality towards AGW and not acceptance/endorsement) compared to how the authors rated their papers. In general there was good agreement between the two groups with how the papers were categorized.

    This indicates that there isn’t an underlying problem with how the raters rated the papers (because this is not what you would see if there was a huge problem with how these papers were rated). That’s why I said that individual ratings are irrelevant in this case. Especially considering you have everything you need to see if the results are correct.

  41. Collin, again, there is a difference between the average rater and the individual rater. With 12 raters only, individuals matter.

  42. And again you ignored my point about the agreement between the authors and the raters. Why should I engage you if you keep ignoring what I say?

    So I’m yet again disengaging and I won’t bother engaging you in the foreseeable future.

  43. Collin, you make three mistakes:

    If the average is fine, there is no reason to assume that underlying data is fine. Errors may have cancelled.

    This is particularly relevant in this case, as Cook defines “consensus” not as a central tendency, but as left-tail/(left-tail+right-tail). Tail statistics are much more sensitive than central tendencies.

    There is strong disagreement between abstract and paper ratings.

  44. KR says:

    “Part of my comment has a series of statistical tests that show that the data as reported have patterns that cannot be explained by chance or anything that is stated in the paper.” – Complete nonsense.

    There are at least four different structures in the sorted data pointing to those patterns. The progression of consensus level over the years, the differing levels of consensus in different categories (both reported in Cook et al), the alphabetical clustering of subject/category by title (clusters starting with “Anthropogenic”, “Carbon dioxide”, “The Impact”, Life Cycle”, “Effect of”, “Global Warming And”, “Radiocarbon”, “Modelling”) leading to auto-correlation, and finally the relative scarcity of rejections (a rare ‘7’ spiking the stats in its vicinity, for example).

    Your year/alphabetic ordered stats windowed are irrelevant, meaningless to the (unrelated) random rating order – and the patterns you see in sorted data entirely unsurprising. This issue is abundantly clear to everyone who has looked at your Comment, and I find it quite odd that you cannot recognize it.

    WRT timed rating stats/trends: I fully expect that rater levels will change over time (but is that fatigue or experience?), that averages will vary between raters (experience in the field, interpretations, language issues due to multi-language authors), etc. However, all that your requested intermediate data would indicate would be the uncertainty of the results, not any errors in conclusions. An error in the reported Cook et al conclusions requires consistent bias in ratings, which can only be established by checking the actual primary data and rating some abstracts yourself – checking for bias against some (hopefully) ground truth, something you have stated you will not do. Established literature looking at the primary data (Doran, Oreskes, Anderegg) indicate the replicating Cook et al conclusions are unbiased.

    And kappa between abstracts and author ratings of full papers is no indication of error – because those are two different data sets for evaluation.

    Your rolling stats are a strawman – you are arguing against an ordering not used in the paper, and the results you see are entirely unsurprising due to known and reported structures in the data. Intermediate data might (perhaps) indicate the level of uncertainties in the results (‘tho Cook et al made no +/- claims, so irrelevant to the correctness of the paper), but cannot determine bias in the conclusions without an independent evaluation of the sample set, cannot find errors. And kappa statistics (“assessing the degree to which two or more raters, examining the same data, agree when it comes to assigning the data to categories”) are wholly inappropriate when looking at two different data sets.

    Your Comment, as it stands, is therefore pointless.

  45. @KR
    You are welcome to demonstrate these points. Please don’t waive your hands. Demonstrate. You can use my code it that helps.

  46. Sou says:

    Richard Tol has been behaving very strangely. He’s showing the same obsessive symptoms as the Auditor . I haven’t read the comments here and they may shed some light on what he’s on about. But the data for Cook et al is all there with their paper and free to download. They’ve provided the full list of all the abstracts they rated. Richard can do his own analysis. I can’t for the life of me think what other “data” he could want that would be relevant to the study.

    Richard Tol’s tied up with the loony mob the GWPF. Same crowd Matt Ridley hangs out with and Matt’s been acting weird lately too. Bunch of cranks and nutters, and given his recent behaviour, Tol fits in with them very well.

    Richard and Matt have written such odd stuff recently that I now lump them in with blogger James Delingpole, who’s a complete fruitcake. Richard and Matt are catching up to James and at the rate they are going, they’ll overtake him for sheer nuttery sometime in the coming 12 months or so.

  47. KR says:

    Kappa is inappropriate across ratings of different data sets (by definition), you yourself have demonstrated autocorrelation/clustering via alphabetic sorting (your figure S14, weakened only slightly by the additional year sort), rater variance and drift cannot say anything about consistent bias (i.e., any error in Cook et al conclusions) without some ground-truth check (which you have declined to do), wottsupwiththat has (IIRC) has computed the skew from ‘7’s in your rolling stats, any trends in rolling stats vanish under random ordering, and the timing of composition and consensus change do not come close to overlap (as you have shown in your Figure 1).

    Done and done. Rolling statistics of an unrelated ordering, and kappa, are irrelevant to this data, and simply fail to support your comment. I feel no need to say more in that regard.

    _You_ are the person claiming incompetence and potential malfeasance on the part of Cook et al – it is incumbent on _you_ to prove your point, and you have not done so. Cook et al stated their sampling and methods, ran the stated tests, and produced the results given by their data. You are more than welcome to suggest alternative approaches, but your attempts to show methodological errors are, as noted by multiple respondents, ill-founded.

    I will be unable to participate/respond further for some time (about to spend a week away from electricity), but these points have all been raised by others – who I invite to respond as they see fit – and, given the progression of this discussion, I fully expect it to be continuing when I return.

  48. @KR
    You’re mixing up different tests.

    Wotts is quite wrong to suggest that bootstraps are sensitive to tails.

  49. You’re handwaving, Richard.

    You should at least acknowledge that your use of Kappa lies between an idiosyncrasy and a marketing prop.

  50. and here’s Willard …

    … who argued that kappa was wrong, and weighted kappa was the way the go, and when shown the results for weighted kappa quickly changed the subject, and now argues against kappa (weighted or not) altogether

    If abstract and paper ratings are so incomparable that kappa is invalid, then Cook was wrong to compare these results in the 97% paper.

  51. BBD says:

    Meanwhile, radiative physics.

    I would not tolerate blatant diversionary tactics like this in a business context.

  52. Fragmeister says:

    I agree that Delingpole is the cherry on the cake of fruitloopery. It might be interesting for a sociologist to track the way that deniers go from a moderate position (..I’m not sure about X) to a wild and way out there denial of clear evidence. Delingpole, Monckton and others are a long way down that path. Ridley is heading over the horizon. Richard Tol seems to me still in sight but to agree with the outcome but disagree with the method seems quaint. He may have a valid point which I am not qualified to check, but his pursuit of the point does not look sensible. But what do I know?

  53. Fragmeister: Simples. Science is a method, rather than a set of results. If the method is wrong, the result is invalid. The results may be correct, but it is not valid. Cook’s 97% consensus is invalid.

    Validity matters for long-term credibility. Climate change is a long-term problem. Validity thus matters.

    Climate change is not like acidification, where we solved the problem (in Europe and North America) before we discovered it was misconstrued. With climate, we do not have that option as we need a century or so to decarbonize the economy.

  54. dana1981 says:

    Good post and a lot of good comments. What particularly irks me is that Tol keeps claiming he’s trying to replicate our study, which is obviously not true. We provided all the data necessary to replicate our study right after it was published. Tol is trying to perform invalid statistical tests to insinuate a problem in our survey, which is not at all the same as replication.

    He’s also repeatedly accused us of hiding something because John Cook isn’t available at his beck and call to provide every irrelevant subset of data the he requests (very Auditor-like), even after Cook explained that he’s been very busy and will provide the data when he has time. After Cook provided the self-ratings data, how did Tol respond? With more accusations that we’re hiding data.

    I don’t think you can blame the profession of economics for that abhorrent behavior.

  55. The data requested are the data that would normally go into peer-review. John Cook has had 190 days to upload the data file. What is he waiting for? Why do you not just make the data available? What terrible secrets are hiding in the data?

  56. dana1981 says:

    “What terrible secrets are hiding in the data?”
    Thanks for proving my point.

  57. As you know, Dana, I release both data and code. I will publish any result, whether embarrassing for you or for me.

  58. I did not argue for weighted K, Richard, but hinted a few times at the point that a weighted Kappa would at least lead you in the right direction. I think this exchange shows you got that hint:

    This exchange also shows that your last comment is misleading at best.


    Running the default weights for K gets you a symmetrical measure. This is problematic, for rating an ABSTRACT as a 3 is more consistent with a self-rating of 2 than a 4. Even your exercise in button-pushing leads you to what some may consider a fair agreement, at least according to authorities you can search for yourself. Not an excellent one, not a good one, but not a bad one either. A fair one.

    Having a fair agreement for two different sets of items and raters might not be that bad, Richard. And even if we grant you all this, you have self-ratings vs ratings. One does not simply take a survey as a classification task. To see how absurd it is, imagine a scientist that tries to measure the reliability of DSM diagnostics by asking psychiatrists who had psychological issues in the past to self-diagnose themselves.

    You are simply fishing in the dark, Richard.


    As I see it, here’s the part of Cook & al where a Kappa with an appropriate model could be applied:

    Each abstract was categorized by two independent, anonymized raters. A team of 12 individuals completed 97.4% (23 061) of the ratings; an additional 12 contributed the remaining 2.6% (607). Initially, 27% of category ratings and 33% of endorsement ratings disagreed. Raters were then allowed to compare and justify or update their rating through the web system, while maintaining anonymity. Following this, 11% of category ratings and 16% of endorsement ratings disagreed; these were then resolved by a third party.

    Upon completion of the final ratings, a random sample of 1000 ‘No Position’ category abstracts were re-examined to differentiate those that did not express an opinion from those that take the position that the cause of GW is uncertain.

    This seems to indicate a lot of “supervised learning” in the process. (Ask your local machine learning guru.) If that can’t satisfy you, what will?

    Hence the point of all my interactions with you. You, Sir, are fishing in the dark from day one about a paper you consider worthless that claims something you consider trivial, and irrelevant for the grand scheme of things to boot. For two months now we see the most cited economist from the Stern Review mudthrowing when not “acting like an economist” (H/T Martin) instead of providing constructive criticisms, and by “constructive criticisms”, I mean a formal specification of what would be contrarian-proof.

    It would be nice if you started to act like a scientist, for a change, not like a silly talking head from The Global Warming Policy Foundation.

    Hope this helps,


    PS: A PS to bypass WP’s filter.

  59. “Climate change is not like acidification, where we solved the problem (in Europe and North America) before we discovered it was misconstrued.”

    In what way was acidification misconstrued? Unless I have misconstrued your sentence, it is nonsense.

  60. Kappa statistics can be applied across *any* two or more methods of rating or evaluation. They are just a statistical tool. Useful as long they are applied appropriately and conclusions are drawn appropriately from the application.

    When kappa statistics are applied to a abstract/volunteer vs paper/author rating, kappa functions as a test of one rating mechanism versus the other. As with any equivalent situation, inferences can be drawn in either direction. How well do authors perform versus a team of people who had more experience in applying their self-devised rating scheme? How well do volunteers perform by reading the abstract against the rating given by the paper own author? In either direction, kappa concordance however tests one thing, i.e., how well the rating system itself performs.

    kappa is abysmal. It can be calculated in any reasonable manner, it still remains very low. The reason is, the raw agreement between raters and authors is only 33% or so. In other words, nothing can rescue this rating system.

  61. Richard, you’re doing it again. Rather than actually explaining what I’ve said that you disagree with, you make a snide little remark that suggests that I don’t want to do something obvious. You’re mis-representing the point I’m trying to make and avoiding actually addressing the issue that I’m raising.

  62. Richard, I think you’re missing what I said about tails (by which I assume you mean outliers). Here’s my explanation and you are welcome to explain where I’m going wrong. Let’s consider the skew test. Your null hypothesis is that the skew should be constant in time. You produce a random distribution with the same number of 1, 2, 3, 4, 5, 6, 7 ratings as in the Cook et al. survey. You run your skew test on this. You repeat this 10000 times to give you a distribution of skews that you can use to produce your confidence interval. You realise (as you did) that there is some drift in the data, so you ignore the first 2000 abstract when you compare the skew of the Cook et al. data with your bootstrap confidence interval. You discover that some of the skews from Cook et al. data fall outside your 5 – 95% confidence interval. You count up how much and discover it is more than 5%. It fails your skew test. Am I right (or at least close).

    My point is that one of the reasons the Cook et al. skew test data is falling outside your 5 – 95% confidence interval is that there are a tiny number (9 in all I believe) of abstracts rated as 7, and these are significantly influencing the skew when they are included in the group who’s skew is being calculated. Okay, so more than 5% fall outside the 5 – 95% confidence interval. Is it relevant? I don’t see why. It’s not Cook et als. fault that there were so few abstracts rated as 7 and that these have a big impact on the skew of their data. So I’m not saying that the outliers are influencing the bootstrap. I’m saying the outliers are infuencing the skew of the Cook et al. data so that it fails (by your reasoning) the skew test.

    Here’s the issue I have. It fails your test because more than 5% fall outside the 5 – 95% confidence interval. You haven’t, however, really explained why failing this test means anything. I’m sure there are many tests that the Cook et could fail. It’s only those that are relevant that really matter though.

  63. Richard, I think you’re making the same mistake that Spence_UK made above. Certainly the method is important and I’m not suggesting otherwise. But, if one finds an error in the method you still have to illustrate what impact this would have on the results. Sometimes it is obvious, but the case still has to be made. One can’t simply go “error – it’s all wrong!”.

    Let me clarify a little more though. As I tried to explain in this post, if the Cook et al. survey was a survey of individuals, rather than a survey of abstracts, then your tests may well be sufficient to show that the results are suspect. However, it isn’t a survey of individuals. The individuals are simply the tool being used to rate the abstracts. Each abstract is rated by two people precisely because there is an expectation that a single person cannot be expected to always give a reasonable rating. So there is an expectation that these two raters will not always give the same rating (otherwise, why use two). Different ratings are reconciled in order to determine a single final rating for each abstract. I would argue that this entire process is effectively a form of error correction. The process was designed with the knowledge that there would be differences. That there are differences, doesn’t indicate an error, it simply indicates a difference in judgement that needs to be reconciled.

    So I return to the main point of this post. The only valid way – in my opinion – to test the robustness of the Cook et al, survey is to redo a sample of the abstracts. Anything else doesn’t really tell you anything. Almost all (if not all) of your tests are simply quantifying what was already known.

  64. The reason is, the raw agreement between raters and authors is only 33% or so. In other words, nothing can rescue this rating system.

    I think to complete your assessment, you also need to explain why we would expect a better agreement between the author ratings (which rate the entire paper) and the volunteer ratings (that only rate the abstract). Without at least providing that, the significance of the lack of agreement doesn’t really have any basis.

    As I think I’ve said before, the goal of the Cook et al. paper was not to discover if one could use an abstract to determine the position of an individual paper with respect to AGW. The goal was to use abstracts to estimate the level of agreement within the literature. The lack of agreement between the author ratings and abstract ratings simply means that the abstract alone doesn’t necessarily tell you the position of an individual paper wrt AGW. It doesn’t, however, tell you that you can’t use the abstracts to determine the level of endorsement within the recent literature.

  65. dana1981 says:

    Long story short, statistical tests don’t mean diddly if you don’t understand what you’re testing.

  66. Richard Telford,

    This may relate to this discussion at Judy’s:

    I can give you a summary of the arguments, but you can simply follow the comments of omnologos and Latimer Adler. This dynamic duo has repeated most of the concerns regarding the use of the word “acidification”.


  67. “you also need to explain why we would expect a better agreement between the author ratings (which rate the entire paper) and the volunteer ratings (that only rate the abstract).”

    The authors were given the rating classification system. They were asked to rate their paper using the same scale. If you take the position that authors are in the best position to rate their paper, the kappa tells you how the raters and the chosen system performs.

    Two rating methods can be completely discordant, i.e., misidentify every single paper, and yet give the same 97%. That’s what has happened.

  68. Tom Curtis says:

    Shub, what you are missing is that disagreement between author and abstract ratings can arise from several reasons.

    First, the abstract raters may simply be mistaken. An example of this is the rating of Nir Shaviv’s paper as highlighted on WUWT.

    Second, the author ratings may be mistaken (possibly due to misunderstanding the ratings, or due to misunderstanding what is meant by “endorse”). Examples of the later include Richard Tol’s claims that several of his papers have been misrated.

    Third, author ratings may differ from abstract ratings because they have more information to hand from which to rate the paper. If, for example, a paper exists which says nothing about AGW in the abstract, discussing instead the impact of changes between wet and dry climate on the propagation of Azaleas, but buried in a comments section of the paper their is one sentence saying, “This relationship shows that natural factors are the sole cause of twentieth century warming”, the abstract raters would rate the paper as 4; and the author would rate it as 7, and that would not be a mistake by the raters.

    Fourth, ratings may differ because authors simply misrepresent the contents and implications of their paper. Thus, Scafetta, he first misrepresents the IPCC consensus as “… NOT that human emissions have contributed 50%+ of the global warming since 1900 but that almost 90-100% of the observed global warming was induced by human emission.” Given that the IPCC is explicit that it claims only that greater than 50% of warming over the last 50 years was anthropogenic, that is an absurd misrepresentation. Second, Scaffeta he then goes on to misrepresent his own paper, claiming that:

    “What my papers say is that the IPCC view is erroneous because about 40-70% of the global warming observed from 1900 to 2000 was induced by the sun. “

    But what his paper actually says is that 35-60% of warming from 1900-2000 could be solar in origin (45-50% in the abstract), and that only 20-40% of warming from 1980-2000 could be solar in origin (25-30% in the abstract). The only place that figures above 60% are mentioned is a 75% contribution from 1900-1950. Thus he has plainly misrepresented his paper as well.

    I will add, parenthetically, that there was effectively no warming from 1950 to 1970, so that if only 35% of warming from 1980-2000 was solar in origin, then 50% plus of the warming from 1950 to 2000 must have been anthropogenic (volcanic forcing being negative over that period). Therefore the rating in the paper was entirely justified, a fact Scafetta obscures only be first misrepresenting the IPCC, and then the contents of his own paper.

    The key point here is that kappa will tell you nothing about the relative weight of these different reasons for differences between abstract and author ratings. In particular it will not tell you that the reason suggested in Cook et al, ie, that the authors rated overwhelmingly fewer papers as 4 because of access to information not found in the abstract alone, is wrong. Absent direct discussion of that point to establish their hypothesis is false, discussions of kappa are restricted in relevance as a critique of Cook et al. It is only by pretending that differences between author and abstract rating can only arise due to abstract rater error that it can be pretended otherwise.

  69. Seems we have a new argument:

  70. @Richard Telford
    The fear of Waldsterben was the main driver of acidification policy. Lots of trees were indeed dying, but we now know that this was due to drought rather than acidity.

  71. @Wotts
    95% of the data should be inside the 95% confidence interval. If not, the data generating process is not described accurately or completely.

    In Cook’s case, it’s the spacing of the 6s that stretch belief.

  72. If I use Willard weights (0 = agreement; 1 = numerically different but agreement on endorsement cq rejection; 2 = endorsement + neutral, rejection + neutral; 3 = endorsement + rejection), kappa is still only 24%.

  73. @Wotts
    And that is exactly what was found: Abstract ratings and paper ratings yield completely different results.

    Yet, Cook et al. use the latter to validate the former.

    [I’m abstracting from their use of a non-representative subsample to validate the sample.]

  74. I would not be surprised if at some sites waldsterben can be attributed to drought or pathogens – waldsterben is a complex phenomenon – but it would be irrational to leap from that to assuming that acid rain had been misconstrued.

    -The detrimental effect of acid rain on rivers and lakes is not misconstrued.
    -The effects of acid rain on buildings and monuments is not misconstrued.
    -The effect of acid rain on soil chemistry is not misconstrued.
    -That acid rain can affect vegetation or ecosystems (for example in the Dutch dune) is not misconstrued.

    It is only the contrarian websites that misconstrue the evidence. Where do you get your information from?

  75. I’m not saying that acidification is not a problem. It is. It is good that we solved it (in Europe and North America). However, the public understanding of the problem did not evolve with the scientific understanding.

    Any historical review of acid rain science and policy will roughly tell this story. Hans von Storch had a nice edited volume a few years ago.

  76. Okay, I think you need to define result here. The abstracts and papers are different things. One is a short (10 or line) description of the key points in the paper. The other is a multi-page “document” that goes into detail about the background, assumptions, model, results, and conclusions. There is no obvious reason why the “results” (ratings wrt AGW) for the abstracts and papers should be the same. What would be surprising is if the endorsement fraction was different for the abstracts than for the papers – which it wasn’t.

  77. @Wotts
    So are you saying that these are different things yet they validate one another?

  78. Let’s be careful here. I’m suggesting that they are different (but related) things and, firstly, that there are perfectly credible reasons why the rating of the abstracts (by the volunteers) might differ from the rating of the papers (by the authors). Therefore, that a test shows they are different doesn’t tell us anything new or tell us, necessarily, that this is a problem. One should at least explain why failing such a test has some significance.

    However, if you want to use abstracts to determine the level if consensus in the literature you would be concerned if the consensus you obtained using abstracts differed greatly from what you got if you asked the authors to rate the full paper. That was my second point. So, yes – in a sense it is used for validation and I’m struggling to see why this is an issue.

    This is not a terribly unusual process. It’s early (or at least, I had a late night) so I’m struggling to think of a suitable example where you would use two different measurement to determine the same thing. Maybe an example is different temperature proxies. They’re all being used to determine the same thing (past temperature history) but you wouldn’t run a test to compare – directly – the data for one proxy with that for another.

  79. @Wotts
    So you’re saying that they’re different but the same, and that we can use one statistic for validation but not another. I guess you did have a late night.

  80. No, Richard. I’m not saying that.

  81. Well, whatever you are trying to say, either ratings measure completely different things and therefore cannot be used for validation, or they measure roughly the same thing and then the full suite of validation tests should be applied.

  82. Hold on, the point is that the ratings are then used to determine the level of consensus in the literature. Therefore, you can use the two different ratings to validate the level of consensus obtained, not to validate the ratings. That’s what I’m suggesting. Let’s see if we can actually reach some agreement about a few things. I’m not confident, but I’ll try nonetheless. I’m also hoping that we can avoid being pedantic about terminology, but again not that confident.

    Goal of Cook et al. paper is to determine the level of consensus with regards to AGW in the recent scientific literature, not to determine the position of individual papers wrt AGW. Agreed?

    They assess a large sample of abstracts using volunteers (2 per abstract) and reconcile differences so as to produce a rating for each abstract. Agreed?

    They ask authors to also rate their papers against the same criteria. Agreed?

    They then use the abstract ratings to determine the level of consensus in the literature. Agreed?

    They then use the author ratings to also determine the level of consensus in the literature. Agreed?

    They compare the level of consensus determined using abstract ratings with the level of consensus obtained using paper ratings. They agree. The level of consensus obtained using abstract ratings has been “validated” using the level of consensus obtained from the paper ratings. Agreed?

    So, I’m not suggesting that the paper ratings validated the abstract ratings. I’m suggesting that the paper ratings can be used to give some confidence that the level of consensus obtained using the abstract ratings has some merit.

  83. Agreed until the last point. Abstract ratings and paper ratings strongly disagree and therefore cannot be used to validate one another.

    Even if they could, a non-representative subsample cannot be used for validation, and finding numerical similarity is a sign of invalidation.

  84. Okay, but I very clearly did not say that they validated each other. I suggested that the result one obtains using the abstract ratings could be compared with the result one obtains using the paper ratings. So, they are – and should not – be used to validate each other. But one can compare the results one obtains using the two different ratings. As I think I said in an earlier comment, one would be quite concerned if the level of consensus one obtained using the abstract ratings differed greatly from what one obtained using the paper ratings.

    Validation may be the wrong term to use then, as they weren’t actually being used to validate each other, the paper ratings (done by the authors) was simply an alternative way in which to estimate the level of consensus.

  85. So, we have here a thingy that is not a validation and therefore does not need to meet the usual criteria for validation but is used for validation nonetheless.

  86. Are you being intentionally misleading? Do you agree that the comparison is between the level of consensus derived using abstract ratings and the level of consensus derived using paper ratings? There is no claim in Cook et al. that the paper ratings validate the abstract ratings. All that they claim is that the level of consensus obtained using the paper ratings is essentially the same as that obtained using the abstract ratings.

  87. And that is not a claim of validation?

  88. I think that depends what you mean. I would argue that it would indicate that one can use abstracts to estimate the level on consensus in the literature. It does not mean that you can use an abstract to determine the position of an individual paper with respect to AGW. It – in part – “validates” the final result, but does not indicate that every abstract has been correctly rated (or that it has been rated correctly with respect to the position of the paper).

  89. I think I now understand. It is a validating non-validation.

  90. Oh dear, back to short, snappy little retorts that don’t actually address what’s being said. Anyway, as I mentioned in the post, I’m on holiday for the rest of this week, so – since I’d quite like it to be relaxing and reasonably stress free – I’m going to ignore the rest of your comments. Feel free to continue making comments if you wish. This isn’t me trying to suppress your right to free-speech, simply me exercising my right to not have to respond. Have a good week 😉

  91. Tom, All your criticisms apply to volunteer classification of abstracts before they apply to a comparison of volunteer ratings vs author ratings. What you list are the problems with the rating system.

  92. So has Sonia Boehmer Christiansen. The title is ‘Acid Politics ‘.

  93. Tom Curtis says:

    Really? The fact that Scaffeta misrepresents his own research is a problem with the rating system? Obviously we have barely touched the surface. No doubt the Cook et al rating system is also responsible for scrofulus and WW 2 as well.

  94. Don Monfort says:

    Why did you quit the Cook survey team?

    Have you seen table 5 in the Cook et al paper? Doesn’t the disparity between the author’s self-ratings of their own papers and the Cook et al team’s ratings of the abstracts of the respective papers indicate that the survey rating system was a bust? Wouldn’t it have been a better plan to just survey the authors? Is that what you told the team, when you quit?

  95. To be honest, I’m slightly uncomfortable with you asking what could be a personal question. Of course, if Tom wishes to answer, he is welcome to do so. Firstly, I’m largely unaware of Tom’s role with the team, so don’t know if your question is really implying anything or not. However, I think expecting someone to answer what could be a personal question is a little unreasonable (if he did quit the team, there could be many reasons some of which may be personal and private). Your question is, however, available for all to see. I’m simply responding to make the case that Tom not answering should not be seen as anything significant and I would hope that you would agree.

  96. Don Monfort says:

    How is it a personal question? Aren’t we discussing Cook et al? If he was involved, why shouldn’t he be asked about his involvement?

  97. Don Monfort says:

    PS: I don’t expect him to answer.

  98. Hypothetical answer, Don. If I were to have to pull out of something because I’d suddenly developed some kind of medical condition that I’d rather others didn’t know about, I don’t see why I should be obliged to tell everyone. Now I have no knowledge of Tom’s involvement or whether or not he pulled out or why. I was simply asking you to accept that people could pull out of something for reasons that they’d rather others didn’t know about. I have no idea if this is at all relevant to Tom, simply that your question may be – in some sense – unreasonable and was asking that you might be willing to at least acknowledge this possibility.

  99. PS: That makes it seem like a loaded question then.

  100. Don Monfort says:

    Yes, it is a loaded question. He quit the team because he saw fatal flaws in the project design, but now that it turned out to his liking he is a vociferous defender. Relevant, or not? Penalize me for putting him on the spot.

  101. Not trying to penalize you at all. You seem to know the answer and have now had a chance to post the supposed reason on my blog. I have no way of knowing if you’re correct or not and I suspect you don’t either. You may have some form of communication that you’ve interpreted in this way, but that doesn’t make your interpretation correct. I have yet to decide on my moderation policy, but making an unproven claim about the motives of another person is coming pretty close to crossing the line. Given that I suspect your view is already something that has been expressed elsewhere, I’ll let this stand unless I find evidence to suggest otherwise. I will certainly be considering whether or not I should be insisting on evidence if such statements are to be made in future.

  102. Don Monfort says:

    Sorry, I am not accusing you of penalizing me. Throw that comment away. You seem to be a fair and civil sort. Refreshing and appreciated from your side.

    I would suggest that you do some more investigation of this subject. Read the threads on the Blackboard.

  103. I have read quite extensively on this subject. I clearly disagree with the position you hold. That, in itself, doesn’t necessarily mean anything. Maybe it’s time we all started realising that we can hold opposing positions and still have a decent discussion about this subject.

  104. Don Monfort says:

    The position I hold is that the Cook paper is useless. It concludes that 97% of climate science papers support the assertion that humans are causing some unspecified amount amount of global warming. Nuff said.

  105. BBD says:

    None of the contrarian attempts to create a fake controversy over Cook et al. makes the slightest difference to the facts. There is a strong scientific consensus on the causation and potential negative effects of AGW.

    What we see here is a sustained and transparent attempt to obscure the facts with manufactured doubt.


  106. chris says:

    Don Montfort, you ask a potentially relevant question and a largely irrelevant question. On you relevant question:

    “Doesn’t the disparity between the author’s self-ratings of their own papers and the Cook et al team’s ratings of the abstracts of the respective papers indicate that the survey rating system was a bust? Wouldn’t it have been a better plan to just survey the authors?”

    As an interested reader of the Cook et al paper (a paper I probably wouldn’t have read were it not for the extraordinary nonsense written about it on blogs!), I think Table 5 is fascinating and gets to the heart of some relevant aspects of how one perceives and understands scientific information as interested members of the public, as scientists and as scientific authors.

    As a member of the public one might rarely peruse scientific abstracts. An abstract is likely to first bit of information a scientist would read about a paper before deciding whether to explore the paper in more depth. An abstract should encapsulate the context of a piece of work and its essential conclusions. However it can’t reproduce the entire paper, nor can it convey the unique knowledge and understanding of the paper’s authors. So a survey of non-expert assessment of scientific abstracts is bound to give a different outcome to a survey of the authors assessment of their own paper.

    It’s fascinating and instructive that authors endorse the AGW position with respect to their own papers considerably more strongly that non-authors based on the latter’s perusal of abstracts (the Table 5 you refer to). That says something quite significant about the nature of scientific consensus and the interpretations non-experts might make upon a relatively superficial perusal of a paper (i.e. by reading the abstract).

    “Wouldn’t it have been a better plan to just survey the authors?”

    Not really. We would have been deprived of some useful and fascinating information as described just above.

    One thing that we might remember about science and scientific papers Don, is (rather like bringing up children) we are not aiming for perfection in our scientific papers. Much as we might aim to be a “good enough” parent, so our scientific papers should be “good enough” (normally we aim for “pretty bloody good even if I say so myself” rather than “good enough”!). One might think of all sorts of ways that a scientific paper might be done differently or even in hindsight, improved.

    That rather trite truism is what will continually scupper poor Richard Tol who is trying ever so hard to find a crack that will allow him to trash a pretty blood good paper according to a notional perfectionism that he is struggling to define…!

  107. chris says:

    of course I should have called you Don Monford and not Don Montford…apologies

  108. chris says:

    for goodness sake – I’ve done it again…it’s not as if your name is so difficult! sorry, Don Monfort!

  109. That’s why I just call him Don 🙂

  110. Don Monfort says:

    No problem, Chriss. The big picture is that no almost nobody cares if humans are responsible for some unspecified -Ari’s p0rno approach-amount of recent global warming. The public opinion polls, conducted by professionals, tell us that

  111. Don Monfort says:

    Faithfully stated repetition of the meme., BBD. But it really ain’t working for you. See public opinion polls. The people ain’t scared of global warming.

  112. BBD says:

    Repetition of the facts, not a “meme”.

    Interesting that you frame this as a contest to persuade the public that the facts are not as they are.

  113. Don Monfort says:

    See public opinion polls. You ain’t scaring them. Keep at it.

  114. BBD says:

    This doesn’t change the facts. Which are:

    – A strong scientific consensus exists that AGW is real and potentially pernicious

    – “The public” will only start to worry when climate impacts are a little more undeniable even that at present

    – Science communication cannot address this problem

    – Physics doesn’t care

  115. And here’s why I call him Don Don:

    I bet we meet on the street one day, stevie. I would recognize you anywhere. When I come up to you and introduce myself, you will run. It won’t take me long to catch you by the scruff of your fat little neck. But I won’t hurt you. At the most a vigorous dutch rub. I’ll drag you to the nearest decent bar and buy you some drinks.

  116. Very good. I’ve been rather frustrated by a discussion I’ve been having with Don and others on another blog. Not only have you cheered me up, I’m now somewhat more positively inclined towards Don Don 🙂

  117. BBD says:

    I must admit that a dutch rub was a new one for me. It doesn’t sound very collegiate.

  118. In the right context it is a friendly although somewhat condescending (i.e., you silly little thing) gesture. Does make Don appear somewhat patronising maybe, but would not normally (as far as I’m aware) be seen as something that you do to be mean.

  119. > He quit the team because he saw fatal flaws in the project design […]

    Here’s Tom’s first reaction to the publication of the paper:

    Congratulations to John Cook and the SkS team for this important paper. I know how much work was involved and the team that carried it out have done a marvelous job.

    Perhaps we should we say that Tom Curtis believes that Cook & and al contains some marvelous flaws.

  120. Don Don’s hypothesis may be a part of a theory:

    It’s a stupid survey. That’s why Tom Curtis quit the team. He couldn’t see how they could get a good result out of that mess. But hey, it didn’t matter. They got the right answer for propaganda purposes. So Tom is back on the team with his convoluted defenses of the foolishness.

  121. Tom Curtis says:

    Don Montfort, the reason I left the project are the reasons I gave at the time, which no doubt you have already read. They were my belief that guidelines for rating as given would have required me to rate too many papers which endorsed the consensus (IMO) as neutral; and my further opinion that deniers would focus on the “no position” papers rather than the more relevant ratio of “endorsement papers” to “rejection papers” to generate talking points so that the faithful would not be informed by the paper.

    I find hindsight gives an interesting view of those reasons. The far greater proportion of endorsements relative to “no position” (4) rated papers from abstract ratings relative to author ratings makes my initial concern seem prescient. On the other hand, it is clear that some raters did not take as strict a view of the criteria as I did, or at least some of the time they did not, possibly due to rater fatigue. As, IMO, in this sort of study it is better to get a false negative than a false positive, ie, to understate rather than to overstate the acceptance of the consensus, it appears the stricter criteria used by Cook was desirable. With hindsight, therefore, I would have not left the project.

    Further, my cynicism regarding deniers has been fulfilled in spades. Not only have there been frequent comments by deniers insisting that the true, and only relevant ratio, is the ratio between endorsement papers and total papers – some have gone much further. In at least one instance, they have gone as far as to insist that the only true and relevant ratio was that between explicit endorsement with quanitification (1) and all papers. That particular organization is, in effect, insisting that an abstract that said “the IPCC has clearly shown the danger of human induced climate change” should count as not endorsing the IPCC consensus. Parallel to that is a theme that insists “endorsement” means “is evidence of” so that, according to them, papers which explicitly contain a summary of the IPCC findings which they endorse are to be excluded because they only endorse them rather than provide new research in support (ie, are not WG 1 papers, in the parlance of Lucia).

    In contradiction to that, there is now a theme that suggests all papers should have been endorsement papers because endorsement, purportedly means only affirming that CO2 is a greenhouse gas, or something equally inane. This in straightforward contradiction of the classification of ratings 1 and 7, in which quantification requires explicitly asserting that anthropogenic factors are responsible for greater than 50% of warming for endorsement, or less than 50% for rejection. That the denier meme that “we all endorse AGW” requires that “endorse” and “reject” be interpreted as being inconsistent in meaning for ratings 2 and 3 with respect to 1 (for “endorse”) and 5 and 6 with respect to 7 (for “reject”) is not ignored, but then turned into a second argument against the paper. I guess it is remarkably efficient to gain to arguments against the paper by simply reinterpreting the terms, but it would be more honest to interpret the terms so that they are consistent – even if it deprives you of two talking points.

    Overall, the sheer range of denier talking points, and there frequent mutual inconsistency shows that their problem is not with the design or execution of the paper per se, but rather with the result – which they are desperately casting about to find any form of words which they can accept as a basis for rejecting it.

  122. Tom Curtis says:

    Don Montfort, I did not see fatal flaws in the design, and you know it. Seeing you are asking these questions you have undoubtedly read the hacked SkS forum posts, and know that I left because of a personal disagreement with the survey design. Had I considered the issues over which I left “fatal”, I would not have been so hypocritical as to wish Cook and his team well on the effort. Rather I would have argued that the “flaw” was in fact fatal; and suggested methods of correcting it.

    For those who have not read the forum hack, here in entirety is my resignation from the project:

    “Under current instructions I am being forced to rate far too many papers as neutral when it is almost certain they accept the concensus, but do not mention “anthropogenic” or “GHG”. The result will be that the the ratings will underestimate support of the consensus to a level that amounts to distortion, IMO. Of the last five papers I have rated, 3 where rated as neutral as per instructions, but which I believe to clearly have supported AGW.

    In light of this, I feel I can no longer participate in this project. I wish you and everybody else invovled good luck, and look forward to the final results. “

    John Cook’s response was, in effect, that it was OK to overestimate neutral papers at the expense of endorsing papers because the author ratings would show that that had in fact occurred. In the event he was wiser than I.

  123. dana1981 says:

    The only reason anybody knows that Tom quit the project (very early on, immediately after it started) is because of the stolen contents of our hacked private forum. Tom also specifies why he quit in that same stolen conversation material. So Don Don is very well aware of why Tom quit, he’s just trolling (as usual).

    I don’t know if Tom wants to discuss the issue so I won’t delve into it. Personally I don’t mind if he wants to discuss it – if anything his reason for quitting makes our result even more impressive, IMO.

  124. BBD says:

    Alpha Chimps vs Beta Chimps in Don’s world-view. He is not on common ground.

  125. dana1981 says:

    For the record, I agreed with Tom. I felt that our methodology was too conservative, putting too many papers in the ‘no opinion’ category that could easily have been put in the implicit endorsement category. As Brysse and Oreskes put it, we were ‘erring on the side of least drama’. We knew our study would be incessantly attacked by deniers (and as you know, it has been), so the majority decision was to be conservative in our ratings, and I deferred to the majority and carried on.

    Thus if anything, we underestimated the consensus (though probably not by a significant amount – when you’re already as high as 97% you can’t go a whole lot higher). It’s amusing that deniers would take Tom’s decision as somehow undermining our result, when in reality it simply indicates how conservative it is.

  126. Don Monfort says:

    dana, dana
    Stolen, hacked, carelessly left in the open for all to see-take your pick. It is what it is. We know what you guys were up to. Those of you who want to be amused as well as informed should Google “Ari’s p0rno approach”.

  127. dana1981 says:

    Yeah, who cares if the material was stolen, right? Not deniers, that’s for sure. The ends always justify the means.

    The really annoying thing is when people take stolen material written during the early stages of the project to misrepresent the final results, as with this ‘p0rno approach’ BS, which I’ve already debunked several times and will debunk again tomorrow on the U of Nottingham blog.

    It’s just like Climategate – people take stolen material out of context and just don’t give a crap when the context and subsequent events are explained to them. That’s denialist thinking for you – there’s no way to break through that ideological bias filter with facts.

  128. Don Monfort says:

    You got issues with the survey design, which were inconsequential enough for you to quit the team. I got it. Now everything is fine with the survey. Your issues that required you to quit your team went away. What happened is you found out that Cook was smart enough to conjure up the 97%, no matter how the ill-conceived project started out. Check the public opinion polls, Tom. the general public still do not care if 97% of alleged climate scientists agree that humans are causing some unspecified amount of global warming, including the Hiroshima bombs exploding in the abysses of the oceans.

  129. Don Monfort says:

    Tom, do you know how many people are scared by the assertion that 97% of climate science/scientists endorses the assertion that humans are causing some unspecified amount of global warming? I will help you. Not nearly enough to get any meaningful mitigation of CO2 emissions. Try some other ploy.

  130. Don Monfort says:

    Life isn’t fair. They stole my secrets. Maybe I should not have left them lying around for anyone to pick up. Of course I have proof that they were stolen, but I am looking for it. Maybe the proof was also stolen. We never learn our lesson. One of the reasons we are losing the propaganda war.

  131. BBD says:

    Try some other ploy.

    So there’s a conspiracy to misinform and frighten the public?

    Sounds a bit paranoid to me.

  132. > What happened is you found out that Cook was smart enough to conjure up the 97%, no matter how the ill-conceived project started out.

    Don Don’s theory is thickening.

  133. Tom Curtis says:

    Don Montfort, the idea that the purpose of Cook et al is to scare people is entirely of your manufacture. It is certainly not my intention to scare people, and nor is it that SFAIK of John Cook.

    We do, however, wish people to be able to make correctly informed opinions about the issue of climate change. So, in this instance, when denier organizations spread lies suggesting that significantly less than 90% of climate scientists believe that >50% of recent warming is due to anthropogenic factors, and when polls show that that lie is widely accepted, we wish people to be informed of what the real value actually is (whatever it may be). That, by the way, is why I have pointed out at Skeptical Science, that Cook et al is a survey of literature rather than scientists, and that Anderegg et al showed climate scientists who accepted the evidence for AGW have published, on average, twice as many papers as those who reject AGW, and that the number of climate scientists who accept AGW is likely closer to 94% than 97% based on Cook et al. It is also why I quote endorsement in the literature as being almost certainly greater than 90% and very likely greater than 95%. For what it is worth, among climate scientists acceptance of AGW is almost certainly greater than 80% and very likely greater than 90%, while rejection of AGW is restricted to at most 5% of climate scientists.

  134. Tom Curtis says:

    It is a mark of Montford’s discomfit with facts that he needs to paraphrase any response to him in ways which plainly misrepresent their contents. It is as though if he actually let people speak for themselves without distortion his opinions would just dissolve away.

  135. Don Monfort says:

    The takeaway from this thread:

    “Richard Tol (@RichardTol) says:
    July 27, 2013 at 1:29 pm

    Collin, you make three mistakes:

    If the average is fine, there is no reason to assume that underlying data is fine. Errors may have cancelled.

    This is particularly relevant in this case, as Cook defines “consensus” not as a central tendency, but as left-tail/(left-tail+right-tail). Tail statistics are much more sensitive than central tendencies.

    There is strong disagreement between abstract and paper ratings.”

    kappa kappa

  136. Don Monfort says:


    Please explain the conclusion of Cook et al. Does the paper claim that 97% of climate science papers endorse the assertion that >50% of recent warming has been caused by humans?

  137. Tom Curtis says:

    I had a personal issue, as you put it, with saying that 66.4% (as it turned out) of papers had no position on AGW when as near as I could determine would have been about half that (35.5% as it happened). While not actually dishonest in that I would only be reporting the value determined by a method rather than the real value, I dislike reporting things that are not true. Hence my withdrawal.

    What I find interesting is your desperation to construe my prediction that the paper would over estimate “no position” papers (something which actually occurred) as evidence that the 97% figure cannot be trusted.

    Finally, even if the abstract rating had been better at distinguishing “endorsement” papers from “no position” papers, the percentage of endorsement would only have increased to 98.8% – hardly consequential. I was aware of that likelyhood based on prior results. My concern was entirely with overstating “no position” papers rather than understating significantly the percentage of endorsement papers.

  138. Tom Curtis says:

    From the introduction of the paper:

    “We examined a large sample of the scientific literature on global CC, published over a 21 year period, in order to determine the level of scientific consensus that human activity is very likely causing most of the current GW (anthropogenic global warming, or AGW).”

    From the abstract:

    “We analyze the evolution of the scientific consensus on anthropogenic global warming (AGW) in the peer-reviewed scientific literature, examining 11 944 climate abstracts from 1991–2011 matching the topics ‘global climate change’ or ‘global warming’. We find that 66.4% of abstracts expressed no position on AGW, 32.6% endorsed AGW, 0.7% rejected AGW and 0.3% were uncertain about the cause of global warming. Among abstracts expressing a position on AGW, 97.1% endorsed the consensus position that humans are causing global warming.”

    So, yes it does.

    It is only possible to pretend that it does not by assuming that the theory of AGW does not endorse that claim, despite the statements of the IPCC and despite the clear criteria implicit in instructions for endorsement levels 1 and 7.

  139. BBD says:

    And so it goes on. Manufactured uncertainty vs the facts.

  140. Don Monfort says:

    No it doesn’t, Tom. The alleged 97% consensus in Cook et al makes no statement on quantification. You know that. You know about Ari’s p0rno approach. Why do you people continue to play these games? It is really not working for you. Pay attention to Mike Hulme. He is telling you that post 2009- CLIMATEGATE-you can’t get away with this stuff any more.

  141. Tom Curtis says:

    I should add to that that the idea that the respondents to the author survey were confused on what was being asked assumes them to be complete idiots who do not know what is at stake in surveys like this. It further implies that those authors who indicated that their papers rejected AGW actually thought the papers rejected the claim that CO2 was a greenhouse gas. Neither claim is credible.

    The suggestion that endorsement means simply endorsing that CO2 is a greenhouse gas or something equally inane is a trumped up argument deployed because it plays well rather than because there is any evidence in its favour. It is dishonest, and you know it.

  142. Don Monfort says:

    Do you know how many papers out of 12,000 surveyed allegedly endorsed the assertion that >50% of recent warming was caused by humans?

  143. Don Monfort says:

    How many responding authors affirmed that their papers endorsed the assertion that humans are responsible for >50% of recent global warming?

  144. Don Monfort says:

    Tom:”It is only possible to pretend that it does not by assuming that the theory of AGW does not endorse that claim, despite the statements of the IPCC and despite the clear criteria implicit in instructions for endorsement levels 1 and 7.”

    How about giving us the data for each of those categories? There must be a bunch in cat. 1, since you are claiming the 97% consensus includes the >50% quantification. How many papers out of 12,000 support your claim of consensus? I know the answer. Do I have to help you?

  145. Tom Curtis says:

    Yet again, either the endorsement means endorsement of the claim that most warming over the last 50 years is anthropogenic in origin or it is not. Now consider an abstract that says that:
    CO2 is a greenhouse gas with a radiative forcing of 3.7 W/m^2;
    The recent CO2 increase is anthropogenic in origin;
    Feedbacks are net negative, so that the likely increase in temperature due to a doubling of CO2 is only 0.5 C; and that
    The temperature increase since 1950 is largely (80%) the result of natural variation, specifically the response to the AMO.

    According to your interpretation, this paper must be classified as both 2 and 7. That, however, is impossible because ratings are exclusive. Therefore your interpretation forces you to interpret the rating system as inconsistent.

    In contrast, by my interpretation, the paper can only be classified as 7, and the rating system is consistent. It is a standard principle of criticism that when a paper being criticized is ambiguous, if one of two possible interpretations makes the paper inconsistent while the other makes it consistent you choose the later interpretation. You may still criticize the ambiguity, but criticizing the purported inconsistency amounts to erecting a straw man, for it was open to you to treat the paper as consistent. Ergo, given these facts your interpretation is simply wrong.

    Further, it requires you to treat every time the paper says “Humans are causing global warming” as meaning “humans are causing [some of] global warming” rather than “humans are causing [the majority of] global warming”, despite the fact that the phrase is used interchangeably with terms like “AGW” (ie, you ignore context), and despite the fact that based on the principles of conversational implicature, sentences must be relevant so that “humans are causing global warming” without further qualification means that “humans are the main cause of global warming”.

    I will not respond further to you on this point. You will not get it because you have a strategic interest in not getting it. More rational people who are prepared to be guided by facts, however, will see the logic of the argument.

  146. Don Monfort says:

    I know why you won’t respond further, Tom. You are talking foolishness and blowing smoke. I don’t blame you for giving up.

    This is very simple. Read the categories 1-7. Category 1. is the only category that indicates an endorsement of AGW that includes the quantification of >50% human responsibility. Period. Out of the 2,000+ papers on which Cook received authors’ self-ratings, 228 were rated in category 1. Out of 12,000 papers, the Cook team rated fewer papers than that in category 1., so we can obviously throw that crap out. 228 papers is all you have got to claim a 97% consensus with a >50% quantification. And even that is tainted by self-selection bias and other issues that smart guys like Tol have hit you with. kappa kappa

    Oh, but there were very few papers that disputed our alleged consensus. BS. That’s marketing survey baloney, not science. You don’t in good conscience claim a consensus based on a positive affirmation from only 10% of the population. Hello!

  147. Don Monfort says:

    Anybody there?

    Mr Wotts, has Tom re-convinced you that it’s a 97% consensus endorsing the >50% human contribution story? It won’t look so good if you change you mind again, within a couple of hours.

  148. @Don Yes, the same public opinion polls shows a low acceptance among the public of the theory of evolution in the U.S. Doesn’t mean it isn’t true.

  149. dana1981 says:

    The consensus on AGW > 50% is only a piddly little 96%.

  150. @Don
    According to the released data (but note that Cook et al. hold back the bulk of their data):
    The consensus rate is 98.0%
    The explicit consensus rate is 97.6%
    The explicit & quantified consensus rate is 87.7%

    Only 64 (out of 11944) explicitly endorse the hypothesis that human greenhouse gas emissions have caused climate change, and quantify its contribution at more than half.

    32 of these papers are in impact and policy journals, however. Discounting those
    The explicit & quantified consensus rate (scientific literature) is 86,5%.
    Bias-correcting the sample
    The explicit & quantified consensus rate (sci lit, bias-corr) is 86.4%.

  151. Sorry, overlooked your second question:
    8 papers have been rated as 1 (explicit & quantified endorsement) by both Cook’s team and the authors of the paper.

  152. I should add that the error rate is 12% in the abstract ratings, so one of the eight abstract ratings is probably wrong. I cannot estimate the error rate in the paper ratings because Mr Cook suppressed the necessary information in his data release. The error rate is at least 3%.

  153. Don Monfort says:


    On Bart V’s blog, I tricked dana1981 into revealing that the author’s self-ratings in category 1. amounted to 228. Let’s give them that instead of the 65, because the authors should know their own papers and the Cook raters’ ability to place papers in the right category obviously sucks, as proven by table 5. Throw the Cook et al authors’ ratings out with the trash. They are useless. Inutil. Crap. Did you know that the anonymous raters were the authors themselves-self described as “citizen scientists.” DIY climate science, as revealed in Climate Science for Dummies.

    And yes, they hold back the bulk of their data because they don’t want to reveal the number of papers rated in category 1, versus the other categories. They are pretending that the 97% consensus is defined as endorsement of the >50% story, even though the paper does not state that and the data do not support it. They lump categories 2 and 3, in with category 1. What kind of sense does that make? Damn dishonest. What is the freaking point of having categories 2 and 3, if you want to prove that there is a 97% consensus that humans cause the majority of global warming? The respective papers either support that assertion, or they don’t. The great majority, don’t. And if the 97% catch all consensus is that humans are causing some unspecified amount of global warming, who freaking cares?

    The Cook paper is nothing but transparently crude agitprop. How did that crap get published in an alleged science journal? Didn’t the reviewers ask to see the freaking data?

    You are killing them.
    kappa kappa

  154. Dana, Don is referring to the following statement in the paper

    Explicit endorsements were divided into non-quantified (e.g., humans are contributing to global warming without quantifying the contribution) and quantified (e.g., humans are contributing more than 50% of global warming, consistent with the 2007 IPCC statement that most of the global warming since the mid-20th century is very likely due to the observed increase in anthropogenic greenhouse gas concentrations).

    hence only explicit quantified endorsements are directly associated with 50% due to humans (and then category 7 says explicitly less than 50%). Are you suggesting that of those that quantify, 96% state that humans are associated with more than 50% of the recent warming?

  155. On Bart V’s blog, I tricked dana1981 into revealing that the author’s self-ratings in category 1. amounted to 228.

    Not only does this make you seem devious and deceptive (which is certainly consistent what I’ve experienced from you) when I read this I simply thought that Dana had misunderstood. It seems as though he quoted the number of abstracts rated as 1 rather than the number of papers (by the authors) rated as 1. I appreciate that you never forgive a mistake, but if you can’t do so these discussions all become rather pointless. You don’t “win” the argument because someone misunderstood what you were asking. You “win” the argument by having the most convincing argument.

  156. Don Monfort says:

    You should stay on your own 97% blog, under the protection of your minion moderators. Let’s see, the best you can claim is that 228 authors affirm that their papers should be in category 1. That is out of 2142 papers for which your received responses. And knowing about self-selection bias, that is not a representative sample.

    There were 39 papers that rejected AGW. Don’t you have to have a positive affirmation to count a paper in the consensus? Those that are not in the consensus, ain’t in the consensus. You can’t lump categories 2 and 3 into category 1, unless you are crooked and shameless. So how do you come up with 96%. Show us your hocus pocus arithmetic. You got like 10% positive affirmation of being what you want them to be. Get serious.

  157. Don Monfort says:

    We have been through that , Wotts. Don’t let them fool you again. Category 2 and 3 do not belong in category 1. It ain’t any more complicated than that.

  158. Tom Curtis says:

    Richard Tol steps forward with his shallow analysis.

    Shallow because he again pushes the vapid notion that “impacts” papers have no relevance to the issue. What he neglects (knowingly in that it has been pointed out to him before) is that the “impacts” category does not correspond to the “Working Group 2” subject matter in the IPCC. Thus he wishes to exclude as not belonging to the scientific literature (apparently) such papers as:

    Schlesinger and Ramankutty (1992), Implications for global warming of intercycle solar irradiance variations

    “FOLLOWING earlier studies1–6, attention has recently been directed again to the possibility that long-term solar irradiance variations, rather than increased greenhouse gas concentrations, have been the dominant cause of the observed rise in global-mean surface temperature from the mid-nineteenth century to the present. Friis-Christensen and Lassen7 report a high correlation (0.95; ref. 8) between the variable period of the ’11-year’ sunspot cycle and the mean Northern Hemisphere land surface temperature from 1865 to 1985. The Marshall Institute report9 concludes that ‘…the sun has been the controlling influence on climate in the last 100 years, with the greenhouse effect playing a smaller role.” Here we explore the implication that such putative solar irradiance variations would have for global warming. Our results provide strong circumstantial evidence that there have been intercycle variations in solar irradiance which have contributed to the observed temperature changes since 1856. However, we find that since the nineteenth century, greenhouse gases, not solar irradiance variations, have been the dominant contributor to the observed temperature changes.”

    Or Schonweise et al (1997)

    “The problem of global climate change forced by anthropogenic emissions of greenhouse gases (GHG) and sulfur components (SU) has to be addressed by different methods, including the consideration of concurrent forcing mechanisms and the analysis of observations. This is due to the shortcoming and uncertainties of all methods, even in case of the most sophisticated ones. In respect to the global mean surface air temperature, we compare the results from multiple observational statistical models such as multiple regression (MRM) and neural networks (NNM) with those of energy balance (EBM) and general circulation models (GCM) where, in the latter case, we refer to the recent IPCC Report. Our statistical assessments, based on the 1866–1994 period, lead to a GHG signal of 0.8–1.3 K and a combined GHG-SU signal of 0.5–0.8 K detectable in observations. This is close to GCM simulations and clearly larger than the volcanic, solar and ENSO (El Niño/southern oscillation) signals also considered.”

    See also Roeckner (1992), Ma et al (2004), and Verdes (2007).

    Another paper in that grouping is Karl and Trenberth (2003), which, while not a direct analysis of the evidence for AGW, is a summary of the literature and also falls under a notional WG 1 grouping on that basis. A second example of that sort is Shea et al (2007) (Rating 2, impacts) which Tol had previously, and mistakenly, indicated to be a duplicate paper. While not “evidence” of AGW, these papers are without question informed endorsements of the theory of AGW in the scientific literature, and endorsements after consideration of the evidence.

    It may be possible, and interesting to do as Tol here attempts and to split of those papers which provide evidence for the hypothesis from those that merely endorse it. It is certainly not possible to do so by the mere pro forma exclusion of “mitigation” and “impacts” papers from consideration. Clearly those who suggest that you can have never tested their hypothesis that that was a valid procedure. Seizing on a talking point seems to have been sufficient for their level of intellectual rigour. They have done this despite one obvious lack in Cook et al, ie, an “attribution” category. That lack forces attribution papers to be shoehorned into whichever of the other categories best fits. While a lack, it is not obviously an error in that the categories were deliberately chose to mimic those of Oreskes (2004) for easy comparison.

  159. Don, you referring to me or to Dana. Your comments are becoming quite unpleasant and you are behaving a little trollish (disruptive, borderline inflammatory). As I’ve mentioned before, I don’t have a moderation policy and don’t really want to have introduce one or prevent someone from making their case. I would ask, however, that you try to at least be pleasant and, ideally, consider that the other people are saying before pointing out that they’re completely wrong.

  160. Don, I’m not suggesting that they do and nor is Dana. Re my previous comment. You could also try not mis-representing what other people say too.

  161. Tom Curtis says:

    Double count much, Richard?

    Having already excluded 10 abstract ratings as errors because they disagree with the author rating, you now insist that the 8 remaining must contain further errors because, what – it makes good copy? It will make the GWPF think your part of the team?

  162. Okay, I think I’ve worked it out. 228 abstracts rated as category 1. 9 rated as category 7. Hence 96% of those that quantify the level of human influence, endorse AGW > 50%.

  163. Note that Dana cannot “affirm” that: There are only 64 abstracts rated 1.

  164. Don Monfort says:

    Look Wotts, they hid their data. The only way to find out what they are up to is to pry it out of them. You ask dana if it is not 228. Ask him why they lumped categories 1,2 and 3 together. If they are trying to prove that the consensus is that >50% bullcrap, then that is exclusively category 1. What is so hard to understand about this stuff.

    I know propaganda when I see it. I am a pro. They are amateurs. I helped bring down an evil empire exposing and fighting this kind of crap. They can’t fool me. Good night, Wotts. You are a nice guy, but naive.

  165. Tom Curtis says:

    I note here that Richard has accepted without comment Montfort’s premise that only rating 1 papers ensorse 50%+ anthropogenic cause of recent warming. Given that, how can he justify his claim that several of his papers were incorrectly rated as endorsing the consensus when they clearly endorse the idea that GHGs have a positive warming effect? Is he merely adopting convenient rhetoric with no attempt at consistency?

  166. Don Monfort says:

    Give it a rest, Tom. You quit the team and now you want back on the bandwagon. You are not qualified to tussle with Dr. Tol.

    kappa kappa

  167. Don, I have the data for the abstract ratings. The 96% Dana quoted above is correct. You don’t like them lumping 1, 2, 3 together. I don’t see the problem. The categories are well explained and clear. You disagree, that’s your right.

    I’m glad you think I’m a nice guy, and I may well be naive. However, you should ask yourself why you object so strongly to the results of this paper. Any reason?

  168. @Tom
    Admittedly, I have yet to consider the errors in subject classification. Schoenwiese will be surprised to learn that he has published on the impacts of climate change. Schlesinger and Ramakutty have, but their joint papers are not.

    Thanks for pointing out yet another fault in Cook et al.

  169. Richard, I don’t know if you’ve given any thought as to why some people are suspicious of your motives. If you have, a hint at the answer lies in your above comment.

  170. Don Monfort says:

    Oh Wotts, the categories are well explained and clear. The only category that supports the >50% story is category 1. There are according to your count about 100 papers in that category. According to the respondent authors there are 228, in category 1 and 39 that reject AGW. That’s not a scientific sample, but go with it. How do you get that 96% bull crap? There are 12,000 papers. 100 or 228 papers out of 12,000 do not a grand consensus make, except in agitprop fantasy land. This is old Soviet apparatchik kind of stuff, Wotts. Elementary. It failed then and it fails now. Study table 5. Ask dana why they will not tell you what the data are for the individual categories 1-7. I will go back to my single malt now. kappa kappa

  171. Don Monfort says:

    Are you suspicious of Mike Hulme’s motives too, Wotts? Is it necessary to have a suspicious motive to criticize Cook et al? Why are you so suspicious?

  172. Because it’s 228 in rated as category 1 and 9 rated as category 7 (which is the other category that is quantified and explicit). Simple calculation 228/237 = 0.962. That’s where the 96% comes from. Enjoy your single malt. Maybe you can ponder why you object to the result of this study so strongly.

  173. Don, no I’m not suspicious of Mike Hulme’s motives. If I’m suspicious of Richard’s motives it’s because he has explicitly said things that make it appear that he is very keen to show that Cook et al. is wrong and very pleased when he thinks he has. It doesn’t appear particularly objective.

  174. Don Monfort says:

    There are 39 author self-ratings that reject AGW. At least get that part right. Why don’t you read table 5.

    I reject Cook et al, because I understand what they are doing. You don’t.

  175. What are you talking about? I recalculated Dana’s 96% based on the abstract ratings. I have the data in an excel spreadsheet. That’s what was being discussed. Stop diverting the discussion to something new and then claiming that I don’t understand what I’m talking about. Don’t mix up the abstract ratings and the author ratings.

    I you want to use the author ratings (table 5) then the endorse fraction is 1342/(1342+39) = 0.972 = 97%. However, this table combines (for the author rated papers) 1,2,3 into a single category and combines 5,6,7 (for the author rated papers) into a single category. So this isn’t comparing papers rated as 1 with all papers rated as 1 and 7. You seem to be doing precisely what you’ve been criticising Cook et al. for doing. Ironic or what?

    It seems to me that you reject Cook et al. because you actually don’t know what you’re talking about. I know that I get irritated when people try to give me advice, so apologies for doing the same to you. Simply saying that your argument has more merit than someone elses’s is because you understand something and they don’t, isn’t a particularly credible argument. Why not try actually focusing on the paper and stop basing your arguments on your supposed brilliance and other people’s supposed ignorance.

  176. @Wotts
    Objectivity is immeasurable. You cannot look inside my head. You cannot look inside Hume’s head. You do not know what drives us, or what makes us happy or sad.

    Replicability is measurable. My data and algorithms are free for all to inspect. Please tell me which of my calculations is wrong or incomplete, and I will change them.

  177. Don Monfort says:

    You are making it up as you go along ,Wotts. You said 228, which is from the authors’ self-ratings. Now you say you calculated based on the abstract ratings. And you want to switch from using the author’s category 1 ratings of 228 to the bullcrap amalgamation of categories 1, 2, and 3. I proved that you were wrong on the very basic data in the Cook paper over on Ben’s blog. You reluctantly admitted it. You are backsliding. And you are proven wrong again. I am done with you, Wotts. You people can play your little games amongst yourselves. but you will accomplish nothing. You are just shooting yourselves in your little feet. You should listen to Mike Hulme.

  178. Richard, I make no claims about your motives and apologise if it appears that I have done so. I simply comment on my view of what I see, read or hear. I completely agree that one cannot see inside someone’s head, but when you thank someone for supposedly illustrating another error with Cook et al., it makes me think that you take pleasure from trying to show that their work is wrong.

    Richard, I think I have commented in quite some detail with regards to what I think is wrong with your analysis. Simply getting some calculations right (which you didn’t actually manage initially) does not mean that your results have any merit. If you think that the only way to invalidate your work is to find a mistake in your calculation, then you have a rather simplistic view of how science/research works. As I’ve said quite clearly in this post, it is my view that even if your calculations do find some “error” in the Cook et al. analysis, until you actually show that this somehow influences their ratings of the abstracts you haven’t actually shown that their results are wrong. This isn’t a survey of the volunteers, this is a survey of the abstracts and you haven’t actually shown that there are any major problems with the abstract ratings determined by Cook et al.

  179. Okay, it appears that you have a point. The 228 is from the author self ratings, not the abstracts. However, the 39 you use is wrong. The total number of author rated papers that quantify is 237, hence 228/237 = 0.962 = 96%. If you want to use table 5, you should do the calculation I did in my previous comment. So, there you go. I’ve acknowledged an error. You willing to do the same? I suspect not. Also, by the way, if you really have done with me and choose not to respond, that’s also fine. I find your style frustrating (as I’ve already made clear). I haven’t reluctantly admitted anything. I’ve admitted my errors when I realise them, and see nothing wrong with doing so. I don’t mind making errors (that’s all part of learning) and nor do I mind admitting errors, it implies nothing. That you make such a big deal of it suggests that your actual arguments are weak and you’re relying on poking holes in what others say (by being – as you admit yourself – deceptive). In my opinion, if you were genuinely interested in this discussion, you would be more accommodating. That you are not does not, in my opinion, reflect well on you.

  180. Tom Curtis says:


    1) Cook et al purport to measure endorsement of AGW in the literature, not evidence for AGW in the literature. Confining the literature search to (approximately) WG 1 material is irrelevant to the former, if not the later.

    2) Cook et al make no claim that their categories (methods, mitigation, etc) correspond in whole or in part with the subject matter of any IPCC WG or chapter. In particular, they do not claim that the “impacts” category corresponds to the impacts analyzed by WG 2. Thus including papers which examine the impact of anthropogenic emissions on temperatures under “impacts” is not a mistake by Cook et al.

    3) Cook et al do indicate that the categories are chosen for easy comparison of their results with Oreskes 2004. For that purpose it is necessary that their categories match those of Oreskes, which they do at least nominally. You may wish to make a case that they do not match in actuality. If you make that case successfully, you will have found a flaw in Cook et al, but you don’t find it by simply saying it is there.

    4) It is you, not Cook et al, who have suggested that certain categories are privileged with respect to endorsement. That examination of the categories shows that it will not support your claim shows you to have made a mistake, not Cook et al.

    5) The inability to acknowledge your own mistakes is the ultimate sign of academic vacuity. You got it wrong – again. Man up and admit it.

  181. Tom Curtis says:

    Wotts, comparison of the self rating data shows that the six papers whose data was not released where all likely all endorsement papers. In addition, 17 endorsement papers according to table for are recorded as no position papers in the release. As a result, by my calculation, the figures are:
    Endorse 1319: (1342) 61.75%
    No Position: 778 (761) 36.42%
    Reject: 39 (39) 1.83%
    Total: 2136 (2142) 100%
    (Figures in brackets from table 4)
    Excluding the no position papers, it is 97.13% endorse, 2.87% reject.

    The self rated category 1 papers as a percentage of self rated category 1 and 7 papers is 96.17%, being 226/235. If Montford is correct in his claim about what Dana said, we know that two of the six papers excluded to preserve anonymity of self rating authors, the figure then becomes 228/237 as you have calculated.

    Of course, Montford will not agree to that figure. Although insisting that only rating 1 papers be counted as endorsing, he insists that ratings 5-7 papers be counted as rejecting, thereby showing his intention is propaganda, not analysis. Even so that still shows a resounding 85% endorsement level. This is what makes the deniers hate Cook et al so much. No matter how the twist and distort the data, it still shows their position to be rejected by the overwhelming majority of people with expert knowledge in the area.

  182. Thanks, Tom. So maybe Don had a point. There are aspects of this that I don’t understand (or know). But, as you explain quite clearly, that doesn’t give Don’s views any more credence. My lack of knowledge doesn’t increase Don’s knowledge. Knowledge isn’t a conserved quantity as far as I’m aware 🙂

  183. Of the six hidden paper ratings, 4 were endorsements and 2 were neutral.

  184. BBD says:

    You are manufacturing a fake controversy. Who cares about your distortions and misdirections? I don’t. The very real, very strong scientific consensus doesn’t. And radiative physics continues to operate as it always has. Your denialism is of no account whatsoever.

  185. Tom Curtis says:

    Montford has been trying to repackage his inconsistent interpretation of Cook et al by asking how many papers endorsed that >50% of recent warming was anthropogenic in origin. The answer is, of course, 3896 abstract ratings, and 1325 author rated papers. That is because any endorsement is an endorsement of that proposition.

    Montford disagrees based on the rating instructions, which are:

    ” (1) Explicit endorsement with quantification – Explicitly states that humans are the primary cause of recent global warming;

    (2) Explicit endorsement without quantification – Explicitly states humans are causing global warming or refers to anthropogenic global warming/climate change as a known fact;

    (3) Implicit endorsement – Implies humans are causing global warming. E.g., research assumes greenhouse gas emissions cause warming without explicitly stating humans are the cause;

    (4a) No position – Does not address or mention the cause of global warming;

    (4b) Uncertain – Expresses position that human’s role on recent global warming is uncertain/undefined;

    (5) Implicit rejection – Implies humans have had a minimal impact on global warming without saying so explicitly E.g., proposing a natural mechanism is the main cause of global warming;

    (6) Explicit rejection without quantification – Explicitly minimizes or rejects that humans are causing global warming;

    (7) Explicit rejection with quantification – Explicitly states that humans are causing less than half of global warming

    Note that the first phrase (before the hyphen) is the “level of endorsement”, while the second phrase is the description. As such, the second phrase is not part of the definition of each level, but rather a guide as to when it applies. Thus, if a paper is rated (6), it is rated as “Explicit rejection without quantification”, not as “Explicitly minimizes or rejects that humans are causing global warming”. The later tells us when to apply the rating, but does not define the rating. If follows that for consistency “endorsement” in must have the same meaning in each of ratings 1-3, and “rejection” must have the same meaning in each of ratings 5-7. Despite this, Montford insists that they do not.

    The descriptions do give some clue as to what is being endorsed, or rejected which has some bearing. It is that “humans are causing global warming”, and given conversational implicature, this must be understood as “humans are the main cause of global warming” unless other causes are explicitly mentioned, which they are not. Again, Montford rejects this.

    That leaves Montford with a 7 point classification scheme (ignoring 4b for convenience) which he insists classifies based on level of endorsement of human factors causing some global warming. That means that for consistency there must exist numbers a, b and c, and e, d, and f such that a>b, b>c, and e>d, d>f, and c=>=e; and such that a paper endorsing that a+% of warming is anthropogenic is endorsement level 1, b+% is endorsement level 2, c+% is endorsement level 3, while e-% is endorsement level 5, d-% is endorsement level 6, and f-% is endorsement level 7.

    Of course, this condition is impossible for Montford to meet, for we know that a = f = 50%. It is only by relaxing the condition that a > b to the condition that a =>= b (and so on), and setting all of a-f at 50% that the clasification system can be consistent.

    In that case, the question arises as to in what does the difference between endorsement levels 1-3, and 5-7 consist. The answer is in epistemological warrant. Endorsement levels 1-3 each endorse anthropogenic factors as causing 50+% of recent warming. However, with endorsement level 1 our warrant for asserting that is unquestionable due to direct assertion. With level 2, our warrant is less in that it contains language which does not explicitly asset the quantification. At level 3, we must rely on background knowledge to determine if, together with the assertions in the abstract (or paper), they imply 50+% anthropogenic warming. The difference between levels 1 through 3 then approximately coincides in the increased probability of our being mistaken about the endorsement of the position by the paper.

    It should be noted that Montford cannot consistently object to this understanding, for he relies on exactly this distinction to distinguish between levels 2 and 3, and 5 and 6. Thus he must consider this sort of distinction coherent (although I do not expect from him sufficient consistency to admit as much). Thus, the difference between Montford’s interpretation of the classifications and mine is that I, for consistency, extend the logical category of the distinction between implicit and explicit endorsement (or rejection) to the distinction between those endorsement levels and 1 (or 7). Montford’s position implies, therefore, that the classification scheme is not only strictly inconsistent, but inconsistent in mixing logical categories within the one classification scheme.

    Finally, Montford is likely to insist that his interpretation of the classification scheme is correct, and that the scheme is simply inconsistent. That however, is not a legitimate approach in criticism. If an alternative and consistent interpretation exists (which is clearly the case), insisting that your inconsistent interpretation be used instead has no bearing on the merits of the paper you are criticizing. It is Montford who introduces the flaw by his chosen interpretation. The flaw is then, of his manufacture and not a flaw in Cook et al. The proper procedure is to interpret the classification scheme consistently and see what follows.

  186. Dude, it is Monfort, not Montford

  187. “Don Montfort, the reason I left the project are the reasons I gave at the time, which no doubt you have already read.”

    The reasons you quit the project are simple. They are as true today as they were on the day you quit: the classification scheme is subjective and tolerant of large errors in the classification of papers which do not state a position. I don’t care much for the 97% number seeing as it has been pulled out of some nether regions in the paper. Among things that do matter however is whether is whether accomplishes what it says it does, namely, classify a large number of paper accurately enough for use.

    You saw that there was a mismatch between how the categories 3, and 4 should be classified. Your direction did not match where the other raters were headed. This is the exact error I quantified in my post: Why the Cook paper is bunk: Part II.

    Give your rating scheme and the abstracts to a bunch of un-biased people: you’ll see the same phenomenon. Categories 3, 4 and 5 will clash significantly. These are the least well-defined, well-demarcated categories. They are nothing but reflective of the inherent uncertainty in a project of this type. Tol’s rolling standard deviation and skew statistics reflect precisely the problems arising from raters dealing with this issue.

  188. Tom Curtis says:


    1) You mistate the reasons for my resigning from the project for your own rhetorical purposes (and I am rather sick of deniers doing so).

    2) Your blog post is bunk, but I’m not going to go into that now.

    3) It is true that:

    “Give your rating scheme and the abstracts to a bunch of un-biased people: you’ll see the same phenomenon. Categories 3, 4 and 5 will clash significantly. These are the least well-defined, well-demarcated categories. They are nothing but reflective of the inherent uncertainty in a project of this type.”

    I would add that the “bias” towards a classification of 4 from among these three, and of 3 from among 3 and 5 reflect the actual literature as represented in the abstracts. That does not mean there will not be significant disagreement on individual ratings. There will be. But the fifty-fifty calls tend to be between 3 and 4 in the literature far more than between 4 and 5, and 3 and 5; so any unbiased will produce results qualitatively similar to those produced by the Cook et al rating team.

    However, leave that aside. Throw out the implicit categories on both sides as too uncertain to use. The result is the there are 97.6% non-implicit endorsement papers from among non-implicit papers with a position. So while it could be argued that the implicit categories add nothing useful to Cook et al, it cannot reasonably be argued that it distorts the result.

  189. Don Monfort says:

    You are a joker, tom. Only category 1. includes the >50% quantification. Anyone who can read can see that. You talk a lot but say nothing. Invoking some BS subjective interpretation by conversational implacature is not science, tommy. It’s just BS.

    Someone tell tommy why :

    1. explicit endorsement with quantification

    2. explicit endorsement without quantification

    3. implicit endorsement

    ain’t the same. It’s got something to do with the different wording. Even a moron should be able to see that. Unless they want to be fooled, like Wotts.

  190. Don Monfort says:

    You don’t get even the basics, wotts. They are leading you around by the nose.

  191. Tom Curtis says:

    Richard, what is the source of your information about the held back papers?

  192. @Tom C
    There are totals in the paper, and there are totals from the released data. I took the difference.

  193. Ditto for the error rate. 3% of ratings are non-integer. Therefore, the authors must have disagreed.

  194. Tom Curtis says:

    Poor Montfort. Does it really take that little to exhaust your capacity to follow a rational argument? How do you manage to tie your shoe laces in the morning?

  195. Don Monfort says:

    Do you know the difference between with and without, tommy? Do you know up from down, tommy?

  196. Don Monfort says:

    Dana is getting reamed on the nottingham blog. He should stay behind the protective shield of his Guardian moderators.

    Dana Nuccitelli July 29, 2013 at 2:35 pm

    Did anyone commenting here actually read the above post?

    Nice argument, dana.

  197. Don Monfort says:

    Somebody help tommy. Please explain to him why these two words do not mean the same thing:

    1. with

    2. without

  198. Don’t forget part I:

    I’ve heard through the grapevines that the discussion in the comments section was not bad.

  199. Dude, it’s implicature.

    Here’s a scientific article on the subject:

    The paper reviews a substantial part of the research on linguistic politeness, with the objective to evaluate current politeness theories and to outline directions for future politeness studies. The topics addressed comprise (1) the distinction of politeness as strategic conflict avoidance and social indexing; (2) the linguistic enactment of politeness; (3) social and psychological factors determining politeness forms and functions; (4) the impact of discourse type on politeness; (5) the counterpart to politeness, i.e. rudeness. Furthermore, the paper provides an introduction to the remaining contributions to this Special Issue.

    Just a random example, of course, Don Don.

    Thank you for your concerns.

  200. > Dana is getting reamed […]

    That must hurt.

    Let’s see. Barry. TLITB1. Roger. Richard. Shub. MangoChutney. Arthur. Foxgoose. Mark. Now, that’s a big tag team.

    Why don’t you go and join in, Don Don?

  201. Yes, Tom – you make a good point. I think I have been rather caught out in this discussion. My mistake for assuming that those with whom I’m engaging are being open and honest and that the questions being asked are genuine. Should have known better really. Live and learn.

  202. Don Monfort says:

    I have, willie. Why don’t you go and chime in with your bizarre knee-jerk defenses of anything alarmist. Dana needs a sycophant.

  203. Given that I’m trying to maintain a semblance of civility, Can I ask that everyone do their best to do the same.

  204. Don Monfort says:

    Why don’t you explain what that has got to do with the difference between ‘with’ and ‘without’, as they were used in an allegedly scientific paper.

  205. Don Monfort says:

    Dana couldn’t handle the truth on the nottingham blog. After insisting on the right to defend his paper, he scampered off failing to address the issues raised in the comments. Here is one that he ran from:

    Don Monfort July 29, 2013 at 3:48 pm

    This is not science, Dana:

    “Our survey also included categories for papers that quantified the human contribution to global warming. In the author self-ratings phase of our study, 237 papers fell into these categories. 96 percent of these said that humans are the primary cause of the observed global warming since 1950. The consensus on human-caused global warming is robust.”

    You have stated that authors’ responses included 228 self-ratings in category 1.-humans are the primary cause. In your table 5 there are 39 authors’ self-ratings that reject AGW. Obviously they reject the assertion that humans are the primary cause of AGW. Follow me so far, Dana? If you want to play this game you should at least consistently use the appropriate numbers. It’s 228 versus 39.

  206. BBD says:

    There’s a strong scientific consensus that most if not all warming (GAT/OHC) since the 1970s is anthropogenic. Bickering over the details to create a false impression of disunity where none exists is the tactic of first and final resort for contrarians who have no robust scientific counter-argument to the scientific consensus on AGW.

  207. Don Monfort says:

    If there is a strong scientific consensus, then we don’t need any more bogus 97% propaganda papers to prove it. What other science tries to sell an alleged massive consensus using toothpaste peddling tactics? The pause is killing your cause. You need to step up your game.

  208. BBD says:

    It’s not a “cause”. It’s physics.

    The only reason papers like Cook et al. are written is because contrarians falsely claim disunity amongst scientists. It is an attempt to address the rhetoric of contrarianism. The astonishingly virulent response to Cook et al. demonstrates how much contrarianism depends on false claims of disunity amongst scientists rather than on any coherent scientific argument.

  209. Don Monfort says:

    OK, the pause is killing your physics.

    Bring em’ on! A new faux 97% consensus paper every six months will keep our troops fired up. Big Oil will pump ever increasing bundles of filthy lucre into our right-wing Creationist tobacco puffing disinformation conspiracy campaign. The pollutants will continue to be spewed out in ever increasing billows by our Chinese accomplices and we will be rolling in the CO2 fertilized high clover.

    If you really believe that Cook et al is scientific argument, you are delusional. And only the already convinced are buying it. Check the scientific public opinion polls conducted by professionals.

  210. BBD says:

    OK, the pause is killing your physics.

    Nonsense. OHC 0 – 2000m.

    Anyone conflating surface temperature with “global warming” has misunderstood the basics.

  211. BBD says:

    If you really believe that Cook et al is scientific argument, you are delusional.

    This is an irrelevance. I am forced to repeat myself, again:

    There’s a strong scientific consensus that most if not all warming (GAT/OHC) since the 1970s is anthropogenic. Bickering over the details to create a false impression of disunity where none exists is the tactic of first and final resort for contrarians who have no robust scientific counter-argument to the scientific consensus on AGW.

    The only reason papers like Cook et al. are written is because contrarians falsely claim disunity amongst scientists. It is an attempt to address the rhetoric of contrarianism. The astonishingly virulent response to Cook et al. demonstrates how much contrarianism depends on false claims of disunity amongst scientists rather than on any coherent scientific argument.

    Please read the words.

  212. Tom Curtis says:

    That is a sharp lesson anybody who attempts to defend science (as opposed to defending their political ideology from science) very quickly learns when entering the climate debate.

  213. dana1981 says:

    Simple solution I try hard to live by (sometimes don’t manage to live up to it) – DNFTT.

    Don Don is gleeful because nearly all the commenters on my post are contrarians. Because they make negative (and generally invalid and inaccurate and ignorant) comments, I’m “getting reamed”. Whoopdy-doo. DNFTT.

  214. Don Monfort says:

    I didn’t force you to repeat yourself. But keep doing it. It’s amusing. Especially that part about the missing heat sneaking through the atmosphere, penetrating the land and ocean surfaces undetected and secreting itself in the deep frigid abysses. It’s really not working for you. If you subtract that BS 97% number from 100%, you get the part of the general public that is worried about CAGW. Nine out of ten dentists surveyed recommend toothpaste and are scared of global warming. No, that’s just some survey conducted by DIY hobbiest pretend “climate scientists”. It.s more like 3 out of a hundred.

  215. dana1981 says:

    You know my advice here BBD – DNFTT.

  216. Thanks, Richard.

    Could you make your computation available? I’d like to see how you got that number.

  217. > Kappa statistics can be applied across *any* two or more methods of rating or evaluation.

    Indeed, the question is how well it does on specific cases. Even random rating may be problematic:

    In order to obtain a good equation of chance agreement probability, it is necessary to define what chance agreement is and to explain the circumstances under which it occurs. Any agreement between 2 raters A and B can be considered as a chance agreement if a rater has performed a random rating (i.e. classified a subject without being guided by its characteristics) and both raters have agreed. If a rating is random, it is possible to demonstrate that agreement can occur with a fixed probability of 0.5. Simulations that we have conducted also tend to confirm this fact. It follows that a reasonable value for chance-agreement probability should not exceed 0.5.

    Click to access kappa_statistic_is_not_satisfactory.pdf

  218. > If you take the position that authors are in the best position to rate their paper […]

    An alternative is to take the position that the authors rated their PAPER, while the raters rated the ABSTRACTS. These are two different objects. This fails a basic condition:

    Interobserver variation can be measured in any situation in which two or more independent observers are evaluating the same thing.

    Click to access Interrater_agreement.Kappa_statistic.pdf

    While we can posit that an ABSTRACT is related to a PAPER, we can’t say it’s the same thing. Therefore, a Kappa test might better suited the reliability of the rater-rater performance, and perhaps not rater-self-rater one.

    Also note the guidelines offered in that paper. Even a loosely-weighted Kappa gives 24% would be said to be “fair”, whence a > 80% threshold would be almost perfect.

    Considering we’re evaluating two “things”, we’re far from the abyss advertised.

  219. Don Monfort says:

    You have misrepresented the Cook et al data, dana. You know that 228 and 39 ain’t 96%, in the real world. Are you going to correct your error? If you don’t, that makes you a liar. Very pathetic performance on the nottingham blog, dana. Cook should bench you. Do you have any other hobbies, besides playing at DIY climate science?

    kappa kappa

  220. Don, apart from suggesting that you look at table 5 again (and actually give it some thought before posting another comment) I’m going to take Dana’s advice. I find your style unpleasant, your discussion tactics dishonest, and your absolute certainty concerning. If you can avoid thread-bombing, being abusive, and being personally insulting (which you haven’t quite achieved TBH, but I’ll let that pass for now), you are free to comment. Don’t expect me to respond though. Free-speech does not mean that you’re entitled to a response.

  221. Rob Painting says:

    Don Mon – There’s nothing sneaky about the wind-driven ocean circulation transporting heat to the deep ocean, it’s just that no contrarian, it appears, can be bothered studying a bit of oceanography. Personal incredulity is not a scientific argument.

  222. BBD says:

    @ Dana @ Wotts.

    Sage counsel.

  223. Don Monfort says:

    You look at table 5, wotts. It very clearly states that 39 authors’ self-ratings reject AGW. If they reject AGW-guess what-they reject that humans are responsible for most of it . What is so hard to understand about that? Dana is desperate to maintain and promote the 97%-96% meme and he has misrepresented the data in Cook et al to make it look like the consensus includes quantification of >50% human responsibility. He has not corrected his error, so now I am calling him a a deliberate liar. And I know that you will be a denier of these facts. Now call me a troll.

  224. > It very clearly states that 39 authors’ self-ratings reject AGW. If they reject AGW-guess what-they reject that humans are responsible for most of it .

    This conception of AGW defuses the bomb that Dana misclassified unquantified endorsements that do not minimize or reject AGW.

    You can have the last word, Don Don.

  225. No, all that Tom is saying applies to any rating system.

  226. Don Monfort says:

    You are not supposed to reply to my comments, willie. I have been designated a troll.

    228 for: 39 against ( and that is not counting 4. b. ratings, which are a position and they do not support the consensus)

    You do the math. The 97% meme is busted.

    kappa kappa

  227. Don Don gets called for unpleasantness, dishonesty, over-confidence, thread-bombing, being abusive and being insulting.

    Don Don now plays victim.

    Life is unfair.

  228. Don Monfort says:

    I am not complaining, willie. I am amused by your fellow travelers’ attempts to marginalize my comments with that troll bullshit. They are just little punk sissies. At least you have the guts not to run away. Did you see dana’s pathetic performance on the nottingham blog? He doesn’t do so well when he is not being protected by sycophant moderators. That’s all I have for you clowns. The audience is too small and dense to continue taking my time to dispense wisdom on this barren venue. See you elsewhere, willie.

  229. And you don’t understand why you get called for your behaviour? Hmmm, quite remarkable!

  230. dana1981 says:

    Wotts – correct, except those are the numbers for self-ratings, not abstracts. There weren’t many abstracts rated in categories 7 or 1. Much bigger sample size in the self ratings (228 cat 1, 9 cat 7).

  231. dana1981 says:

    By “tricked me” Don Don means he asked and I answered. The self-ratings data are publicly available and anyone can check this for themselves. 228 Cat 1, 9 Cat 7.

  232. Thanks and, indeed, I did get confused between self-ratings and abstract ratings. Still amazed that Don thinks should use 39 rejects from table 5 with 228 self-rated as category 1.

  233. Table 5 combines Cats 1-3 and 5-7. The 39 rejects is comparable to the 1342 endorsements. If you want to look at those explicitly quantifying AGW, that’s only Cat 1 (228) and 7 (9), which are not listed in Table 5. Not surprisingly, Don Don is either confused or misrepresenting our results.

  234. Pingback: Watt about the Tol Poll? | Wotts Up With That Blog

  235. Pingback: The Climate Change Debate Thread - Page 3055

  236. Pingback: Watt about the 97% consensus, again? | Wotts Up With That Blog

  237. Pingback: Real Sceptic » Cook’s 97% Climate Consensus Paper Doesn’t Crumble Upon Examination

Comments are closed.