The hope and despair of science and TDM
Chris Hartgerink spoke at an event we co-hosted with ScienceEurope and MEP Zdzisław Krasnodębski last week at the European Parliament. Read his story about the opportunities TDM can bring and the challenges he faces due to inadequate Copyright rules.
Text- and Data Mining (TDM) first came to my attention when I was a research assistant, and it excited me greatly. Here was a method that allowed you to systematically extract information from the literature at a rate unthinkable without it, removing the potential for human error while doing it and replacing it with software procedures.
Our research group was conducting TDM before we knew it was called that, but it started out manually when that line of research started around 2010. More specifically, the work focused on whether statistical conclusions in research articles were properly calculated or whether they might be wrongly reported. For example, a study on a drug trial might state the following result: “the experimental drug decreased mortality of the patients significantly compared to the control group, F(1, 39) = 2.43, p < .05.” This is a typical result in science when it comes to statistics. Effectiveness of the medicine is decided based on the result “p<.05” and with the underlined information, we can recalculate that p-value and check whether the conclusion is correct or not consistent. Here the p-value would actually be 0.13, rendering the conclusion unwarranted and the drug ineffective.
Such statistical information can be extracted from many papers. Initially, my supervisor Jelte Wicherts and his then PhD-student Marjan Bakker set out to do exactly this. However, because this was done manually the time investment to extract all those data points was substantial, resulting in 1148 results from 49 papers. Several years later, when colleagues Sacha Epskamp and Michele Nuijten came along, they suggested to automate these procedures, which resulted in the software called ‘statcheck’. For comparison, this software runs hundreds of papers in mere minutes, where manually doing this would take days, if not weeks of work. The hopes of using TDM for large scale data collection were high.
When I became a research assistant for the research group back in 2013, I was tasked with collecting articles to run this software on. I ended up manually downloading circa 30,000 research articles over several months without any problems, from which we extracted 250,000 results. However, I love computers and started thinking for easier ways to do this and finally found the idea of webscraping (automatically downloading webpages). In the next year I taught myself to program a scraping procedure and informed my supervisor Jelte Wicherts that I could now automatically download 900,000 articles. He was astounded by the scale.
However, that was the moment the despair of TDM set in — once I started downloading these articles, two large publishers told my university that this was infringing on their copyrights and they considered it stealing of their content. If I did not stop, they threatened to shut down access for my entire university (see here and here). Despite Tilburg University’s legal access to these journals, the legal action caused pressure for me to stop — 900,000 articles became 300,000 articles. Some subscription-based publishers explicitly allowed webscraping whereas others require various additional agreements. The legal landscape is unclear and there is a large power asymmetry if individual researchers are threatened to be the cause of an entire university being blocked from access due to their work and are not allowed to negotiate about those various agreements.
Even when a non-commercial exception for research is in place, such as in the UK under the so-called “Hargreaves exception”, TDM research remains difficult. In light of the events that I described above, I sought a collaboration with a UK university to do TDM research. Considering that university had legal access to the articles, TDM is fully legal. Nonetheless, the university shut down our project before we could even start the information collection. Even when there is a legal right to do TDM, management seems to imagine risks and consequences.
Given these hardships, despair is the only word for TDM research that I can give within the current legal and management system. If an exception finds its way into the reforms, it needs to be unequivocal and remove any doubt from management’s mind that TDM is illegal. It needs to empower those who want to TDM instead of just encouraging them. Moreover, I think that publishers are underestimating their benefits from a wide exception. In the age of big data, they could capitalize on TDM even more than others, but that value is seemingly not part of the discussion.
On a final, more personal note: many of my colleagues have told me that “I wouldn’t be able to do what you are doing.” And frankly, I have thought about quitting TDM research many times. But the potential is just too great for me to let it slide.
Text made available under a CC 0 public domain dedication.
By Chris Hartgerink, PhD candidate in Statistics at Tilburg University