For some, quantity trumps quality in scientific research. Countries that submitted the least papers to the online repository arXiv since 2011 tended to plagiarize the most. That’s what Science’s news and policy tracker, ScienceInsider, found when it asked arXiv to share data about the papers researchers submitted to it.
Anyone can submit a manuscript to arXiv–pronounced “archive”–as long as it documents a study in the math or physics domains. And the documents don’t have to go through the orthodox peer-review process, which makes it relatively easy to get accepted.
When it comes down to it, a bot vets each new study for quality, especially regarding reusing text from older studies. The automated program compares a new article’s text to the text of every other document in arXiv’s database. After ruling out exceptions, like when an author cites her own work or uses quotes, the bot flags the ones that heavily lift text word for word from older studies.
Copying, the bot finds, is quite common: among the 767,000 papers submitted from arXiv’s inception in 1991 until 2012, one in 16 authors were found to have copied long phrases and sentences from their own previously published work, and about one out of every 1,000 authors copied about a paragraph’s worth of text from other people’s papers without citing them.
So what happens to these copycat studies? ArXiv’s founder, Paul Ginsberg, and Cornell PhD student Daniel Citron have conducted what they say is the first comprehensive study of patterns of text reuse within the full texts of an important large scientific corpus. Eventually, they found the ones with the most reused text tended not to get cited so much by later researchers.
“One motivation for undertaking this analysis of arXiv data was the known incidence of text copying and plagiarism, usually noticed by readers, and sometimes reported in the news media,” the researchers write of their study, which attempts to focus on “textual overlap” within ArXiv, not “plagiarism” per se. There are no universal guidelines for what constitutes plagiarism in science anyway, they note, but rather “a standard somewhat more lenient than currently applied to journalists, popular authors, and public figures.”
Even if the algorithm can’t detect clear-cut plagiarism, it can help. An author’s tendency to reuse text in an article is a good indicator of her likelihood to plagiarize. Citron and Ginsberg shared their results in PNAS earlier this month, and posited that the plagiarism was influenced by cultural differences “in academic infrastructure and mentoring, or incentives that emphasize quantity of publication over quality.”
Those not highly proficient in English might also be likely to lift text from English sources. The paper observes this at the student level, “where in order to explain concepts, students less confident in their English proficiency tended to employ longer phrases from other sources, rather than just words.” But even at a later career stage, there may be a continued impetus for plagiarism. “A researcher concerned that his or her articles are rejected due to the quality of writing may feel compelled to imitate sentence structures from other articles.”
But ScienceInsider wanted to get to the bottom of these cultural differences. Knowing that authors had to report their home countries with each submission, it asked Ginsberg to release this data. ScienceInsider then mapped out all the countries from which authors submitted at least 100 papers since August 2011 and found a small number of countries, like the U.S., Canada, and Japan had the least number of flagged authors.
Incidentally, authors from these industrialized countries turned out to submit the most papers to arXiv. Authors from less industrialized countries, ScienceInsider noted, had the most flagged studies but also tended to submit fewer papers:
For example, of all the authors from Bulgaria who submitted papers since August 2011, 20% submitted flagged articles. But arXiv’s bot flagged only 6% of the authors from Japan. In the same time frame, around 4,700 papers came out of Japan, but Bulgaria only submitted around 200.
In the U.S., 1,236 out of 26,052 authors were flagged, while in Germany, 297 out of 9,201 authors were flagged. In Iran, 164 out of 1054 authors were cited for “text overlap,” while in China, 688 out of 6,372 authors were flagged.
“While conceivably exacerbated by the ease of cutting and pasting text in electronic format,” the researchers note, “the problem does predate both the new technology and the use of preprints. Ironically the combination of those make make that reuse that much easier to detect.”
Leave it to a bot, a bit of data mining, and a map to keep researchers in check–or at least raise more questions about how the same science is conducted differently across the globe.
If the world were mapped according to how many scientific research papers each country produced, it would take on a rather bizarre, uneven appearance. The Northern hemisphere would balloon beyond recognition. The global south, including Africa, would effectively melt off the map.
This image is based on data from 2001 - but, as this interactive map which tracks the same data from 1990 to 2011 shows, very little has changed in the past decade.
The map makes a dramatic point about the complexities of global inequalities in knowledge production and exchange. So what is driving this inequality and how can it be corrected?
Money and technology are needed to produce research. The average research and development intensity – that is, as a percentage of GDP – was 2.4% for OECD countries in 2009. But few developing countries had reached 1%. Without sufficient national funds, researchers must spend a great deal of time fundraising and dealing with grant organisations outside their universities. This means less time for actually undertaking and producing research.
When it comes to technology, substantial bandwidth powers the global north and connects it to its neighbours. The internet is far slower and more expensive in Africa, making collaboration between researchers on the continent difficult and making it tougher for them than those in the US, Europe and Asia.
These technical, financial and even mechanical issues are easy to identify. It is tempting to put one’s faith in the idea that more money and machines will solve the problems of knowledge production inequality. But it’s not that simple.
A double bind
Values and practices contribute just as much to global imbalances as material disparities do. The science journals that publish the research which populates our strange map aren’t neutral: engagement with them is characterised by several levels of uneven participation.
A study of four high impact journals in the management social sciences found that they attracted authors from many countries worldwide but their empirical sites of investigation were significantly located in Europe and North America. This suggests that local researchers will use their scarce financial and technical resources to get published in high impact, supposedly international journals.
Given the overall constrained research environments in which researchers operate, these resources are lost to local research needs and may in effect subsidise the research of the global north. At the same time, relatively well-resourced researchers from the global north undertake research in developing countries and publish in those same journals.
In the worst cases, the global south simply provides novel empirical sites and local academics may not become equal partners in these projects about their own contexts.
Researchers in the global south are caught in a double bind. They are rewarded for publishing in “international” journals in several ways: through promotions and often even financially. But development imperatives, government policies and their own interests pressurise them to undertake research that is relevant to pressing social and related problems which may not be appealing or even “academic” enough to interest the international journals.
There is another problem with this journals map: it measures science journal articles as the sole representation of scientific research output. It ignores things like monographs and edited collections and interprets “science” narrowly – excluding social sciences and humanities’ genres.
In many contexts valid research is undertaken and published with the unfortunate name of “grey literature”. This includes working papers, technical and policy reports. These genres of output are often prevalent in research areas focused on pressing development issues.
Another category of “invisible research” from the South is the considerable output commissioned by government and undertaken by consultants, many of whom are practising academics. Even when it is published, this kind of research is often not attributed to its actual authors. It has the added problem of often being embargoed – researchers sometimes even have to sign confidentiality agreements or “official secrets acts” when they are given grants.
Some complain that including these genres in our understanding of scientific research will compromise quality. But we shouldn’t reject these outputs. We should find ways to prove their worth, whether through new mechanisms of peer review or new metrics that measure impact and value through use and re-use.
Access is another issue. These coveted journals generally reside behind paywalls. This excludes those who cannot afford to pay for it, like researchers in resource-constrained environments and members of the public who don’t have passwords for the electronic facilities of universities and research institutions.
This situation will improve thanks to the open access policies that are currently being developed in the European Union, the UK and elsewhere. These policies will substantially increase the volume of research to which scholars and readers worldwide have access. But there’s an ironic danger in this more ubiquitous availability.
If the developing world doesn’t have similar national and regional policies and if resources aren’t made available to actively support open dissemination in these countries, research from the developing world will be rendered even more invisible.
This may unwittingly consolidate the erroneous impression that these scholars are undertaking little of value, have little to contribute to global knowledge and are reliant on the intellectual capacity of the global north.
Starting to change the map will require several steps. Firstly, funding and technological infrastructure must be improved. At the same time, our own perceptions of “science” must be broadened to encompass the social sciences.
Research outputs need to be recognised as existing beyond the boundaries of the formal journal article. Incentives and reward systems need to be adjusted to encourage and legitimise the new, fairer practices that are made possible in a digitally networked world.
And finally, the open access movement needs to broaden its focus from access to knowledge to full participation in knowledge creation and in scholarly communication.
Editor’s note: this article was updated after publication to reflect newer map data.
A longer version of this article originally appeared on the London School of Economics’ Impact Blog.