From Guerrilla Open Archives

14 minute read

On Data Refuge

This is the text of an essay I contributed to a pamphlet published by Memory of the World. I am deeply grateful to Memory of the World for inviting me to prepare this paper, for editing it, and for teaching me so much since then. The full pamphlet, including this essay and two others is available here: https://hcommons.org/deposits/item/hc:19825/

My goal in this paper is to tell the story of a grass-roots project called Data Refuge (http://www.datarefuge.org) that I helped to co-found shortly after, and in response to, the Trump election in the USA. Trump’s reputation as anti-science, and the promise that his administration would elevate people into positions of power with a track record of distorting, hiding, or obscuring the scientific evidence of climate change caused widespread concern that valuable federal data was now in danger. The Data Refuge project grew from the work of Professor Bethany Wiggin and the graduate students within the Penn Program in Environmental Humanities (PPEH), notably Patricia Kim, and was formed in collaboration with the Penn Libraries, where I work. In this paper, I will discuss the Data Refuge project, and call attention to a few of the challenges inherent in the effort, especially as they overlap with the goals of this collective.

I am not a scholar. Instead, I am a librarian, and my perspective as a practicing informational professional informs the way I approach this paper, which weaves together the practical and technical work of ‘saving data’ with the theoretical, systemic, and ethical issues that frame and inform what we have done. I work as the head of a relatively small and new department within the libraries of the University of Pennsylvania, in the city of Philadelphia, Pennsylvania, in the US. I was hired to lead the Digital Scholarship department in the spring of 2016, and most of the seven (soon to be eight) people within Digital Scholarship joined the library since then in newly created positions. Our group includes a mapping and spatial data librarian and three people focused explicitly on supporting the creation of new Digital Humanities scholarship. There are also two people in the department who provide services connected with digital scholarly open access publishing, including the maintenance of the Penn Libraries’ repository of open access scholarship, and one Data Curation and Management Librarian. This Data Librarian, Margaret Janz, started working with us in September 2016, and features heavily into the story I’m about to tell about our work helping to build Data Refuge. While Margaret and I were the main people in our department involved in the project, it is useful to understand the work we did as connected more broadly to the intersection of activities—from multimodal, digital, humanities creation to open access publishing across disciplines—represented in our department in Penn.

At the start of Data Refuge, Professor Wiggin and her students had already been exploring the ways that data about the environment can empower communities through their art, activism, and research, especially along the lower Schuylkill River in Philadelphia. They were especially attuned to the ways that missing data, or data that is not collected or communicated, can be a source of disempowerment. After the Trump election, PPEH graduate students raised the concern that the political commitments of the new administration would result in the disappearance of environmental and climate data that is vital to work in cities and communities around the world. When they raised this concern with the library, together we co-founded Data Refuge. It is notable to point out that, while the Penn Libraries is a large and relatively well-resourced research library in the United States, it did not have any automatic way to ingest and steward the data that Professor Wiggin and her students were concerned about. Our system of acquiring, storing, describing and sharing publications did not account for, and could not easily handle, the evident need to take in large quantities of public data from the open web and make them available and citable by future scholars. Indeed, no large research library was positioned to respond to this problem in a systematic way, though there was general agreement that the community would like to help.

The collaborative, grass-roots movement that formed Data Refuge included many librarians, archivists, and information professionals, but it was clear from the beginning that my own profession did not have in place a system for stewarding these vital information resources, or for treating them as ‘publications’ of the federal government. This fact was widely understood by various members of our profession, notably by government document librarians, who had been calling attention to this lack of infrastructure for years. As Government Information Librarian Shari Laster described in a blog post in November of 2016, government documents librarians have often felt like they are ‘under siege’ not from political forces, but from the inattention to government documents afforded by our systems and infrastructure. Describing the challenges facing the profession in light of the 2016 election, she commented: “Government documents collections in print are being discarded, while few institutions are putting strategies in place for collecting government information in digital formats. These strategies are not expanding in tandem with the explosive proliferation of these sources, and certainly not in pace with the changing demands for access from public users, researchers, students, and more.” (Laster 2016) Beyond government documents librarians, our project joined efforts that were ongoing in a huge range of communities, including: open data and open science activists; archival experts working on methods of preserving born-digital content; cultural historians; federal data producers and the archivists and data scientists they work with; and, of course, scientists.

Born from the collaboration between Environmental Humanists and Librarians, Data Refuge was always an effort both at storytelling and at storing data. During the first six months of 2017, volunteers across the US (and elsewhere) organized more than 50 Data Rescue events, with participants numbering in the thousands. At each event, a group of volunteers used tools created by our collaborators at the Environmental and Data Governance Initiative (EDGI) ( https://envirodatagov.org/) to support the End of Term Harvest (http://eotarchive.cdlib.org/) project by identifying seeds from federal websites for web archiving in the Internet Archive. Simultaneously, more technically advanced volunteers wrote scripts to pull data out of complex data systems, and packaged that data for longer term storage in a repository we maintained at datarefuge.org. Still other volunteers held teach-ins, built profiles of data storytellers, and otherwise engaged in safeguarding environmental and climate data through community action (http://www.ppehlab.org/datarefugepaths). The repository at datarefuge.org that houses the more difficult data sources has been stewarded by myself and Margaret Janz through our work at Penn Libraries, but it exists outside the library’s main technical infrastructure.

This distributed approach to the work of downloading and saving the data encouraged people to see how they were invested in environmental and scientific data, and to consider how our government records should be considered the property of all of us. Attending Data Rescue events was a way for people who value the scientific record to fight back, in a concrete way, against an anti-fact establishment. By downloading data and moving it into the Internet Archive and the Data Refuge repository, volunteers were actively claiming the importance of accurate records in maintaining or creating a just society.

Of course, access to data need not rely on its inclusion in a particular repository. As is demonstrated so well in other contexts, technological methods of sharing files can make the digital repositories of libraries and archives seem like a redundant holdover from the past. However, as I will argue further in this paper, the data that was at risk in Data Refuge differed in important ways from the contents of what Bodó refers to as ‘shadow libraries’ (Bodó 2015). For opening access to copies of journals articles, shadow libraries work perfectly. However, the value of these shadow libraries relies on the existence of the widely agreed upon trusted versions. If in doubt about whether a copy is trustworthy, scholars can turn to more mainstream copies, if necessary. This was not the situation we faced building Data Refuge. Instead, we were often dealing with the sole public, authoritative copy of a federal dataset and had to assume that, if it were taken down, there would be no way to check the authenticity of other copies. The data was not easily pulled out of systems as the data and the software that contained them were often inextricably linked. We were dealing with unique, tremendously valuable, but often difficult to untangle datasets rather than neatly packaged publications. The workflow we established was designed to privilege authenticity and trustworthiness over either the speed of the copying or the easy usability of the resulting data.

This extra care around authenticity was necessary because of the politicized nature of environmental data that made many people so worried about its removal after the election. It was important that our project supported the strongest possible scientific arguments that could be made with the data we were ‘saving’. That meant that our copies of the data needed to be citable in scientific scholarly papers, and that those citations needed to be able to withstand hostile political forces who claim that the science of human-caused climate change is ‘uncertain’. It was easy to imagine in the Autumn of 2016, and even easier to imagine now, that hostile actors might wish to muddy the science of climate change by releasing fake data designed to cast doubt on the science of climate change. For that reasons, I believe that the unique facts we were seeking to safeguard in the Data Refuge bear less similarity to the contents of shadow libraries than they do to news reports in our current distributed and destabilized mass media environment. Referring to the ease of publishing ideas on the open web, Zeynep Tufecki wrote in a recent column, “And sure, it is a golden age of free speech—if you can believe your lying eyes. Is that footage you’re watching real? Was it really filmed where and when it says it was? Is it being shared by alt-right trolls or a swarm of Russian bots? Was it maybe even generated with the help of artificial intelligence? (Yes, there are systems that can create increasingly convincing fake videos.)” (Tufekci 2018). This was the state we were trying to avoid when it comes to scientific data, fearing that we might have the only copy of a given dataset without solid proof that our copy matched the original.

If US federal websites cease functioning as reliable stewards of trustworthy scientific data, reproducing their data without a new model of quality control risks producing the very censorship that our efforts are supposed to avoid, and further undermining faith in science. Said another way, if volunteers duplicated federal data all over the Internet without a trusted system for ensuring the authenticity of that data, then as soon as the originals were removed, a sea of fake copies could easily render the original invisible, and they would be just as effectively censored. “The most effective forms of censorship today involve meddling with trust and attention, not muzzling speech itself.” (Tufekci 2018).

These concerns about the risks of open access to data should not be understood as capitulation to the current market-driven approach to scholarly publishing, nor as a call for continuation of the status quo. Instead, I hope to encourage continuation of the creative approaches to scholarship represented in this collective. I also hope the issues raised in Data Refuge will serve as a call to take greater responsibility for the systems into which scholarship flows and the structures of power and assumptions of trust (by whom, of whom) that scholarship relies on.

While plenty of participants in the Data Refuge community posited scalable technological approaches to help people trust data, none emerged that were strong enough to risk further undermining faith in science that a malicious attack might cause. Instead of focusing on technical solutions that rely on the existing systems staying roughly as they are, I would like to focus on developing networks that explore different models of trust in institutions, and that honor the values of marginalized and indigenous people. For example, in a recent paper, Stacie Williams and Jarrett Drake describe the detailed decisions they made to establish and become deserving of trust in supporting the creation of an Archive of Police Violence in Cleveland (Williams and Drake 2017). The work of Michelle Caswell and her collaborators on exploring post-custodial archives, and on engaging in radical empathy in the archives provide great models of the kind of work that I believe is necessary to establish new models of trust that might help inform new modes of sharing and relying on community information (M. Caswell and Cifor 2016).

Beyond seeking new ways to build trust, it has become clear that new methods are needed to help filter and contextualize publications. Our current reliance on a few for-profit companies to filter and rank what we see of the information landscape has proved to be tremendously harmful for the dissemination of facts, and has been especially dangerous to marginalized communities (Noble, 2018). While the world of scholarly humanities publishing is doing somewhat better than open data or mass media, there is still a risk that without new forms of filtering and establishing quality and trustworthiness, good ideas and important scholarship will be lost in the rankings of search engines and the algorithms of social media. We need new, large scale systems to help people filter and rank the information on the open web. In our current situation, according to media theorist dana boyd, “[t]he onus is on the public to interpret what they see. To self-investigate. Since we live in a neoliberal society that prioritizes individual agency, we double down on media literacy as the ‘solution’ to misinformation. It’s up to each of us as individuals to decide for ourselves whether or not what we’re getting is true.” (boyd 2018)

In closing, I’ll return to the notion of Guerrilla warfare that brought this panel together. While some of our collaborators and some in the press did use the term ‘Guerrilla archiving’ to describe the data rescue efforts (Currie and Paris 2017), I generally did not. The work we did was indeed designed to take advantage of tactics that allow a small number of actors to resist giant state power. However, if anything, the most direct target of these guerrilla actions in my mind was not the Trump administration. Instead, the action was designed to prompt responses by the institutions where many of us work and by communities of scholars and activists who make up these institutions. It was designed to get as many people as possible working to address the complex issues raised by the two interconnected challenges that the Data Refuge project threw into relief. The first challenge, of course, is the need for new scientific, artistic, scholarly and narrative ways of contending with the reality of global, human-made climate change. And the second challenge, as I’ve argued in this paper, is that our systems of establishing and signaling trustworthiness, quality, reliability and stability of information are in dire need of creative intervention as well. It is not just publishing but all of our systems for discovering, sharing, acquiring, describing and storing that scholarship that need support, maintenance, repair, and perhaps in some cases, replacement. And this work will rely on scholars, as well as expert information practitioners from a range of fields (M. L. Caswell 2016).

Closing note: The workflow established and used at Data Rescue events was designed to tackle this set of difficult issues, but needed refinement, and was retired in mid-2017. The Data Refuge project continues, led by Professor Wiggin and her colleagues and students at PPEH, who are “building a storybank to document how data lives in the world – and how it connects people, places, and non-human species.” (“DataRefuge” n.d.) In addition, the set of issues raised by Data Refuge continue to inform my work and the work of many of our collaborators.

Bodó, Balázs. 2015. “Libraries in the Post - Scarcity Era.” In Copyrighting Creativity: Creative Values, Cultural Heritage Institutions and Systems of Intellectual Property, edited by Porsdam. Routledge.

boyd, danah. 2018. “You Think You Want Media Literacy… Do You?” Data & Society: Points. March 9, 2018. https://points.datasociety.net/you-think-you-want-media-literacy-do-you-7cad6af18ec2.

Caswell, M. L. 2016. “’The Archive’Is Not an Archives: On Acknowledging the Intellectual Contributions of Archival Studies.”

Caswell, Michelle, and Marika Cifor. 2016. “From Human Rights to Feminist Ethics: Radical Empathy in the Archives.” Archivaria 82 (0): 23–43.

Currie, Morgan, and Britt Paris. 2017. “How the ‘Guerrilla Archivists’ Saved History – and Are Doing It Again under Trump.” The Conversation (blog). February 21, 2017. https://theconversation.com/how-the-guerrilla-archivists-saved-history-and-are-doing-it-again-under-trump-72346.

“DataRefuge.” n.d. PPEH Lab. Accessed May 21, 2018. http://www.ppehlab.org/datarefuge/.

“DataRescue Paths.” n.d. PPEH Lab. Accessed May 20, 2018. http://www.ppehlab.org/datarefugepaths/.

“End of Term Web Archive: U.S. Government Websites.” n.d. Accessed May 20, 2018. http://eotarchive.cdlib.org/.

“Environmental Data and Governance Initiative.” n.d. EDGI. Accessed May 19, 2018. https://envirodatagov.org/.

Laster, Shari. 2016. “After the Election: Libraries, Librarians, and the Government - Free Government Information (FGI).” Free Government Information (FGI). November 23, 2016. https://freegovinfo.info/node/11451.

Noble, Safiya Umoja. 2018. Algorithms of Oppression: How Search Engines Reinforce Racism. New York: NYU Press.

Tufekci, Zeynep. 2018. “It’s the (Democracy-Poisoning) Golden Age of Free Speech.” WIRED. Accessed May 20, 2018. https://www.wired.com/story/free-speech-issue-tech-turmoil-new-censorship/.

“Welcome - Data Refuge.” n.d. Accessed May 20, 2018. https://www.datarefuge.org/.

Williams, Stacie M, and Jarrett Drake. 2017. “Power to the People: Documenting Police Violence in Cleveland.” Journal of Critical Library and Information Studies 1 (2). https://doi.org/10.24242/jclis.v1i2.33.

Updated: