Archives as Data
Location
Deadline
Dates
Type
Digital history and archiving are thriving, but the increasing volume of digitized and “born digital” materials for historical research also presents new challenges for archivists and historians. Typically, the only way to explore these resources has been through keyword searching. More direct access to the data creates tremendous new research opportunities, but the barriers to entry can seem daunting.
The third edition of this NEH-funded program will offer practical training for historians and archivists in processing and analyzing textual data. Participants in the Archiving Digital Records workshop, designed for archivists, will learn how to use new technology to improve the description and arrangement of digital or digitized records, especially PDFs, and provide users with new ways to access them. Participants will receive training in using metadata tools such as PDF Processing, OCR Processing, and Named Entity Recognition (NER) analysis. Participants in the Text-as-Data workshop, designed for historians, will learn how to organize and analyze large document collections and use new methods to formulate original arguments. Participants will receive training in using data science technologies like R and SQL and will be expected to attend afternoon lab sessions where they will put these tools into practice. All participants will come together during lunch for invited speakers and seminar-style discussions on the novel challenges posed by archival research in the age of “big data,” including issues related to community representation, protecting private information in online archives, and the professional and scholarly pitfalls in navigating this new terrain.
The Institute will be led by Matthew Connelly and Courtney Chartier, with co-teacher Ray Hicks, who has extensive experience processing and analyzing textual data. Lunch-time talks will feature presentations from archivists, historians, and data scientists (see list of previous invitees below). The Text as Data workshop will run for two weeks, while the Archiving Digital Records workshop classes will only run for one week. In the second week, participants in the Archiving workshop will be expected to participate in the lunchtime talks and discussions remotely. Attendance is free, and funding is available, although limited, for those who need to travel to participate. Note that we expect all participants to attend daily, and group activities will require everyone to be present and actively contributing.
The Institute is a joint project of Columbia's History Lab and Columbia Libraries, and is funded by the NEH Institutes for Advanced Topics in the Digital Humanities. Hands-on training will use textual data from the Freedom of Information Archive, a project that has aggregated the largest database of declassified government documents in the world. Here are the draft syllabi for the workshops as well as the slides from a typical Text-as-Data course. Please peruse through these documents prior to applying