Giter VIP home page Giter VIP logo

libraryofcongress / btp-data Goto Github PK

View Code? Open in Web Editor NEW
1.0 6.0 0.0 55.34 MB

This Python tutorial demonstrates how to process and visualize the Library of Congress' By the People transcription data using natural language processing.

Home Page: https://libraryofcongress.github.io/btp-data/

License: Creative Commons Zero v1.0 Universal

Jupyter Notebook 53.57% CSS 9.12% JavaScript 7.75% HTML 29.57%

btp-data's Introduction

Processing and Visualizing By the People Transcription Data


About the data

This tutorial provides an example of data processing and visualization using transcription data from By the People, the Library of Congress's crowdsourced transcriptioning program. By the People invites anyone to contribute to the Library of Congress as a virtual volunteer by transcribing and reviewing digital images of texts to enhance Library of Congress digital collections. The Library of Congress thanks all By the People volunteers for sharing their time and knowledge with us to make this tutorial possible.

By the People transcriptions and tags are created by anonymous and registered volunteers. Once a transcription is finished, it must be reviewed by a registered volunteer. A transcription may undergo multiple rounds of edits before being completed. Finally, transcriptions are spot-checked by Library of Congress subject matter experts before they are incorporated into the digital collections on the Library's website to enhance search and accessibility. Transcriptions are also packaged into .csv files and made available as datasets as part of the Selected Datasets Collection.

In this tutorial we will work with four datasets related to the movement for women's suffrage in the United States. Each dataset’s README includes additional information about its content and creation

  • Anthony, Susan B. Transcription datasets from Susan B. Anthony Papers, Manuscript Division. compiled by By The People. Washington, D.C.: By the People, Library of Congress, to 2022, 2021. Software, E-Resource. https://www.loc.gov/item/2020445591/.
  • Catt, Carrie Chapman. Transcription datasets from Carrie Chapman Catt Papers, Manuscript Division. compiled by By The People. Washington, D.C.: By the People, Library of Congress, to 2022, 2020. Software, E-Resource. https://www.loc.gov/item/2019667239/.
  • Stanton, Elizabeth Cady. Transcription datasets from Elizabeth Cady Stanton Papers, Manuscript Division. compiled by By The People. Washington, D.C.: By the People, Library of Congress, 2021. Software, E-Resource. https://www.loc.gov/item/2020445592/.
  • Terrell, Mary Church. Transcription dataset from the Mary Church Terrell Papers, Manuscript Division. compiled by By The People. Washington, D.C.: By the People, Library of Congress, to 2021, 2018. Software, E-Resource. https://www.loc.gov/item/2021387726/.

Transcription volunteers are instructed to transcribe the text as written, including misspellings and abbreviations. Formatting is generally not preserved with the exception of line breaks. Minimal markup does include “?” for illegible or unclear text, square brackets around deleted text, and square brackets and asterisks around marginalia ([*example*]). Pages without text are marked “nothing to transcribe” and do not have transcriptions in loc.gov.

By the People datasets contain the following fields:

  • Campaign – this is the highest hierarchical level in the arrangement of collections on By the People (example: Susan B. Anthony Papers). This field displays the campaign’s title.
  • Project – this is the second-highest hierarchical level of collections on By the People. Projects may map to an existing subset of a digital collection, such as an archival series, or may be a grouping of related items uniquely organized for By the People. This field displays the project’s title.
  • Item – this is the third-highest hierarchical level of collections on By the People, typically representing a folder, letter, document, or diary. This field displays the item title.
  • ItemId – this is the identifier for the item (see above for definition). This numerical identifier is consistent across the By the People website and in loc.gov. The item and metadata are usually located on the Library’s website at https://www.loc.gov/item/[ItemID]/
  • Asset – this is the identifier for the individual asset image. It is also referred to colloquially as the “page” by By the People volunteers and Community Managers. This identifier is used in the By the People site and on loc.gov.
  • AssetStatus – this indicates the status of the asset in the peer review workflow – Not Started, In Progress, Needs Review, or Completed. Dataset assets will always be marked as “Completed.”
  • DownloadURL – this link provides access to the image file for the Asset from which the transcription was created.
  • Transcription – this is the text created by the By the People volunteers, representing the written content of the DownloadURL image and corresponding to the Asset. This field will be blank for assets that volunteers marked “Nothing to transcribe”.
  • Tags – these are all the tags that have been applied to the asset. If there is more than one tag, the tags are delimited by a semicolon and space.

About the notebooks

This tutorial first cleans and processes transcriptions from the four datasets using Pandas and the spaCy Natural Language Processing library. This code tokenizes the transcriptions, breaking the strings of text into tokens (words) that will be further analyzed. It then identifies the lemma, or root, for each word. For example, the lemma of "voted" is "vote", and the lemma of "women" is "woman". The code next iterates over each token to produce a list of lemmas from the original transcriptions that excludes stop words, punctuation, numbers, and words that volunteers were unable to fully transcribe, which are designated with "?". Stop words are commonly used words, such as "the", "a", or "is".

The tutorial then creates two visualizations from the cleaned data using the Matplotlib and Numpy Python libraries. The first is a combined bar graph showing the five most used words for each of the four datasets. The second is a focused look at the "Speeches" series from the Susan B. Anthony Papers. With data coming from a typed inventory of speeches found in the collection, this code groups Anthony's speeches by year, and then plots the usage of the top five words in her speeches by year.


Running the notebooks

In order to run a Jupyter notebook, navigate to the directory that contains the notebook files using cd /path/to/dcm-btp-notebooks, then run the command jupyter notebook. This will launch the Notebook Dashboard in an Internet browser.

In order to properly run these notebooks, make sure that the appropriate Python libraries are installed. The dataset files are already included in this tutorial in the data directory (along with each dataset’s README), which can be seen in the Notebook Dashboard.

An entire notebook can be run by clicking Run in the menu bar. Individual cells can be run by clicking into the cell, then hitting Shift + Enter. The notebooks in this tutorial must be run in order, from 2 to 5.

These notebooks contain optional code that can be run to print results to the notebook. This helps show what the code is doing at each step. These cells have “Optional:” in the title. Add # to the beginning of a line of code to comment out the lines you do not want to run.

The outputs from Data-Processing will be saved to the outputs directory, which can be seen in the Notebook Dashboard.


Prerequisites, dependencies, and jupyter commands

This repo contains Jupyter Notebooks and therefore requires jupyter to be installed before launching a Jupyter session.

Prerequisites

  1. If you are using Anaconda Python, jupyter should already be installed.
  2. If you are using a different Python distribution, install jupyter with the following pip command:
pip install notebook

Dependencies

  1. The following libraries are required:
  • pandas *Anaconda Default
  • re *Python Standard Library
  • ast *Python Standard Library
  • matplotlib *Anaconda Default
  • numpy *Anaconda Default
  • collections *Python Standard Library
  • operator *Python Standard Library
  • spacy
  1. If you are using Anaconda Python, all libraries except for spacy should already be installed. Install spacy using the following:
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_lg

Note: Do not use pip install --user with Anaconda.

  1. For other Python installations, you will need to install pandas, matplotlib, and numpy, unless you have previously installed these. You can install them with:
pip install pandas matplotlib numpy

Then, install spacy:

pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_lg

Starting a Jupyter session

  1. Launch an Anaconda Prompt (or whatever shell you use to run Python)
  2. cd into the dcm-btp-notebooks directory
  3. Run the command jupyter notebook
  4. A new tab should open in your browser listing the contents of the dcm-btp-notebooks directory

Create an empty notebook

  1. Once a Jupyter session is running, use the New drop-down and select Python 3
  2. Give the notebook a name
  3. Open the file menu and choose Save and Checkpoint
  4. Start adding content!

Authorship and use

These notebooks were created by Dave Durden and Madeline Goebel, Digital Collection Specialists at the Library of Congress. They are marked with a CC0 1.0 Universal public domain dedication.

All contributions to the By the People application are released into the public domain as they are created. Anyone can use and re-use the datasets in any way they want.


Processing and Visualizing By the People Transcription Data by David Durden and Madeline Goebel is marked with CC0 1.0

btp-data's People

Contributors

drdn avatar beefoo avatar madelinegoebel avatar

Stargazers

Rafia avatar

Watchers

Chris Adams avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.