Giter VIP home page Giter VIP logo

ohtap's Introduction

Oral History Text Analysis Project (OHTAP)

Stanford University

Estelle Freedman, Natalie Marine-Street

w/ Katie McDonough

Research Assistants:

  • Nick Gardner (Fall 2020 -)
  • Yibing Du (Fall 2020 - )
  • Jade Lintott (Fall 2020 - )
  • Anika Asthana (Summer 2020 - )
  • Natalie Sada (Summer 2020)
  • Maddie Street (Summer 2020)
  • Jenny Hong (2019-20)
  • Preston Carlson (2019-20)
  • Hilary Sun (2018-19)
  • Cheng-Hau Kee (Summer 2018)

Winnow Subcorpus Creation Tool

Version 1

This can only handle one corpus and one file of keywords.

Run python subcorpora_tool/find_subcorpora_v1.py -d corpus_directory_name -w keywords_file.txt -m metadata_file.csv.

You need to pass in three arguments:

  • -d: The directory where the corpora files are located.
  • -w: The text file that contains the list of keywords.
  • -m: The CSV file where the metadata for all of the corpora is located. This file should be in the format noted in the metadata document in the Drive folder.

Version 2

Run python find_subcorpora_v2.py with subcorpora_tool. It must be run within the directory subcorpora_tool. Make sure that all text files are utf-8 encoded to avoid any encoding errors.

You need to pass in three arguments:

  • -d: The directory where the corpora files are located. The folder should be the name of the corpus. If no folder is specified, the default is corpus.
  • -w: The text file for the keywords. These are currently assumed to be all lowercase. If no file is specified, the default is keywords.txt.
  • -m: The CSV file where the metadata for all of the corpora is located. This file should be in the format noted in the metadata document in the Drive folder. We assume the CSV has a header, and that its encoding is utf-8. If no file is specified, the default is metadata.csv.

For the keywords file, the keywords can be specified using the following rules:

  • Using the * symbol means that any number of letters can be be replaced.
  • Keywords are to be separated by a newline (each one should be on a separate line). Those that are to be included should be the first lines of the file; those that are to be ignored should be in the last half of the file, separated from the included keywords with a newline. An examples is as follows:
rape
rap*

rapport
rapping

In this example, rape and rap* are included, and rapport and rapping are excluded.

The following files and folders will be output:

  • corpus_keywords_report.html: Contains the basic report about the keywords and corpus.
  • corpus_keywords_file_top_words.csv: Contains the top words of each file in the corpus in the following format: filename, word, count.
  • corpus_keywords_keyword_collocations.csv: Contains the keyword collocations and their counts from all the files in the following format: word_1, word_2, count. If there are no collocations, there is no file.
  • corpus_keywords_keyword_collocations_formats.csv: Contains the keyword collocations and their formats. If there are no collocations, there is no file.
  • corpus_keywords_multiple_keywords.csv: Contains how many times keywords appeared together in the same document in the following format: word_1, word_2, count. If none is output, then there are no multiple keywords.
  • corpus_keywords_keyword_counts.csv: Contains the keyword counts in all in the following format: keyword, count.
  • corpus_keywords_keyword_counts_by_file.csv: Contains the keyword counts in each file in the following format: filename, keyword, count.
  • corpus_keywords_keyword_formats.csv: Contains all the different formats of each keyword. If none is output, then there are no keywords at all.

Encoding files

For now, we have different scripts to encode different collections. I made some notes on problems/patterns with the files.

We are currently separating it as follows (it is not complete yet; just enough to separate basic information). We still need to do the following:

  • Include page numbers for those that don't have it ().
  • Figure out what to do with additional information that isn't currently covered in encoding.
  • Figure out what to do with images (WOL).
<!DOCTYPE TEI.2>
<TEI.2>
	<teiHeader type = "[Collection Name]">
		<fileDesc>
			<titleStmt>
				<title>[Title of Interview]</title>
				<author>
					<name id = "[Interviewee1 Initials]" reg = "[Interviewee Name]" type = "interviewee">[Interviewee Name]</name>, interviewee
						...
				</author>
				<respStmt>
					<resp>Interview conducted by </resp><name id = "[Interviewer1 Initials]" reg = "[Interviewer1 Name]" type = "interviewer">[Interviewer1 Name]</name>
				</respStmt>
				<respStmt>
					<resp>Text encoded by </resp><name id = "[Encoder Initials]">[Encoder Name]</name>
				</respStmt>
			</titleStmt>
			<sourceDesc>
				<biblFull>
					<titleStmt>
						<title>[Title of Interview]</title>
						<author>[Interviewees]</author>
					</titleStmt>
					<extent></extent>
					<publicationStmt>
						<publisher>[Institution]</publisher>
						<pubPlace></pubPlace>
						<date>[Interview Date]</date>
						<authority/>
					</publicationStmt>
				</biblFull>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<langUsage>
				<language id = "[Language Abbr.]">[Language]</language>
			</langUsage>
		</profileDesc>
	</teiHeader>
	<text>
		<body>
			<div1 type = "about_interview">
				<head>[Interview Boilerplate]</head>
				<list type = "simple">
					<item>Interviewer:<name id = "spk1" key = "[Interviewer1 Initials]" reg = "[Interviewer1 Name]" type = "interviewer">[Interviewer1 Name]</item>
					...
					<item>Subject:<name id = "spk?" key = "[Interviewee1 Initials]" reg = "[Interviewee1 Name]" type = "interviewee">[Interviewee1 Name]</item>
					...
					<item>Date:<date>[Interview Date]</date></item>
				</list>
			</div1>
			<div2>
				<pb id = "p[Page Number]" n = [Page Number] />
				<sp who = "spk1">
					<speaker n = "1">...</speaker>
				</sp>
			</div2>
		</body>
	</text>
</TEI.2>

Phase I Oral History Collections

Black Women Oral History Project (BWOH)

Brown Women Speak: Pembrooke Center Transcripts (BWSP)

Rutgers Oral History Archives (ROHA)

Rosie the Riveter WWII American Homefront Project - Bancroft (RTRB)

In order to run, you need: - the path for a folder for Metadata with the following files: - "Interviews.csv" a csv version of the Interviews metadata sheet - "Interviewees.csv" a csv version of the Interviewees metadata sheet - "Collections.csv" a csv version of the Collections metadata sheet - the path for a folder containing all of the transcripts - a "name map" of the appropriate format (should be in the github), can also be generated using map_names_RTRB.py - the pandas library (included with the anacondas distribution)

Run with py encode_RTRB.py [transcript folder path] [metadata folder path] [name map path]

Smith College Alumnae Oral History Project (SCAP)

Stanford Nurse Alumni Interviews (SNAI)

Stanford Historical Society Alumni Interviews (SHSA)

Stanford Historical Society Faculty and Staff and Misc Interviews (SHSF)

Phase II

Oklahoma Centennial Farm Families (OCFF)

In order to run, you need: - the path for a folder for Metadata with the following files: - "Interviews.csv" a csv version of the Interviews metadata sheet - "Interviewees.csv" a csv version of the Interviewees metadata sheet - "Collections.csv" a csv version of the Collections metadata sheet - the path for a folder containing all of the transcripts - a "name map" of the appropriate format (should be in the github), can also be generated using map_names_OFCC.py - the pandas library (included with the anacondas distribution)

Run with py encode_OCFF.py [transcript folder path] [metadata folder path] [name map path]

NOTE: In order for this script to work, all of the transcripts but be preprocessed manually by adding the following "tags" to each transcript: - <<BOILERPLATE START>>, <<BOILERPLATE END>> - <<INTRODUCTION START>>, <<INTRODUCTION END>> - <<INTERVIEW START>>, <<INTERVIEW END>>

Oklahoma One Hundred Year Life Collection (OOHYLC)

O-STATE Stories (OSS)

Run py encoding_tool/encode_OSS.py. You need to have metadata.csv and oss_export.xml in the folder encoding_tool.

Challenges:

  • There are multiple interviewees, multiple dates, and multiple interviewers for the interviews.
  • The transcripts are really nicely formatted into XML, but the transcript is incorrect. The speaker is not specified correctly during their line--it seems like they used a PDF reader and it incorrectly read the format.
  • Another special note about this one--it's really hard to distinguish the boilerplate from the text--I first found the most common ways the interviewer started the interview ("I am [interviewer name]", etc. It's listed under boilerplate_sep in the code) and used those, then looked at each other file individually.

TODO:

  • Separate speakers. For now, I just put it all into the <div2>.
  • Add in the images and other artifacts. For now, I just have empty pages.

Dust, Drought and Dreams Gone Dry: Oklahoma Women and the Dust Bowl (OWDB)

Inductees of the Oklahoma Women's Hall of Fame Oral History Project (OWHF)

Smith College AARJ (SCAARJ)

Run py encoding_tool/encode_SCAARJ.py. You need to have metadata.csv and a folder SCAARJ containing the .txt files within it (download from the Drive). Make sure that the .txt files are encoded in utf-8.

This one was definitely one of the cleaner transcript sets to work with. It had one interviewer, one interviewee per transcript, and each transcript was written out the same way.

Challenges:

  • .docx, the original format of the interviews, are difficult to work with in python. I just took the .txt files that I had already copied and pasted to work with for this code.
  • I did a GOOGLE VOICE speaker manually for Nguyen, Tu-Uyen's transcript. I called GOOGE VOICE speaker 3 and designated it as an interviewer.

TODO:

  • Add in page numbers...which has to be done manually unfortunately.

Smith College Activist Life (SCAL)

NEED TO COMPLETE!

Challenges:

  • Not all of the interviews are the same, but these can be done manually since there aren't a lot of interviews in here anyways.

TODO:

  • Add in page numbers...which has to be done manually unfortunately.

Smith College Voices of Feminism (SCVF)

Spotlighting Oklahoma Oral History Project (SOOH)

Run py encoding_tool/encode_SOOH.py. You need to have metadata.csv and sok_export.xml in the folder encoding_tool.

Challenges:

  • There are multiple interviewees.
  • There was a bug in one of the transcripts where the creator tag for Van Deman, Jim, doesn't contain his full name (so I manually added it in).
  • For interviewer Julie Pearson Little Thunder, there were random hyphens in her name during the transcription.
  • Some do not have a full interview (specifically, Steinle, Alice).
  • There were a lot of manual checks I had to do to separate the boilerplate and to write regexes.

TODO:

  • Separate speakers. For now, I just put it all into the <div2>.
  • Take out End of Interview. For now, it doesn't affect the results.

UNC The Long Civil Rights Movement: Gender and Sexuality (UNCGAS)

UNC Southern Women (UNCSW)

UNC The Long Civil Rights Movement: The Women's Movement in the South (UNCTWMS)

Women of the Oklahoma Legislature (WOL)

Run py encoding_tool/encode_WOL.py. You need to have metadata.csv and wol_export.xml in the folder encoding_tool.

Challenges:

  • The transcripts is really nicely formatted into XML, but the transcript is incorrect. The speaker is not specified correctly during their line--it seems like they used a PDF reader and it incorrectly read the format.
  • Another special note about this one--it's really hard to distinguish the boilerplate from the text--I first found the most common ways the interviewer started the interview ("I am [interviewer name]", etc. It's listed under boilerplate_sep in the code) and used those, then looked at each other file individually.

TODO:

  • Separate speakers. For now, I just put it all into the <div2>.
  • Add in the images and other artifacts. For now, I just have empty pages.

Miscellaneous Pre-processing Scripts

Helping to categorize occupations.

The script miscellaneous_scripts/output_top_occupations.py reads in metadata.csv (in the same folder) and outputs the top occupations listed under Past Occupations and Current Occupation in our metadata spreadsheet. This is to help categorize them under job categories. It outputs an occupations.csv file that lists all of the occupations by how often they appear.

Helping to isolate the transcript

RTRB

Run miscellaneous_scripts/separating_interview/separate_RTRB.py with RTRB of .txt files within the separating_interview folder.

ROHA

Run miscellaneous_scripts/separating_interview/separate_ROHA.py with ROHA of .txt files within the separating_interview folder.

SOOH

Run miscellaneous_scripts/separating_interview/separate_SOOH.py with the sok_export.xml file within the separating_interview folder.

BWOH

Run miscellaneous_scripts/separating_interview/separate_BWOH.py with the BWOH of .txt files within the separating_interview folder. Some had to be manually checked.

SCAP

Run miscellaneous_scripts/separating_interview/separate_SCAP.py with the SCAP of .txt files within the separating_interview folder. Some had to be manually checked.

SCVF

Run miscellaneous_scripts/separating_interview/separate_SCVF.py with the SCVF of .txt files within the separating_interview folder. A lot of interviews were blank so they didn't have interview transcripts.

ohtap's People

Contributors

anika-asthana avatar hilarysun95 avatar hsun083 avatar jadelintott avatar jennyhong avatar kmcdono2 avatar nickdgardner avatar njmarine avatar percystreet avatar pjames27 avatar yibing-du avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ohtap's Issues

Research question 2

Summary

"Based on searches using pre-established keyword sets, how and when do women talk about issues of sexual assault / harassment / abuse in oral histories? Capture term frequency and relative term frequency overall and per collection?"

Actions

TODO

Related Issues and Pull Requests

Updates

viz label guidelines

  • collection name (shorthand abbr)

  • variables being examined

  • date chart is created

  • be sure to label axes

  • include markdown cell (above/below) to operate as caption. be very descriptive to explain anything that text, color, or other visual signals are meant to communicate.

Winnow Update

Summary

Need to update Winnow so that it works with our new metadata set-up
Will update both here and in the read me as more detail emmerges

Actions

  • Set up meeting with Natalie
  • Do detailed look over Winnow code
  • Add Natalie Sada to this ticket once her github is added

review RQ1-3 code to date

Summary

Ok this is a more specific set of instructions for setting up the code review process.

Each RA should complete these steps independently for their own code. @Yibing-Du @jadelintott @anika-asthana (as appropriate)

If something doesn't make sense or you need more detailed instructions, just let me know!

@ebf77 @njmarine

Actions

  1. fix file names per the github basics tips -> #34
  2. create a branch for your RQ (see tips in #34)
  3. make a back up of the code/other files you plan to move from one branch to the other
  4. move your code and other files related to the RQ into the new branch
  5. remove them from the master branch
  6. set up pull request from your new branch
  7. assign me (@kmcdono2) as the reviewer
  8. your pull request will appear in the project board under the 'To Review' column. You'll get a notification when I being commenting on the code.

Related Issues and Pull Requests

Updates

  • DATE OF UPDATE:

review existing code in OHTAP repo

-<!-- Please fill out the following issue template to the best of your knowledge.

These comments won't appear when you post your issue.
You can delete them after you've read them if you prefer. -->

Summary

Ticket for organizing intro to existing code for 3 new research assistants.

Actions

Jade

Nick

Yibing

Related Issues and Pull Requests

Updates

  • DATE OF UPDATE:

download NVivo

  • Nick
  • Yibing
  • Jade
    (This is Jade. I'm trying to download but it keeps saying the code I got from stanford isn't valid. Ideas?)

transition to app for running tools

  1. create UI for running subcorpus tool
  2. step for signaling false hits
  3. ability to modify false hits
  4. storing false and validated hits as JSON objects
  5. preserving stats from all phrases (raw, adjusted 1, adjusted 2, etc)
  6. integrating metadata
  7. creating options for viz

review and finish Github setup

  • update weekly milestones (I've only made them for weeks 1-3) --> https://github.com/ohtap/ohtap/milestones
  • move Maddie's code into github @kmcdono2 will do this with Maddie
  • add labels if needed (play this by ear, but you might want labels for "RQ1", "RQ2", and "RQ3" for each of the research questions
  • create umbrella ticket for each RQ to get started
  • create tickets for preprocessing tasks (and for Nick, the annotation tool eval task)

look into exporting NVivo annotations for ML

Summary

Actions

-see what data formats NVivo annotations can be exported as
-search for previous work using NVivo annotations in ML

Related Issues and Pull Requests

Updates

  • DATE OF UPDATE:

practice weekly meeting style

There are a few goals for the weekly meetings - to build community, to ensure progress/resolve obstacles, and be a place to discuss new ideas.

For starters:

  • before the meeting, be sure to update your tickets with any progress/blockers!!
  • practice on next week working through the project board at the beginning of the meeting
    1. review "done" tickets
    2. review any remaining "in progress"
    3. move "to do" tickets into "in progress"
  • use any remaining time to have more in-depth conversations
  • invite Katie/Natalie/Estelle to answer questions as needed
  • for help/ideas re: collaborative communication, check out: https://the-turing-way.netlify.app/communication/os-comms.html

Metadata Validation

Summary

Going through the metadata sheet and making everything regular

Actions

  • Talk to Natalie
  • Washington DC problem
  • AD vs AC columns switched
  • Fill-in country
  • Ask about gaps in location data
  • Geocoding

Research Question #3

Summary

Research Question #3: Does the rate of speech (and the extent of speech) about these topics differ by demographic criteria (e.g. race, birth year/cohort) or interview year? Map change over time in rate of speech by metadata on age, race, class. Potentially using tf-idf of bi-grams, by speaker/year (or other metadata filter). After NVivo coding for event time, ask if there are historical periods with more events mentioned? Have to adjust statistically by distribution of corpus (group that was alive at the time).

Actions

  • Testing whether I can post in this repository

Related Issues and Pull Requests

Updates

  • DATE OF UPDATE:

upload code for encoding transcripts

Summary

Ticket to arrange for transfer of @butter4fish code to encode transcripts into the repo.

Actions

  • assess where @hilarysun95 put the old code
  • decide whether to make a new repo for encoding work
  • introduce Yibing to this work

Related Issues and Pull Requests

Updates

  • DATE OF UPDATE:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.