Giter VIP home page Giter VIP logo

jespersens-cycle-middle-english's Introduction

Description

This repository contains the source code to generate data from Jespersen's Cycle in Middle English using the second edition of the Penn Parsed Corpus of Middle English (PPCME2).

The query and codes included are based on work done by Aaron Ecay and Meredith Tamminga. I'd like to thank them for sharing their time and expertise in understanding the queries and code. Note that the results generated here are distinct.

Instructions

Data

Starting in the 12th century we see a change in how sentential negation is expressed. We initially observe purely pre-verbal negation.

Ic ne secge

This is followed by a period of bipartite negation where we observe both pre- and post-verbal negative markers.

I ne seye not

Finally, we observe purely post-verbal negation.

I say not

This change is often referred to as Jespersen's Cycle, following Dahl (1979), due to Jespersen's (1917) observation that:

The history of negative expressions in various languages makes us witness the following curious fluctuation: the original negative adverb is first weakened, then found insufficient and therefore strengthened, generally through some additional word, and this in its turn may be felt as the negative proper and may then in course of time be subject to the same developments as the original word

Code

To run the code either download the files as a ZIP, or clone the repository:

git clone https://github.com/christopherahern/jespersens-cycle-middle-english.git

Change directories to the cloned repository and create a symbolic link to the root directory of your copy of your local copy of the PPCME2:

ln -s <location of PPCME2> corpus 

Now run the make script to output the data to data/neg-data.csv:

./make.sh

As a point of reference, make.sh takes less than two minutes to run on a laptop:

time ./make.sh

real	1m24.266s
user	1m48.135s
sys	0m2.013s

Output

The data will be output to data/neg-data.csv with the following columns:

  • exclude : tokens we might want to exclude for various reasons
  • ne : whether ne appears and whether it is contracted
  • not : whether not appears and whether it is before or after the verb
  • clausetype : details about whether the token appears in a matrix or relative clause
  • never.posn : whether never appears and whether it is before or after the verb
  • finite : whether or not the clause is finite
  • id : unique id of the sentence containing the token
  • year : year of the document containing the token
  • document : name of the document containing the token
  • stage : (1) ne..., (2) ne...not, (3) ...not

Note that these are all defined by the queries in coding.c and the script data.R.

The dates of each document can be found in the description of the corpus and are summarized in data/document-dates.csv.

Citation

If you use this repository to generate data, please cite it. More importantly, if you use data generated from the parsed corpus please cite the corpus:

Kroch, Anthony, and Ann Taylor. 2000. The Penn-Helsinki Parsed Corpus of Middle English (PPCME2). Department of Linguistics, University of Pennsylvania. CD-ROM, second edition, release 4.

It takes a lot of time and effort to build and annotate historical corpora well, so it's important to acknowledge that hard work.

Comments

If you have comments or questions about anything, feel free to email [email protected] or create an issue

jespersens-cycle-middle-english's People

Contributors

christopherahern avatar

Watchers

 avatar

jespersens-cycle-middle-english's Issues

Dating scheme for documents

Here's a proposed method for assigning the date given the corpus description: http://www.ling.upenn.edu/histcorpora/PPCME2-RELEASE-4/

  1. If only a single date is provided, then use that as the document date regardless of the status of the information (e.g. c, a, ?)
  2. If both c and a dates are provided, use the c date.
  3. If there are multiple c dates, then use the one without ?. If both are simply c, then use the earlier date.
  4. If there are multiple a dates, then use the one without ?. If both are simply a, then use the earlier date.
  5. If both c or a and ? dates are provided, use c or a.
  6. Adjust any dates according to expert knowledge or detailed examination of document information

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.