cltk / cltk Goto Github PK

The Classical Language Toolkit

License: MIT License

Python 96.25% Makefile 0.08% Jupyter Notebook 3.67%

ai greek historical-linguistics latin ling nlp nltk python spacy stanza

cltk's Introduction

The Classical Language Toolkit (CLTK) is a Python library offering natural language processing (NLP) for pre-modern languages.

Installation

For the CLTK's latest version:

$ pip install cltk

For more information, see Installation docs or, to install from source, Development.

Pre-1.0 software remains available on the branch v0.1.x and docs at https://legacy.cltk.org. Install it with pip install "cltk<1.0".

Documentation

Documentation at https://docs.cltk.org.

Citation

When using the CLTK, please cite the following publication, including the DOI:

Johnson, Kyle P., Patrick J. Burns, John Stewart, Todd Cook, Clément Besnier, and William J. B. Mattingly. "The Classical Language Toolkit: An NLP Framework for Pre-Modern Languages." In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 20-29. 2021. 10.18653/v1/2021.acl-demo.3

The complete BibTeX entry:

@inproceedings{johnson-etal-2021-classical,
    title = "The {C}lassical {L}anguage {T}oolkit: {A}n {NLP} Framework for Pre-Modern Languages",
    author = "Johnson, Kyle P.  and
      Burns, Patrick J.  and
      Stewart, John  and
      Cook, Todd  and
      Besnier, Cl{\'e}ment  and
      Mattingly, William J. B.",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-demo.3",
    doi = "10.18653/v1/2021.acl-demo.3",
    pages = "20--29",
    abstract = "This paper announces version 1.0 of the Classical Language Toolkit (CLTK), an NLP framework for pre-modern languages. The vast majority of NLP, its algorithms and software, is created with assumptions particular to living languages, thus neglecting certain important characteristics of largely non-spoken historical languages. Further, scholars of pre-modern languages often have different goals than those of living-language researchers. To fill this void, the CLTK adapts ideas from several leading NLP frameworks to create a novel software architecture that satisfies the unique needs of pre-modern languages and their researchers. Its centerpiece is a modular processing pipeline that balances the competing demands of algorithmic diversity with pre-configured defaults. The CLTK currently provides pipelines, including models, for almost 20 languages.",
}

License

cltk's People

Contributors

Stargazers

Watchers

Forkers

lukehollis amitshilo eamonnbell marpozzi captainclutcho linearregression theaverageguy nishant23 coderbhupendra hemantpugaliya ashg1910 sameeriitkgp vaibhav4595 manvig chestnut3108 manglakaran f2014063 aakarshsingh neagaze yaknastak rishy oudalab saikrishnar vipul-sharma20 vfelso pavanirv madity aman-ks dlc-mexico ololobus wencanluo rajkumarpalanisamy96 aviash kokmritunjay chaduhl vishalbhalla shaiwilson pr4chimishra talkingavocado vgpprasad91 captainilu andreasgrv ponteineptique xuanhuangyiqi soumyag213 akirato sarthak30 stenskjaer mbevila themichaelbizness swierh soumenganguly jfaville saikswaroop mark-keaton ryderwishart nimitbhardwaj ryanfb ratulghosh akash-pardasani koushik101 michaalbert demfier manu-chroma shekharrajak akhilesh28 satya1612 j-duff nodage ykl7 souravsingh prasastoadi deathcod arkitpatel silentflame thomascothran omargamal8 rrichajalota parag00991 karthik24iyer bigmanstan anurag-198 hallicopter tauqs diehumblex rockstarrrishere brillforward umarbrowser xcitic arpitaagrawal somiyagawa varshit97 priyank-p0 dimen61 pranaydeeps kvaghel1 gaurav1911 ashutoshsaboo srishti-1795 shreyaasridhar

cltk's Issues

Corpus variables

Here's a first attempt at a json list of most of the corpora now within the cltk organization.

The two types below are text, treebank, and training_set. In the near future I hope to add parallel_text. And there should be some kind of undefined option as a catchall.

@smargh is this enough to get you started?

[
   {
      "encoding":"utf-8",
      "languages":[
         "greek"
      ],
      "markup":"tei_xml",
      "name":"greek_corpus_perseus",
      "retrieval ":"remote",
      "type":"text",
      "url":"https://github.com/cltk/greek_treebank_perseus/raw/master/greek_treebank_perseus.tar.gz"
   },
   {
      "encoding":"utf-8",
      "languages":[
         "latin"
      ],
      "markup":"tei_xml",
      "name":"latin_corpus_perseus",
      "retrieval ":"remote",
      "type":"text",
      "url":"https://github.com/cltk/latin_corpus_perseus/raw/master/latin_corpus_perseus.tar.gz"
   },
   {
      "encoding":"utf-8",
      "languages":[
         "latin"
      ],
      "markup":"xml",
      "name":"latin_treebank_perseus",
      "retrieval ":"remote",
      "type":"treebank",
      "url":"https://github.com/cltk/latin_treebank_perseus/raw/master/latin_treebank_perseus.tar.gz"
   },
   {
      "encoding":"utf-8",
      "languages":[
         "greek"
      ],
      "markup":"xml",
      "name":"greek_treebank_perseus",
      "retrieval ":"remote",
      "type":"treebank",
      "url":"https://github.com/cltk/greek_treebank_perseus/blob/master/greek_treebank_perseus.tar.gz"
   },
   {
      "encoding":"utf-8",
      "languages":[
         "greek"
      ],
      "markup":"plaintext",
      "name":"greek_training_set_sentence",
      "retrieval ":"remote",
      "type":"training_set",
      "url":"https://github.com/cltk/greek_training_set_sentence/blob/master/greek.tar.gz"
   },
   {
      "encoding":"utf-8",
      "languages":[
         "latin"
      ],
      "markup":"plaintext",
      "name":"latin_training_set_sentence",
      "retrieval ":"remote",
      "type":"training_set",
      "url":"https://github.com/cltk/latin_training_set_sentence/blob/master/latin.tar.gz"
   },
   {
      "encoding":"utf-8",
      "languages":[
         "latin"
      ],
      "markup":"plaintext",
      "name":"latin_corpus_lacus_curtius",
      "retrieval ":"remote",
      "type":"text",
      "url":"https://github.com/cltk/latin_corpus_lacus_curtius/blob/master/lacus_curtius.tar.gz"
   },
   {
      "encoding":"utf-8",
      "languages":[
         "latin"
      ],
      "markup":"plaintext",
      "name":"latin_corpus_latin_library",
      "retrieval ":"remote",
      "type":"text",
      "url":"https://github.com/cltk/latin_corpus_latin_library/blob/master/latin_library.tar.gz"
   },
   {
      "encoding":"latin-1",
      "languages":[
         "greek"
      ],
      "markup":"tlg_beta_code",
      "name":"tlg",
      "retrieval ":"local",
      "type":"text"
   },
   {
      "encoding":"latin-1",
      "languages":[
         "latin",
         "coptic"
      ],
      "markup":"phi_beta_code",
      "name":"phi5",
      "retrieval ":"local",
      "type":"text"
   },
   {
      "encoding":"latin-1",
      "languages":[
         "greek",
         "latin"
      ],
      "markup":"phi_beta_code",
      "name":"phi7",
      "retrieval ":"local",
      "type":"text"
   }
]

Write TreebankPerseusLatin sub-class

We need a TreebankPerseusLatin sub-class to mirror the PerseusLatin sub-class. It will be a sub-class of the RemoteCorpus class.

All of the great discussion here reminds me that we probably need some structure. What tasks are what people doing when? In general, I think we should have a timeline plan. What needs to be done now? What should be done next? What do we postpone? Then, we can assign project leads. Anyone can work on a task, but one person is responsible for getting it out the door. For example, I have given myself the responsibility of writing the Corpus class and the corpus specific classes for the default corpora. This clearly needs to be done now, as this is the background (having the data and having access to it) for most of the other functionality. We need to clarify what we are doing, who is responsible for what, and what a reasonable timetable is. So, if you have a task that you are passionate about, speak up and take it. Then, we can figure out where it fits in the overall timeline and go from there.

Kyle, what do you see as the two or three things that need to be done NOW? Let's start there.

Read/interpret TLG's *.BIN files

Does anyone know of a way to read the TLG/PHI's *.BIN files? I know that these are CD/DVD files in binary format. But, I don't know how to get any type of even partially usable data from them. Any help or suggestions would be greatly appreciated.

Write SentenceTokensLatin sub-class

This will be a sub-class of RemoteCorpus.

Write TreebankPerseusGreek sub-class

We need a TreebankPerseusGreek sub-class to mirror the PerseusGreek sub-class. It will be a sub-class of the RemoteCorpus class.

Write indices for text corpora

Write test for Latin stemmer.py

Assert expected result for simple stemmer use, as at http://docs.cltk.org/en/latest/classical_latin.html#stemming

Add cltk download to tests

This will have to come after the Corpus class rewrite.

Update travis config: https://github.com/kylepjohnson/cltk/blob/master/.travis.yml

and rm the try/excepts at 65 and 88 of https://github.com/kylepjohnson/cltk/blob/master/cltk/tests/test_cltk.py

Integrate Logeion data

Logeion is the best Greek/Latin dictionary on the web and mobile. It has a large number of the best dictionaries in well-formatted structures and has a lot of frequency and other word statistics. I think we should add this data.

Here is the SourceForge .sqlite files: http://sourceforge.net/projects/logeion/files/, and here's the GitHub repo: https://github.com/logeion/logeion-backend. I don't really know much of anything about servers or such, so getting this setup is currently beyond me. But if we could get a corpus together, I think it would be advantageous.

Compile Lacus Curtius Latin for CLTK

https://github.com/kylepjohnson/corpus_lacus_curtius_latin

Compiling TLG texts

Ok, I'm writing the TLG class right now, and in reading through the code, I was struck by the compiling process as well as the actual output. Obviously the output is not ideal (@ everywhere and other odd characters), and simply dropping all non-ASCII characters is dropping information.

So, I went looking trying to figure out how in the hell to interpret the TLG binary text files, and I found something: http://tlgu.carmen.gr. This is a C program that converts the TLG binary text files to Unicode text files. And it is really good. I've run some tests, and it knows what it's doing. Now, how do we handle this? There are two obvious choices:

force users to install and compile tlgu as a dependency
try to write a Python version of tlgu

The first is obviously a user-facing problem, while the latter is a major pain in one or both of our asses. However, we need to do one of them, because it has what we need.

Compile PROIEL treebank for CLTK

https://github.com/proiel/proiel-treebank

Remove large files from entire git tree

The repository's size was out of control when I started work on it. I think it's close to 1GB all together. I'm using bfg to rm any files larger than 1MB. https://rtyley.github.io/bfg-repo-cleaner/

Any large with lexicon and stem now and elsewhere I am moving to a proper corpus repository.

TLG/PHI indices

So, the newest version of tlg.py has new versions of the index_authors() and index_meta() meta, using tlgu to convert the .DIR files into ASCII text, and then parse those into dicts.

This goes back to an earlier discussion, but I think it's cleaner to have a space for this discussion. Do we generate these indices and then package them with CLTK, so that users download them (although they won't be downloading these corpora themselves), or do we generate them during the compile process and save them to disk? I understand the argument for packaging indices with the corpora that we will host on GitHub ourselves (remote corpora), but for these local corpora, how do we want to handle this?

My two cents, generate on the fly on first run, save to disk, read from file on all subsequent runs.

Write wrapper API for `tlgu`

tlgu is a command line utility for converting TLG and PHI binary files into Unicode files. There is a base class (TLGU) in the tlgu.py file. We need to write a semantic API for the various options. These should be *kwargs passed to the run() method.

So, we need to:

name the various options
refactor the run() method

For reference, here are the options as given on the man page:

OPTIONS

−b :: inserts a form feed and citation information (levels a, b, c, d) on every "book" citation change. By default the program will output line feeds only (see also −p).

−p :: observes paging instructions. By default the program will output line feeds only.

−r :: primarily Roman text (PHI). Some TLG texts, notably doccan1.txt and doccan2.txt are mostly roman texts lacking explicit language change codes. Setting this option will force a change to roman text after each citation block is encountered.

−v :: highest-level reference citation is included before each text line (v-level)

−w :: reference citation is included before each text line (w-level)

−x :: reference citation is included before each text line (x-level)

−y :: reference citation is included before each text line (y-level)

−z :: lowest-level reference citation is included before each text line (z-level).

−Z <custom_citation_format_string> :: an arbitrary combination of citation information is included before each text line; see also -e option e.g. "%A/%B/%x/%y/%z\t" will output the contents of the A, B citation description levels, followed by x, y, z citation reference levels, followed by a TAB character.

−e <custom_blank_citation_string> :: if there is no citation information for a citation level defined with the -Z option above, a single right-hand slash is substituted by default; you may define any string with this option e.g. "-" or "[NONE]" are valid inputs

−B :: inserts blank space (a tab) before each and every line.

−X :: compact format; v, w, x citations are inserted as they change at the beginning of each section.

−Y :: compact format; w, x, y citations are inserted as they change at the beginning of each section.

−N :: no spaces; line ends and hyphens before an ID code are removed while hyphens and spaces before page and column ends are (still) retained.

−C :: citation debug information is output.

−S :: special code debug information is output.

−V :: block processing information is output (verbose).

−W :: each work (book) is output as a separate file in the form output_file-xxx.txt; if an output file is not specified, this option has no effect.

Write LacusCurtius sub-class

We need a LacusCurtius sub-class that mirrors the LatinLibrary sub-class. It will be a sub-class of the RemoteCorpus class.

Compile Index Thomisticus corpus for CLTK

http://www.corpusthomisticum.org/it/index.age

Document type classes

We need to think about what the different document types are and how they might be structured. For example, prose vs poetry. And structure may be work, book, section, sentence, word. What are further complexities here?

If we can get a good object schema for documents, it should make our indexing, schema, and API better.

Move Greek Treebank

Move to CLTK organization.

Move Latin Treebank

Move to CLTK organization.

FYI: I'm rewriting the compiling/importing code

Kyle,

I have started down the path of writing code to convert Perseus XML into structured, formatted plain text. It is a task, so who knows how long it'll take, but along the way, I have dug into the cltk code, primarily the code under /corpus. I have forked this repo and am working on this fork, but it will probably be a while till I get a working Pull Request. Before then, I wanted to let you know what I'm doing and why I think it will help.

I am restructuring the entire code within this scope to be entirely modular. I am writing classes for each corpus, which are all actually sub-classes of a Corpus class, which itself uses a CLTK class. Once finished, there will be no redundant code, each corpus will be individually accessible (not just the import/compile code, but also future code, like convert to structured plain text), and the code should be easier to adapt over time.

Like I said, a Pull Request is probably a ways out, but you can see where I am heading once I push my initial work (and then all subsequent work) to my fork.

stephen

Write POSLatin sub-class

This will be a sub-class of the RemoteCorpus class.

Compile Poeti d’Italia for CLTK

http://www.mqdq.it/mqdq/poetiditalia/indice_autori_alfa.jsp?scelta=AZ&path=metri_opere

Write downloader for `tlgu`

The TLGU class, which wraps the tlgu command line utility, needs a method to automatically download the utility. There is already an initial method to compile the executable from the downloaded C file. I have hosted the files on GitHub here (https://github.com/smargh/tlgu), but we will likely eventually move this under the CLTK organization.

The downloader definitely needs the tlgu.c and tlgu.h files, but do we want all the rest as well? And where do we download them to? To a temp directory that we then delete? Or make it permanent somewhere in the cltk directory tree?

Download API

This is my working thinking for the downloading of corpora (importing and compiling). This simultaneously tries to mirror the better parts of NLTK, as well as address the peculiarities of our situation. Thoughts and suggestions welcome.

CLTK API

`cltk_data/` Directory Structure

cltk_data/
    corpora/
        originals/
            latin/
                {name}/
            greek/
                {name}/
        structured/
            latin/
                text/
                    perseus/
                    lacus_curtius/
                    latin_library/
                    phi5/
                    phi7/
                treebank/
                    perseus/
                training_set/
                    sentence/
            greek/
                text/
                    perseus/
                    tlg/
                    phi7/
                treebank/
                    perseus/
                training_set/
                    sentence/
        plain/
            latin/
                perseus/
                lacus_curtius/
                latin_library/
                phi5/
                phi7/
            greek/
                perseus/
                tlg/
                phi7/
        {
        readable/
            latin/
                perseus/
                lacus_curtius/
                latin_library/
                phi5/
                phi7/
            greek/
                perseus/
                tlg/
                phi7/
        }

Importing Corpora

Importing a corpus will write data to two or possibly three sub-directories in cltk_data/corpora/: originals/, structured/, and possibly plain/. There are 3 types of corpora that the CLTK currently employs: [1] text, [2] treebank, and [3] training set. Treebanks and training sets are only written to originals/ and structured/, while text corpora are written to these as well as plain/.

The content of these three directories should be self-explanatory. The originals/ directory contains a corpus' original data in whatever format it comes in, whether that be a .tar.gz compressed file or a full directory tree. The structured/ directory contains a corpus' fully-structured, transformed data. Finally, the plain/ directory contains a corpus' text files stripped of all structural metadata. Let's look at two examples of text corpora and how their data will look in each directory. First, consider the TLG corpus. The files in originals/ consist of a collection of binary encoded files using Beta Code markup. The files written to structured/ are transformed to utf-8 Unicode files with full structural information included. The files in plain/ contain simply the Greek text without any structural information. Next, let's look at the Perseus Greek corpus. Only a .tar.gz compressed file will reside in originals/ directory. The uncompressed files written to structured/ are in TEI-XML format. The plain/ directory then contains the uncompressed files with all XML tags stripped out.

So, this is what importing a corpus will do, but how do you actually import a corpus? Here's how:

import cltk

# If you want to download all corpora 
cltk.download()

# If you want to download only one corpus
cltk.download('tlg')

# If you want to download a certain set of corpora
cltk.download(['tlg', 'phi5', 'phi7'])

Write corpora sub-classes

This is a new, single Issue for all of the remaining corpora that need their own sub-class:

Updated greek sentence tokenizer for final sigma

Once this is closed, push the update to the cltk.

Question about new project structure

Since we're getting closer to rolling out the new corpus functionality, I should bring up the reorganized file structure that you've introduced. I'll briefly throw out a few points and questions, which I don't mean to sound accusatory (or defensive).

what was wrong with the previous dir structure? from what I have seen, what I had was most common for packed apps.
I liked that the docs were kept out of the other cltk dirs, since totally different software runs it.
Likewise I prefer to keep files like pylintrc, requirements.txt, and .travis.yml isolated from our core python software
does the current structure build correctly with distutils (iepython setup.py sdist install?)
as I have mentioned, I was to keep parity with the NLTK to degree possible. This is in part to lower others' barrier to entry and also because I would like to push some of my linguistic research (and now the corpora too, perhaps) upstream
Is there anything (bad ideas, code) that prevents us from returning the dirs to what I had before?

Unfortunately I won't have time to work through the corpora branch until this Saturday. I know that I will have a more informed perspective once I work with your code. Nevertheless I want to give a heads up that this is the one thing that sticks out to me as needing to be addressed.

As you know, I am happy to talk (and learn) about how else we can organize the application for long-term success.

Write SentenceTokensGreek sub-class

Compile IntraText Latin corpus for CLTK

http://www.intratext.com/Catalogo/Autori/Aut33.HTM

http://www.intratext.com/LAT/

Compile Musisque Deoque for CLTK

http://www.mqdq.it/mqdq/indice_autori_alfa.jsp?scelta=AZ&path=metri_opere

Move Greek sentence tokenizer training set

Move the CLTK organization.

Compile Corpus Grammaticorum Latinorum for CLTK

http://kaali.linguist.jussieu.fr/CGL/text.jsp

Restructure `cltk_data/`

How about scrapping original and compiled idea and, from the root, list the types of data available, which is ~/nltk_data does, eg with corpora, taggers, tokenizers.

Then within corpora, eg:
$ pwd
/Users/kyle/cltk_data/corpora
$ ls -l
latin_library
perseus_greek
phi5
phi7
pos_latin
sentence_tokens_greek
sentence_tokens_latin
tlg
treebank_perseus_greek
treebank_perseus_latin
Then, within any given corpus, we can have any directory structure that makes sense for that corpus. For the latin_library, say, this structure might be very simple (just an index and files) and for others, like the tlg, corpora can be more refined (original, full_structure, semi_structured, plaintext). Your tlgu interface could default to output to this location with these names, and then we also have the option of giving a custom dir name in this same location too.

It probably won't be until this weekend that I can give you a thorough overview of the kinds of data and corpora that we'll need to account for. Though if coded in the way I imagine, this dir structure for ~/cltk_data should decoupled enough that you need not know exactly what each contains.

Finally, I have wondered about other types of data that could go into~/cltk_data. Two thoughts: (a) helper apps like tlgu. Something like ~/cltk_data/helper_apps could be an intuitive and predictable place for the cltk to look. (b) User-generated data. For small datasets, users can of course save data anywhere. But there might be value in making workspaces available. (a) Sounds smart to me (to you too?) but for (b) I am not ready to make a decision.

This is a great idea. One of the things I added to the CLTK class was the ability to set another path for cltk_data/, but under that dir, I do think another schema would be beneficial. I think corpora/, taggers/, tokenizers/, utilities/, and user_data/ should be second level directories.

Question tho: Should

pos_latin
sentence_tokens_greek
sentence_tokens_latin
treebank_perseus_greek
treebank_perseus_latin

be under corpora/ or under either taggers/ or tokenizers/?

Create multiple base classes for corpora

I'm starting to think that there would be some advantages to creating a sort-of matrix of base classes that can be mixed and matched in the corpus specific classes. So, for example, there is the distinction between RemoteCorpus and LocalCorpus, but there is also a distinction between GreekCorpus and LatinCorpus (and then any other languages added to cltk). There is also the distinction between BinaryCorpus (like TLG and PHI) and UnicodeCorpus (like Latin Library).

What do we think of using Python's multiple inheritance capabilities to split functionality into these 6 sub-classes (all themselves sub-classes of Corpus), and then have any new specific corpus class inherit up to three of these sub-classes (as they are in pairs)? For example, the TLG class would be a sub-class of LocalCorpus, GreekCorpus, and BinaryCorpus. I think this would help in organizing the corpora and keeping the code clean. I'm also thinking it will make adding new corpora easier.

However, I am learning on the go about multiple inheritance in Python, so any help and suggestions is appreciated.

Plus, are they any suggestions for other base-classes? I'm aiming to fully abstract the corpora, so that corpus specific classes are basically just wrappers of their respective parent classes. This means that we should try to think of all integral divisions for all possible corpora. So, for example, I haven't really looked into the Treebank data and the POS data. But I imagine that these corpora are of a different type than, say, Perseus Greek. What would this division be? What should we name them?

I think it would be helpful to think out a full API and class structure before finalizing the actual code.

Move Latin Library and write index.json

Move to CLTK organization and add index.json.

Compile Index Thomisticus treebank for CLTK

http://itreebank.marginalia.it/

http://www.corpusthomisticum.org/it/index.age

Add Sphinx documentation

I think we need to have fuller documentation, and I think that Sphinx is as good an option as any. I'm working to add it to my code now.

Move POS tagging training data sets

Move to CLTK organization.

Write TEI XML converter

Now that we are nearly done with TLGU and have our output spectrum for TLG, PHI5, and PHI7, we need to find a way to mirror that as close as possible with Perseus Greek and Perseus Latin, which come in TEI-XML format. Like the TLG, this is a highly structured, machine readable format. We should be able to (at least) generate a plain text version (easy enough, but I want to think through finding the most efficient means possible) and a partially structured version (this is the trickier part). On a somewhat side note, I am currently compiling the full TEI-XML documents by using BeautifulSoup's pretty_print() function, to ensure consistent formatting.

We need to determine what the partially structured version will retain, and how to convert the TEI to this format.

Thoughts? Suggestions?

Investigate new corpora

I’ve been reading B. McGillivray’s »Methods in Latin Computation Linguistics« (Brill 2014) which has a good summary of Latin corpora, lexicons, tools, etc. in Ch. 2 “Computational Resources and Tools for Latin.” What I should do is write a review/summary of the book when I finish—for now, I’ll list a few corpora that she mentions in the chapter that might be worthwhile to consider for the future. I haven’t looked into the availability, rights status, or potential cost of these resources—just made some notes on her resources overview.

Index Thomisticus http://www.corpusthomisticum.org/it/index.age

Index Thomisticus Treebank http://itreebank.marginalia.it/

Musisque Deoque http://www.mqdq.it/mqdq/indice_autori_alfa.jsp?scelta=AZ&path=metri_opere

Poeti d’Italia… http://www.mqdq.it/mqdq/poetiditalia/indice_autori_alfa.jsp?scelta=AZ&path=metri_opere

Corpus Grammaticorum Latinorum http://kaali.linguist.jussieu.fr/CGL/text.jsp

Her own LatinISE Sketch Engine corpus (documentation http://www.sketchengine.co.uk/documentation/wiki/Corpus/Latin), built in turn from…

LacusCurtius

IntraText http://www.intratext.com/latina/

MusisqueDeoque

…and trained against the Perseus Treebank, the IT-TB, and the PROIEL Treebank https://github.com/proiel/proiel-treebank

The IT and IT-TB are particularly important to McGillivray’s computational/statistical methods (cf. pp. 11-17, esp. 15) as the size of the IT greatly increases the total amount of data she has to work with. Her first study in the book, for example, is a verbal valency lexicon of Latin built from the Perseus Treebank and IT-TB.

By the way, it’s been amazing to see the energy (and progress!) around the CLTK in the past week. Great work.

—Patrick (@diyclassics)=

Python3.4 issue

So, I am trying to get cltk up and running on my new Yosemite machine, and I have hit a problem. I have Python 2.x and 3.x installed via Homebrew. I used pyvenv to create a virtual environment and then pip installed cltk into that directory. All is good up to now. But, when I try to create a script to do the compiling (as in the docs), I discover that Sublime Text (my editor) is importing all of the modules from my Python 2.7 installation. I can see from the docs that you activate the venv from Terminal, and then you can run the code from the shell (I will do that next), but I want to develop alongside cltk, so I want to be able to write scripts in Sublime. How do I have Python properly setup in that situation to act properly?

Let me know if you need any other info to help me, and thanks in advance for your help.

stephen

Python 2.7 compatibility

Are you using something that makes cltk impossible to work with 2.7? If not, we should make is cross-compatible.

Move Latin sentence tokenizer training set

Move to CLTK organization repository.

Compile Lacus Curtius Greek for CLTK

http://penelope.uchicago.edu/Thayer/E/Roman/Texts/home.html

Write PHI7 sub-class

We need a PHI7 sub-class that mirrors the TLG and PHI5 sub-classes. This will be a sub-class of the LocalCorpus class.

How to generate lemmata_list.py?

That would be great to get everything to regenerate /cltk/stem/classical_latin/lemmata_list.py (instructions, script, corpus...).

I'd like to tweak it to add word frequency to create, for example, predictive IME keyboard in Latin, or sort lemma candidates by their frequency.

Convert TLG DOCCAN2 metadata keys

I am working on extracting as much info from the TLG index files as possible. While I still can't get to the *.BIN files (cf. Issue 31), I can read the *.DIR and *.TXT files with tlgu, and DOCCAN2.DIR has a robust index of information for the TLG texts. The problem is that the key:value pairs are a bit odd. The keys are all three letter abbreviations, and I haven't translated them all to clear, full keys. If anyone could help me fill this mapping out, I would greatly appreciate it:

METADATA_KEYS = {
    'key': 'key',
    'nam': 'name',
    'epi': 'genre',
    'geo': 'geographical_adj',
    'dat': 'date',
    'vid': 'cf',            # ??
    'wrk': 'work',
    'cla': 'classification',
    'xmt': 'format',        # ??
    'typ': 'type',
    'wct': 'word_count',
    'cit': 'citation_structure',
    'tit': 'title',
    'pub': 'publisher',
    'pla': 'publication_place',
    'pyr': 'publication_year',
    'ryr': 'republication_year',
    'rpl': 'republication_place',
    'rpu': 'republication_publisher',
    'pag': 'pages',
    'edr': 'editor',
    'brk': 'broken',        # ??
    'ser': '',              # ??
    'srt': 'short_title',   # ??
    'crf': '',              # ??
    'syn': 'synonym',
    'gen': 'genre',         # relation to `epi`??
    'ref': 'reference'      # ??
}

cltk / cltk Goto Github PK

cltk's Introduction

Installation

Documentation

Citation

License

cltk's People

Contributors

Stargazers

Watchers

Forkers

cltk's Issues

CLTK API

cltk_data/ Directory Structure

Importing Corpora

Recommend Projects

Recommend Topics

Recommend Org

`cltk_data/` Directory Structure