Comments (7)
Stephen, in case it helps. I wrote a script for converting Perseus XML into plain text for my APA 2014 alliteration poster. The XML files—though pretty much all orderly TEI/XML—are not consistent so there’s some peculiarities to deal with different authors/works. This script is also includes a specific workaround for dealing with section breaks (what was needed for the alliteration study.) But you might find some of this useful, so here’s the code: https://github.com/diyclassics/Alliteration-in-Latin-Literature/blob/master/code/perseusPreprocess.py
Best,
PJB
@diyclassics
On Oct 4, 2014, at 12:49 PM, Stephen Margheim [email protected] wrote:
Kyle,
I have started down the path of writing code to convert Perseus XML into structured, formatted plain text. It is a task, so who knows how long it'll take, but along the way, I have dug into the cltk code, primarily the code under /corpus. I have forked this repo and am working on this fork, but it will probably be a while till I get a working Pull Request. Before then, I wanted to let you know what I'm doing and why I think it will help.
I am restructuring the entire code within this scope to be entirely modular. I am writing classes for each corpus, which are all actually sub-classes of a Corpus class, which itself uses a CLTK class. Once finished, there will be no redundant code, each corpus will be individually accessible (not just the import/compile code, but also future code, like convert to structured plain text), and the code should be easier to adapt over time.
Like I said, a Pull Request is probably a ways out, but you can see where I am heading once I push my initial work (and then all subsequent work) to my fork.
stephen
—
Reply to this email directly or view it on GitHub.
from cltk.
Wow, you guys are awesome.
@smargh You have correctly identified some very redundant code. Your cleanup of this will be a terrific help. That module grew organically as I needed to add access to a new corpora, and it is becoming hard to manage. Two tips: (1) Make sure that the corpus importer will be able to grow with other languages. For example, consider some kind of logic or class or argument to separate the downloading of, say, Hebrew from Greek from Sanskrit. I am 100% open to how this gets done. (2) If you think this revision will become an overwhelming task, try breaking it into two parts. In this case, I see two discrete tasks, (i) improving my spaghetti code for downloading corpora into ~/cltk_data
and (ii) modularizing text manipulation (eg, xml parsing, code cleanup, Beta Code transliteration) of downloaded data. Based on my experience, I suspect that by first separating out the downloading from the parsing, the text-processing code will be much easier to modularize. With TEI so popular these days, the latter would be in incredible boon!
@diyclassics Thanks for sharing your Perseus XML parsing. This will surely come in handy sooner than later.
@smargh + @diyclassics An update from my end: I too have been struggling with XML lately. In my case, I have been parsing the Perseus treebank data for the purpose of making a POS training set & automated tagger. I have some early hacking here for Greek: https://github.com/kylepjohnson/treebank_perseus_greek What I have done is made a POS training set (pos_training_set.txt
) with make_pos_training_set.py
(in which you'll see my use of lxml). You can then generate your own machine learning tagger and then tag untagged text. (Note: This stuff is covered in chapters 3 & 4 of the "Python Text Processing with NLTK 2.0 Cookbook". ) You can follow along with the recipe I give in the README.
So far, I have only used the tagger with the UnigramTagger() but it seems to work well for texts which are part of the original training set (which is a good sign). I have yet to test it thoroughly, however. I'm going to do Latin tonight.
Thanks again, both, and please holler if you get stuck bad on a problem. I try not to be touchy about edits to my code, so if you see anything of mine that looks like a bad idea, try your hand at making it better, more intuitive, and/or more Pythonic.
from cltk.
Ok. I've pushed my initial work to my fork at https://github.com/smargh/cltk. You can see the class based structure that I am taking. I have some initial code for a base CLTK class, which integrates with a config
file (right now, the only thing this is used for is to alter the location of the cltk_data
directory). I then have the beginnings of the Corpus
class, which is the foundation class for all of the individual corpus classes. I have renamed /corpus
to /corpora
(more specific, and avoids conflict with corpus.py
) and within that dir I have the beginnings of the individual corpus classes. These are all sub-classes of the Corpus
base class. You can see that already a lot of redundancy is gone and the individual corpus classes are very streamlined. The idea is to put any corpus specific code into these classes, while all shared code (I define shared as used by 2 or more corpora) goes in the Corpus
base class. You will also see that I structuring each individual corpus to have two classes: one for the corpus as a whole (where you can download and compile), and one for a specific document in that corpus. Any of the text/format manipulation will go here. You can see some (not very good or directed) examples of this in the perseus_greek.py
file.
Right now, my plan is to mirror the current api in compiler.py
, but have it call to these classes (and their retrieve()
methods) for downloading. So the downloading stage will be able to remain the same, but then user's can access any corpus or any text in a corpus as an object. I think that on this data side, a fully fledged object oriented approach will make access more user-friendly as well as flexible. By going fully modular, we ought to enable easier "hacking" of CLTK.
How does this sound? Thoughts, comments, concerns?
stephen
from cltk.
Hi Stephen,
This is more than I expected and I am very impressed.
Concerning the big picture stuff you're talking about, I think you're right on target. Your object-oriented approach to interacting with them is especially apt.
A couple nit-picky points/questions:
- I have been trying to shadow the
nltk
's directory structure when possible. This leads me to think that it would be preferable to keep the dir namecorpus
and to keep most of your new code (corpus.py
,soup_utils.py
,config/py
,main.py
) within it. - Specialized handling for specific languages can wait for now. If we do what you're working on well, we can add this modularly, as you say.
- I have done work here and there for parsing TLG and PHI texts, so I can consolidate and contribute that to
tlg.py
(and aphi5.py
andphi7.py
)
I would really like to see this in action. What can I do to help it along? From my end, the closer the repository is to the current cltk master the smoother the transition. If you can get this new code into corpus/
and write out a few example commands to illustrate usage, I think the corpus imports at least will close to ready. From there, it's just a matter writing and improving text cleanup and interaction for specific corpora.
Thanks,
Kyle
from cltk.
My thought on other languages initially would be corpus.py
can handle that expansion. The Corpus
base class should stay as generic as possible, so that ought not be a problem. And, all corpus specific tasks will reside in that corpus' sub-class anyway. Aside from that, we can create a new Document class (I'm actually going to mirror the Corpus
-> specific corpus structure) for any new language. So, in the future, there are two base classes: Corpus
, for the corpus level functions, and Document
, for the document level functions. Each individual corpus will have corresponding classes: e.g. PerseusGreek
and PerseusGreekDoc
which are sub-classes of the base classes. This would effectively eliminate any language specific problems, as such things would live in the corpus specific code.
Now, we may want to add one level of complexity for flexibility and efficiency and make some intermediate classes. So, for corpus stuff, there are some corpora that only require downloading from the internet (retrieve()
), others require a local directory. We could make sub-classes of the Corpus
base class for these two basic types (maybe RemoteCorpus
and LocalCorpus
), and individual corpora sub-class off these depending on what type they are. For documents, we would do similarly. We could have TEIDoc
, TXTDoc
(I actually have already started these), and even GreekDoc
, LatinDoc
, and then any other languages. Then, corpus specific classes would be sub-classes of these.
This would allow the most generic code to go into Corpus
and Document
, any redundant code to go into the intermediate classes, and corpus specific code to go into the corpus sub-sub-class. Thus, when new languages are added, we would def add a new corpus sub-sub-class and maybe a new intermediate sub-class, but the base classes would remain the same. Not only would this hopefully make adding more languages and corpora easier, but it should also help direct peoples thinking. Whatever divisions we make in the intermediate classes will guide people in considering how to classify the specific corpus they want to add.
Anyway, those are my two cents. I'm finishing up the TLG
sub-class now. This is actually what led to the intermediate classes thought. While all the remote corpora were easy to write once I had the Corpus
class, I'm putting a lot of code in the TLG
class which I think can be more generically useful for the other local corpora. So, once I finish it, and turn to PHI, I ought to have a much clearer picture of how such a tri-leveled setup would look like. The goal, regardless, is to make the corpus specific classes as lightweight as possible. That will make adding new ones as easy as possible.
from cltk.
The Corpus and Document classes sound great. And I love the RemoteCorpus
and LocalCorpus
retrieval idea. Can't wait to see it!
from cltk.
basic skeletons of RemoteCorpus
and LocalCorpus
are up on my fork. For now, I think this discussion can be closed. I want to move to more structured conversations/issues.
from cltk.
Related Issues (20)
- ValueError: could not broadcast input array from shape (117,) into shape (300,) HOT 2
- Memory leak caused by analyzing certain texts HOT 1
- Questions about ambiguous POS and cases and about extracting lines of poetry. HOT 5
- ModuleNotFoundError: No module named 'cltk.corpus' HOT 1
- broken on latest Python in linux HOT 2
- Another root error with Collatinus.Decliner HOT 1
- Support Python 3.11 HOT 6
- Modern Greek language HOT 1
- CUDA memory leak HOT 2
- Problem with cltk_nlp.analyze() - Wont' read str HOT 13
- ctlk 1.0.25 depends on a PyYAML bug version (5.4.1) HOT 3
- cltk_nlp.analyze() producing errors no matter what input HOT 7
- Integration of HuggingFace models
- Python 3.11 support HOT 5
- Incorrect syllabification in Greek prosody HOT 8
- Don't require a period at the end of a line for Scansion().scan_text() HOT 4
- Latin lemmatization appears buggy HOT 1
- Error analyzing NLP HOT 2
- Cannot pickle CLTK Doc containing certain data types from Spacy HOT 13
- cltk.NLP.analyze miscounts character indices HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cltk.