Comments (5)
@herrtao Wow, somehow this completely slipped by me -- apologies for not responding.
Tethne is primarily designed for cases where you are starting with bibliographic metadata (e.g. from Web of Science, JSTOR, Zotero). If you're just working with a bunch of plain-text files, then there are potentially simpler approaches.
As a starting-place, you might take a look at the notebooks in this project. There are several different workflows -- in the topic modeling sections, there are notebooks that demonstrate LDA with Tethne/MALLET and gensim. In particular, this notebook demonstrates LDA with gensim -- if you don't have metadata, you can just skip/comment out those parts.
I hope that helps! Let me know if you have any other questions. We can also discuss further off-channel if you'd prefer ([email protected]).
from tethne.
This will be TETHNE-131
from tethne.
@herrtao Take a look at this thread for a related discussion. It's not exactly what you asked, but maybe helpful.
from tethne.
@herrtao Ok, as of v0.8.1.dev5 this is now a feature! Since this is a pre-release version you'll have to upgrade Tethne with the --pre flag.
pip install -U tethne --pre
Here's an example. Please let me know what you think. If you run into issues, or have other requests, please check out our new Q/A group.
>>> from tethne.readers.plain_text import read
>>> corpus = read('/path/to/directory/with/texts')
To use the corpus for topic modeling, you could then do:
>>> model = LDAModel(corpus, featureset_name='plain_text')
>>> model.fit(Z=5, max_iter=200)
More documentation will be forthcoming, but here's the docstring for now:
Generate a :class:`.Corpus` from a collection of plain-text files.
Plain-text content will be available as a feature set called "plain_text".
Uses :class:`nltk.corpus.reader.plaintext.PlaintextCorpusReader`\.
Parameters
----------
path : str
Path to a directory containing plain text files.
pattern : str
(default: '.+\.txt') A RegEx pattern used to select texts for inclusion
in the corpus. By default will select any file ending in `.txt`.
extractor : function
This function can be used to parse the name of each file for additional
metadata. It should accept a single string (the filename), and return
a dictionary of fields and values. These fields will be added to the
resulting :class:`.Paper` instance.
index_by : str
(default: 'fileied') Field on :class:`.Paper` to use as the primary
index.
structured : bool
(default: True) If True, the contents of the document collection will be
represented by a :class:`.StructuredFeatureSet`\. If False, a
:class:`.FeatureSet` will be used instead. Setting ``structured=False``
is appropriate if word-order does not matter (e.g. topic modeling).
corpus : bool
(default: True) If False, will return a list of :class:`.Paper`
instances rather than a :class:`.Corpus`\.
kwargs : kwargs
Any additional kwargs will be passed to the
:class:`nltk.corpus.reader.plaintext.PlaintextCorpusReader` constructor.
Refer to the `NLTK documentation
<http://www.nltk.org/api/nltk.corpus.reader.html#nltk.corpus.reader.plaintext.PlaintextCorpusReader>`_
for details.
Returns
-------
:class:`.Corpus`
"""
from tethne.
thanks for the reply!
from tethne.
Related Issues (20)
- direct_citation creates undirected graph with no node attributes HOT 11
- Possibility to ignore certain WoS-tags HOT 5
- value error importing metadata zotero RDF HOT 1
- Transform FeatureSet HOT 5
- DTM import error HOT 5
- Using Mallet file as corpus for the LDA model HOT 6
- RuntimeError: MALLET import-file failed with exit code 127 HOT 9
- ImportError: failed to find libmagic. Check your installation` HOT 1
- attachment_probability does not work in v0.8 HOT 8
- Error in reading all wos files in a folder HOT 2
- availability of topic coherence measure HOT 1
- Missing module HOT 2
- ImportError: cannot import name _iterable HOT 1
- Tethne Install Error setup.py egg_info HOT 3
- AttributeError: 'module' object has no attribute 'wos'
- KeyError: 'date' same error in generating co-author and co-citation graphs HOT 1
- 'module' object has no attribute 'wos' HOT 1
- Installation error HOT 4
- Documentation for 0.8 incorrectly specifies what's returned in a direct_citation graph
- authorKeywords and keywordsPlus are inversed
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tethne.