stanfordhci / termite Goto Github PK

View Code? Open in Web Editor NEW

115.0 26.0 35.0 1.29 MB

(development moved to new repos)

License: BSD 3-Clause "New" or "Revised" License

JavaScript 43.79% CSS 2.85% Shell 6.53% Python 44.87% Scala 1.95%

termite's Introduction

Current Development

Starting in 2014, we have split Termite into two components:

Termite Data Server for processing the output of topic models and providing the content as a web service
Termite Visualizations for visualizing topic model outputs in a web browser

Our goals are to:

support multiple topic modeling tools
reduce the cost of developing new visualizations through shared infrastructure
allow multiple visualizations to interact with any number of topic modeling software and with other visualizations

Please see the respective repositories for the latest software and additional information.

Termite

Termite is a visualization tool for inspecting the output of statistical topic models based on the techniques described in the following publication. For more details about this repository, see the file "README.old".

Termite: Visualization Techniques for Assessing Textual Topic Models
Jason Chuang, Christopher D. Manning, Jeffrey Heer
Computer Science Dept, Stanford University
http://vis.stanford.edu/papers/termite

termite's People

Contributors

Stargazers

Watchers

termite's Issues

Bug with empty string term?

Hi folks,
I've tried this out and get an error with an empty string term. It looks like the parser is not filtering out u'' from the term lists. In the mallet case, I can work around it, but the stanford topic modeler seems to be more likely to find only empty strings and fails totally - no topic terms output at all. If you want the file I input, I can send it if you email me. (It's tab sep txt.)

Good work though...
Lynn

font-awesome.css missing in v2.0.0

Modeling ran successfully, and I started the server with ./web.py

Extracting topic model outputs: [data/music/lda] --> [data/music/entry-0000]
--------------------------------------------------------------------------------
Importing a Mallet model...
    model = data/music/lda
    output = data/music/entry-0000
    min_term_freq = 20
    min_term_count = 5
Reading "topic-word-weights.txt" from Mallet...
Writing data to disk...
--------------------------------------------------------------------------------
ericks-air:termite-2.0.0 erickpeirson$ ./web.py 
Web server is now running at http://localhost:8888
Press "Ctrl + C" to stop the web server.

At http://localhost:8888/client_src/ the Javascript log shows:

GET http://localhost:8888/client_src/css/font-awesome.css 404 (File not found)

Track state history (and prepare for UI logging)

Keep track of all changes in states in the states sqlite database. Add stubs to enable UI logging.

To be completed after issue #11.

Remove all data-related attributes from MatrixState except for dataID

Remove all data-related attributes "sparseMatrix", "rowDims", "columnDims", "rowAdmissions", "columnAdmissions" and make them private fields in MatrixState.

Update MatrixState to automatically invoke getData API whenever the attribute "dataID" changes.

Topic-to-Topic Comparison Visualization

A list of subtasks to come.

SQLite implementation of states.json

Convert states.json (located in each dataset entry's folder) from a plain JSON file to a SQLite database.

Basic server API

Problems generating example

I'm running into the following error:

$ ./execute.py example.cfg
--------------------------------------------------------------------------------
Tokenizing source corpus...
    corpus_path = corpus/toy.txt (file)
    model_path = output/example-project/topic-model (mallet)
    data_path = output/example-project
    num_topics = 3
    number_of_seriated_terms = 10
--------------------------------------------------------------------------------
Current time = Wed Jun  5 11:01:52 2013
--------------------------------------------------------------------------------
Tokenizing source corpus...
    corpus_path = corpus/toy.txt (file)
    data_path = output/example-project
    tokenziation = [A-Za-z_]+
Connecting to data...
Reading from disk...
Traceback (most recent call last):
  File "./execute.py", line 164, in <module>
    main()
  File "./execute.py", line 161, in main
    Execute( logging_level ).execute( corpus_format, corpus_path, model_library, model_path, data_path, num_topics, number_of_seriated_terms )
  File "./execute.py", line 70, in execute
    Tokenize( self.logger.level ).execute( corpus_format, corpus_path, data_path )
  File "/cygdrive/u/termite/pipeline/tokenize.py", line 55, in execute
    self.documents.read()
  File "/cygdrive/u/termite/pipeline/api_utils.py", line 26, in read
    docID, docContent = line.split( '\t' )
ValueError: need more than 1 value to unpack

This is what happens when the "documents" in toy.txt are line separated. I get the error ValueError: too many values to unpack when I try to separate by tabs. Here is the toy file:

1115 W Franklin
Bessy the Cow
Big Farm Way
The cow jumped over the moon
Look at me
I'm some text
What will be next?
Who knows
Look, over there

Any help is much appreciated!

Dump document content and metadata for each dataset.

Basic client code

Problem in visualizing generated model

Hi all,

I imported one file with one passage per line and runed LDA modeling. I can get the trained LDA model, but cannot visualize the result.

The data format is like this:
01 [tab] xxx xxx

This is part of the output:

Extracting topic model outputs: [data/esoap/lda] --> [data/esoap/entry-0000]

Importing a Mallet model...
model = data/esoap/lda
output = data/esoap/entry-0000
min_term_freq = 20
min_term_count = 5
Reading "topic-word-weights.txt" from Mallet...

Writing data to disk...

Creating default index file: /index.json
Creating default state file: data/esoap/entry-0000/states.json

To give more clue, the index.json file has the content:
{ "runID" : "$RUN_IDENTIFIER", "entryIDs" : [ 0 ], "nextEntryID" : 1 }
while states.json file has nothing but a pair of empty bracket.

Someone successfully run the visualization? Please help!

Cheers!

Add setEntry.py server API

Add a setEntry.py script to match getEntry.py, in order to support synchronization of states between the server and the client.

Add synchronized file/database access on the server

At the moment, states on the server are saved as flat JSON files. This is convenient for now, but runs the risk that if two setEntry.py calls are made at the same time, we'll have two threads writing to the same file, potentially corrupting the data. Changes here involve converting file read/write calls to SQLite read/write calls, and ensuring that the calls are synchronized.

To be completed after issues #10 and #11.

SQLite implementation of index.json

Convert the index file for each dataset from a JSON file to a SQLite database.

Document Viewer

A list of subtasks to come.

Your setup has a bad line

starting @ line 146 change to:
echo "Uncompressing Google Closure Compiler..."
unzip $LIBRARY/compiler-latest.zip closure-compiler-v20170423.jar -d $LIBRARY
 mv $LIBRARY/closure-compiler-v20170423.jar $LIBRARY/closure-compiler.jar

Note that the name of the jar file has changed

MemoryError in computing term similarity

using Termite (with Mallet) for topic model visualisation i encountered two errors:

1- in io_utils.py the str(some string) should be changed (e.g. to unicode(some string)) so that people can work in non-ASCII too.

2- I'm going to visualise more than 300/000 Persian docs topic model and i get this error that indicates my 32GB RAM has no other empty space.
The stack trace:

Computing term similarity...
data_path = output/example-project
sliding_window_size = 10
Connecting to data...
Reading data from disk...
Computing document co-occurrence...
Traceback (most recent call last):
File "./execute.py", line 166, in
main()
File "./execute.py", line 163, in main
Execute( logging_level ).execute( corpus_format, corpus_path, model_library, model_path, data_path, num_topics, number_of_seriated_terms )
File "./execute.py", line 88, in execute
ComputeSimilarity( self.logger.level ).execute( data_path )
File "/home/.../termite-master/pipeline/compute_similarity.py", line 50, in execute
self.computeDocumentCooccurrence()
File "/home/.../termite-master/pipeline/compute_similarity.py", line 93, in computeDocumentCooccurrence
self.incrementCount( cooccurrence, (aToken, bToken) )
File "/home/.../termite-master/pipeline/compute_similarity.py", line 76, in incrementCount
occurrence[ key ] = 1
MemoryError

Running the code again doesn't solve anything. The code consumes all the memory i have and wants more.

Add fetch() support to MatrixState

Add javascript code to read data and state information from server (via cgi-bin/getEntry.py) whenever the attribute "dataID" is changed".

In MatrixState.js:

Defining the appropriate MatrisState.prototype.fetch function.

In index.html

Currently, the webpage loads states/data from server via an ajax call and then initializing the matrix by calling the functions setEntries() or setMatrix(). The webpage should now only need to initialize a StateMatrix object and setting its dataID attribute.

Implement issue #9 prior to this task.

Cannot use

Hi,

I have the following issue:

docID, docContent = line.split( '\t' )

ValueError: need more than 1 value to unpack

Could you please let me know how to fix this.

Problem generating visualization with sample data

I am running into this error at the end of processing input file (formatted with number[tab]text[\n]):

Iteration no.  74
#-------- breaking out early ---------#
candidates checked:  4
change in energy:  0.0
maxTerm:  
maxPosition:  0
Traceback (most recent call last):
  File "./execute.py", line 164, in <module>
    main()
  File "./execute.py", line 161, in main
    Execute( logging_level ).execute( corpus_format, corpus_path, model_library, model_path, data_path, num_topics, number_of_seriated_terms )
  File "./execute.py", line 89, in execute
    ComputeSeriation( self.logger.level ).execute( data_path, number_of_seriated_terms )
  File "/media/drive/Downloads/termite-master/pipeline/compute_seriation.py", line 56, in execute
    self.compute( numSeriatedTerms )
  File "/media/drive/Downloads/termite-master/pipeline/compute_seriation.py", line 103, in compute
    (candidateTerms, self.seriation.term_ordering, self.seriation.term_iter_index, self.buffers) = self.iterate_eff(candidateTerms, self.seriation.term_ordering, self.seriation.term_iter_index, self.buffers, self.bestEnergies, iteration)
  File "/media/drive/Downloads/termite-master/pipeline/compute_seriation.py", line 205, in iterate_eff
    candidateTerms.remove(maxTerm)
ValueError: list.remove(x): x not in list

Input data file: http://db.tt/XYfTQDzG
Complete console log: http://db.tt/19rTChaM

@jcchuang - I would appreciate any help to get Termite running without any issues! Thanks.

No results displayed in v2.0.0

Modeling ran successfully, and I started the server with ./web.py``. I navigated tohttp://localhost:8888/client_src/```, and see the following:

The javascript log shows:

Uncaught TypeError: Cannot read property 'dataIDs' of null (index):105
(anonymous function) (index):105
(anonymous function) d3.js:1939
event d3.js:430
respond

Need a proper web application framework

Removed all previous custom web.py and server_src code.
Reimplemented all functionalities in web2py.
Updated MALLET import scripts to match web2py.

d3 download link out of date in setup.sh

When I ran the setup, it gave me some errors concerning the d3.v3.zip that was downloaded. I looked at it, and it was only 9 kB instead of the 123 kB when I download it manually from d3's website. I changed to url to point to the newest download, and that fixed the issue for me.

Hash representation for "rowLabels" and "columnLabels"

Currently, MatrixState contains two copies of row labels (i.e., terms) and column labels (i.e., topic names).

When MatrixState reads the matrix information associated with a dataset, the received data (saved in MatrixState.data) contains two arrays "terms" and "topics" that are the list of vocabulary used in a topic model and the default names for topics, respectively. Currently, user-defined row labels and column labels are saved as arrays "rowLabels" and "columnLabels" as MatrixState attributes.

In MatrixState.js file:

Create two new attributes "rowModifiedLabels" and "columnModifiedLabels" of type {number: String} that record only the modified labels.
Automatically update attributes "rowLabels" and "columnLabels" whenever attributes "rowModifiedLabels" and "columnModifiedLabels" are changed.
Modify functions rowLabels() and columnLabels() so that they write to (or unset) attributes "rowModifiedLabels" and "columnModifiedLabels".

Build inverted index to support search.

problems running termite

Hi, I am running into the following error:

./execute.py example.cfg
Traceback (most recent call last):
File "./execute.py", line 166, in
main()
File "./execute.py", line 143, in main
logging_level = config.getint( 'Misc', 'logging' )
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ConfigParser.py", line 351, in getint
return self._get(section, int, option)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ConfigParser.py", line 348, in _get
return conv(self.get(section, option))
ValueError: invalid literal for int() with base 10: '20 # Display info messages'

My example file is tab delimited (from one of the examples someone reported in a previous issue):

01 1115 W Franklin
02 Bessy the Cow
03 Big Farm Way
04 The cow jumped over the moon
05 Look at me

I am not sure what might be causing this error. Any help will be appreciated.

Incremental changes for "rowLabels" and "columnLabels"

When synchronizing states to the server, read/write only "rowModifiedLabels" and "columnModifiedLabels". Refer to issue #22.

Add sync() support for MatrixState

Add client javascript code to write states back to the server (via cgi-bin/setStates.py). Changes involve defining (and the default Backbone implementation) of MatrixState.prototype.sync function.

To be completed after issue #6.

Use sparse matrix representation throughout the entire data processing pipeline.

Right now, there are two ways to load a term-topic probability matrix: By calling either MatrixState.importMatrix() or MatrixState.importEntries().

The former loads a 2D array of numbers and the latter loads an 1D array of non-zero entries. For matrices with a large number of zeros (in our case), the latter is the more efficient representation. However, the client code currently passes the matrix from MatrixState to MatrixModel using a full matrix representation, nullifying any advantage we get from loading a sparse matrix.

Split {getEntry, setEntry} API into {getData, getStates, setStates}

Divide server-client communication by content: Data that can only be generated by a topic model and provided by the server. States that can be read/written by the client.