Giter VIP home page Giter VIP logo

termite's Introduction

Current Development

Starting in 2014, we have split Termite into two components:

Our goals are to:

  • support multiple topic modeling tools
  • reduce the cost of developing new visualizations through shared infrastructure
  • allow multiple visualizations to interact with any number of topic modeling software and with other visualizations

Please see the respective repositories for the latest software and additional information.

Termite

Termite is a visualization tool for inspecting the output of statistical topic models based on the techniques described in the following publication. For more details about this repository, see the file "README.old".

Termite: Visualization Techniques for Assessing Textual Topic Models
Jason Chuang, Christopher D. Manning, Jeffrey Heer
Computer Science Dept, Stanford University
http://vis.stanford.edu/papers/termite

termite's People

Contributors

ashpjin avatar elmer-garduno avatar jcchuang avatar joeycozza avatar ys-l avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

termite's Issues

Bug with empty string term?

Hi folks,
I've tried this out and get an error with an empty string term. It looks like the parser is not filtering out u'' from the term lists. In the mallet case, I can work around it, but the stanford topic modeler seems to be more likely to find only empty strings and fails totally - no topic terms output at all. If you want the file I input, I can send it if you email me. (It's tab sep txt.)

Good work though...
Lynn

font-awesome.css missing in v2.0.0

Modeling ran successfully, and I started the server with ./web.py

Extracting topic model outputs: [data/music/lda] --> [data/music/entry-0000]
--------------------------------------------------------------------------------
Importing a Mallet model...
    model = data/music/lda
    output = data/music/entry-0000
    min_term_freq = 20
    min_term_count = 5
Reading "topic-word-weights.txt" from Mallet...
Writing data to disk...
--------------------------------------------------------------------------------
ericks-air:termite-2.0.0 erickpeirson$ ./web.py 
Web server is now running at http://localhost:8888
Press "Ctrl + C" to stop the web server.

At http://localhost:8888/client_src/ the Javascript log shows:

GET http://localhost:8888/client_src/css/font-awesome.css 404 (File not found)

Problems generating example

I'm running into the following error:

$ ./execute.py example.cfg
--------------------------------------------------------------------------------
Tokenizing source corpus...
    corpus_path = corpus/toy.txt (file)
    model_path = output/example-project/topic-model (mallet)
    data_path = output/example-project
    num_topics = 3
    number_of_seriated_terms = 10
--------------------------------------------------------------------------------
Current time = Wed Jun  5 11:01:52 2013
--------------------------------------------------------------------------------
Tokenizing source corpus...
    corpus_path = corpus/toy.txt (file)
    data_path = output/example-project
    tokenziation = [A-Za-z_]+
Connecting to data...
Reading from disk...
Traceback (most recent call last):
  File "./execute.py", line 164, in <module>
    main()
  File "./execute.py", line 161, in main
    Execute( logging_level ).execute( corpus_format, corpus_path, model_library, model_path, data_path, num_topics, number_of_seriated_terms )
  File "./execute.py", line 70, in execute
    Tokenize( self.logger.level ).execute( corpus_format, corpus_path, data_path )
  File "/cygdrive/u/termite/pipeline/tokenize.py", line 55, in execute
    self.documents.read()
  File "/cygdrive/u/termite/pipeline/api_utils.py", line 26, in read
    docID, docContent = line.split( '\t' )
ValueError: need more than 1 value to unpack

This is what happens when the "documents" in toy.txt are line separated. I get the error ValueError: too many values to unpack when I try to separate by tabs. Here is the toy file:

1115 W Franklin
Bessy the Cow
Big Farm Way
The cow jumped over the moon
Look at me
I'm some text
What will be next?
Who knows
Look, over there

Any help is much appreciated!

Problem in visualizing generated model

Hi all,

I imported one file with one passage per line and runed LDA modeling. I can get the trained LDA model, but cannot visualize the result.

The data format is like this:
01 [tab] xxx xxx

This is part of the output:

Extracting topic model outputs: [data/esoap/lda] --> [data/esoap/entry-0000]

Importing a Mallet model...
model = data/esoap/lda
output = data/esoap/entry-0000
min_term_freq = 20
min_term_count = 5
Reading "topic-word-weights.txt" from Mallet...

Writing data to disk...

Creating default index file: /index.json
Creating default state file: data/esoap/entry-0000/states.json

To give more clue, the index.json file has the content:
{ "runID" : "$RUN_IDENTIFIER", "entryIDs" : [ 0 ], "nextEntryID" : 1 }
while states.json file has nothing but a pair of empty bracket.

Someone successfully run the visualization? Please help!

Cheers!

Add setEntry.py server API

Add a setEntry.py script to match getEntry.py, in order to support synchronization of states between the server and the client.

Add synchronized file/database access on the server

At the moment, states on the server are saved as flat JSON files. This is convenient for now, but runs the risk that if two setEntry.py calls are made at the same time, we'll have two threads writing to the same file, potentially corrupting the data. Changes here involve converting file read/write calls to SQLite read/write calls, and ensuring that the calls are synchronized.

To be completed after issues #10 and #11.

Your setup has a bad line

starting @ line 146 change to:
echo "Uncompressing Google Closure Compiler..."
unzip $LIBRARY/compiler-latest.zip closure-compiler-v20170423.jar -d $LIBRARY
 mv $LIBRARY/closure-compiler-v20170423.jar $LIBRARY/closure-compiler.jar

Note that the name of the jar file has changed

MemoryError in computing term similarity

using Termite (with Mallet) for topic model visualisation i encountered two errors:

1- in io_utils.py the str(some string) should be changed (e.g. to unicode(some string)) so that people can work in non-ASCII too.

2- I'm going to visualise more than 300/000 Persian docs topic model and i get this error that indicates my 32GB RAM has no other empty space.
The stack trace:

Computing term similarity...
data_path = output/example-project
sliding_window_size = 10
Connecting to data...
Reading data from disk...
Computing document co-occurrence...
Traceback (most recent call last):
File "./execute.py", line 166, in
main()
File "./execute.py", line 163, in main
Execute( logging_level ).execute( corpus_format, corpus_path, model_library, model_path, data_path, num_topics, number_of_seriated_terms )
File "./execute.py", line 88, in execute
ComputeSimilarity( self.logger.level ).execute( data_path )
File "/home/.../termite-master/pipeline/compute_similarity.py", line 50, in execute
self.computeDocumentCooccurrence()
File "/home/.../termite-master/pipeline/compute_similarity.py", line 93, in computeDocumentCooccurrence
self.incrementCount( cooccurrence, (aToken, bToken) )
File "/home/.../termite-master/pipeline/compute_similarity.py", line 76, in incrementCount
occurrence[ key ] = 1
MemoryError

Running the code again doesn't solve anything. The code consumes all the memory i have and wants more.

Add fetch() support to MatrixState

Add javascript code to read data and state information from server (via cgi-bin/getEntry.py) whenever the attribute "dataID" is changed".

In MatrixState.js:

  • Defining the appropriate MatrisState.prototype.fetch function.

In index.html

  • Currently, the webpage loads states/data from server via an ajax call and then initializing the matrix by calling the functions setEntries() or setMatrix(). The webpage should now only need to initialize a StateMatrix object and setting its dataID attribute.

Implement issue #9 prior to this task.

Cannot use

Hi,

I have the following issue:

docID, docContent = line.split( '\t' )

ValueError: need more than 1 value to unpack

Could you please let me know how to fix this.

Problem generating visualization with sample data

I am running into this error at the end of processing input file (formatted with number[tab]text[\n]):

Iteration no.  74
#-------- breaking out early ---------#
candidates checked:  4
change in energy:  0.0
maxTerm:  
maxPosition:  0
Traceback (most recent call last):
  File "./execute.py", line 164, in <module>
    main()
  File "./execute.py", line 161, in main
    Execute( logging_level ).execute( corpus_format, corpus_path, model_library, model_path, data_path, num_topics, number_of_seriated_terms )
  File "./execute.py", line 89, in execute
    ComputeSeriation( self.logger.level ).execute( data_path, number_of_seriated_terms )
  File "/media/drive/Downloads/termite-master/pipeline/compute_seriation.py", line 56, in execute
    self.compute( numSeriatedTerms )
  File "/media/drive/Downloads/termite-master/pipeline/compute_seriation.py", line 103, in compute
    (candidateTerms, self.seriation.term_ordering, self.seriation.term_iter_index, self.buffers) = self.iterate_eff(candidateTerms, self.seriation.term_ordering, self.seriation.term_iter_index, self.buffers, self.bestEnergies, iteration)
  File "/media/drive/Downloads/termite-master/pipeline/compute_seriation.py", line 205, in iterate_eff
    candidateTerms.remove(maxTerm)
ValueError: list.remove(x): x not in list

Input data file: http://db.tt/XYfTQDzG
Complete console log: http://db.tt/19rTChaM

@jcchuang - I would appreciate any help to get Termite running without any issues! Thanks.

No results displayed in v2.0.0

Modeling ran successfully, and I started the server with ./web.py``. I navigated tohttp://localhost:8888/client_src/```, and see the following:

image

The javascript log shows:

Uncaught TypeError: Cannot read property 'dataIDs' of null (index):105
(anonymous function) (index):105
(anonymous function) d3.js:1939
event d3.js:430
respond

d3 download link out of date in setup.sh

When I ran the setup, it gave me some errors concerning the d3.v3.zip that was downloaded. I looked at it, and it was only 9 kB instead of the 123 kB when I download it manually from d3's website. I changed to url to point to the newest download, and that fixed the issue for me.

Hash representation for "rowLabels" and "columnLabels"

Currently, MatrixState contains two copies of row labels (i.e., terms) and column labels (i.e., topic names).

When MatrixState reads the matrix information associated with a dataset, the received data (saved in MatrixState.data) contains two arrays "terms" and "topics" that are the list of vocabulary used in a topic model and the default names for topics, respectively. Currently, user-defined row labels and column labels are saved as arrays "rowLabels" and "columnLabels" as MatrixState attributes.

In MatrixState.js file:

  • Create two new attributes "rowModifiedLabels" and "columnModifiedLabels" of type {number: String} that record only the modified labels.
  • Automatically update attributes "rowLabels" and "columnLabels" whenever attributes "rowModifiedLabels" and "columnModifiedLabels" are changed.
  • Modify functions rowLabels() and columnLabels() so that they write to (or unset) attributes "rowModifiedLabels" and "columnModifiedLabels".

problems running termite

Hi, I am running into the following error:

./execute.py example.cfg
Traceback (most recent call last):
File "./execute.py", line 166, in
main()
File "./execute.py", line 143, in main
logging_level = config.getint( 'Misc', 'logging' )
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ConfigParser.py", line 351, in getint
return self._get(section, int, option)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ConfigParser.py", line 348, in _get
return conv(self.get(section, option))
ValueError: invalid literal for int() with base 10: '20 # Display info messages'

My example file is tab delimited (from one of the examples someone reported in a previous issue):

01 1115 W Franklin
02 Bessy the Cow
03 Big Farm Way
04 The cow jumped over the moon
05 Look at me

I am not sure what might be causing this error. Any help will be appreciated.

Add sync() support for MatrixState

Add client javascript code to write states back to the server (via cgi-bin/setStates.py). Changes involve defining (and the default Backbone implementation) of MatrixState.prototype.sync function.

To be completed after issue #6.

Use sparse matrix representation throughout the entire data processing pipeline.

Right now, there are two ways to load a term-topic probability matrix: By calling either MatrixState.importMatrix() or MatrixState.importEntries().

The former loads a 2D array of numbers and the latter loads an 1D array of non-zero entries. For matrices with a large number of zeros (in our case), the latter is the more efficient representation. However, the client code currently passes the matrix from MatrixState to MatrixModel using a full matrix representation, nullifying any advantage we get from loading a sparse matrix.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.