williamleif / socialsent Goto Github PK
View Code? Open in Web Editor NEWCode and data for inducing domain-specific sentiment lexicons.
License: Apache License 2.0
Code and data for inducing domain-specific sentiment lexicons.
License: Apache License 2.0
acc, auc, avg_prec = binary_metrics(polarities, lexicon, eval_words)
It seems get the best result (acc, auc, avg_prec) with the help of the baseline domain-specific lexicon(lexicon).
I am confused how this project induce domain-specific lexicon without the help of existed domain-specific lexicon. How the sentiment words are assigned, polarities["good"]
is the positive score of word "good" ? and What is the negative score of word "good" .
Thank you for any feedback.
I think it is just the print statements that need to be changed to the future syntax.
Is the "twitter.json" file under the path "socialsent/socialsent/data/lexicons/" a public lexicon or lexicon constructed by your approach?
Hi @williamleif,
I really like that you have provided a reference for each of lexicon induction method. Can you let please let me know where you have adopted dist function from?
def dist(embeds, positive_seeds, negative_seeds, **kwargs):
polarities = {}
sim_mat = similarity_matrix(embeds, **kwargs)
for i, w in enumerate(embeds.iw):
if w not in positive_seeds and w not in negative_seeds:
pol = sum(sim_mat[embeds.wi[p_seed], i] for p_seed in positive_seeds)
pol -= sum(sim_mat[embeds.wi[n_seed], i] for n_seed in negative_seeds)
polarities[w] = pol
return polarities
Regards
Nader
First of all, thank you @williamleif for releasing this code. It will be tremendously helpful in my research.
This is not really an issue
as much as a question.
What's the mininum
and maximum
values of output lexicons? It appears that it's somewhere around +/- 4 or 5
but I can't find it identified anywhere in the original paper.
If it's in the code base somewhere I haven't found it either.
Thanks in advance for the quick help.
Hello! Working with your code right now to work on a sentiment scorer for The_Donald as part of a weekend project. I'm working through the code base right now and am finding that the [SUBREDDIT_NAME]-dict.pkl file is necessary but there's no structure I'm seeing that clear shows what is involved with generating this file - what is in this file? How do I go about generating it?
Thanks!
The file d-starts with cd data , but data folder does not exist
It migth be due to my installation or ?
I get the following warning and exception while running $ python evaluate_methods.py twitter
WARNING:
/usr/lib64/python2.7/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
EXCEPTION:
Traceback (most recent call last):
File "evaluate_methods.py", line 625, in
evaluate_twitter_methods()
File "evaluate_methods.py", line 461, in evaluate_twitter_methods
*_DEFAULT_ARGUMENTS)
File "evaluate_methods.py", line 494, in run_method
return method(embeddings, positive_seeds, negative_seeds, *_kwargs)
File "/home/michel/bin/socialsent/polarity_induction_methods.py", line 224, in bootstrap
polarities_list = pool.map(map_func, range(num_boots))
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 567, in get
Learning embedding transformation
raise self._value
Exception: Error when checking model target: expected y to have shape (None, 400) but got array with shape (36, 1)
Thank you for any feedback.
Hi William,
I've some doubts about how to use SVDEmbedding. If I'm not mistaken, this method significantly outperformed wor2vec and GloVe on preliminary experiments with the domain-specific data, as I read in the paper. So, I'd like to use it to build my words embeddings.
When I modify the example.py and put:
embeddings = create_representation("SVD", "data/example_embeddings/glove.6B.100d.txt")
I receive this message:
IOError: [Errno 2] No such file or directory: 'socialsent/data/word_embeddings/glove.6B.100d.txt-u.npy'
Checking the method, I see that I need three files:
ut = np.load(path + '-u.npy')
s = np.load(path + '-s.npy')
vocabfile = path + '-vocab.pkl'
What contains each file? Sorry, if the question is simple or obvious, but I'm new in word embedding.
Thank you so much!
Lea
I've updated the files to support the newest distribution of Keras. How can I get access to create a PR?
Hi again William,
I appreciate your quick responses. Regarding the lexicons, I got following error:
In [1]: %run example.py
Using TensorFlow backend.
Evaluting SentProp with 100 dimensional GloVe embeddings
Evaluting only binary classification performance on General Inquirer lexicon
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
/socialsent-master/example.py in <module>()
8 print "Evaluting SentProp with 100 dimensional GloVe embeddings"
9 print "Evaluting only binary classification performance on General Inquirer lexicon"
---> 10 lexicon = lexicons.load_lexicon("inquirer", remove_neutral=True)
11 pos_seeds, neg_seeds = seeds.hist_seeds()
12 embeddings = create_representation("GIGA", "data/example_embeddings/glove.6B.100d.txt",
/socialsent-master/socialsent/lexicons.pyc in load_lexicon(name, remove_neutral)
163
164 def load_lexicon(name=constants.LEXICON, remove_neutral=True):
--> 165 lexicon = util.load_json(constants.PROCESSED_LEXICONS + name + '.json')
166 return {w: p for w, p in lexicon.iteritems() if p != 0} if remove_neutral else lexicon
167
/socialsent-master/socialsent/util.pyc in load_json(fname)
32
33 def load_json(fname):
---> 34 with open(fname) as f:
35 return json.loads(f.read())
36
IOError: [Errno 2] No such file or directory: '/afs/cs.stanford.edu/u/wleif/sentiment/polarity_induction/data/lexicons/inquirer.json'
I know that I have to play with paths in constant.py
file. I just wanted to let you know about the bug.
Regards,
Nader
Sorry, it's not quite clear to me which lexicon you used for evaluation on Twitter (Table 2b in the paper). Was it socialsent/socialsent/data/lexicons/twitter.json
or was it something different?
Hi William,
Thanks a lot for the code. I was ring to run the example.py to see how the code works. However, I got an error as follows:
In [1]: %run example.py
Using TensorFlow backend.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/Desktop/socialsent-master/example.py in <module>()
2 from socialsent import lexicons
3 from socialsent.polarity_induction_methods import random_walk
----> 4 from socialsent.evaluate_methods import binary_metrics
5 from socialsent.representations.representation_factory import create_representation
6
/Desktop/socialsent-master/socialsent/evaluate_methods.py in <module>()
474
475 def run_method(positive_seeds, negative_seeds, embeddings, transform_embeddings=False, post_densify=False,
--> 476 method=polarity_induction_methods.linear, **kwargs):
477 if transform_embeddings:
478 print "Transforming embeddings..."
AttributeError: 'module' object has no attribute 'linear'
Can you please let me know how to fix this? Moreover, the code is not python3-friendly because of print function you've used in the code. I mean, when I am running the code in python3, it gives me syntax error regarding the print function as it prints without parantheses.
Regards,
Nader
Hi,
When I try to run the example.py file I run into this error.
from keras.models import Graph
ImportError: cannot import name Graph
I heard Keras removed Graph function. So is there any alternative to do this and make this work ?
The paper claims that the transition matrix T=D^0.5*E*D^0.5
is symmetric, and socialsent.graph_construct.transition_matrix
takes a sym
parameter that is supposed to make the transition matrix symmetric.
That is however not the case (mathematically, the above equation does not make a non-symmetric matrix symmetric). This can be trivially demonstrated by e.g.:
import numpy as np
from socialsent.graph_construction import transition_matrix
v = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
v_norm = np.divide(v.T, np.linalg.norm(v, axis=1)).T
E = namedtuple('Embedding', ['m'])
embeddings = E(m=v_norm)
M = transition_matrix(embeddings, nn=1, sym=True, arccos=True)
print M
[[ 0. 0.97276408 0. ]
[ 0. 0. 1. ]
[ 0. 1. 0. ]]
What am I missing?
Ideally, a user should be able to pip install and then run a command to download some default word embeddings. The pip install solution is really awkward right now because the user needs to modify constants.py to point to downloaded word embeddings.
Hi Will,
Back to you with some memory issues. My experience so far is that SocialSent runs into memory problem when you reach a threshold of more or less 7000 words to score. So I ran it on a distributed architecture (shartcnet) with 38000 words to score and ask for 16G memory, yet it very soon runs out of memory again:
...
Using Theano backend.
/opt/sharcnet/python/2.7.8/intel/lib/python2.7/site-packages/scipy/lib/_util.py:35: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
DeprecationWarning)
Evaluating SentProp with 100 dimensional GloVe embeddings
Evaluating binary and continuous classification performance
LEXICON
SEEDS
EMBEDDINGS
EVAL_WORDS
Traceback (most recent call last):
File "concreteness.py", line 95, in
sym=True, arccos=True)
File "/home/genereum/socialsent-master/polarity_induction_methods.py", line 99, in random_walk
M = transition_matrix(embeddings, **kwargs)
File "/home/genereum/socialsent-master/graph_construction.py", line 62, in transition_matrix
return Dinv.dot(L).dot(Dinv)
MemoryError
--- SharcNET Job Epilogue ---
job id: 12138822
exit status: 1
cpu time: 313s / 12.0h (0 %)
elapsed time: 479s / 12.0h (1 %)
virtual memory: 11.9G / 16.0G (74 %)
Job returned with status 1.
WARNING: Job only used 1 % of its requested walltime.
WARNING: Job only used 0 % of its requested cpu time.
WARNING: Job only used 65 % of allocated cpu time.
WARNING: Job only used 74% of its requested memory.
...
A solution would be to run it 7000 words at time. But maybe you know a way to increase the memory use by the program?
Thanks,
Michel
Hi there,
I just find out an issue when I run embedding. The detailed description is in below.
File "example.py", line 13, in
set(lexicon.keys()).union(pos_seeds).union(neg_seeds))
File "/anaconda3/lib/python3.6/site-packages/socialsent-0.1.2-py3.6.egg/socialsent/representations/representation_factory.py", line 10, in create_representation
return GigaEmbedding(path, *args, **kwargs)
File "/anaconda3/lib/python3.6/site-packages/socialsent-0.1.2-py3.6.egg/socialsent/representations/embedding.py", line 120, in init
for line in lines(path):
File "/anaconda3/lib/python3.6/site-packages/socialsent-0.1.2-py3.6.egg/socialsent/util.py", line 50, in lines
with open(fname) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/example_embeddings/glove.6B.100d.txt'
Kind regards
Dear William,
I’d like to use your code to induce a sentiment lexicon from a new corpus. In your answer to the issue #8, you wrote that the first step is to “Use representations/cooccurgen.py to process a corpus and construct co-occurrence matrices.”
By looking at cooccurgen.py, it seems that it takes in input a corpus in the COHA word_lemma_pos format and it also needs a file called index.pkl.
Thanks a lot in advance!
Best,
Rachele
Hi William,
First of all thanks for the great work and codes!
I meet a compatibility problem using the densifier part and managed to fix it. It might be good to verify with you to see if it's a general case, thanks!
in example.py
, I tried to directly call densifier by
densify(embeddings, pos_seeds, neg_seeds)
then following raises
File "/mounts/Users/student/mzhao/mytools/temp/socialsent/socialsent/embedding_transformer.py", line 146, in <lambda> model.add_node(Lambda(lambda x: K.reshape(K.sqrt(K.sum(x * x, axis=1)), (x.shape[0], 1))), File "/mounts/Users/student/mzhao/env/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 271, in reshape return tf.reshape(x, shape) File "/mounts/Users/student/mzhao/env/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 2630, in reshape name=name) File "/mounts/Users/student/mzhao/env/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 494, in apply_op raise err TypeError: Expected binary or unicode string, got Dimension(None)
which is essentially raised from
https://github.com/williamleif/socialsent/blob/master/socialsent/embedding_transformer.py#L146
when try to get tensor shape, and could be easily fixed by K.shape(x)[0]
I am wondering if you meet this problem?
My environment:
Keras==0.3.3
tensorflow==1.8.0 & tensorflow==1.0.0 (no gpu)
OS: openSUSE Leap 15.0
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.