williamleif / socialsent Goto Github PK

View Code? Open in Web Editor NEW

195.0 195.0 76.0 2.59 MB

Code and data for inducing domain-specific sentiment lexicons.

License: Apache License 2.0

Python 99.35% Shell 0.65%

socialsent's People

Contributors

Stargazers

Watchers

socialsent's Issues

I am confused, how this project induce sentiment lexicon without help of existed domain-specifc lexicon

acc, auc, avg_prec = binary_metrics(polarities, lexicon, eval_words)
It seems get the best result (acc, auc, avg_prec) with the help of the baseline domain-specific lexicon(lexicon).
I am confused how this project induce domain-specific lexicon without the help of existed domain-specific lexicon. How the sentiment words are assigned, polarities["good"] is the positive score of word "good" ? and What is the negative score of word "good" .

Thank you for any feedback.

Python 3 support

I think it is just the print statements that need to be changed to the future syntax.

one question about lexicons

Is the "twitter.json" file under the path "socialsent/socialsent/data/lexicons/" a public lexicon or lexicon constructed by your approach?

Reference for a lexicon induction method

Hi @williamleif,

I really like that you have provided a reference for each of lexicon induction method. Can you let please let me know where you have adopted dist function from?

def dist(embeds, positive_seeds, negative_seeds, **kwargs):
    polarities = {}
    sim_mat = similarity_matrix(embeds, **kwargs)
    for i, w in enumerate(embeds.iw):
        if w not in positive_seeds and w not in negative_seeds:
            pol = sum(sim_mat[embeds.wi[p_seed], i] for p_seed in positive_seeds)
            pol -= sum(sim_mat[embeds.wi[n_seed], i] for n_seed in negative_seeds)
            polarities[w] = pol
    return polarities

Regards
Nader

`min` and `max` values in output lexicons

First of all, thank you @williamleif for releasing this code. It will be tremendously helpful in my research.

This is not really an issue as much as a question.

What's the mininum and maximum values of output lexicons? It appears that it's somewhere around +/- 4 or 5 but I can't find it identified anywhere in the original paper.

If it's in the code base somewhere I haven't found it either.

Thanks in advance for the quick help.

What is in the {}-dict.pkl for subreddits?

Hello! Working with your code right now to work on a sentiment scorer for The_Donald as part of a weekend project. I'm working through the code base right now and am finding that the [SUBREDDIT_NAME]-dict.pkl file is necessary but there's no structure I'm seeing that clear shows what is involved with generating this file - what is in this file? How do I go about generating it?

Thanks!

in example.sh

The file d-starts with cd data , but data folder does not exist
It migth be due to my installation or ?

errors when running "evaluate_methods.py"

I get the following warning and exception while running $ python evaluate_methods.py twitter

WARNING:
/usr/lib64/python2.7/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)

EXCEPTION:
Traceback (most recent call last):
File "evaluate_methods.py", line 625, in
evaluate_twitter_methods()
File "evaluate_methods.py", line 461, in evaluate_twitter_methods
*_DEFAULT_ARGUMENTS)
File "evaluate_methods.py", line 494, in run_method
return method(embeddings, positive_seeds, negative_seeds, *_kwargs)
File "/home/michel/bin/socialsent/polarity_induction_methods.py", line 224, in bootstrap
polarities_list = pool.map(map_func, range(num_boots))
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 567, in get
Learning embedding transformation
raise self._value
Exception: Error when checking model target: expected y to have shape (None, 400) but got array with shape (36, 1)

Thank you for any feedback.

How to use SVDEmbedding?

Hi William,

I've some doubts about how to use SVDEmbedding. If I'm not mistaken, this method significantly outperformed wor2vec and GloVe on preliminary experiments with the domain-specific data, as I read in the paper. So, I'd like to use it to build my words embeddings.

When I modify the example.py and put:
embeddings = create_representation("SVD", "data/example_embeddings/glove.6B.100d.txt")
I receive this message:
IOError: [Errno 2] No such file or directory: 'socialsent/data/word_embeddings/glove.6B.100d.txt-u.npy'

Checking the method, I see that I need three files:
ut = np.load(path + '-u.npy')
s = np.load(path + '-s.npy')
vocabfile = path + '-vocab.pkl'

What contains each file? Sorry, if the question is simple or obvious, but I'm new in word embedding.

Thank you so much!
Lea

Updated Keras

I've updated the files to support the newest distribution of Keras. How can I get access to create a PR?

IOError of Lexicons

Hi again William,

I appreciate your quick responses. Regarding the lexicons, I got following error:

In [1]: %run example.py
Using TensorFlow backend.
Evaluting SentProp with 100 dimensional GloVe embeddings
Evaluting only binary classification performance on General Inquirer lexicon
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
/socialsent-master/example.py in <module>()
      8     print "Evaluting SentProp with 100 dimensional GloVe embeddings"
      9     print "Evaluting only binary classification performance on General Inquirer lexicon"
---> 10     lexicon = lexicons.load_lexicon("inquirer", remove_neutral=True)
     11     pos_seeds, neg_seeds = seeds.hist_seeds()
     12     embeddings = create_representation("GIGA", "data/example_embeddings/glove.6B.100d.txt",

/socialsent-master/socialsent/lexicons.pyc in load_lexicon(name, remove_neutral)
    163 
    164 def load_lexicon(name=constants.LEXICON, remove_neutral=True):
--> 165     lexicon = util.load_json(constants.PROCESSED_LEXICONS + name + '.json')
    166     return {w: p for w, p in lexicon.iteritems() if p != 0} if remove_neutral else lexicon
    167 

/socialsent-master/socialsent/util.pyc in load_json(fname)
     32 
     33 def load_json(fname):
---> 34     with open(fname) as f:
     35         return json.loads(f.read())
     36 

IOError: [Errno 2] No such file or directory: '/afs/cs.stanford.edu/u/wleif/sentiment/polarity_induction/data/lexicons/inquirer.json'

I know that I have to play with paths in constant.py file. I just wanted to let you know about the bug.

Regards,
Nader

Lexicon Documentation

Sorry, it's not quite clear to me which lexicon you used for evaluation on Twitter (Table 2b in the paper). Was it socialsent/socialsent/data/lexicons/twitter.json or was it something different?

AttributeError: 'module' object has no attribute 'linear'

Hi William,

Thanks a lot for the code. I was ring to run the example.py to see how the code works. However, I got an error as follows:

In [1]: %run example.py
Using TensorFlow backend.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/Desktop/socialsent-master/example.py in <module>()
      2 from socialsent import lexicons
      3 from socialsent.polarity_induction_methods import random_walk
----> 4 from socialsent.evaluate_methods import binary_metrics
      5 from socialsent.representations.representation_factory import create_representation
      6 

/Desktop/socialsent-master/socialsent/evaluate_methods.py in <module>()
    474 
    475 def run_method(positive_seeds, negative_seeds, embeddings, transform_embeddings=False, post_densify=False,
--> 476         method=polarity_induction_methods.linear, **kwargs):
    477     if transform_embeddings:
    478         print "Transforming embeddings..."

AttributeError: 'module' object has no attribute 'linear'

Can you please let me know how to fix this? Moreover, the code is not python3-friendly because of print function you've used in the code. I mean, when I am running the code in python3, it gives me syntax error regarding the print function as it prints without parantheses.

Regards,
Nader

Fix Keras >0.3 compatibility

Hi,
When I try to run the example.py file I run into this error.
from keras.models import Graph
ImportError: cannot import name Graph
I heard Keras removed Graph function. So is there any alternative to do this and make this work ?

Symmetric transition matrix

The paper claims that the transition matrix T=D^0.5*E*D^0.5 is symmetric, and socialsent.graph_construct.transition_matrix takes a sym parameter that is supposed to make the transition matrix symmetric.

That is however not the case (mathematically, the above equation does not make a non-symmetric matrix symmetric). This can be trivially demonstrated by e.g.:

    import numpy as np
    from socialsent.graph_construction import transition_matrix

    v = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
    v_norm = np.divide(v.T, np.linalg.norm(v, axis=1)).T
    E = namedtuple('Embedding', ['m'])
    embeddings = E(m=v_norm)
    M = transition_matrix(embeddings, nn=1, sym=True, arccos=True)
    print M

[[ 0.          0.97276408  0.        ]
 [ 0.          0.          1.        ]
 [ 0.          1.          0.        ]]

What am I missing?

Clean solution to download default word embeddings.

Ideally, a user should be able to pip install and then run a command to download some default word embeddings. The pip install solution is really awkward right now because the user needs to modify constants.py to point to downloaded word embeddings.

Memory issues for network construction (i.e. nearest neighbor computation)

Hi Will,

Back to you with some memory issues. My experience so far is that SocialSent runs into memory problem when you reach a threshold of more or less 7000 words to score. So I ran it on a distributed architecture (shartcnet) with 38000 words to score and ask for 16G memory, yet it very soon runs out of memory again:

...
Using Theano backend.
/opt/sharcnet/python/2.7.8/intel/lib/python2.7/site-packages/scipy/lib/_util.py:35: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
DeprecationWarning)
Evaluating SentProp with 100 dimensional GloVe embeddings
Evaluating binary and continuous classification performance
LEXICON
SEEDS
EMBEDDINGS
EVAL_WORDS
Traceback (most recent call last):
File "concreteness.py", line 95, in
sym=True, arccos=True)
File "/home/genereum/socialsent-master/polarity_induction_methods.py", line 99, in random_walk
M = transition_matrix(embeddings, **kwargs)
File "/home/genereum/socialsent-master/graph_construction.py", line 62, in transition_matrix
return Dinv.dot(L).dot(Dinv)
MemoryError
--- SharcNET Job Epilogue ---
job id: 12138822
exit status: 1
cpu time: 313s / 12.0h (0 %)
elapsed time: 479s / 12.0h (1 %)
virtual memory: 11.9G / 16.0G (74 %)

Job returned with status 1.
WARNING: Job only used 1 % of its requested walltime.
WARNING: Job only used 0 % of its requested cpu time.
WARNING: Job only used 65 % of allocated cpu time.
WARNING: Job only used 74% of its requested memory.
...

A solution would be to run it 7000 words at time. But maybe you know a way to increase the memory use by the program?

Thanks,
Michel

An issue in embedding process

Hi there,

I just find out an issue when I run embedding. The detailed description is in below.

File "example.py", line 13, in
set(lexicon.keys()).union(pos_seeds).union(neg_seeds))
File "/anaconda3/lib/python3.6/site-packages/socialsent-0.1.2-py3.6.egg/socialsent/representations/representation_factory.py", line 10, in create_representation
return GigaEmbedding(path, *args, **kwargs)
File "/anaconda3/lib/python3.6/site-packages/socialsent-0.1.2-py3.6.egg/socialsent/representations/embedding.py", line 120, in init
for line in lines(path):
File "/anaconda3/lib/python3.6/site-packages/socialsent-0.1.2-py3.6.egg/socialsent/util.py", line 50, in lines
with open(fname) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/example_embeddings/glove.6B.100d.txt'

Kind regards

On the use of cooccurgen.py

Dear William,
I’d like to use your code to induce a sentiment lexicon from a new corpus. In your answer to the issue #8, you wrote that the first step is to “Use representations/cooccurgen.py to process a corpus and construct co-occurrence matrices.”
By looking at cooccurgen.py, it seems that it takes in input a corpus in the COHA word_lemma_pos format and it also needs a file called index.pkl.

Do I have to transform my corpus into a tabular format like the COHA format?
How is the index.pkl file created?
Is there any way to use the script starting from a raw corpus?

Thanks a lot in advance!
Best,
Rachele

compatibility when keras uses tf backend

Hi William,

First of all thanks for the great work and codes!

I meet a compatibility problem using the densifier part and managed to fix it. It might be good to verify with you to see if it's a general case, thanks!

in example.py, I tried to directly call densifier by

densify(embeddings, pos_seeds, neg_seeds)

then following raises

File "/mounts/Users/student/mzhao/mytools/temp/socialsent/socialsent/embedding_transformer.py", line 146, in <lambda> model.add_node(Lambda(lambda x: K.reshape(K.sqrt(K.sum(x * x, axis=1)), (x.shape[0], 1))), File "/mounts/Users/student/mzhao/env/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 271, in reshape return tf.reshape(x, shape) File "/mounts/Users/student/mzhao/env/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 2630, in reshape name=name) File "/mounts/Users/student/mzhao/env/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 494, in apply_op raise err TypeError: Expected binary or unicode string, got Dimension(None)

which is essentially raised from

https://github.com/williamleif/socialsent/blob/master/socialsent/embedding_transformer.py#L146

when try to get tensor shape, and could be easily fixed by K.shape(x)[0]

I am wondering if you meet this problem?

My environment:
Keras==0.3.3
tensorflow==1.8.0 & tensorflow==1.0.0 (no gpu)
OS: openSUSE Leap 15.0

Thanks!

williamleif / socialsent Goto Github PK

socialsent's People

Contributors

Stargazers

Watchers

Forkers

socialsent's Issues

Recommend Projects

Recommend Topics

Recommend Org