Giter VIP home page Giter VIP logo

auto_cliwc's Introduction

Auto LIWC

The code for Chinese LIWC Lexicon Expansion via Hierarchical Classification of Word Embeddings with Sememe Attention (AAAI18).

Datasets

This folder datasets contains two datasets.

  1. HowNet.txt is an Chinese knowledge base with annotated word-sense-sememe information.
  2. sc_liwc.dic is the Chinese LIWC lexicon. This is revised version of the original C-LIWC file. Because the original contains part of speech (POS) categories such as verb, adverb, and auxverb, we believe it is more accurate to utilize POS tagging programs when conducting text analysis in a given text. Therefore, we delete POS categories in our experiment. Furthermore, the hierarchical structure is slightly different from the original English version of LIWC, so we altered the hierarchical structure based on the English LIWC. As for the exact meaning of each category, you can refer to here and here.

Please note that the above datasets files are for academic and educational use only. They are not for commercial use. If you have any questions, please contact us first before downloading the datasets.

Due to the large size of the embedding file, we can only release the code for training the word embeddings. Please see word2vec.py for details.

Run

Run the following command for training and testing:

python3 train_liwc.py

If the datasets are in a different folder, please change the path here.

The current code generates different training and testing set every time. To reproduce the results in the paper, you can load train.bin and test.bin located in bin_data using pickle.

Dependencies

  • Tensorflow == 1.4.0
  • Scipy == 0.19.0
  • Numpy == 1.13.1
  • Scikit-learn == 0.18.1
  • Gensim == 2.0.0

Cite

If you use the code, please cite this paper:

Xiangkai Zeng, Cheng Yang, Cunchao Tu, Zhiyuan Liu, Maosong Sun. Chinese LIWC Lexicon Expansion via Hierarchical Classification of Word Embeddings with Sememe Attention. The 32nd AAAI Conference on Artificial Intelligence (AAAI 2018).

auto_cliwc's People

Contributors

wirehack avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

auto_cliwc's Issues

InvalidArgumentError: slice index 1 of dimension 0 out of bounds.

I add "word = str()" before use it due to "NameError: name 'word' is not defined". Now here's another mistake:
ValueError: slice index 1 of dimension 0 out of bounds. for 'Decoder/BahdanauAttention/strided_slice' (op: 'StridedSlice') with input shapes: [1], [1], [1], [1] and with computed input tensors: input[1] = <1>, input[2] = <2>, input[3] = <1>.

词典的来源是什么?

您好!感谢分享的内容。我注意到现有工程里提供的SC-LIWC词典只有5000多词,但是论文中提到的SC-LIWC应该有7000多词。因此本项目中的词典是否有所遗漏词语呢?词典来源于哪个网站呢?LIWC官网现在都需要付费才能使用他们的软件。

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/300Tvectors.txt'

mldl@ub1604:/ub16_prj/Auto_CLIWC$ python3 train_liwc.py
Traceback (most recent call last):
File "train_liwc.py", line 35, in
vector_size, word2vectors = load_data.load_vectors(vector_file)
File "/home/mldl/ub16_prj/Auto_CLIWC/utils/load_data.py", line 32, in load_vectors
vector_file = io.open(filename, 'r', encoding=encoding)
FileNotFoundError: [Errno 2] No such file or directory: 'datasets/300Tvectors.txt'
mldl@ub1604:
/ub16_prj/Auto_CLIWC$

UnboundLocalError: local variable 'word' referenced before assignment

mldl@ub1604:/ub16_prj/Auto_CLIWC$ python3 train_liwc.py
Traceback (most recent call last):
File "train_liwc.py", line 37, in
word2sememe_vecs, word2sememe_length, word2average_sememes = load_data.load_hownet(hownet_file, word2type, word2vectors)
File "/home/mldl/ub16_prj/Auto_CLIWC/utils/load_data.py", line 83, in load_hownet
word2sememes[word].update(m)
UnboundLocalError: local variable 'word' referenced before assignment
mldl@ub1604:
/ub16_prj/Auto_CLIWC$

What is PATH2SOGOUT?

It seems that I should put something in it for training the word embedding model.

ValueError: setting an array element with a sequence.

ValueError Traceback (most recent call last)
in
1 word_embeddings = tf.constant(vector_matrix, dtype=tf.float32)
----> 2 sememe_embeddings = tf.constant(memory_matrix, dtype=tf.float32)
3 sememe_lengths = tf.constant(memory_lengths, dtype=tf.int32)
4 x = tf.gather(word_embeddings, index_holder)
5 sememe_memory = tf.gather(sememe_embeddings, index_holder)

d:\Users\Matt\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in constant(value, dtype, shape, name, verify_shape)
206 tensor_value.tensor.CopyFrom(
207 tensor_util.make_tensor_proto(
--> 208 value, dtype=dtype, shape=shape, verify_shape=verify_shape))
209 dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)
210 const_tensor = g.create_op(

d:\Users\Matt\Anaconda3\lib\site-packages\tensorflow\python\framework\tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape)
441 else:
442 _AssertCompatible(values, dtype)
--> 443 nparray = np.array(values, dtype=np_dt)
444 # check to them.
445 # We need to pass in quantized values as tuples, so don't apply the shape

ValueError: setting an array element with a sequence.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.