A seq2seq model that can generate summaries from fine food reviews on Amazon.

HTML 77.43% Jupyter Notebook 16.42% Python 6.14%

text-summarization-with-amazon-reviews's Introduction

Text-Summarization-with-Amazon-Reviews

The objective of this project is to build a seq2seq model that can create relevant summaries for reviews written about fine foods sold on Amazon. This dataset contains above 500,000 reviews, and is hosted on Kaggle. It's too large to host here, it's over 300MB.

To build our model we will use a two-layered bidirectional RNN with LSTMs on the input data and two layers, each with an LSTM using bahdanau attention on the target data. Jaemin Cho's tutorial for seq2seq was really helpful to get the code in working order because this is my first project with TensorFlow 1.1; some of the functions are very different from 1.0. The architecture for this model is similar to Xin Pan's and Peter Liu's, here's their GitHub page.

This model uses Conceptnet Numberbatch's pre-trained word vectors.

Here are some examples of reviews and their generated summaries:

Description(1): The coffee tasted great and was at such a good price! I highly recommend this to everyone!
Summary(1): great coffee
Description(2): This is the worst cheese that I have ever bought! I will never buy it again and I hope you won’t either!
Summary(2): omg gross gross
Description(3): love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packets
Summary(3): love it

I wrote an article about this project that explains parts of it in detail.

text-summarization-with-amazon-reviews's People

Contributors

Stargazers

Watchers

Forkers

zuijiawoniu panyang jalammar chapzq77 s4sarath juzenn mayoor sxdkxgwan coderxdy shulyalkar1 harshadeepg avsolatorio longchuan1985 saurav-31 shiyongde ranniee caplu deeplearningsky sandeep1729 phpmind sriharsha0806 mike201456 seanreed1111 mayank-chaudhari9 willworld94 bhuvanas harrytanme obinsc abiraja2004 sree181 yuhanyu centemians nilportugues sudhu26 pb-pravin pranjalpandey22 neuron888 atanu1ytiam armandidandeh gaalipour samuelxmli zuominr lanhaochen ajaykedia1992 varunbagga08 liybu36 pw2018t jaredchung paps272003 sankirtanm13 adityaneel94 gaurav104 sirius9 soul-an advaya08 rezazangeneh katedoan lcheng61 raphrfg aspril arun-ghontale sanjayrakshit gauravghongde kinect59 sachinsingh3107 zc08174024 liuweiping2020 atul107 databill86 rameshaimlds tkorchagin maxpowerso9k harbingerofhell sameertikoo kaushik246 liyuanz johnliuyq sjyttkl larfii tamimmustafij uttgeorge noabenefraim soumensarker byhqsr stephennfernandes jingmiao-ti agarwallakshya37 pr162 shethaa ankitnigam1985 vikranth03 kaushik9801 krishnavineel17 omnaladkar jagruthividya369 wilfoderek hieunct gowshi1412 dibinsvds

text-summarization-with-amazon-reviews's Issues

Are the output layers concatenated properly ?

Here is the encoding_layer for ease:

def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob):
    '''Create the encoding layer'''
    
    for layer in range(num_layers):
        with tf.variable_scope('encoder_{}'.format(layer)):
            cell_fw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw, 
                                                    input_keep_prob = keep_prob)

            cell_bw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw, 
                                                    input_keep_prob = keep_prob)

            enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw, 
                                                                    cell_bw, 
                                                                    rnn_inputs,
                                                                    sequence_length,
                                                                    dtype=tf.float32)
    # Join outputs since we are using a bidirectional RNN
    enc_output = tf.concat(enc_output,2)
    
    return enc_output, enc_state

I'm new to tensorflow, so here's my understanding:
Based on num_layers, it'll create that many tf.nn.bidirectional_dynamic_rnns. So shouldn't the line enc_output = tf.concat(enc_output,2) be inside the for loop ? Will the return statement returns all the enc_output. If so how ?

And how to find accuracy, can you suggest briefly ?

Restoring Session

Hi,
I'm trying to restore the session for the last checkpoint saved. I do this uncommenting this two lines of the code:

loader = tf.train.import_meta_graph("./" + checkpoint + '.meta')
loader.restore(sess, checkpoint)

But I'm getting this error:
ValueError: cannot add op with name encoder_0/bidirectional_rnn/fw/lstm_cell/kernel/Adam as that name is already used

Am I doing something wrong or is this an issue?

how to make only one bidirectional layer, and other regular rnn layers

Thank you very much your tutorial. it really helped me a lot.
I am wondering if you can help me with this. In you set up, it seems that you are making every encoding layer as bidirectional. But i think it is also very common that we only make 1 bidirectional layer and making other layers as regular rnn layers.(like google's translation model).

I am not sure how to connect a bidirectional layer with a regular rnn layer?

Trained model making random predictions when loaded

The model made some good predictions when I tested it immediately after training was finished. But when I loaded the model after opening the notebook next day, it makes some random meaningless predictions.
Any reason for this?

how to define the batch_size and number_layers

Tensorflow version issues

AttributeError: module 'tensorflow' has no attribute 'placeholder'

What are your embed vectors for <UNK><PAD><EOS><GO>?

Are they random vectors?

accuracy for trained model

Hi, I have trained with Fine Food Reviews data with 100 epochs. After that I have changed dataset. downloaded dataset from amazon reviews dataset. There I have taken Home and Kitchen dataset in that dataset I have taken Review Text and Summary. And trained with 100 epochs.

After completion of training I have tested the model. getting same summary for distinct reviews. I have trained so many times getting the same summary while I am testing

Giving same summaries for all reviews

@Currie32 Can you please help - this is producing the same summary for each review. in my case repeating "not" many times.

What can be the root cause

Issues accessing the number batch embeddings

Anyone else having difficulty opening the numberbatch-en-17.02.txt.

Please provide guidance on how you were able to extract it.

Single Layer Encoder

In your code it seems like you are trying to create a multilayer encoder, however what is actually happening is that multiple single layer decoders are being created. As you can see in this graph image created by tensorflow http://imgur.com/a/jqAn5 the rnn inputs are feed into each bidirectional rnn rather than one feeding into the other. I created this image by running your encoding_layer function and passing in some placeholders and then using the tf.summary.FileWriter to draw the graph.

To create a multilayer encoder you would want something like this

def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob):
    '''Create the encoding layer'''
    layer_input = rnn_inputs
    for layer in range(num_layers):
        with tf.variable_scope('encoder_{}'.format(layer)):
            cell_fw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw, 
                                                    input_keep_prob = keep_prob)

            cell_bw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw, 
                                                    input_keep_prob = keep_prob)

            enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw, 
                                                                    cell_bw, 
                                                                    layer_input,
                                                                    sequence_length,
                                                                    dtype=tf.float32)
            layer_input = tf.concat(enc_output, 2)
    # Join outputs since we are using a bidirectional RNN
    enc_output = tf.concat(enc_output,2)
    return enc_output, enc_state

You can see in this graph image http://imgur.com/a/bJANa That this creates the multi layer structure I assume you are going for. (The caveat is that in this function your rnn size must be half of the embedding size so if you want them to be different just move the first birnn out of the loop and have the layers after the first have the same size)

correct way to use ground truth for training and greedy for testing

hi, when I use a similar code to solve a seq2seq problem, during the training, the loss decrease very quickly. but during the testing, all the output sequence are same. Can you example what sample_id means?

Giving same summary for all reviews

  Hi ,i trained the model but when i tested with different reviews its giving same summary for all whatever we got when i executed  first review ,i didnt understand whats  the problem, can i know what is the reason for that.

currie32 / text-summarization-with-amazon-reviews Goto Github PK