Giter VIP home page Giter VIP logo

char-embeddings's Introduction

char-embeddings

char-embeddings is a repository containing 300D character embeddings derived from the GloVe 840B/300D dataset, and uses these embeddings to train a deep learning model to generate Magic: The Gathering cards using Keras. The generation and model construction is heavily modified after the automatic text generation Keras example by François Chollet.

Usage

This repository contains a number of Python 3 scripts:

  • create_embeddings.py: Converts a pretrained word embeddings file into a character embeddings file by averaging the per-character vectors.
  • create_magic_text.py: Converts an MTG JSON card dump into a one-per-line card encoding.
  • text_generator_keras.py: Constructs and trains the Keras model, producing Magic cards at each epoch.
  • text_generator_keras_sample.py: Using the text file and Keras model generated from the previous two scripts, generate a large amount of Magic cards.

The output folder contains Magic card output at each epoch, a log of losses at every 50th batch, the learned character embeddings at the last epoch, the trained model itself, and a large sample of generated Magic cards.

Requirements

keras, tensorflow, h5py, scikit-learn

Maintainer

Max Woolf (@minimaxir)

License

MIT

char-embeddings's People

Contributors

minimaxir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

char-embeddings's Issues

Question on deriving char embeddings from word embeddings.

Hi. I was looking at create_embeddings.py to see how you derived char embeddings directly from word embeddings.

It looks like you equate a char embedding with the average of all word vectors that contain that char, counting each char multiple times if it occurs more than once in a word. Is that correct?

Did you decide to do this because you got good results or was there some other reason for this?

Thanks!

FA

whitespace character embedding!

Hi, I am new to character embedding, isn't it necessary to have an embedding for whitespace character? Since this method of averaging does not produce any character embedding.

Attempt to use glove character embeddings causing strange errors.

Hello! I've been working on the text-generation example from Keras, and I saw your code. I tried to reverse engineer what I already had to work with your character embeddings. Unfortunately, it's introducing a very strange error which I am not finding documentation about and I am unsure about the cause of. My first thought is that I've failed to get the embedding layer into the model properly, but it trains successfully (?). If you could offer me some assistance with fixing this, I'd be greatly in your debt!

here's example stacktrace / output

275456/283158 [============================>.] - ETA: 4s - loss: 1.4438e-05
275584/283158 [============================>.] - ETA: 4s - loss: 1.4431e-05
275712/283158 [============================>.] - ETA: 4s - loss: 1.4425e-05
275840/283158 [============================>.] - ETA: 4s - loss: 1.4418e-05
275968/283158 [============================>.] - ETA: 3s - loss: 1.4411e-05
276096/283158 [============================>.] - ETA: 3s - loss: 1.4405e-05
276224/283158 [============================>.] - ETA: 3s - loss: 1.4398e-05
276352/283158 [============================>.] - ETA: 3s - loss: 1.4391e-05
276480/283158 [============================>.] - ETA: 3s - loss: 1.4385e-05
276608/283158 [============================>.] - ETA: 3s - loss: 1.4378e-05
276736/283158 [============================>.] - ETA: 3s - loss: 1.4371e-05
276864/283158 [============================>.] - ETA: 3s - loss: 1.4365e-05
276992/283158 [============================>.] - ETA: 3s - loss: 1.4358e-05
277120/283158 [============================>.] - ETA: 3s - loss: 1.4351e-05
277248/283158 [============================>.] - ETA: 3s - loss: 1.4345e-05
277376/283158 [============================>.] - ETA: 3s - loss: 1.4338e-05
277504/283158 [============================>.] - ETA: 3s - loss: 1.4332e-05
277632/283158 [============================>.] - ETA: 3s - loss: 1.4325e-05
277760/283158 [============================>.] - ETA: 2s - loss: 1.4318e-05
277888/283158 [============================>.] - ETA: 2s - loss: 1.4312e-05
278016/283158 [============================>.] - ETA: 2s - loss: 1.4305e-05
278144/283158 [============================>.] - ETA: 2s - loss: 1.4299e-05
278272/283158 [============================>.] - ETA: 2s - loss: 1.4292e-05
278400/283158 [============================>.] - ETA: 2s - loss: 1.4285e-05
278528/283158 [============================>.] - ETA: 2s - loss: 1.4279e-05
278656/283158 [============================>.] - ETA: 2s - loss: 1.4272e-05
278784/283158 [============================>.] - ETA: 2s - loss: 1.4266e-05
278912/283158 [============================>.] - ETA: 2s - loss: 1.4259e-05
279040/283158 [============================>.] - ETA: 2s - loss: 1.4253e-05
279168/283158 [============================>.] - ETA: 2s - loss: 1.4246e-05
279296/283158 [============================>.] - ETA: 2s - loss: 1.4240e-05
279424/283158 [============================>.] - ETA: 2s - loss: 1.4233e-05
279552/283158 [============================>.] - ETA: 1s - loss: 1.4227e-05
279680/283158 [============================>.] - ETA: 1s - loss: 1.4220e-05
279808/283158 [============================>.] - ETA: 1s - loss: 1.4214e-05
279936/283158 [============================>.] - ETA: 1s - loss: 1.4207e-05
280064/283158 [============================>.] - ETA: 1s - loss: 1.4201e-05
280192/283158 [============================>.] - ETA: 1s - loss: 1.4194e-05
280320/283158 [============================>.] - ETA: 1s - loss: 1.4188e-05
280448/283158 [============================>.] - ETA: 1s - loss: 1.4181e-05
280576/283158 [============================>.] - ETA: 1s - loss: 1.4175e-05
280704/283158 [============================>.] - ETA: 1s - loss: 1.4168e-05
280832/283158 [============================>.] - ETA: 1s - loss: 1.4162e-05
280960/283158 [============================>.] - ETA: 1s - loss: 1.4155e-05
281088/283158 [============================>.] - ETA: 1s - loss: 1.4149e-05
281216/283158 [============================>.] - ETA: 1s - loss: 1.4142e-05
281344/283158 [============================>.] - ETA: 0s - loss: 1.4136e-05
281472/283158 [============================>.] - ETA: 0s - loss: 1.4129e-05
281600/283158 [============================>.] - ETA: 0s - loss: 1.4123e-05
281728/283158 [============================>.] - ETA: 0s - loss: 1.4117e-05
281856/283158 [============================>.] - ETA: 0s - loss: 1.4110e-05
281984/283158 [============================>.] - ETA: 0s - loss: 1.4104e-05
282112/283158 [============================>.] - ETA: 0s - loss: 1.4097e-05
282240/283158 [============================>.] - ETA: 0s - loss: 1.4091e-05
282368/283158 [============================>.] - ETA: 0s - loss: 1.4085e-05
282496/283158 [============================>.] - ETA: 0s - loss: 1.4078e-05
282624/283158 [============================>.] - ETA: 0s - loss: 1.4072e-05
282752/283158 [============================>.] - ETA: 0s - loss: 1.4066e-05
282880/283158 [============================>.] - ETA: 0s - loss: 1.4059e-05
283008/283158 [============================>.] - ETA: 0s - loss: 1.4053e-05
283136/283158 [============================>.] - ETA: 0s - loss: 1.4046e-05Epoch 00000: loss improved from inf to 0.00001, saving model to models/weights-improvement-00-0.0000-embeddings.hdf5

283158/283158 [==============================] - 154s - loss: 1.4045e-05   

----- diversity: 0.2
----- Generating with seed: "riminal and responsible for not having h"
riminal and responsible for not having hTraceback (most recent call last):
  File "/root/PycharmProjects/Keras/sandbox.py", line 292, in <module>
    train()
  File "/root/PycharmProjects/Keras/sandbox.py", line 246, in train
    next_index = sample(preds, diversity)
  File "/root/PycharmProjects/Keras/sandbox.py", line 202, in sample
    probas = np.random.multinomial(1, preds, 1)
  File "mtrand.pyx", line 4612, in mtrand.RandomState.multinomial
TypeError: object of type 'numpy.float64' has no len()


Here's my code

'''Example script to generate text from Nietzsche's writings.

At least 20 epochs are required before the generated text
starts sounding coherent.

It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.

If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.
'''

from keras.models import Sequential, load_model
from keras.layers import Dense, Activation
from keras.layers import LSTM, SimpleRNN, Dropout, Bidirectional, Embedding, GRU
from keras.callbacks import ModelCheckpoint
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import pprint

embeddings_path = "glove.840B.300d-char.txt"
embedding_dim = 300


text_file_kapital = open("Das_Kapital.txt", 'rb')

def preprocess(text_file):

    ##get rid of line breaks and non-ASCII
    lines = []
    for line in text_file:
        line = line.strip().lower()
        line = line.decode("ascii", "ignore")
        if len(line) == 0:
            continue
        lines.append(line)
    text_file.close()
    text = " ".join(lines)
    return text

text = preprocess(text_file_kapital)


chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 10
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')

#working
# X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
# y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
# for i, sentence in enumerate(sentences):
#     for t, char in enumerate(sentence):
#         X[i, t, char_indices[char]] = 1
#     y[i, char_indices[next_chars[i]]] = 1

X = np.zeros((len(sentences), maxlen), dtype=np.int)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t] = char_indices[char]
y[i, char_indices[next_chars[i]]] = 1

print('Processing pretrained character embeds...')
embedding_vectors = {}
with open(embeddings_path, 'r') as f:
    for line in f:
        line_split = line.strip().split(" ")
        vec = np.array(line_split[1:], dtype=float)
        char = line_split[0]
        embedding_vectors[char] = vec

embedding_matrix = np.zeros((len(chars), 300))
#embedding_matrix = np.random.uniform(-1, 1, (len(chars), 300))
for char, i in char_indices.items():
    #print ("{}, {}".format(char, i))
    embedding_vector = embedding_vectors.get(char)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector



# build the model: a bidirectional GRU

def build_model():

    ## working
    # print('Build model...')
    # model = Sequential()
    # #model.add(Embedding(len(chars), embedding_dim, input_length=maxlen, weights=[embedding_matrix]))
    # model.add(Bidirectional(GRU(128, unroll=True, return_sequences=True), input_shape=(maxlen, len(chars))))
    # model.add(Dropout(0.2))
    # model.add(Bidirectional(GRU(128, unroll=True)))
    # model.add(Dropout(0.2))
    # model.add(Dense(len(chars)))
    # model.add(Activation('softmax'))
    #
    # model.compile(loss='categorical_crossentropy', optimizer="RMSprop")
    # return model

    print('Build model...')
    model = Sequential()
    model.add(Embedding(len(chars), embedding_dim, input_length=maxlen, weights=[embedding_matrix]))
    model.add(Bidirectional(GRU(32, unroll=True), input_shape=(maxlen, len(chars))))
    model.add(Dropout(0.2))
    # model.add(Bidirectional(GRU(128, unroll=True)))
    # model.add(Dropout(0.2))
    model.add(Dense(len(chars)))
    model.add(Activation('softmax'))

    model.compile(loss='categorical_crossentropy', optimizer="RMSprop")
    return model


#model = load_model("models/weights-improvement-00-1.2807-biggggger.hdf5")

model = build_model()
model.summary()

def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds + 1e-6) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

# train the model, output generated text after each iteration

def train():
    for iteration in range(1, 60):
        filepath = "models/weights-improvement-{epoch:02d}-{loss:.4f}-embeddings.hdf5"
        checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
        callbacks_list = [checkpoint]
        print()
        print('-' * 50)
        print('Iteration', iteration)
        model.fit(X, y,
                  batch_size=128,
                  epochs=1, callbacks = callbacks_list)

        start_index = random.randint(0, len(text) - maxlen - 1)
        #model.save("models/testmodel.h5")

        for diversity in [0.2, 0.5, 1.0, 1.2]:
            print()
            print('----- diversity:', diversity)

            generated = ''
            sentence = text[start_index: start_index + maxlen]
            generated += sentence
            print('----- Generating with seed: "' + sentence + '"')
            sys.stdout.write(generated)

            for i in range(400):
                # x = np.zeros((1, maxlen, len(chars)))
                # for t, char in enumerate(sentence):
                #     x[0, t, char_indices[char]] = 1.
                #
                # preds = model.predict(x, verbose=0)[0]
                # next_index = sample(preds, diversity)
                # next_char = indices_char[next_index]

                x = np.zeros((1, maxlen), dtype=np.int)
                for t, char in enumerate(sentence):
                    x[0, t] = char_indices[char]

                preds = model.predict(x, verbose=0)[0][0]
                next_index = sample(preds, diversity)
                next_char = indices_char[next_index]

                generated += next_char
                sentence = sentence[1:] + next_char

                sys.stdout.write(next_char)
                sys.stdout.flush()
            print()

def predict(num_to_predict, temperatrue, seed):
    start_index = random.randint(0, len(text) - maxlen - 1)

    if len(seed) > 40:
        print("Type fewer characters, you typed this man characters")
        print(len(seed))
        return 0
    newstring = ""
    space = 40 - len(seed)
    for i in range(space):
        newstring += " "
    seed = newstring + seed
    sentence = seed
    generated = ''
    #sentence = text[start_index: start_index + maxlen]
    generated += sentence
    print('----- Hey Marx, Tell me about:  "' + sentence.strip() + '"')
    sys.stdout.write(generated)

    for i in range(num_to_predict):
        x = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x[0, t, char_indices[char]] = 1.

        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds, temperatrue)
        next_char = indices_char[next_index]

        generated += next_char
        sentence = sentence[1:] + next_char

        #sys.stdout.write(next_char)
        #sys.stdout.flush()
    print()
    pprint.pprint(generated)

train()

# ################# PREDICT
# cont = 0
#
# while cont == 0:
#     newtext = input("What do you want me to tell you about?")
#
#     if newtext == "1":
#         cont = 1
#     predict(500, 0.5, newtext.lower())

TypeError: Cannot convert -0.05 to EagerTensor of dtype int32

@minimaxir I am trying to do text classification with spam messages.

While running the code provided by @minimaxir (with some slight modifications) , I faced this error -

TypeError: Cannot convert -0.05 to EagerTensor of dtype int32

Here is my code -

`
from future import print_function
from keras.models import Sequential, Model, load_model
from keras.layers import Dense, Activation, Embedding
from keras.layers import LSTM, Input
from keras.optimizers import RMSprop, Adam
from keras.utils.data_utils import get_file
from tensorflow.keras.layers import BatchNormalization
from keras.callbacks import Callback, ModelCheckpoint
from sklearn.decomposition import PCA
import numpy as np
import random
import sys
import csv
import os
import h5py
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.snowball import PorterStemmer
from sklearn import preprocessing
def preprocess_text(sen):
# Remove punctuations and numbers
sentence = re.sub('[^a-zA-Z]', ' ', sen)

# Single character removal
sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

# Removing multiple spaces
sentence = re.sub(r'\s+', ' ', sentence)

stops = stopwords.words('english')
#print(stops)
porter = PorterStemmer()
for word in sentence.split():
    if word in stops:
        sentence = sentence.replace(word, '')
    sentence = sentence.replace(word, porter.stem(word))
return sentence.lower()

maxlen = 40 # must match length which generated model
num_char_generated = 30000
df = pd.read_csv('spam.csv')
df['Message'] = df['Message'].apply(preprocess_text)
df.head(10)
mes = []
for i in df['Message']:
mes.append(i.split())
for m in mes:
chars = sorted(list(set(m)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

def sample(preds, temperature=1.0):
# helper function to sample an index from a probability array
   preds = np.asarray(preds).astype('float64')
   preds = np.log(preds + 1e-6) / temperature
   exp_preds = np.exp(preds)
   preds = exp_preds / np.sum(exp_preds)
   probas = np.random.multinomial(1, preds, 1)
   return np.argmax(probas)

print('Loading model...')
model = load_model('model.hdf5')
f2 = open('text_sample.txt', 'w')

start_index = random.randint(0, len(text) - maxlen - 1)
print("He")
for diversity in [0.2, 0.5, 1.0, 1.2]:
    print()
    print('----- diversity:', diversity)
    f2.write('----- diversity:' + ' ' + str(diversity) + '\n')

    generated = ''
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    print('----- Generating with seed: "' + sentence + '"')
    f2.write('----- Generating with seed: "' + sentence + '"' + '\n---\n')
    sys.stdout.write(generated)

    for i in range(num_char_generated):
        x = np.zeros((1, maxlen), dtype=np.int)
        for t, char in enumerate(sentence):
          x[0, t] = char_indices[char]

        preds = model.predict(x, verbose=0)[0][0]
        next_index = sample(preds, diversity)
        next_char = indices_char[next_index]
        generated += next_char
        sentence = sentence[1:] + next_char

        sys.stdout.write(next_char)
        sys.stdout.flush()
f2.write(generated + '\n')
print("H")
f2.close()

`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.