minimaxir / char-embeddings Goto Github PK

A repository containing 300D character embeddings derived from the GloVe 840B/300D dataset, and uses these embeddings to train a deep learning model to generate Magic: The Gathering cards using Keras

License: MIT License

Python 100.00%

char-embeddings's Introduction

char-embeddings

char-embeddings is a repository containing 300D character embeddings derived from the GloVe 840B/300D dataset, and uses these embeddings to train a deep learning model to generate Magic: The Gathering cards using Keras. The generation and model construction is heavily modified after the automatic text generation Keras example by François Chollet.

Usage

This repository contains a number of Python 3 scripts:

create_embeddings.py: Converts a pretrained word embeddings file into a character embeddings file by averaging the per-character vectors.
create_magic_text.py: Converts an MTG JSON card dump into a one-per-line card encoding.
text_generator_keras.py: Constructs and trains the Keras model, producing Magic cards at each epoch.
text_generator_keras_sample.py: Using the text file and Keras model generated from the previous two scripts, generate a large amount of Magic cards.

The output folder contains Magic card output at each epoch, a log of losses at every 50th batch, the learned character embeddings at the last epoch, the trained model itself, and a large sample of generated Magic cards.

Requirements

keras, tensorflow, h5py, scikit-learn

Maintainer

Max Woolf (@minimaxir)

License

MIT

char-embeddings's People

Contributors

Stargazers

Watchers

Forkers

vyraun sukritsangwan ourgeneration poivrenoir huyingxi tienthanhdhcn s91-maker iman91 idoshlomo zhangr2 dongfang91 rifkiaputri akbari59 xiongfeihtp anirudhsekar96 hoagy-davis-digges akshit1223 zhengyu-li ncmarian tanghao922 nvn01234 shubhampachori12110095 kcwijaya divyamahindrakar zhangyijia1979 jshomy theanhle libertatis parekhaagam chepet shmilychomi afcarl zzx2017 harshit364 sajanaw yutatakaba sriharshavv wlhgtc mswon kyusonglee vgpprasad91 yangyangii albertbj chensvm web199195 prashantranjan09 kunalranjan08 dipeshmangwani jingnanjia sysujayce mohitrohatgi shaanappel vijayasindhu himanisoni zhangjiantong rama-007 abresas fc20567 yokeyong boidi thecaiguy jesperhalvorsen prudhviraj12 prariehill wuyuwuliao sunyancn edcmartin blutab liuwenhaha caralen dingyunxia shunyuanxue praveenedula rezarjz sureshkgunti and-lap aekayguy abhayjain0 samarthkhanna yuanchang98 syyunn awsmsaurabh liaomin97 changewow brahmg chaitanyareddy-dev abhiss4 guptaatul78 jiangchao52 akmaleache wangruirdfz raidhar adrathor parmarjigar saipreet wins999 kskulkarnigit celesteimmanuel gs-zhang wazir19gh

char-embeddings's Issues

Sample Generator script should select a random point from dataset instead of creating sequences

Since creating sequences is slow and not necessary.

Question on deriving char embeddings from word embeddings.

Hi. I was looking at create_embeddings.py to see how you derived char embeddings directly from word embeddings.

It looks like you equate a char embedding with the average of all word vectors that contain that char, counting each char multiple times if it occurs more than once in a word. Is that correct?

Did you decide to do this because you got good results or was there some other reason for this?

Thanks!

whitespace character embedding!

Hi, I am new to character embedding, isn't it necessary to have an embedding for whitespace character? Since this method of averaging does not produce any character embedding.

Attempt to use glove character embeddings causing strange errors.

Hello! I've been working on the text-generation example from Keras, and I saw your code. I tried to reverse engineer what I already had to work with your character embeddings. Unfortunately, it's introducing a very strange error which I am not finding documentation about and I am unsure about the cause of. My first thought is that I've failed to get the embedding layer into the model properly, but it trains successfully (?). If you could offer me some assistance with fixing this, I'd be greatly in your debt!

here's example stacktrace / output

275456/283158 [============================>.] - ETA: 4s - loss: 1.4438e-05
275584/283158 [============================>.] - ETA: 4s - loss: 1.4431e-05
275712/283158 [============================>.] - ETA: 4s - loss: 1.4425e-05
275840/283158 [============================>.] - ETA: 4s - loss: 1.4418e-05
275968/283158 [============================>.] - ETA: 3s - loss: 1.4411e-05
276096/283158 [============================>.] - ETA: 3s - loss: 1.4405e-05
276224/283158 [============================>.] - ETA: 3s - loss: 1.4398e-05
276352/283158 [============================>.] - ETA: 3s - loss: 1.4391e-05
276480/283158 [============================>.] - ETA: 3s - loss: 1.4385e-05
276608/283158 [============================>.] - ETA: 3s - loss: 1.4378e-05
276736/283158 [============================>.] - ETA: 3s - loss: 1.4371e-05
276864/283158 [============================>.] - ETA: 3s - loss: 1.4365e-05
276992/283158 [============================>.] - ETA: 3s - loss: 1.4358e-05
277120/283158 [============================>.] - ETA: 3s - loss: 1.4351e-05
277248/283158 [============================>.] - ETA: 3s - loss: 1.4345e-05
277376/283158 [============================>.] - ETA: 3s - loss: 1.4338e-05
277504/283158 [============================>.] - ETA: 3s - loss: 1.4332e-05
277632/283158 [============================>.] - ETA: 3s - loss: 1.4325e-05
277760/283158 [============================>.] - ETA: 2s - loss: 1.4318e-05
277888/283158 [============================>.] - ETA: 2s - loss: 1.4312e-05
278016/283158 [============================>.] - ETA: 2s - loss: 1.4305e-05
278144/283158 [============================>.] - ETA: 2s - loss: 1.4299e-05
278272/283158 [============================>.] - ETA: 2s - loss: 1.4292e-05
278400/283158 [============================>.] - ETA: 2s - loss: 1.4285e-05
278528/283158 [============================>.] - ETA: 2s - loss: 1.4279e-05
278656/283158 [============================>.] - ETA: 2s - loss: 1.4272e-05
278784/283158 [============================>.] - ETA: 2s - loss: 1.4266e-05
278912/283158 [============================>.] - ETA: 2s - loss: 1.4259e-05
279040/283158 [============================>.] - ETA: 2s - loss: 1.4253e-05
279168/283158 [============================>.] - ETA: 2s - loss: 1.4246e-05
279296/283158 [============================>.] - ETA: 2s - loss: 1.4240e-05
279424/283158 [============================>.] - ETA: 2s - loss: 1.4233e-05
279552/283158 [============================>.] - ETA: 1s - loss: 1.4227e-05
279680/283158 [============================>.] - ETA: 1s - loss: 1.4220e-05
279808/283158 [============================>.] - ETA: 1s - loss: 1.4214e-05
279936/283158 [============================>.] - ETA: 1s - loss: 1.4207e-05
280064/283158 [============================>.] - ETA: 1s - loss: 1.4201e-05
280192/283158 [============================>.] - ETA: 1s - loss: 1.4194e-05
280320/283158 [============================>.] - ETA: 1s - loss: 1.4188e-05
280448/283158 [============================>.] - ETA: 1s - loss: 1.4181e-05
280576/283158 [============================>.] - ETA: 1s - loss: 1.4175e-05
280704/283158 [============================>.] - ETA: 1s - loss: 1.4168e-05
280832/283158 [============================>.] - ETA: 1s - loss: 1.4162e-05
280960/283158 [============================>.] - ETA: 1s - loss: 1.4155e-05
281088/283158 [============================>.] - ETA: 1s - loss: 1.4149e-05
281216/283158 [============================>.] - ETA: 1s - loss: 1.4142e-05
281344/283158 [============================>.] - ETA: 0s - loss: 1.4136e-05
281472/283158 [============================>.] - ETA: 0s - loss: 1.4129e-05
281600/283158 [============================>.] - ETA: 0s - loss: 1.4123e-05
281728/283158 [============================>.] - ETA: 0s - loss: 1.4117e-05
281856/283158 [============================>.] - ETA: 0s - loss: 1.4110e-05
281984/283158 [============================>.] - ETA: 0s - loss: 1.4104e-05
282112/283158 [============================>.] - ETA: 0s - loss: 1.4097e-05
282240/283158 [============================>.] - ETA: 0s - loss: 1.4091e-05
282368/283158 [============================>.] - ETA: 0s - loss: 1.4085e-05
282496/283158 [============================>.] - ETA: 0s - loss: 1.4078e-05
282624/283158 [============================>.] - ETA: 0s - loss: 1.4072e-05
282752/283158 [============================>.] - ETA: 0s - loss: 1.4066e-05
282880/283158 [============================>.] - ETA: 0s - loss: 1.4059e-05
283008/283158 [============================>.] - ETA: 0s - loss: 1.4053e-05
283136/283158 [============================>.] - ETA: 0s - loss: 1.4046e-05Epoch 00000: loss improved from inf to 0.00001, saving model to models/weights-improvement-00-0.0000-embeddings.hdf5

283158/283158 [==============================] - 154s - loss: 1.4045e-05   

----- diversity: 0.2
----- Generating with seed: "riminal and responsible for not having h"
riminal and responsible for not having hTraceback (most recent call last):
  File "/root/PycharmProjects/Keras/sandbox.py", line 292, in <module>
    train()
  File "/root/PycharmProjects/Keras/sandbox.py", line 246, in train
    next_index = sample(preds, diversity)
  File "/root/PycharmProjects/Keras/sandbox.py", line 202, in sample
    probas = np.random.multinomial(1, preds, 1)
  File "mtrand.pyx", line 4612, in mtrand.RandomState.multinomial
TypeError: object of type 'numpy.float64' has no len()

Here's my code

'''Example script to generate text from Nietzsche's writings.

At least 20 epochs are required before the generated text
starts sounding coherent.

It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.

If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.
'''

from keras.models import Sequential, load_model
from keras.layers import Dense, Activation
from keras.layers import LSTM, SimpleRNN, Dropout, Bidirectional, Embedding, GRU
from keras.callbacks import ModelCheckpoint
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import pprint

embeddings_path = "glove.840B.300d-char.txt"
embedding_dim = 300


text_file_kapital = open("Das_Kapital.txt", 'rb')

def preprocess(text_file):

    ##get rid of line breaks and non-ASCII
    lines = []
    for line in text_file:
        line = line.strip().lower()
        line = line.decode("ascii", "ignore")
        if len(line) == 0:
            continue
        lines.append(line)
    text_file.close()
    text = " ".join(lines)
    return text

text = preprocess(text_file_kapital)


chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 10
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')

#working
# X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
# y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
# for i, sentence in enumerate(sentences):
#     for t, char in enumerate(sentence):
#         X[i, t, char_indices[char]] = 1
#     y[i, char_indices[next_chars[i]]] = 1

X = np.zeros((len(sentences), maxlen), dtype=np.int)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t] = char_indices[char]
y[i, char_indices[next_chars[i]]] = 1

print('Processing pretrained character embeds...')
embedding_vectors = {}
with open(embeddings_path, 'r') as f:
    for line in f:
        line_split = line.strip().split(" ")
        vec = np.array(line_split[1:], dtype=float)
        char = line_split[0]
        embedding_vectors[char] = vec

embedding_matrix = np.zeros((len(chars), 300))
#embedding_matrix = np.random.uniform(-1, 1, (len(chars), 300))
for char, i in char_indices.items():
    #print ("{}, {}".format(char, i))
    embedding_vector = embedding_vectors.get(char)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector



# build the model: a bidirectional GRU

def build_model():

    ## working
    # print('Build model...')
    # model = Sequential()
    # #model.add(Embedding(len(chars), embedding_dim, input_length=maxlen, weights=[embedding_matrix]))
    # model.add(Bidirectional(GRU(128, unroll=True, return_sequences=True), input_shape=(maxlen, len(chars))))
    # model.add(Dropout(0.2))
    # model.add(Bidirectional(GRU(128, unroll=True)))
    # model.add(Dropout(0.2))
    # model.add(Dense(len(chars)))
    # model.add(Activation('softmax'))
    #
    # model.compile(loss='categorical_crossentropy', optimizer="RMSprop")
    # return model

    print('Build model...')
    model = Sequential()
    model.add(Embedding(len(chars), embedding_dim, input_length=maxlen, weights=[embedding_matrix]))
    model.add(Bidirectional(GRU(32, unroll=True), input_shape=(maxlen, len(chars))))
    model.add(Dropout(0.2))
    # model.add(Bidirectional(GRU(128, unroll=True)))
    # model.add(Dropout(0.2))
    model.add(Dense(len(chars)))
    model.add(Activation('softmax'))

    model.compile(loss='categorical_crossentropy', optimizer="RMSprop")
    return model


#model = load_model("models/weights-improvement-00-1.2807-biggggger.hdf5")

model = build_model()
model.summary()

def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds + 1e-6) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

# train the model, output generated text after each iteration

def train():
    for iteration in range(1, 60):
        filepath = "models/weights-improvement-{epoch:02d}-{loss:.4f}-embeddings.hdf5"
        checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
        callbacks_list = [checkpoint]
        print()
        print('-' * 50)
        print('Iteration', iteration)
        model.fit(X, y,
                  batch_size=128,
                  epochs=1, callbacks = callbacks_list)

        start_index = random.randint(0, len(text) - maxlen - 1)
        #model.save("models/testmodel.h5")

        for diversity in [0.2, 0.5, 1.0, 1.2]:
            print()
            print('----- diversity:', diversity)

            generated = ''
            sentence = text[start_index: start_index + maxlen]
            generated += sentence
            print('----- Generating with seed: "' + sentence + '"')
            sys.stdout.write(generated)

            for i in range(400):
                # x = np.zeros((1, maxlen, len(chars)))
                # for t, char in enumerate(sentence):
                #     x[0, t, char_indices[char]] = 1.
                #
                # preds = model.predict(x, verbose=0)[0]
                # next_index = sample(preds, diversity)
                # next_char = indices_char[next_index]

                x = np.zeros((1, maxlen), dtype=np.int)
                for t, char in enumerate(sentence):
                    x[0, t] = char_indices[char]

                preds = model.predict(x, verbose=0)[0][0]
                next_index = sample(preds, diversity)
                next_char = indices_char[next_index]

                generated += next_char
                sentence = sentence[1:] + next_char

                sys.stdout.write(next_char)
                sys.stdout.flush()
            print()

def predict(num_to_predict, temperatrue, seed):
    start_index = random.randint(0, len(text) - maxlen - 1)

    if len(seed) > 40:
        print("Type fewer characters, you typed this man characters")
        print(len(seed))
        return 0
    newstring = ""
    space = 40 - len(seed)
    for i in range(space):
        newstring += " "
    seed = newstring + seed
    sentence = seed
    generated = ''
    #sentence = text[start_index: start_index + maxlen]
    generated += sentence
    print('----- Hey Marx, Tell me about:  "' + sentence.strip() + '"')
    sys.stdout.write(generated)

    for i in range(num_to_predict):
        x = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x[0, t, char_indices[char]] = 1.

        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds, temperatrue)
        next_char = indices_char[next_index]

        generated += next_char
        sentence = sentence[1:] + next_char

        #sys.stdout.write(next_char)
        #sys.stdout.flush()
    print()
    pprint.pprint(generated)

train()

# ################# PREDICT
# cont = 0
#
# while cont == 0:
#     newtext = input("What do you want me to tell you about?")
#
#     if newtext == "1":
#         cont = 1
#     predict(500, 0.5, newtext.lower())

Pretrained character embeddings

TypeError: Cannot convert -0.05 to EagerTensor of dtype int32

@minimaxir I am trying to do text classification with spam messages.

While running the code provided by @minimaxir (with some slight modifications) , I faced this error -

TypeError: Cannot convert -0.05 to EagerTensor of dtype int32

Here is my code -

`
from future import print_function
from keras.models import Sequential, Model, load_model
from keras.layers import Dense, Activation, Embedding
from keras.layers import LSTM, Input
from keras.optimizers import RMSprop, Adam
from keras.utils.data_utils import get_file
from tensorflow.keras.layers import BatchNormalization
from keras.callbacks import Callback, ModelCheckpoint
from sklearn.decomposition import PCA
import numpy as np
import random
import sys
import csv
import os
import h5py
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.snowball import PorterStemmer
from sklearn import preprocessing
def preprocess_text(sen):
# Remove punctuations and numbers
sentence = re.sub('[^a-zA-Z]', ' ', sen)

# Single character removal
sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

# Removing multiple spaces
sentence = re.sub(r'\s+', ' ', sentence)

stops = stopwords.words('english')
#print(stops)
porter = PorterStemmer()
for word in sentence.split():
    if word in stops:
        sentence = sentence.replace(word, '')
    sentence = sentence.replace(word, porter.stem(word))
return sentence.lower()

maxlen = 40 # must match length which generated model
num_char_generated = 30000
df = pd.read_csv('spam.csv')
df['Message'] = df['Message'].apply(preprocess_text)
df.head(10)
mes = []
for i in df['Message']:
mes.append(i.split())
for m in mes:
chars = sorted(list(set(m)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

def sample(preds, temperature=1.0):
# helper function to sample an index from a probability array
   preds = np.asarray(preds).astype('float64')
   preds = np.log(preds + 1e-6) / temperature
   exp_preds = np.exp(preds)
   preds = exp_preds / np.sum(exp_preds)
   probas = np.random.multinomial(1, preds, 1)
   return np.argmax(probas)

print('Loading model...')
model = load_model('model.hdf5')
f2 = open('text_sample.txt', 'w')

start_index = random.randint(0, len(text) - maxlen - 1)
print("He")
for diversity in [0.2, 0.5, 1.0, 1.2]:
    print()
    print('----- diversity:', diversity)
    f2.write('----- diversity:' + ' ' + str(diversity) + '\n')

    generated = ''
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    print('----- Generating with seed: "' + sentence + '"')
    f2.write('----- Generating with seed: "' + sentence + '"' + '\n---\n')
    sys.stdout.write(generated)

    for i in range(num_char_generated):
        x = np.zeros((1, maxlen), dtype=np.int)
        for t, char in enumerate(sentence):
          x[0, t] = char_indices[char]

        preds = model.predict(x, verbose=0)[0][0]
        next_index = sample(preds, diversity)
        next_char = indices_char[next_index]
        generated += next_char
        sentence = sentence[1:] + next_char

        sys.stdout.write(next_char)
        sys.stdout.flush()
f2.write(generated + '\n')
print("H")
f2.close()

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.