The coderecommendations from endruk

coderecommendations's Issues

Share Weights Embedding

Use the same embedding weights of encoder for decoder.

Split large x into chunks

To also train on large x inputs, I can chunk down the input sequences and train them on multiple iterations of the encoder.

[index] * 80000 -> x * [[index] * z]
where x stands for the number of encoder iterations and z for the length of each chunk.

Ignore batches that are too long

Due to the different time_lengths of all batches it could be that some batches may be too long to fit on a single GPU memory.
Implement a mechanic to prevent using those batches in training.

Multiprocessing Vocab Creation Bug

The vocab creation terminates wit hthis error:

2019-02-06 13:17:46,482 [INFO] sizes of the dataset:
2019-02-06 13:17:46,483 [INFO] training: 197416
2019-02-06 13:17:46,483 [INFO] validation: 29612
2019-02-06 13:17:46,483 [INFO] testing: 19742
2019-02-06 13:17:46,498 [DEBUG] start vocab creation
Traceback (most recent call last):
  File "./run_training.py", line 58, in <module>
    dataset.build_vocab(top_k=top_k, num_processes=num_processes)
  File "/mnt/raid/data/karge/Github/CodeRecommendations/Seq2Seq_Pytorch_test/Data/dataset.py", line 121, in build_vocab
    result = process.get()
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 385, in _handle_tasks
    put(task)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 206, in send
    self._send_bytes(ForkingPickler.dumps(obj))
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

This indicates that the pandas dataframe pass does not work.
Size of the CSV-file: 15GB

This is a SO-thread for this issue:
https://stackoverflow.com/questions/47776486/python-struct-error-i-format-requires-2147483648-number-2147483647

So: I have to re-work multiprocessing vocab creation
For now I stick with single process vocab creation

Add features to dataset
Add features to models
Test models

Add validation test generation

At the end of a validation iteration there can be a test generation, using one element of the validation set and use it to generate an output - print the result.

endruk / coderecommendations Goto Github PK

coderecommendations's People

Contributors

Stargazers

Watchers

coderecommendations's Issues

Share Weights Embedding

Split large x into chunks

Ignore batches that are too long

Multiprocessing Vocab Creation Bug

Build generation workflow

Attention Network

Feature Design

Add validation test generation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent