tca19 / phd-thesis Goto Github PK
View Code? Open in Web Editor NEWMy PhD thesis with all its source files, including all .tex files and images created, as well as the slides of my defense.
My PhD thesis with all its source files, including all .tex files and images created, as well as the slides of my defense.
Improving methods to learn word representations =============================================== for efficient semantic similarities computations ================================================ ABOUT This repository contains the PhD thesis of Julien Tissier, entitled, "Improving methods to learn word representations for efficient semantic similarities computations" It also contains all the source materials used to produce the thesis, including the Latex .tex source files, the images and their respective source files to generate or modify them (either the Libreoffice Draw source or the Python code) and the slides of the PhD defense. CONTENT This repository is composed of: - chapters/: this folder contains all the chapters of the thesis, as ".tex" source files. There are 10 chapters (from 00-introduction.tex to 09-software.tex), a cover page (000-garde.tex) and the bibliography (99-bibliography.bib). - images/: this folder contains all the images used in the thesis (i.e. with the \includegraphics{} command in the .tex files) either as PNG or PDF. - images-code/: this folder contains the Python code used to generate some plots or illustration images of the thesis with the Matplotlib library. - images-src/: this folder contains the source files of some illustrations images used in the thesis, as Libreoffice Draw files (.odg). - PhD-Defense-Julien-Tissier.pdf: the defense presentation as PDF, 48 slides. - PhD-Thesis-Julien-Tissier.pdf: the thesis as PDF, 127 pages. - makefile: used to generate the thesis from source files. Use the command `make` at the root of this repository to produce it. You will need the following tools: make, pdflatex and bibtex. - phd-thesis.tex: the main .tex file, containing all the Latex package to use and the different chapters to include. SUMMARY Many natural language processing applications rely on word embeddings (also called word representations) to achieve state-of-the-art results. These numerical representations of the language should encode both syntactic and semantic information to perform well in downstream tasks. However, common models (word2vec, GloVe) use generic corpus like Wikipedia to learn them and they therefore lack specific semantic information. Moreover it requires a large memory space to store them because the number of representations to save can be in the order of a million. The topic of my thesis is to develop new learning algorithms to both improve the semantic information encoded within the representations while making them requiring less memory space for storage and their applications in NLP tasks. The first part of my work is to improve the semantic information contained in word embeddings. I developed dict2vec, a model that uses additional information from online lexical dictionaries when learning word representations. The dict2vec word embeddings perform ∼15% better against the embeddings learned by other models on word semantic similarity tasks. The second part of my work is to reduce the memory size of the embeddings. I developed an architecture based on an autoencoder to transform commonly used real-valued embeddings into binary embeddings, reducing their size in memory by 97% with only a loss of ∼2% in accuracy in downstream NLP tasks. AUTHOR Written by Julien Tissier <[email protected]>. COPYRIGHT This thesis and all the files in this repository are licensed under the "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License". By using or downloading this repository, you agree to: 1. NonCommercial - You may not use the material for commercial purposes. 2. Attribution - You must give appropriate credit, provide a link to the licensor, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. 3. ShareAlike - If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. 4. No additional restrictions - You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits. For more details, see https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.