Giter VIP home page Giter VIP logo

ai_book's Introduction

Yeonwoo Sung

Yeonwoo's GitHub stats

Skills

๐Ÿ“š Tech Stack ๐Ÿ“š

ย  ย  ย  ย 
ย  ย  ย  ย 

Kaggle

Kaggle Competition Expert

Education

BSc in University of St Andrews (2016 - 2020)

NVIDIA, Deep Learning NLP certificate (2019)

NVIDIA, Rapid Application Development with Large Language Models (2024)

Coursera - Machine Learning, Certification for โ€Machine Learningโ€ course (2018)

Awards

  • Deanโ€™s List for academic excellence in the academic year 2019/20
  • Medal for performance in Programming Projects module (2018)

ai_book's People

Contributors

dependabot[bot] avatar trellixvulnteam avatar yeonwoosung avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ai_book's Issues

Text Embeddings Reveal (Almost) As Much As Text

paper, code

Abstract

How much private information do text embeddings reveal about the original text? We investigate the problem of embedding \textit{inversion}, reconstructing the full text represented in dense text embeddings. We frame the problem as controlled generation: generating text that, when reembedded, is close to a fixed point in latent space. We find that although a naรฏve model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover 92% of 32-token text inputs exactly. We train our model to decode text embeddings from two state-of-the-art embedding models, and also show that our model can recover important personal information (full names) from a dataset of clinical notes.

Personal Thoughts

Reconstructing and recovering the original texts from the text embeddings might be considered as AI-based vulnerability, which could cause unintended privacy leakage issue.

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2023-12-17 แ„‹แ…ฉแ„’แ…ฎ 9 49 25

The paper stated that it is possible to decrease the recoverage of Vec2Text by adding some Gaussian noise directly to each embedding.

As you could see in the chart above, there is a some point that we could maximize the distance between Vec2Text recovery percentage and Retrieval performance (keep the Retrieval performance and drop the recovery probability drastically).

Makes me to remind that adding "proper" noises to embeddings improves the AI-based systems!

Radioactive data: tracing through training

Abstract

  • Neural classifiers can improve their performance by training on more data
  • But given a trained classifier, it's difficult to tell what data it was trained on
  • This is especially relevant if you have proprietary or personal data and you want to make sure that other people don't use it to train their models
  • Train CNN with Vanilla data first, then train the CNN with Radioactive Data (i.e. Distorted images)

Basically, this paper introduces a method to mark a dataset with a hidden "radioactive" tag, such that any resulting classifier will clearly exhibit this tag, which can be detected.

Details

fig2

  • When you radioactively mark the data points, it simply adds a feature

    • Let's assume that there are 10 classes for our problem
      • We can imagine that there are 10 vectors (each vector is a unique axis, and it depicts a corresponding class)
      • The classifier plots the point on that data space, and find the class that is aligned most
    • ๋‹ค๋ฅธ ์˜ˆ๋กœ, ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ 4๊ฐœ์˜ ํด๋ž˜์Šค๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž.
      • ์ด ๊ฒฝ์šฐ, ๋ถ„๋ฅ˜๊ธฐ๋Š” ์ด 4๊ฐœ์˜ ์ถ• ๋ฒกํ„ฐ (w, x, y, z)๋ฅผ ๊ฐ€์ง„๋‹ค.
      • ์ •ํ™•ํžˆ ๋งํ•˜์ž๋ฉด ์ถ•์ด ์•„๋‹ˆ๋ผ ํ•™์Šต์„ ํ†ตํ•ด์„œ ๋ฐฐ์šด ๋ฒกํ„ฐ์ด๋‹ค.
      • ์ด ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฅ˜ํ•  ๋•Œ์—๋Š” ์ด ๋ฐ์ดํ„ฐ ๊ณต๊ฐ„ ์œ„์— ์ ์„ ์ฐ์€ ๋’ค, 4๊ฐœ์˜ ๋ฒกํ„ฐ ์ค‘ ๊ฐ€์žฅ ๊ฑฐ๋ฆฌ๊ฐ€ ๊ฐ€๊นŒ์šด ๋ฒกํ„ฐ์˜ ํด๋ž˜์Šค๋กœ ๋ถ„๋ฅ˜
  • So, here, we are introducing a fake class vector

    • Clearly, this is cheating!
  • By using this method, we modify the training data only

  • We will give a little bit of generalization capability, but this will force to pay attention to the fake features (radioactive)
    - This is something that you could detect

  • For testing, they create random vectors on the augmented dataspace and look up the cosine value between fake vector and each random vector
    - Authors of this paper stated that if you distort the data well, then theoretically the distribution of the cosine between fake vector and random vectors should follow the given distribution

distribution

  • The paper also shows the methods for re-aligning the feature spaces

Personal Thoughts

Clearly, data is the modern gold. Neural classifiers can improve their performance by training on more data, but given a trained classifier, it's difficult to tell what data it was trained on. This is especially relevant if you have proprietary or personal data and you want to make sure that other people don't use it to train their models. This paper introduces a method to mark a dataset with a hidden "radioactive" tag, such that any resulting classifier will clearly exhibit this tag, which can be detected.

Modeling Recurrence for Transformer

Abstract

  • propose additional "Attentive Recurrent Network(ARN)" to Transformer encoder to leverage the strengths of both attention and recurrent networks
  • WMT14 EnDe and WMT17 ZhEn demonstrates the effectiveness
  • study reveals that a short-cut bridge of shallow ARN outperforms deep counterpart

Details

Main Approach

  • use an additional recurrent encoder to the source side

fig2

  • recurrent model can be simple (a) RNN, GRU, LSTM or (b) Attentive Recurrent Network where context representation is generated via attention with previous hidden state

Impact of Components

  • ablation study on size of addition recurrent encoder
    • smaller BiARN encoder attached directly to top of decoder outperforms all others

table1

  • ablation study on number of recurrent steps in ARN
    • ~8 seems optimal

fig5

  • ablation study on how to integrate representation in the decoder side
    • stack on top outperformed all others

table2

fig4

Overall Result

  • with additional ARN encoder, BLEU scores improve with statistical significance

table3

Linguistic Analysis

  • what linguistic characteristics are models learning?
    • 1-Layer BiARN performs better on all syntactic and some semantic tasks
  • List of Linguistic Characteristics
    • SeLen : sentence length
    • WC : recover original words given its source embedding
    • TrDep : check whether encoder infers the hierarchical structure of sentences
    • ToCo : classify in terms of the sequence of top constituents
    • BShif : tests whether two consecutive tokens are inverted
    • Tense : predict tense of the main-clause verb
    • SubN : number of main-clause subjects
    • ObjN : number of the direct object of the main clause
    • SoMo : check whether some sentences are modified by replacing a random noun or verb
    • CoIn : two coordinate clauses with half the sentence inverted

table5

Personal Thoughts

  • Translation requires a complicated encoding function in the source side. Pros of attention, rnn, cnn can be complemented to produce a richer representation
  • This paper showed that there is a small room of improvement for rnn encoder to play part in Transformer encoder with short-cut trick

Link: https://arxiv.org/pdf/1904.03092v1.pdf
Authors: Hao et al. 2019

Shortformer

O.Press et. al. [1] challenge the conventional wisdom that scaling transformer language models to longer sequences improves results. They show that by initially training on shorter sub-sequences and then progressing to longer ones via staged training, we can improve perplexity and reduce training time.

They additionally define a new method, position-infused attention, that enables caching and efficiently attending to previously computed representations. This method does not require large input sub-sequences.

What Does BERT Look At?

paper

Abstract

Large pre-trained neural networks such as BERT have had great recent success in NLP, motivating a growing body of research investigating what aspects of language they are able to learn from unlabeled data. Most recent analysis has focused on model outputs (e.g., language model surprisal) or internal vector representations (e.g., probing classifiers). Complementary to these works, we propose methods for analyzing the attention mechanisms of pre-trained models and apply them to BERT. BERT's attention heads exhibit patterns such as attending to delimiter tokens, specific positional offsets, or broadly attending over the whole sentence, with heads in the same layer often exhibiting similar behaviors. We further show that certain attention heads correspond well to linguistic notions of syntax and coreference. For example, we find heads that attend to the direct objects of verbs, determiners of nouns, objects of prepositions, and coreferent mentions with remarkably high accuracy. Lastly, we propose an attention-based probing classifier and use it to further demonstrate that substantial syntactic information is captured in BERT's attention.

In short

If you visualize the attention map of each BERT layer, you will find out that the first few layers mostly activates on [CLS] token, where behind-side layers mostly activates on [SEP] token.

Most papers say that all following tokens looks up [CLS] token in Self Attention, so the output logit of the [CLS] token contains the semantic data of the input sentence.

Few-Shot Learning with Class Imbalance

E. Triantafillou et. al. [1] had experiments for few-shot learning with class imbalance to see if the class imbalance actually impacts to the performance of the few-shot learning methods.

Results

  1. compared to the balanced task, the performances on class-imbalance tasks counterparts always drop, by up to 18.0% for optimization-based methods, and up to 8.4 for metric-based methods

  2. contrary to popular belief, meta-learning algorithms, such as MAML, do not automatically learn to balance by being exposed to imbalanced tasks during (meta-)training time

  3. strategies used to mitigate imbalance in supervised learning, such as oversampling, can offer a stronger solution to the class imbalance problem

  4. the effect of imbalance at the meta-dataset level is less significant than the effect at the task level with similar imbalance magnitude.

Reference

[1] Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Utku Evci, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, Hugo Larochelle. Few-Shot Learning with Class Imbalance

Model parallelism to use a huge huge model in colab

Understanding Retrieval Augmentation for Long-Form Question Answering

Understanding Retrieval Augmentation for Long-Form Question Answering

paper

Summary (tl;dr)

Explores retrieval-augmented language models on long-form question answering; finds that retrieval is an important component but evidence documents should be carefully added to the llm; finds that attribution error happens more frequently when retrieved documents lack sufficient information/evidence for answering the question.

Abstract

We present a study of retrieval-augmented language models (LMs) on long-form question answering. We analyze how retrieval augmentation impacts different LMs, by comparing answers generated from models while using the same evidence documents, and how differing quality of retrieval document set impacts the answers generated from the same LM. We study various attributes of generated answers (e.g., fluency, length, variance) with an emphasis on the attribution of generated long-form answers to in-context evidence documents. We collect human annotations of answer attribution and evaluate methods for automatically judging attribution. Our study provides new insights on how retrieval augmentation impacts long, knowledge-rich text generation of LMs. We further identify attribution patterns for long text generation and analyze the main culprits of attribution errors. Together, our analysis reveals how retrieval augmentation impacts long knowledge-rich text generation and provide directions for future work.

Understanding Black-box Predictions via Influence Functions

Abstract

  • Use influence function to trace a model's prediction back to its training data.
  • Approximation of influence function that requires gradients and Hessian vectors provides valuable information
  • Useful in debugging models and detecting dataset errors

Details

  • Using influence function, one can ask questions such as "What is the model parameter like when certain training data was missing/altered?" without re-training the whole model
  • Useful in detecting adversarial examples
  • Useful in fixing mislabeled examples by providing good candidate lists, but limited boost compared to the simple listing via highest training loss

Personal Thoughts

  • Understanding neural networks is difficult because all the theoretical assumptions do not hold in non-convex, data-dependent, .. environment.
  • Good approximation methods are always powerful and applicable

Link: https://arxiv.org/pdf/1703.04730.pdf
Authors: Pang Wei Koh(Stanford), Percy Liang(Stanford)

Scaling the Transformer

While the transformer represents a massive leap forward in modeling long-range dependency, the models we have seen so far are still fundamentally limited by the size of the input. Since the size of the dot-product matrix grows quadratically in the sequence length, this quickly becomes the bottleneck as we try to extend the length of the input sequence.

When the length of the input sequence grows, both computation time and memory of self-attention mechanism grows quadratically due to the dot-product.

ExecuTorch

ExecuTorch
ExecuTorch Runtim Overview

ExecuTorch is a PyTorch platform that provides infrastructure to run PyTorch programs everywhere from AR/VR wearables to standard on-device iOS and Android mobile deployments. One of the main goals for ExecuTorch is to enable wider customization and deployment capabilities of the PyTorch programs.

Mistral-7b, Zephyr-7b-alpha

Mistral-7b-v0.1, Zephyr-7b-alpha

  • Mistral-7b outperformed Llama2-13b-hf and gpt-3.5-turbo
  • Zephyr-7b-alpha outperformed mistral-7b, and beat Llama2-70b

DPO vs PPO (DPO is better for finetuning?)

  • Zephyr-7b-alpha is a finetuned model of the Mistral-7b with DPO trainer.
  • HuggingFace team found that PPO is fragile with hyperparameters, while DPO is robust for hyperparameters

Using Fast Weights to Attend to the Recent Past

The authors [1] propose "fast weights", a type of attention mechanism to the recent past that performs multiple steps of computation between each hidden state computation step in an RNN. The authors evaluate their architecture on various tasks that require short-term memory, arguing that the fast weights mechanism frees up the RNN from memorizing sthings in the hidden state which is freed up for other types of computation.

Reference:

[1] Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, Catalin Ionescu, Using Fast Weights to Attend to the Recent Past

์ง€์†์ ์ธ ํ•™์Šต์ด ์–ด๋ ค์šด ์ด์œ 

์น˜๋ช…์ ์ธ ๋ง๊ฐ (Catastrophic Forgetting)

ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต ํ›„์— ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋‹ค์‹œ ํ•™์Šต์„ ํ•˜๋ฉด ๊ธฐ์กด ํ•™์Šต ๋‚ด์šฉ์„ ๋ง๊ฐํ•˜๋Š” ํ˜„์ƒ.

์˜ˆ๋ฅผ ๋“ค์–ด, 100๋งŒ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต์„ ํ•œ ํ›„์— 10๋งŒ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€ ํ•™์Šต ์‹œํ‚ค๋ ค๋ฉด 110๋งŒ ๊ฐœ์˜ ์ „์ฒด ๋ฐ์ดํ„ฐ๋กœ ๋‹ค์‹œ ํ•™์Šต์„ ์‹œํ‚ค๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ธ ๋ฐฉ์‹์ด๋‹ค. ์ด๋ ‡๊ฒŒ ํ•ด์•ผ๋งŒ ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•œ ๊ธฐ์กด ์„ฑ๋Šฅ์ผ ์œ ์ง€๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

Can Active Memory Replace Attention?

Abstract

  • Yes for the case of soft attention: somewhat mixed result across tasks.
  • Active memory operates on all of the memory in parallel in a uniform way, bringing improvement in the algorithmic task, image processing, and generative modelings.
  • Does active memory perform well in machine translation? [YES]

Details

Attention

  • Only a small part of the memory changes at every step, or the memory remains constant.
  • Important limitation in attention mechanism is that it can only focus on a single element of the memory due to its nature of softmax.

Active Memory

  • Any model where every part of the memory undergoes an active change at every step.

NMT with Neural GPU

  • parallel encoding and decoding
  • BLEU < 5
  • conditional dependence between outputs are not considered

NMT with Markovian Neural GPU

  • parallel encoding and 1-step conditioned decoding
  • BLEU < 5
  • Perhaps, Markovian dependence of the outputs is too weak for this problem - a full recurrent dependence of the state is needed for good performance

NMT with Extended Neural GPU

  • parallel encoding and sequential decoding
  • BLEU = 29.6 (WMT 14 En-Fr)
  • active memory decoder (d) holds a recurrent state of decoding and output tape tensor (p) holds past decoded logits, going through CGRU^d.

CGRU

  • the convolutional operation followed by recurrent operation
  • stack of CGRU expands receptive field of conv operation
  • output tape tensor acts as external memory of decoded logits

Personal Thoughts

  • Same architecture, but encoder and decoder hidden states may be doing different things
    • encoder: embed semantic locally
    • decoder : track how much it has decoded, use tape tensor to hold information of what it has decoded
  • Will it work for languages with different sentence order?
  • What part of the translation problem can we treat as convolutional?
  • Is "Transformer" a combination of attention and active memory?

Link: https://arxiv.org/pdf/1610.08613.pdf
Authors: Lukas Kaiser (Google Brain) et al. 2017

Similarity metrices

  1. Euclidean distance

Euclidean distance (often called L2 norm) is the most intuitive of the metrics.

1_X5MMsxJauDXDh3RKnJKWLQ.png

However, the Euclidean distance only considers the size not orientation (direction of the vectors). To overcome this issue, we could adapt either dot product or cosine similarity.

  1. Dot product

One drawback of Euclidean distance is the lack of orientation considered in the calculation โ€” it is based solely on magnitude. And this is where we can use our other two metrics. The first of those is the dot product.

The dot product considers direction (orientation) and also scales with vector magnitude.

We care about orientation because similar meaning (as we will often find) can be represented by the direction of the vector โ€” not necessarily the magnitude of it.

For example, we may find that our vector's magnitude correlates with the frequency of a word that it represents in our dataset. Now, the word hi means the same as hello, and this may not be represented if our training data contained the word hi 1000 times and hello just twice.

So, vectors' orientation is often seen as being just as important (if not more so) as distance.

The dot product is calculated using:

1_928VGLCFwRwaFnIu-Ptxfg.png

The dot product considers the angle between vectors, where the angle is ~0, the cosฮธ component of the formula equals ~1. If the angle is nearer to 180 (orthogonal/perpendicular), the cosฮธ component equals ~0.

Therefore, the cosฮธ component increases the result where there is less of an angle between the two vectors. So, a higher dot-product correlates with higher orientation.

Clearly, the dot product calculation is straightforward (the simplest of the three) โ€” and this gives us benefits in terms of computation time.

However, there is one drawback. It is not normalized โ€” meaning larger vectors will tend to score higher dot products, despite being less similar.

So, in reality, the dot-product is used to identify the general orientation of two vectors โ€” because:

  1. Two vectors that point in a similar direction return a positive dot-product.

  2. Two perpendicular vectors return a dot-product of zero.

  3. Vectors that point in opposing directions return a negative dot-product.

  1. Cosine similarity

Cosine similarity considers vector orientation, independent of vector magnitude.

1_BJq5ZVsO4rpYTsmV9pIGUQ.png

The first thing we should be aware of in this formula is that the numerator is, in fact, the dot product โ€” which considers both magnitude and direction.

In the denominator, we have the strange double vertical bars โ€” these mean โ€˜the length ofโ€™. So, we have the length of u multiplied by the length of v. The length, of course, considers magnitude.

When we take a function that considers both magnitude and direction and divide that by a function that considers just magnitude โ€” those two magnitudes cancel out, leaving us with a function that considers direction independent of magnitude.

We can think of cosine similarity as a normalized dot product! And it clearly works.

Open X-Embodiment: ๋กœ๋ด‡ ํ•™์Šต ๋ฐ์ดํ„ฐ์„ธํŠธ ๋ฐ RT-X ๋ชจ๋ธ

Open X-Embodiment: ๋กœ๋ด‡ ํ•™์Šต ๋ฐ์ดํ„ฐ์„ธํŠธ ๋ฐ RT-X ๋ชจ๋ธ

  • ์ตœ๋Œ€ ๊ทœ๋ชจ์˜ ์˜คํ”ˆ ์†Œ์Šค ๋ฐ์ดํ„ฐ ์…‹
    • 21๊ฐœ ๊ธฐ๊ด€์˜ ํ˜‘์—…์„ ํ†ตํ•ด 22๊ฐœ์˜ ๋‹ค๋ฅธ ๋กœ๋ด‡์œผ๋กœ๋ถ€ํ„ฐ ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ ์…‹
    • 527๊ฐœ์˜ ์Šคํ‚ฌ์…‹(16๋งŒ๊ฐœ์˜ ํƒœ์Šคํฌ) ํฌํ•จ
  • ์ด ๋ฐ์ดํ„ฐ๋กœ ํ›ˆ๋ จ๋œ RT-X ๋ชจ๋ธ์„ 2๊ฐ€์ง€ ๊ณต๊ฐœ
    • RT-1-X: ๋กœ๋ด‡ ์ œ์–ด๋ฅผ ์œ„ํ•ด ์„ค๊ณ„๋œ ํšจ์œจ์ ์ธ ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜์˜ ์•„ํ‚คํ…์ฒ˜
    • RT-2-X: ์ž์—ฐ ์–ธ์–ด ํ† ํฐ์—์„œ ๋กœ๋ด‡ ์•ก์…˜์„ ์ถœ๋ ฅํ•ด๋‚ด๋Š” ๋Œ€๊ทœ๋ชจ ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ

DeepSpeed Ulysses: ๊ธด ์‹œํ€€์Šค ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ ํ›ˆ๋ จ์„ ์œ„ํ•œ ์‹œ์Šคํ…œ ์ตœ์ ํ™”

DeepSpeen Ulysses

  • ๊ธฐ์กด ์‹œ์Šคํ…œ๋ณด๋‹ค 4๋ฐฐ ๋” ๊ธด ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ์ œ๊ณต, ๋ฐฑ๋งŒ๊ฐœ ์ด์ƒ์˜ ํ† ํฐ์ด ํฌํ•จ๋œ ์‹œํ€€์Šค๋กœ ํ›ˆ๋ จ ๊ฐ€๋Šฅ
  • ํ†ต์‹ ์ด 10๋ฐฐ ์ด์ƒ ๊ฐ์†Œํ•˜์—ฌ ์ฒ˜๋ฆฌ๋Ÿ‰์ด ์ตœ๋Œ€ 2.5๋ฐฐ ํ–ฅ์ƒ. ์ฒ˜๋ฆฌ๋Ÿ‰์ด 175 TFlops/GPU ์ด์ƒ์œผ๋กœ ์œ ์ง€
  • ์™„์ „ํžˆ general ํ•˜๊ณ  ๊ตฌํ˜„์— agnosticํ•œ Attention (FlashAttention 2 ๊ฐ™์€ ๊ตฌํ˜„๊ณผ๋„ ๋™์ž‘)
  • ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ ํ›ˆ๋ จ ์ง€์›: ZeRO-3 ๊ณผ ํ•จ๊ป˜ ์ž‘๋™ํ•˜์—ฌ ๋Œ€๊ทœ๋ชจ ์‹œํ€€์Šค/๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ์ง€์›
  • ์‚ฌ์šฉํ•˜๊ธฐ ์‰ฝ๊ณ  ์ด์‹์„ฑ์ด ๋›ฐ์–ด๋‚˜ ๊ธฐ์กด ํ”„๋ ˆ์ž„์›Œํฌ ๋ณ€๊ฒฝ ์ตœ์†Œํ™”

Deep Networks are kernel machines

Deep Neural Networks are often said to discover useful representations of the data. However, this paper challenges this prevailing view and suggest that rather than representing the data, deep neural networks store superpositions of the training data in their weights and act as kernel machines at inference time. This is a theoretical paper with a main theorem and an understandable proof and the result leads to many interesting implications for the field.

Every Model Learned by Gradient Descent Is Approximately a Kernel Machine

Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation

Abstract

  • The paper stated that the Attention-based models (especially Transformers) will replace the ConvNet for Image Processing
  • Uses Axial-Attention to make Transformer to train images
    • Works well not only as a stand-alone Image Classification model but also as a backbone of panoptic segmentation
    • SOTA for Image Classification with COCO dataset
    • comparable performance for 2 stage method (panoptic segmentation)

Details

  • Uses the Self-Attention

self-attention

Personal Thoughts

  • ConvNet have dominated image processing for the last decade, but transformers are quickly replacing traditional models.

Link: https://arxiv.org/abs/2003.07853
Authors: Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh Chen

Attention and Augmented Recurrent Neural Networks

Abstract

  • Augmenting RNN with Attention is a new trend.
  • A human with a piece of paper is, in some sense, much smarter than a human without.
  • Since vectors are the natural language of neural networks, the memory is an array of vectors

Details

  • Neural Turing Machines

    • RNN with the external memory bank
    • reading and writing: instead of predicting where to read/write (discrete), model always read/writes on all area but simply learn the weight
  • Attentional Interfaces

    • Basic attention
  • Adaptive Computation Time

    • a way for RNNs to do different amounts of computation each step
  • Neural Programmers

    • learns to create programs in order to solve a task

Personal Thoughts

  • attention is the key to next-generation neural network

Link : https://distill.pub/2016/augmented-rnns/
Authors: Olah et al. 2016

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.