yeonwoosung / ai_book Goto Github PK

View Code? Open in Web Editor NEW

23.0 3.0 4.0 558.5 MB

AI book for everyone

Python 1.51% Jupyter Notebook 98.49% Shell 0.01% Java 0.01% Batchfile 0.01% Makefile 0.01%

xai ai-ml machine-learning deep-learning tutorial cv nlp machinelearning ai knowledge-distillation

ai_book's Introduction

Yeonwoo Sung

Skills

📚 Tech Stack 📚

Kaggle

Kaggle Competition Expert

Education

BSc in University of St Andrews (2016 - 2020)

NVIDIA, Deep Learning NLP certificate (2019)

NVIDIA, Rapid Application Development with Large Language Models (2024)

Coursera - Machine Learning, Certification for ”Machine Learning” course (2018)

Awards

Dean’s List for academic excellence in the academic year 2019/20
Medal for performance in Programming Projects module (2018)

ai_book's People

Contributors

Stargazers

Watchers

Forkers

ahmedssoliman trellixvulnteam ugrkilc exploringit289

ai_book's Issues

Neural Ordinary Differential Equation

paper
blog - korean

Text Embeddings Reveal (Almost) As Much As Text

paper, code

Abstract

How much private information do text embeddings reveal about the original text? We investigate the problem of embedding \textit{inversion}, reconstructing the full text represented in dense text embeddings. We frame the problem as controlled generation: generating text that, when reembedded, is close to a fixed point in latent space. We find that although a naïve model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover 92% of 32-token text inputs exactly. We train our model to decode text embeddings from two state-of-the-art embedding models, and also show that our model can recover important personal information (full names) from a dataset of clinical notes.

Personal Thoughts

Reconstructing and recovering the original texts from the text embeddings might be considered as AI-based vulnerability, which could cause unintended privacy leakage issue.

The paper stated that it is possible to decrease the recoverage of Vec2Text by adding some Gaussian noise directly to each embedding.

As you could see in the chart above, there is a some point that we could maximize the distance between Vec2Text recovery percentage and Retrieval performance (keep the Retrieval performance and drop the recovery probability drastically).

Makes me to remind that adding "proper" noises to embeddings improves the AI-based systems!

Radioactive data: tracing through training

Abstract

Neural classifiers can improve their performance by training on more data
But given a trained classifier, it's difficult to tell what data it was trained on
This is especially relevant if you have proprietary or personal data and you want to make sure that other people don't use it to train their models
Train CNN with Vanilla data first, then train the CNN with Radioactive Data (i.e. Distorted images)

Basically, this paper introduces a method to mark a dataset with a hidden "radioactive" tag, such that any resulting classifier will clearly exhibit this tag, which can be detected.

Details

When you radioactively mark the data points, it simply adds a feature
- Let's assume that there are 10 classes for our problem
  - We can imagine that there are 10 vectors (each vector is a unique axis, and it depicts a corresponding class)
  - The classifier plots the point on that data space, and find the class that is aligned most
- 다른 예로, 이미지 분류 문제에서 4개의 클래스가 있다고 가정하자.
  - 이 경우, 분류기는 총 4개의 축 벡터 (w, x, y, z)를 가진다.
  - 정확히 말하자면 축이 아니라 학습을 통해서 배운 벡터이다.
  - 이 분류기가 데이터를 분류할 때에는 이 데이터 공간 위에 점을 찍은 뒤, 4개의 벡터 중 가장 거리가 가까운 벡터의 클래스로 분류
So, here, we are introducing a fake class vector
- Clearly, this is cheating!
By using this method, we modify the training data only
We will give a little bit of generalization capability, but this will force to pay attention to the fake features (radioactive)
- This is something that you could detect
For testing, they create random vectors on the augmented dataspace and look up the cosine value between fake vector and each random vector
- Authors of this paper stated that if you distort the data well, then theoretically the distribution of the cosine between fake vector and random vectors should follow the given distribution

The paper also shows the methods for re-aligning the feature spaces

Personal Thoughts

Clearly, data is the modern gold. Neural classifiers can improve their performance by training on more data, but given a trained classifier, it's difficult to tell what data it was trained on. This is especially relevant if you have proprietary or personal data and you want to make sure that other people don't use it to train their models. This paper introduces a method to mark a dataset with a hidden "radioactive" tag, such that any resulting classifier will clearly exhibit this tag, which can be detected.

Modeling Recurrence for Transformer

Abstract

propose additional "Attentive Recurrent Network(ARN)" to Transformer encoder to leverage the strengths of both attention and recurrent networks
WMT14 EnDe and WMT17 ZhEn demonstrates the effectiveness
study reveals that a short-cut bridge of shallow ARN outperforms deep counterpart

Details

Main Approach

use an additional recurrent encoder to the source side

recurrent model can be simple (a) RNN, GRU, LSTM or (b) Attentive Recurrent Network where context representation is generated via attention with previous hidden state

Impact of Components

ablation study on size of addition recurrent encoder
- smaller BiARN encoder attached directly to top of decoder outperforms all others

ablation study on number of recurrent steps in ARN
- ~8 seems optimal

ablation study on how to integrate representation in the decoder side
- stack on top outperformed all others

Overall Result

with additional ARN encoder, BLEU scores improve with statistical significance

Linguistic Analysis

what linguistic characteristics are models learning?
- 1-Layer BiARN performs better on all syntactic and some semantic tasks
List of Linguistic Characteristics
- SeLen : sentence length
- WC : recover original words given its source embedding
- TrDep : check whether encoder infers the hierarchical structure of sentences
- ToCo : classify in terms of the sequence of top constituents
- BShif : tests whether two consecutive tokens are inverted
- Tense : predict tense of the main-clause verb
- SubN : number of main-clause subjects
- ObjN : number of the direct object of the main clause
- SoMo : check whether some sentences are modified by replacing a random noun or verb
- CoIn : two coordinate clauses with half the sentence inverted

Personal Thoughts

Translation requires a complicated encoding function in the source side. Pros of attention, rnn, cnn can be complemented to produce a richer representation
This paper showed that there is a small room of improvement for rnn encoder to play part in Transformer encoder with short-cut trick

Link: https://arxiv.org/pdf/1904.03092v1.pdf
Authors: Hao et al. 2019

Finally, PyTorch supports M1 Mac for AI acceleration!!

By using MPS(Apple's Metal Performance Shaders), PyTorch accelerates on the M1 Mac!
The code for MPS support is currently in the preview version only (1.12), but will be released in foreseeable future!

PyTorch Korea post
Medium - GPU-Acceleration Comes to PyTorch on M1 Macs

Shortformer

O.Press et. al. [1] challenge the conventional wisdom that scaling transformer language models to longer sequences improves results. They show that by initially training on shorter sub-sequences and then progressing to longer ones via staged training, we can improve perplexity and reduce training time.

They additionally define a new method, position-infused attention, that enables caching and efficiently attending to previously computed representations. This method does not require large input sub-sequences.

Contrastive search for longer sequence decoding

Contrastive search is originally proposed in "[A Contrastive Framework for Neural Text Generation"](Contrastive search is originally proposed in "A Contrastive Framework for Neural Text Generation") at NeurIPS 2022.

The contrastive search is well-known for text-generation for long sequence text data.

HuggingFace blog: Generating Human-level Text with Contrastive Search in Transformers

Cross-entropy as a loss func for classification

Using the cross-entropy error function instead of the sum-of-squares for a classification problem leads to faster training as well as improved generalization.

AWS Bedrock

AWS Bedrock is a serverless service that allows the users to utilize Foundation Models

Sample Codes

CRF (Conditional Random Fields)

Ratsgo's blog

What Does BERT Look At?

paper

Abstract

Large pre-trained neural networks such as BERT have had great recent success in NLP, motivating a growing body of research investigating what aspects of language they are able to learn from unlabeled data. Most recent analysis has focused on model outputs (e.g., language model surprisal) or internal vector representations (e.g., probing classifiers). Complementary to these works, we propose methods for analyzing the attention mechanisms of pre-trained models and apply them to BERT. BERT's attention heads exhibit patterns such as attending to delimiter tokens, specific positional offsets, or broadly attending over the whole sentence, with heads in the same layer often exhibiting similar behaviors. We further show that certain attention heads correspond well to linguistic notions of syntax and coreference. For example, we find heads that attend to the direct objects of verbs, determiners of nouns, objects of prepositions, and coreferent mentions with remarkably high accuracy. Lastly, we propose an attention-based probing classifier and use it to further demonstrate that substantial syntactic information is captured in BERT's attention.

In short

If you visualize the attention map of each BERT layer, you will find out that the first few layers mostly activates on [CLS] token, where behind-side layers mostly activates on [SEP] token.

Most papers say that all following tokens looks up [CLS] token in Self Attention, so the output logit of the [CLS] token contains the semantic data of the input sentence.

Few-Shot Learning with Class Imbalance

E. Triantafillou et. al. [1] had experiments for few-shot learning with class imbalance to see if the class imbalance actually impacts to the performance of the few-shot learning methods.

Results

compared to the balanced task, the performances on class-imbalance tasks counterparts always drop, by up to 18.0% for optimization-based methods, and up to 8.4 for metric-based methods
contrary to popular belief, meta-learning algorithms, such as MAML, do not automatically learn to balance by being exposed to imbalanced tasks during (meta-)training time
strategies used to mitigate imbalance in supervised learning, such as oversampling, can offer a stronger solution to the class imbalance problem
the effect of imbalance at the meta-dataset level is less significant than the effect at the task level with similar imbalance magnitude.

Reference

[1] Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Utku Evci, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, Hugo Larochelle. Few-Shot Learning with Class Imbalance

Paper readings

NLP paper reading in KakaoBrain

Model parallelism to use a huge huge model in colab

Would it be possible to use DeBERTa v2 xlarge on colab notebook?

Model parallelism tutorial notebook
Huggingface Transformers issue - Model parallelism and big models
Huggingface Accelerate
Huggingface Deepspeed integration

단일 머신을 사용한 모델 병렬화 모범사례
단일 머신을 사용한 모델 병렬화 모범사례 - colab notebook

MLOps: Porting LLaMA and Whisper model in C/C++ or Rust

LLaMA.cpp

whisper.cpp

LLaMA rust crate
Rustformer - LLaMA

Understanding Retrieval Augmentation for Long-Form Question Answering

paper

Summary (tl;dr)

Explores retrieval-augmented language models on long-form question answering; finds that retrieval is an important component but evidence documents should be carefully added to the llm; finds that attribution error happens more frequently when retrieved documents lack sufficient information/evidence for answering the question.

Abstract

We present a study of retrieval-augmented language models (LMs) on long-form question answering. We analyze how retrieval augmentation impacts different LMs, by comparing answers generated from models while using the same evidence documents, and how differing quality of retrieval document set impacts the answers generated from the same LM. We study various attributes of generated answers (e.g., fluency, length, variance) with an emphasis on the attribution of generated long-form answers to in-context evidence documents. We collect human annotations of answer attribution and evaluate methods for automatically judging attribution. Our study provides new insights on how retrieval augmentation impacts long, knowledge-rich text generation of LMs. We further identify attribution patterns for long text generation and analyze the main culprits of attribution errors. Together, our analysis reveals how retrieval augmentation impacts long knowledge-rich text generation and provide directions for future work.

Understanding Black-box Predictions via Influence Functions

Abstract

Use influence function to trace a model's prediction back to its training data.
Approximation of influence function that requires gradients and Hessian vectors provides valuable information
Useful in debugging models and detecting dataset errors

Details

Using influence function, one can ask questions such as "What is the model parameter like when certain training data was missing/altered?" without re-training the whole model
Useful in detecting adversarial examples
Useful in fixing mislabeled examples by providing good candidate lists, but limited boost compared to the simple listing via highest training loss

Personal Thoughts

Understanding neural networks is difficult because all the theoretical assumptions do not hold in non-convex, data-dependent, .. environment.
Good approximation methods are always powerful and applicable

Link: https://arxiv.org/pdf/1703.04730.pdf
Authors: Pang Wei Koh(Stanford), Percy Liang(Stanford)

Scaling the Transformer

While the transformer represents a massive leap forward in modeling long-range dependency, the models we have seen so far are still fundamentally limited by the size of the input. Since the size of the dot-product matrix grows quadratically in the sequence length, this quickly becomes the bottleneck as we try to extend the length of the input sequence.

When the length of the input sequence grows, both computation time and memory of self-attention mechanism grows quadratically due to the dot-product.

ExecuTorch

ExecuTorch
ExecuTorch Runtim Overview

ExecuTorch is a PyTorch platform that provides infrastructure to run PyTorch programs everywhere from AR/VR wearables to standard on-device iOS and Android mobile deployments. One of the main goals for ExecuTorch is to enable wider customization and deployment capabilities of the PyTorch programs.

Mistral-7b, Zephyr-7b-alpha

Mistral-7b-v0.1, Zephyr-7b-alpha

Mistral-7b outperformed Llama2-13b-hf and gpt-3.5-turbo
Zephyr-7b-alpha outperformed mistral-7b, and beat Llama2-70b

DPO vs PPO (DPO is better for finetuning?)

Zephyr-7b-alpha is a finetuned model of the Mistral-7b with DPO trainer.
- Uses subset of UltraFeedback
HuggingFace team found that PPO is fragile with hyperparameters, while DPO is robust for hyperparameters

Tips for custom Object Detection dataset

Tips for best training results (Ultralytics)

NaturalProofs: Mathematical Theorem Proving in Natural Language

NaturalProofs: Mathematical Theorem Proving in Natural Language

repo

Using Fast Weights to Attend to the Recent Past

The authors [1] propose "fast weights", a type of attention mechanism to the recent past that performs multiple steps of computation between each hidden state computation step in an RNN. The authors evaluate their architecture on various tasks that require short-term memory, arguing that the fast weights mechanism frees up the RNN from memorizing sthings in the hidden state which is freed up for other types of computation.

Reference:

[1] Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, Catalin Ionescu, Using Fast Weights to Attend to the Recent Past

지속적인 학습이 어려운 이유

치명적인 망각 (Catastrophic Forgetting)

하나의 데이터셋으로 학습 후에 새로운 데이터셋으로 다시 학습을 하면 기존 학습 내용을 망각하는 현상.

예를 들어, 100만 개의 데이터로 학습을 한 후에 10만 개의 데이터를 추가 학습 시키려면 110만 개의 전체 데이터로 다시 학습을 시키는 것이 일반적인 방식이다. 이렇게 해야만 추가 데이터를 포함한 기존 성능일 유지될 수 있기 때문이다.

Can Active Memory Replace Attention?

Abstract

Yes for the case of soft attention: somewhat mixed result across tasks.
Active memory operates on all of the memory in parallel in a uniform way, bringing improvement in the algorithmic task, image processing, and generative modelings.
Does active memory perform well in machine translation? [YES]

Details

Attention

Only a small part of the memory changes at every step, or the memory remains constant.
Important limitation in attention mechanism is that it can only focus on a single element of the memory due to its nature of softmax.

Active Memory

Any model where every part of the memory undergoes an active change at every step.

NMT with Neural GPU

parallel encoding and decoding
BLEU < 5
conditional dependence between outputs are not considered

NMT with Markovian Neural GPU

parallel encoding and 1-step conditioned decoding
BLEU < 5
Perhaps, Markovian dependence of the outputs is too weak for this problem - a full recurrent dependence of the state is needed for good performance

NMT with Extended Neural GPU

parallel encoding and sequential decoding
BLEU = 29.6 (WMT 14 En-Fr)
active memory decoder (d) holds a recurrent state of decoding and output tape tensor (p) holds past decoded logits, going through CGRU^d.

CGRU

the convolutional operation followed by recurrent operation
stack of CGRU expands receptive field of conv operation
output tape tensor acts as external memory of decoded logits

Personal Thoughts

Same architecture, but encoder and decoder hidden states may be doing different things
- encoder: embed semantic locally
- decoder : track how much it has decoded, use tape tensor to hold information of what it has decoded
Will it work for languages with different sentence order?
What part of the translation problem can we treat as convolutional?
Is "Transformer" a combination of attention and active memory?

Link: https://arxiv.org/pdf/1610.08613.pdf
Authors: Lukas Kaiser (Google Brain) et al. 2017

Similarity metrices

Euclidean distance

Euclidean distance (often called L2 norm) is the most intuitive of the metrics.

However, the Euclidean distance only considers the size not orientation (direction of the vectors). To overcome this issue, we could adapt either dot product or cosine similarity.

Dot product

One drawback of Euclidean distance is the lack of orientation considered in the calculation — it is based solely on magnitude. And this is where we can use our other two metrics. The first of those is the dot product.

The dot product considers direction (orientation) and also scales with vector magnitude.

We care about orientation because similar meaning (as we will often find) can be represented by the direction of the vector — not necessarily the magnitude of it.

For example, we may find that our vector's magnitude correlates with the frequency of a word that it represents in our dataset. Now, the word hi means the same as hello, and this may not be represented if our training data contained the word hi 1000 times and hello just twice.

So, vectors' orientation is often seen as being just as important (if not more so) as distance.

The dot product is calculated using:

The dot product considers the angle between vectors, where the angle is ~0, the cosθ component of the formula equals ~1. If the angle is nearer to 180 (orthogonal/perpendicular), the cosθ component equals ~0.

Therefore, the cosθ component increases the result where there is less of an angle between the two vectors. So, a higher dot-product correlates with higher orientation.

Clearly, the dot product calculation is straightforward (the simplest of the three) — and this gives us benefits in terms of computation time.

However, there is one drawback. It is not normalized — meaning larger vectors will tend to score higher dot products, despite being less similar.

So, in reality, the dot-product is used to identify the general orientation of two vectors — because:

Two vectors that point in a similar direction return a positive dot-product.
Two perpendicular vectors return a dot-product of zero.
Vectors that point in opposing directions return a negative dot-product.

Cosine similarity

Cosine similarity considers vector orientation, independent of vector magnitude.

The first thing we should be aware of in this formula is that the numerator is, in fact, the dot product — which considers both magnitude and direction.

In the denominator, we have the strange double vertical bars — these mean ‘the length of’. So, we have the length of u multiplied by the length of v. The length, of course, considers magnitude.

When we take a function that considers both magnitude and direction and divide that by a function that considers just magnitude — those two magnitudes cancel out, leaving us with a function that considers direction independent of magnitude.

We can think of cosine similarity as a normalized dot product! And it clearly works.

Open X-Embodiment: 로봇 학습 데이터세트 및 RT-X 모델

최대 규모의 오픈 소스 데이터 셋
- 21개 기관의 협업을 통해 22개의 다른 로봇으로부터 수집한 데이터 셋
- 527개의 스킬셋(16만개의 태스크) 포함
이 데이터로 훈련된 RT-X 모델을 2가지 공개
- RT-1-X: 로봇 제어를 위해 설계된 효율적인 트랜스포머 기반의 아키텍처
- RT-2-X: 자연 언어 토큰에서 로봇 액션을 출력해내는 대규모 비전-언어 모델

DeepSpeed Ulysses: 긴 시퀀스 트랜스포머 모델 훈련을 위한 시스템 최적화

DeepSpeen Ulysses

기존 시스템보다 4배 더 긴 시퀀스 길이를 제공, 백만개 이상의 토큰이 포함된 시퀀스로 훈련 가능
통신이 10배 이상 감소하여 처리량이 최대 2.5배 향상. 처리량이 175 TFlops/GPU 이상으로 유지
완전히 general 하고 구현에 agnostic한 Attention (FlashAttention 2 같은 구현과도 동작)
대규모 모델 훈련 지원: ZeRO-3 과 함께 작동하여 대규모 시퀀스/모델 크기를 지원
사용하기 쉽고 이식성이 뛰어나 기존 프레임워크 변경 최소화

Deep Networks are kernel machines

Deep Neural Networks are often said to discover useful representations of the data. However, this paper challenges this prevailing view and suggest that rather than representing the data, deep neural networks store superpositions of the training data in their weights and act as kernel machines at inference time. This is a theoretical paper with a main theorem and an understandable proof and the result leads to many interesting implications for the field.

Every Model Learned by Gradient Descent Is Approximately a Kernel Machine

Transformer models

DeBERTa

paper
code
huggingface model

Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation

Abstract

The paper stated that the Attention-based models (especially Transformers) will replace the ConvNet for Image Processing
Uses Axial-Attention to make Transformer to train images
- Works well not only as a stand-alone Image Classification model but also as a backbone of panoptic segmentation
- SOTA for Image Classification with COCO dataset
- comparable performance for 2 stage method (panoptic segmentation)

Details

Uses the Self-Attention

Personal Thoughts

ConvNet have dominated image processing for the last decade, but transformers are quickly replacing traditional models.

Link: https://arxiv.org/abs/2003.07853
Authors: Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh Chen

Attention and Augmented Recurrent Neural Networks

Abstract

Augmenting RNN with Attention is a new trend.
A human with a piece of paper is, in some sense, much smarter than a human without.
Since vectors are the natural language of neural networks, the memory is an array of vectors

Details

Neural Turing Machines
- RNN with the external memory bank
- reading and writing: instead of predicting where to read/write (discrete), model always read/writes on all area but simply learn the weight
Attentional Interfaces
- Basic attention
Adaptive Computation Time
- a way for RNNs to do different amounts of computation each step
Neural Programmers
- learns to create programs in order to solve a task

Personal Thoughts

attention is the key to next-generation neural network

Link : https://distill.pub/2016/augmented-rnns/
Authors: Olah et al. 2016

yeonwoosung / ai_book Goto Github PK

ai_book's Introduction

Yeonwoo Sung

Skills

📚 Tech Stack 📚

Kaggle

Education

Awards

ai_book's People

Contributors

Stargazers

Watchers

Forkers

ai_book's Issues

Abstract

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Main Approach

Impact of Components

Overall Result

Linguistic Analysis

Personal Thoughts

Sample Codes

Abstract

In short

Understanding Retrieval Augmentation for Long-Form Question Answering

Summary (tl;dr)

Abstract

Abstract

Details

Personal Thoughts

DPO vs PPO (DPO is better for finetuning?)

치명적인 망각 (Catastrophic Forgetting)

Abstract

Details

Attention

Active Memory

NMT with Neural GPU

NMT with Markovian Neural GPU

NMT with Extended Neural GPU

CGRU

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Recommend Projects

Recommend Topics

Recommend Org