Giter VIP home page Giter VIP logo

gemini's Introduction

Multi-Modality

Gemini

gemini

The open source implementation of Gemini, the model that will "eclipse ChatGPT", it seems to work by directly taking in all modalities all at once into a transformer with special decoders for text or img generation!

Join the Agora discord channel to help with the implementation! and Here is the project board:

The input sequences for Gemini consist of texts, audio, images, and videos. These inputs are transformed into tokens, which are then processed by a transformer. Subsequently, conditional decoding takes place to generate image outputs. Interestingly, the architecture of Gemini bears resemblance to Fuyu's architecture but is expanded to encompass multiple modalities. Instead of utilizing a visual transformer (vit) encoder, Gemini simply feeds image embeddings directly into the transformer. For Gemini, the token inputs will likely be indicated by special modality tokens such as [IMG], , [AUDIO], or . Codi, a component of Gemini, also employs conditional generation and makes use of the tokenized outputs. To implement this model effectively, I intend to initially focus on the image embeddings to ensure their smooth integration. Subsequently, I will proceed with incorporating audio embeddings and then video embeddings.

Install

pip3 install gemini-torch

Usage

Gemini Transformer Usage

  • Base transformer
  • Multi Grouped Query Attn / flash attn
  • rope
  • alibi
  • xpos
  • qk norm
  • no pos embeds
  • kv cache
import torch

from gemini_torch.model import Gemini

# Initialize model with smaller dimensions
model = Gemini(
    num_tokens=50432,
    max_seq_len=4096,  # Reduced from 8192
    dim=1280,  # Reduced from 2560
    depth=16,  # Reduced from 32
    dim_head=64,  # Reduced from 128
    heads=12,  # Reduced from 24
    use_abs_pos_emb=False,
    attn_flash=True,
    attn_kv_heads=2,
    qk_norm=True,
    attn_qk_norm=True,
    attn_qk_norm_dim_scale=True,
)

# Text shape: [batch, seq_len, dim]
text = torch.randint(0, 50432, (1, 4096))  # Reduced seq_len from 8192

# Apply model to text
y = model(
    text,
)

# Output shape: [batch, seq_len, dim]
print(y)

Full Multi-Modal Gemini

  • Processes images and audio through a series of reshapes
  • Ready to train for production grade usage
  • Hyper optimized with flash attention, qk norm, and other methods
import torch

from gemini_torch.model import Gemini

# Initialize model with smaller dimensions
model = Gemini(
    num_tokens=10000,  # Reduced from 50432
    max_seq_len=1024,  # Reduced from 4096
    dim=320,  # Reduced from 1280
    depth=8,  # Reduced from 16
    dim_head=32,  # Reduced from 64
    heads=6,  # Reduced from 12
    use_abs_pos_emb=False,
    attn_flash=True,
    attn_kv_heads=2,
    qk_norm=True,
    attn_qk_norm=True,
    attn_qk_norm_dim_scale=True,
    post_fusion_norm=True,
    post_modal_transform_norm=True,
)

# Text shape: [batch, seq_len, dim]
text = torch.randint(0, 10000, (1, 1024))  # Reduced seq_len from 4096

# Img shape: [batch, channels, height, width]
img = torch.randn(1, 3, 64, 64)  # Reduced height and width from 128

# Audio shape: [batch, audio_seq_len, dim]
audio = torch.randn(1, 32)  # Reduced audio_seq_len from 64

# Apply model to text and img
y, _ = model(text=text, img=img, audio=audio)

# Output shape: [batch, seq_len, dim]
print(y)
print(y.shape)


# After much training
model.eval()

text = tokenize(texts)
logits = model(text)
text = detokenize(logits)

LongGemini

An implementation of Gemini with Ring Attention, no multi-modality processing yet.

import torch
from gemini_torch import LongGemini

# Text tokens
x = torch.randint(0, 10000, (1, 1024))

# Create an instance of the LongGemini model
model = LongGemini(
    dim=512,  # Dimension of the input tensor
    depth=32,  # Number of transformer blocks
    dim_head=128,  # Dimension of the query, key, and value vectors
    long_gemini_depth=9,  # Number of long gemini transformer blocks
    heads=24,  # Number of attention heads
    qk_norm=True,  # Whether to apply layer normalization to query and key vectors
    ring_seq_size=512,  # The size of the ring sequence
)

# Apply the model to the input tensor
out = model(x)

# Print the output tensor
print(out)

Tokenizer

  • Sentencepiece, tokenizer
  • We're using the same tokenizer as LLAMA with special tokens denoting the beginning and end of the multi modality tokens.
  • Does not fully process img, audio, or videos now we need help on that
from gemini_torch.tokenizer import MultimodalSentencePieceTokenizer

# Example usage
tokenizer_name = "hf-internal-testing/llama-tokenizer"
tokenizer = MultimodalSentencePieceTokenizer(tokenizer_name=tokenizer_name)

# Encoding and decoding examples
encoded_audio = tokenizer.encode("Audio description", modality="audio")
decoded_audio = tokenizer.decode(encoded_audio)

print("Encoded audio:", encoded_audio)
print("Decoded audio:", decoded_audio)

References

  • Combine Reinforcment learning with modular pretrained transformer, multi-modal capabilities, image, audio,
  • self improving mechanisms like robocat
  • PPO? or MPO
  • get good at backtracking and exploring alternative paths
  • speculative decoding
  • Algorithm of Thoughts
  • RLHF
  • Gemini Report
  • Gemini Landing Page

Todo

  • Check out the project board for more todos

  • Implement the img feature embedder and align imgs with text and pass into transformer: Gemini models are trained to accommodate textual input interleaved with a wide variety of audio and visual inputs, such as natural images, charts, screenshots, PDFs, and videos, and they can produce text and image outputs (see Figure 2). The visual encoding of Gemini models is inspired by our own foundational work on Flamingo (Alayrac et al., 2022), CoCa (Yu et al., 2022a), and PaLI (Chen et al., 2022), with the important distinction that the models are multimodal from the beginning and can natively output images using discrete image tokens (Ramesh et al., 2021; Yu et al., 2022b).

  • Implement the audio processing using USM by Google:In addition, Gemini can directly ingest audio signals at 16kHz from Universal Speech Model (USM) (Zhang et al., 2023) features. This enables the model to capture nuances that are typically lost when the audio is naively mapped to a text input (for example, see audio understanding demo on the website).

  • Video Processing Technique: " Video understanding is accomplished by encoding the video as a sequence of frames in the large context window. Video frames or images can be interleaved naturally with text or audio as part of the model input"

  • Prompting Technique: We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach (Wei et al., 2022) that accounts for model uncertainty. The model produces a chain of thought with k samples, for example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood choice without chain of thought. We refer the reader to appendix for a detailed breakdown of how this approach compares with only chain-of-thought prompting or only greedy sampling.

  • Train a 1.8B + 3.25 Model: Nano-1 and Nano-2 model sizes are only 1.8B and 3.25B parameters respectively. Despite their size, they show exceptionally strong performance on factuality, i.e. retrieval-related tasks, and significant performance on reasoning, STEM, coding, multimodal and

gemini's People

Contributors

dependabot[bot] avatar kyegomez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gemini's Issues

Add conversation archiving and foldering capabilities

Hi there,
Please addition of conversation archiving and foldering capabilities to Gemini. This feature would allow users to organize and manage their conversations more effectively.

Thanks

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

Is there a small model trainer, such as llama2.c?

Is there a small model trainer, such as llama2.c?

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

[BUG] Import Error

Describe the bug
Getting the following error when running the example code.

##Code##
import torch
from gemini_torch import Gemini

Initialize the model

model = Gemini(
num_tokens=50432,
max_seq_len=8192,
dim=2560,
depth=32,
dim_head=128,
heads=24,
use_abs_pos_emb=False,
alibi_pos_bias=True,
alibi_num_heads=12,
rotary_xpos=True,
attn_flash=True,
attn_kv_heads=2,
qk_norm=True,
attn_qk_norm=True,
attn_qk_norm_dim_scale=True,
)

Initialize the text random tokens

x = torch.randint(0, 50432, (1, 8192))

Apply model to x

y = model(x)

Print logits

print(y)

##Error below##

ImportError Traceback (most recent call last)
Input In [1], in <cell line: 2>()
1 import torch
----> 2 from gemini_torch import Gemini
4 # Initialize the model
5 model = Gemini(
6 num_tokens=50432,
7 max_seq_len=8192,
(...)
20 attn_qk_norm_dim_scale=True,
21 )

File E:\Anaconda3\lib\site-packages\gemini_torch_init_.py:2, in
1 from gemini_torch.model import Gemini
----> 2 from gemini_torch.utils import ImgToTransformer, AudioToTransformer
3 from gemini_torch.tokenizer import MultimodalSentencePieceTokenizer
5 all = [
6 "Gemini",
7 "ImgToTransformer",
8 "AudioToTransformer",
9 "MultimodalSentencePieceTokenizer",
10 ]

ImportError: cannot import name 'AudioToTransformer' from 'gemini_torch.utils' (E:\Anaconda3\lib\site-packages\gemini_torch\utils.py)

To Reproduce
Steps to reproduce the behavior:

  1. Install the library by following the Readme.md
  2. Run the example.py in JupyterNotebooks
  3. See error

Expected behavior
The program displays a matrix.

Screenshot attached.
Gemini_Torch_error

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

how to train Gemini

Just wondering anything special when training Gemini, like is there any differences train on the multimodal data?

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

記住用戶語言偏好

Gemini 目前無法跨多個會話記住用戶語言偏好。這意味著用戶每次使用 Gemini 時都需要明確指定其首選語言。

這對於始終希望使用相同語言的用戶可能會造成不便。如果 Gemini 可以記住用戶語言偏好並自動將其用於未來的會話,那將會很有幫助。

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

[BUG] Expected size for first two dimensions of batch2 tensor to be: [1, 4] but got: [1, 512].

Describe the bug
I'm try run example code, but it fails with this error.
To Reproduce
Steps to reproduce the behavior:
Run example code

import torch

from gemini_torch.model import Gemini

# Initialize model with smaller dimensions
model = Gemini(
    num_tokens=50432,
    max_seq_len=4096,  # Reduced from 8192
    dim=1280,  # Reduced from 2560
    depth=16,  # Reduced from 32
    dim_head=64,  # Reduced from 128
    heads=12,  # Reduced from 24
    use_abs_pos_emb=False,
    attn_flash=True,
    attn_kv_heads=2,
    qk_norm=True,
    attn_qk_norm=True,
    attn_qk_norm_dim_scale=True,
)

# Text shape: [batch, seq_len, dim]
text = torch.randint(0, 50432, (1, 4096))  # Reduced seq_len from 8192

# Apply model to text
y = model(
    text,
)

# Output shape: [batch, seq_len, dim]
print(y)

Expected behavior
(i don`t know, which response i can get)

Additional context

Traceback (most recent call last):
  File "C:\----\remove it.py", line 3, in <module>
    from gemini_torch.model import Gemini
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\gemini_torch\__init__.py", line 1, in <module>
    from gemini_torch.long_gemini import LongGeminiTransformerBlock, LongGemini
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\gemini_torch\long_gemini.py", line 3, in <module>
    from zeta.nn import FeedForward, OutputHead
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\zeta\__init__.py", line 28, in <module>
    from zeta.nn import *
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\zeta\nn\__init__.py", line 1, in <module>
    from zeta.nn.attention import *
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\zeta\nn\attention\__init__.py", line 14, in <module>
    from zeta.nn.attention.mixture_attention import (
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\zeta\nn\attention\mixture_attention.py", line 8, in <module>
    from zeta.models.vit import exists
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\zeta\models\__init__.py", line 3, in <module>
    from zeta.models.andromeda import Andromeda
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\zeta\models\andromeda.py", line 4, in <module>
    from zeta.structs.auto_regressive_wrapper import AutoregressiveWrapper
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\zeta\structs\__init__.py", line 4, in <module>
    from zeta.structs.local_transformer import LocalTransformer
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\zeta\structs\local_transformer.py", line 8, in <module>
    from zeta.nn.modules import feedforward_network
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\zeta\nn\modules\__init__.py", line 47, in <module>
    from zeta.nn.modules.mlp_mixer import MLPMixer
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\zeta\nn\modules\mlp_mixer.py", line 145, in <module>
    output = mlp_mixer(example_input)
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\zeta\nn\modules\mlp_mixer.py", line 125, in forward
    x = mixer_block(x)
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\zeta\nn\modules\mlp_mixer.py", line 63, in forward
    y = self.tokens_mlp(y)
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\zeta\nn\modules\mlp_mixer.py", line 30, in forward
    y = self.dense1(x)
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected size for first two dimensions of batch2 tensor to be: [1, 4] but got: [1, 512].```

<!-- POLAR PLEDGE BADGE START -->
## Upvote & Fund

- We're using [Polar.sh](https://polar.sh/kyegomez) so you can upvote and help fund this issue.
- We receive the funding once the issue is completed & confirmed by you.
- Thank you in advance for helping prioritize & fund our backlog.

<a href="https://polar.sh/kyegomez/Gemini/issues/32">
<picture>
  <source media="(prefers-color-scheme: dark)" srcset="https://polar.sh/api/github/kyegomez/Gemini/issues/32/pledge.svg?darkmode=1">
  <img alt="Fund with Polar" src="https://polar.sh/api/github/kyegomez/Gemini/issues/32/pledge.svg">
</picture>
</a>
<!-- POLAR PLEDGE BADGE END -->

[BUG] Getting CPU allocation error after the update on 12/15/2023

Describe the bug
Already installed the update on 12/15/2023 via "pip3 install gemini-torch --upgrade". Running the example code generates the following error. -- Maybe Torch cannot be forced to run on the GPU? I have an NVidia GeForce RTX 2070.

"RuntimeError: [enforce fail at alloc_cpu.cpp:80] data. DefaultCPUAllocator: not enough memory: you tried to allocate 21152713932800 bytes."

To Reproduce
Steps to reproduce the behavior:

  1. Install Gemini-Torch per Readme.md
  2. Run the example.py
  3. See error

Expected behavior
The example completes without error and displays a matrix.

Screenshots
Please see screenshots attached.

Gemini_Torch_error02

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.