Giter VIP home page Giter VIP logo

oztobuzz / vista Goto Github PK

View Code? Open in Web Editor NEW
20.0 2.0 1.0 1.83 MB

This is the official repository for Vista dataset - A Vietnamese multimodal dataset contains more than 700,000 samples of conversations and images

Home Page: https://huggingface.co/datasets/Vi-VLM/Vista

License: Other

Python 90.79% Shell 9.21%
dataset multi-modality multimodal open-source vietnamese vietnamese-nlp vision-language-model vista

vista's Introduction

Vista

Data License

image/png

"700.000 Vietnamese vision-language samples open-source dataset"

Overview

This dataset contains over 700,000 Vietnamese vision-language samples, created by Gemini Pro. We employed several prompt engineering techniques: few-shot learning, caption-based prompting and image-based prompting.

  • For the COCO dataset, we generated data using Llava-style prompts

  • For the ShareGPT4V dataset, we used translation prompts.

  • Caption-based prompting: involves using accurate captions and bounding boxes from the original dataset.

  • Image-based prompting: uses images to create captions and conversations.

Curation process involved removing any Han, Japanese, and Korean characters. The data was also refined by filtering out samples with high perplexity levels.

image/png

image/svg

HuggingFace Dataset

Report: Coming Soon

Dataset Structure

The dataset is structured into 5 subsets:

Subset Split Method Size
Vi-LLAVA conversation train caption-based 107,052
validation 4,550
Vi-LLAVA complex reasoning train caption-based 112,650
validation 4,771
Vi-LLAVA detail description train caption-based 111,153
validation 4,714
Vi-ShareGPT4V translation 96,913
Vi-WIT caption-based, image-based 264,831
Total 706,634

Data process

Vi-LLAVA

Follow the instructions in Vi-LLAVA/ folder.

Translate ShareGPT4V

bash scripts/translate_shareGPT4V.sh

WIT

Follow the instructions in WIT/ folder.

Filtering perplexity

Open In Colab

from perplexity.filtering import FilteringPerplexity

# Specific your own dataset
datasets = load_dataset("Specific your dataset", split="train")

# Set up perplextiy filtering
perplexity_filtering = FilteringPerplexity(
    sentencepiece_model_path=os.path.join('path to sentencepiece model'),
    kenlm_model_path=os.path.join("path to kenlm model"),
)

# Compute perplexity
data_contains_perplex = perplexity_filtering.compute(dataset)

# Filter perplexity
threshold = 100  # Set your own threshold if needed
data_filtered = perplexity_filtering.filter(data_contains_perplex, threshold=threshold)

Personal and Sensitive Information

  • The dataset does not contain any personal or sensitive information.

Bias, Risks, and Limitations

  • The dataset may contain biases due to the sources from which the data was collected.
  • Users should be aware of these potential biases when using the dataset.

Authors

Licensing Information

The dataset is released under the MIT license.

Additional Information

Citation Information

BibTeX:

@article{ViVLM Vista 2024,
  title={Vista},
  author={Tran, Oanh Ngoc and Bui, Hop Van and Ha, Hoang Huy and Phan, Phuc Van},
  year=2024,
  month=May},
  url={https://huggingface.co/datasets/Vi-VLM/Vista}

vista's People

Contributors

hahuyhoang411 avatar hllj avatar oztobuzz avatar pphuc25 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

thanhpham1987

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.