Giter VIP home page Giter VIP logo

hashformers's Introduction

✂️ hashformers

HF Spaces Open In Colab PyPi license stars tweet

Hashtag segmentation is the task of automatically adding spaces between the words on a hashtag.

Hashformers is the current state-of-the-art for hashtag segmentation, as demonstrated on this paper accepted at LREC 2022.

Hashformers is also language-agnostic: you can use it to segment hashtags not just with English models, but also using any language model available on the Hugging Face Model Hub.

Basic usage

from hashformers import TransformerWordSegmenter as WordSegmenter

ws = WordSegmenter(
    segmenter_model_name_or_path="gpt2",
    segmenter_model_type="incremental",
    reranker_model_name_or_path="google/flan-t5-base",
    reranker_model_type="seq2seq"
)

segmentations = ws.segment([
    "#weneedanationalpark",
    "#icecold"
])

print(segmentations)

# [ 'we need a national park',
# 'ice cold' ]

It is also possible to use hashformers without a reranker by setting the reranker_model_name_or_path and the reranker_model_type to None.

Installation

pip install hashformers

What models can I use?

Visit the HuggingFace Model Hub and choose your models for the WordSegmenter class.

You can use any model supported by the minicons library. Currently hashformers supports the following model types as the segmenter_model_type or reranker_model_type:

incremental

Auto-regressive models like GPT-2 and XLNet, or any model that can be loaded with AutoModelForCausalLM. This includes large language models (LLMs) such as Alpaca-LoRA ( chainyo/alpaca-lora-7b ) and GPT-J ( EleutherAI/gpt-j-6b ).

ws = WordSegmenter(
    segmenter_model_name_or_path="EleutherAI/gpt-j-6b",
    segmenter_model_type="incremental",
    reranker_model_name_or_path=None,
    reranker_model_type=None
)

masked

Masked language models like BERT, or any model that can be loaded with AutoModelForMaskedLM.

seq2seq

Seq2Seq models like FLAN-T5 ( google/flan-t5-base ), or any model that can be loaded with AutoModelForSeq2SeqLM.

Best results are usually achieved by using an incremental model as the segmenter_model_name_or_path and a masked or seq2seq model as the reranker_model_name_or_path.

A segmenter is always required, however a reranker is optional.

Contributing

Pull requests are welcome! Read our paper for more details on the inner workings of our framework.

If you want to develop the library, you can install hashformers directly from this repository ( or your fork ):

git clone https://github.com/ruanchaves/hashformers.git
cd hashformers
pip install -e .

Relevant Papers

This is a collection of papers that have utilized the hashformers library as a tool in their research.

hashformers v1.3

These papers have utilized hashformers version 1.3 or below.

Blog Posts

Citation

@misc{rodrigues2021zeroshot,
      title={Zero-shot hashtag segmentation for multilingual sentiment analysis}, 
      author={Ruan Chaves Rodrigues and Marcelo Akira Inuzuka and Juliana Resplande Sant'Anna Gomes and Acquila Santos Rocha and Iacer Calixto and Hugo Alexandre Dantas do Nascimento},
      year={2021},
      eprint={2112.03213},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

hashformers's People

Contributors

ruanchaves avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

hashformers's Issues

NER tutorial

Write a NER tutorial that applies hashformers to the HashSet dataset.

API Reference Documentation

Write API Reference documentation for every subpackage.

  • segmenter
  • experiments
  • evaluation
  • ensemble
  • beamsearch

Proposal: Replace mlm-scoring dependency with better-mlm-scoring

Please Describe The Problem To Be Solved

The current dependency on mlm-scoring introduces a challenge due to its mxnet-cu110 dependency. This is causing compatibility issues with Google Colab and prevents us from updating our software stack. Furthermore, there could be potential improvements in scoring quality.

(Optional): Suggest A Solution

I suggest replacing mlm-scoring with better-mlm-scoring. This library doesn't rely on mxnet-cu110, resolving our compatibility issues. It would also allow us to keep our stack current, and is reported to yield better scoring results. The main trade-off will be the time invested in the transition and subsequent testing.

Segmenters tutorial

Expand the standard tutorial to include the new segmenters.

  • tweet segmenter
  • unigram segmenter
  • word segmenter cascades

Proposal: Replace lm-scorer and mlm-scoring back-ends with minicons

Current Scenario

Hashformers currently uses two backend systems, namely lm-scorer for GPT-2 and mlm-scoring for BERT. However, these packages have not been updated in the last 3 years, resulting in compatibility issues and performance limitations. For instance, our reranker based on mlm-scoring is now incompatible with Google Colab.

Proposed Solution

I propose we switch both back-end systems to minicons. This change offers several benefits that can help improve the functionality and performance of Hashformers:

  • Updated Software: minicons is an actively developed software, with the last commit made less than a week ago.
  • Model Flexibility: By transitioning to minicons, we are not limited to using just GPT-2 or BERT. Instead, we can use any Transformer model we prefer. This flexibility could potentially create a new SOTA for hashtag segmentation tasks, as we would have access to more powerful models.
  • Reduced Compatibility Issues: As minicons is built on the latest version of the transformers library, we can expect fewer compatibility problems.
  • Improved Algorithm: The existing algorithm behind mlm-scoring is possibly outdated, as indicated by the development of better-mlm-scoring. This improved scoring method is expected to be integrated soon into minicons through a PR.

In light of these advantages, I believe that transitioning to minicons will significantly enhance the efficiency and effectiveness of our library.

Impossible to install hashformers

I'm trying to install hashformers by using pip install hashformers but it remains stuck during packages installation and then yields the following error:

Attempting uninstall: pandas
Found existing installation: pandas 2.2.2
Uninstalling pandas-2.2.2:
Successfully uninstalled pandas-2.2.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
visions 0.7.6 requires pandas>=2.0.0, but you have pandas 1.5.3 which is incompatible.
Successfully installed accelerate-0.25.0 aiohttp-3.9.5 aiosignal-1.3.1 frozenlist-1.4.1 fsspec-2024.6.0 hashformers-2.0.0 huggingface-hub-0.23.4 minicons-0.2.44 mpmath-1.3.0 multidict-6.0.5 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.5.40 nvidia-nvtx-cu12-12.1.105 openai-0.28.1 pandas-1.5.3 regex-2024.5.15
safetensors-0.4.3 sympy-1.12.1 tenacity-8.4.2 tokenizers-0.19.1 torch-2.3.1 transformers-4.41.2 triton-2.3.1 twitter-text-python-1.1.1 urllib3-1.26.19 yarl-1.9.4

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.