Giter VIP home page Giter VIP logo

numerizer's Introduction

Build Status

numerizer

A Python module to convert natural language numerics into ints and floats. This is a port of the Ruby gem numerizer

Installation

The numerizer library can be installed from PyPI as follows:

$ pip install numerizer

or from source as follows:

$ git clone https://github.com/jaidevd/numerizer.git
$ cd numerizer
$ pip install -e .

Usage

>>> from numerizer import numerize
>>> numerize('forty two')
'42'
>>> numerize('forty-two')
'42'
>>> numerize('four hundred and sixty two')
'462'
>>> numerize('one fifty')
'150'
>>> numerize('twelve hundred')
'1200'
>>> numerize('twenty one thousand four hundred and seventy three')
'21473'
>>> numerize('one million two hundred and fifty thousand and seven')
'1250007'
>>> numerize('one billion and one')
'1000000001'
>>> numerize('nine and three quarters')
'9.75'
>>> numerize('platform nine and three quarters')
'platform 9.75'

Using the SpaCy extension

Since version 0.2, numerizer is available as a SpaCy extension.

Any named entities of a quantitative nature within a SpaCy document can be numerized as follows:

>>> from spacy import load
>>> nlp = load('en_core_web_sm')  # or load any other spaCy model
>>> doc = nlp('The projected revenue for the next quarter is over two million dollars.')
>>> doc._.numerize()
{the next quarter: 'the next 1/4', over two million dollars: 'over 2000000 dollars'}

Users can specify which entity types are to be numerized, by using the labels argument in the extension function, as follows:

>>> doc._.numerize(labels=['MONEY'])  # only numerize entities of type 'MONEY'
{over two million dollars: 'over 2000000 dollars'}

The extension is available for tokens and spans as well.

>>> two_million = doc[-4:-2]  # span corresponding to "two million"
>>> two_million._.numerize()
'2000000'
>>> quarter = doc[6]  # token corresponding to "quarter"
>>> quarter._.numerized
'1/4'

Extras

For R users, a wrapper library has been developed by @amrrs. Try it out here.

numerizer's People

Contributors

amrrs avatar fchavat avatar jaidevd avatar nsj806 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

numerizer's Issues

unable to numerize some numbers.

Unable to numerize 225755. i tried to numerize it using:
numerize('two hundred twenty fife thousand seven hundred and fifty-five')
but the output is:
225007 hundred 55
numerize

Problems unifying numbers

Found some cases in which the library fails to unify numbers.
Some cases identified:

  • thirty two and forty one
    • Expected: 32 and 41
    • Gets: 32 and 40 1
  • thirty two and forty one thousand
    • Expected: 32 and 41000
    • Gets: 32 and 40 1000

Reason found:
there's a method called andition that I could understand have the goal of unifying numbers when separated by whitespace and magnitude of them are such that should be following the way of reading (from left to right).

A fix was implemented in #21 . The tests are running without any failure.
New tests where added to control the issue.

Issue with numerize extension in spaCy version 3.6.1

Problem:

We have been using spaCy along with the numerize extension successfully to extract money amounts in string format and convert them into integers. However, after upgrading from spaCy version 3.5.0 to 3.6.1, we are experiencing an issue to reproduce previous result for a specific pattern.

import spacy

nlp = spacy.load("en_core_web_trf")
amount = nlp("55  thousand")._.numerize() #  two spaces between 55 and thousand
print(amount)

Expected result in spaCy 3.5.0 - {55 thousand: '55000'}
Different result in spaCy 3.6.1 - {55 : '55 '}

Above code was able to correctly extract the money amount as an integer (i.e., 55000) in the old spaCy version. However, after upgrading spaCy and en_core_web_trf version to 3.6.1, it fails to pick up the amount if there are two spaces in between the digits.

Environment:

spaCy Version: 3.6.1
docker image: python:3.11.4-slim-bullseye

Please let me know if this is the right place to raise this issue. I can move this to spaCy Github repo if that's more appropriate.

Only works with en_core_web_sm model

There are three different spaCy models.

  • en_core_web_sm
  • en_core_web_md
  • en_core_web_lg

Having one of them should be sufficient for the extension to work.
This part should be generalized:

nlp = spacy.load('en_core_web_sm')

Current error:

>>> from numerizer import numerize
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/USR/.local/lib/python3.10/site-packages/numerizer/__init__.py", line 1, in <module>
    from .numerizer import numerize, spacy_numerize  # NOQA: F401
  File "/home/USR/.local/lib/python3.10/site-packages/numerizer/numerizer.py", line 5, in <module>
    nlp = spacy.load('en_core_web_sm')
  File "/home/USR/.local/lib/python3.10/site-packages/spacy/__init__.py", line 51, in load
    return util.load_model(
  File "/home/USR/.local/lib/python3.10/site-packages/spacy/util.py", line 427, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

Workaround: Install the small model:

  • python3 -m spacy download en_core_web_sm

Problems with perenthesis and slash characters

Hello,

I detected an issue with some inputs involving parenthesis and slash characters.
For example the text '2*(45+21)/6' is converted into '2*(45+211/6'
I fixed the error by replacing line 266 with:
s = re.sub(r'(?:^|[a-zA-Z])\/(\d+)', r'1/\1', s)

Thanks

Error with 'a' being converted to 1

Hello,
I am using numerizer directly from pip install. I guess one of the features is convertion every mention of 'a' in my text to 1 even if its not denoting 1.

Also for some one I get the same issue

Is there any workaround or fix for this?

Float amounts yield wrong derived values

Great library, thank you for your work.

The issue is that floats are not really matched against. In the example below, the regex will match only the last digit 2 from the 1.2. Then it concatenates, resulting in a wrong value.

from numerizer import numerize

In [0]: numerize("1.2 million")
Out[0]: '1.2000000'

spaCy extensions for numerizer

Suppose my spacy doc is as follows:

Narendra Modi is leading in Varanasi with one hundred and fifty three votes. He is contesting from two constituencies.

Then, we have two numerical entities here.

  1. one hundred and fifty three is a spacy span.
  2. two is a spacy token.

For both of them, we can add extensions as follows:

>>> token
two
>>> span
one hundred and fifty three
>>> token._.numerize
'2'
>>> span._.numerize
'153'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.