Giter VIP home page Giter VIP logo

python-arpa's People

Contributors

sfischer13 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

python-arpa's Issues

failure

import arpa
File "...\arpa\models\base.py", line 8
class ARPAModel(metaclass=ABCMeta):
^
SyntaxError: invalid syntax
Sorry, when I import arpa , it will return this false, why?

Is there a way to list all possible ngrams for a given string?

The model's vocab only returns the unigrams:

>>> import arpa
>>> x = arpa.loadf('big.arpa')
>>> x[0].vocabulary()
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '</s>', '<s>', '<unk>', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '^', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '|', '~']

But the model contains 3-5 grams probabilities, it is possible to provide the 3-5grams available for the sentence?

E.g. given a string:

T h e _ P r o j e c t _ G u t e n b e r g _ E B o o k _ o f _ T h e _ A d v e n t u r e s _ o f _ S h e r l o c k _ H o l m e s 

import error

Hi,
Thanks for creating this toolkit, seems like satisfied all my needs.
However, when I was finishing pip install and trying to import the package, i got this error, any help would be appreciated.

Python 2.7.10 (default, Aug 22 2015, 20:33:39)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import arpa
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/arpa/__init__.py", line 37, in <module>
    from .arpa import dump, dumpf, dumps, load, loadf, loads
  File "/Library/Python/2.7/site-packages/arpa/arpa.py", line 3, in <module>
    from .models.simple import ARPAModelSimple
  File "/Library/Python/2.7/site-packages/arpa/models/simple.py", line 3, in <module>
    from .base import ARPAModel
  File "/Library/Python/2.7/site-packages/arpa/models/base.py", line 8
    class ARPAModel(metaclass=ABCMeta):
                             ^
SyntaxError: invalid syntax

The efficiency of computing backoff

def log_p_raw(self, ngram):
try:
return self._log_p(ngram)
except KeyError:
if len(ngram) == 1:
raise KeyError
else:
try:
log_bo = self._log_bo(ngram[:-1])
except KeyError:
log_bo = 0
return log_bo + self.log_p_raw(ngram[1:])

This try...catch mechanism to implement the backoff may not be efficient enough.
According to the python documentation:

A try/except block is extremely efficient if no exceptions are raised. Actually catching an exception is expensive.

However, it is common in a language model to have unseen ngrams and to backoff to lower orders. Thus, I guess it may be more appropriate to implement this using if...else (as also suggested in the documentation) instead of try...catch.

important parser error

'(\t(-?\\d+(\\.\\d+)?)([eE]-?\\d+)?)?$')

The regular expression has an error here. Consider the case where the line is:
-2.310726 maybe when 9.609759e-05
The exponent in the backoff weight is not correctly parsed -- the e-05 will be missed.

Wrt the correct version, there should be an extra bracket/group in the expression. Like this:
"(\t ( (-?\d+(\.\d+)?)([eE]-?\d+)? ) )?$"

Weird versioning breaking pipenv

Pipfile:
[[source]]
url = "https://pypi.python.org/simple"
verify_ssl = true

[packages]
arpa = "*"

$ pipenv install
Warning: Your dependencies could not be resolved. You likely have a mismatch in your sub-dependencies.
You can use $ pipenv install --skip-lock to bypass this mechanism, then run $ pipenv graph to inspect the situation.
Could not find a version that matches arpa
Tried: 0.1.0a1, 0.1.0a1, 0.1.0a2, 0.1.0a2, 0.1.0a3, 0.1.0a4, 0.1.0a4, 0.1.0a5, 0.1.0a5, 0.1.0a6, 0.1.0a6, 0.1.0b1, 0.1.0b1

Where as:
$ pip install arpa
No problems.

Performance issue

Hello,

I have just noticed that the lib's running time is really huge comparing to analogues. My profiler shows that the script spends most of the time in base.py:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   64.243   64.243 <ipython-input-447-6daa9270bd10>:6(main)
        1    0.000    0.000   64.243   64.243 <string>:1(<module>)
    64294    0.089    0.000    0.116    0.000 base.py:109(_check_input)
    64294    0.134    0.000   63.056    0.001 base.py:122(_replace_unks)
   192881    0.141    0.000   62.922    0.000 base.py:123(<genexpr>)
   128588    0.229    0.000   62.781    0.000 base.py:13(__contains__)
    64294    0.138    0.000   63.694    0.001 base.py:27(log_p)
...

That points to the lines:

def _replace_unks(self, words):
        return tuple((w if w in self else self._unk) for w in words)

I am a newbie in Python so am not sure why is this happening and how to fix it so just wanted to let you know.

I am running Jupyter Notebook v 5.4.0 with Python v. 3.5.4 on OS X.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.