psolin / cleanco Goto Github PK

Company Name Processor written in Python

License: MIT License

Python 100.00%

cleanco's Introduction

cleanco - clean organization names

What is it / what does it do?

This is a Python package that processes company names, providing cleaned versions of the names by stripping away terms indicating organization type (such as "Ltd." or "Corp").

Using a database of organization type terms, It also provides an utility to deduce the type of organization, in terms of US/UK business entity types (ie. "limited liability company" or "non-profit").

Finally, the system uses the term information to suggest countries the organization could be established in. For example, the term "Oy" in company name suggests it is established in Finland, whereas "Ltd" in company name could mean UK, US or a number of other countries.

How do I install it?

Just use 'pip install cleanco' if you have pip installed (as most systems do). Or download the zip distribution from this site, unzip it and then:

Mac: cd into it, and enter sudo python setup.py install along with your system password.
Windows: Same thing but without sudo.

How does it work?

Let's look at some sample code. To get the base name of a business without legal suffix:

>>> from cleanco import basename
>>> business_name = "Some Big Pharma, LLC"
>>> basename(business_name)
>>> 'Some Big Pharma'

Note that sometimes a name may have e.g. two different suffixes after one another. The cleanco term data covers many of these, but you may want to run basename() twice on the name, just in case.

If you want to use your custom terms, please see custom_basename() that also provides some other ways to adjust how base name is produced.

To get the business type or country:

>>> from cleanco import typesources, matches
>>> classification_sources = typesources()
>>> matches("Some Big Pharma, LLC", classification_sources)
['Limited Liability Company']

To get the possible countries of jurisdiction:

>>> from cleanco import countrysources, matches
>>> classification_sources = countrysources()
>>> matches("Some Big Pharma, LLC", classification_sources) ´
['United States of America', 'Philippines']

Are there bugs?

See the issue tracker. If you find a bug or have enhancement suggestion or question, please file an issue and provide a PR if you can. For example, some of the company suffixes may be incorrect or there may be suffixes missing.

To run tests, simply install the package and run python setup.py test. To run tests on multiple Python versions, install tox and run it (see the provided tox.ini).

Special thanks to:

Wikipedia's Types of Business Entity article, where I spent hours of research.
Contributors: Petri Savolainen

cleanco's People

Contributors

Stargazers

Watchers

Forkers

afscott potatochip natereed cronan edskal jzhzhu chreko rejo-p-deepr jz2327 khanhnguyenneka saberry mlaprise jnj16180340 pombredanne dhenderson zolrath rlaumeyer a-bencheikh daviddigital jhfvr galondsc y1my1 appurwar jamshaidsohail5 danielm-github charx0r austinkempf baijiaoo jonathanbossenger compa-inc byrro hjin36 stungkit agrima27 michaelg-baringa helmithejoe chrisdietr maasanka twalen hongbopeng maxpospischil geoffreyweiner rajeshkannanramakrishnan nata1y isvworld darimadam vkelk maxu777 mpucci92 elliottsmith saharmor pqhai akshaysharma29 altons mkbldn taraskuzyk pulin05 zzandww rjurney hacktbrasil ronarbo tboland pablomitchell synapticarbors vchauhan-ai jma4 mohammed78620 caas-hamburg workable aalars giorgosandreadis kouichi1229 stealth-bomber fbnil emreyesilyurt elijahahianyo tunchunairarko gen-li lauren-cgreen mbaak italanchan sarahlevitz robbarry brightquery-inc replicawj arpitjain799 zh4ng3 jonasr alvinjxz warrencohn shankerj alexanderlukanin13

cleanco's Issues

Acronyms for the legal entity should be included

There should be abbreviations of legal entity added in order to classify rightly

In [3]: matches("Relience Private Limited", classification_sources)                                       
Out[3]: 
['Hong Kong',
 'Israel',
 'New Zealand',
 'Pakistan',
 'United Kingdom',
 'United States of America']

Here there should have been India in the output.

Move data away from main class (into country-specific modules?)

To support for example abbreviation expansion for other languages than english, it would be better if the data was split into submodules rathen than kept embedded in the class.

For example, add a "data" subpackage to contain language modules with names from the ISO 639-1 standard. So current ones would be in module "data/uk.py".

If this is ok, I can provide an implementation.

Drop Python 3.5, add GH actions, drop Travis CI, start preparations for 2.1

Done. Need someone to update changelog.

Croatian companies

I minor thing:

'Croatia': ['d.d.', 'd.d.o.', 'obrt'],

should be:

'Croatia': ['d.d.', 'd.o.o.', 'obrt'],

d.o.o = "drustvo ogranicene odgovornosti"; there is no d.d.o (but d.d. is OK, as it stands for "dionicko drustvo")

Add support for case, whitespace & separator normalization

I understand this may fall outside the scope, but it would be very convenient if cleanco also had this kind of simple normalization built-in:

standardizing lettercase (e.g., all lowercase)
standardizing separators (e.g., commas must be followed by spaces)
standardizing whitespace (e.g., converting all runs of whitespace to single spaces)

Could we please have 1.4 or 1.3.1 (or any release) in PyPI With the recent patches?

support "public" company distinction

In many countries, a "public" limited (liability) company has a distinction that its shares are publicly traded or -tradable. We don't have this distinction in cleanco currently.

error in Belgium

HI Guys

there is an error in one of the belgium types:
the correct type is CVBA: coöperatieve vennootschap met beperkte aansprakelijkheid (CVBA)

However in your covered terms it is written as "'cbva'"
https://pydoc.net/cleanco/1.3/termdata/

thanks !

Use ISO3166 country names

This makes it easier to map the country-specific codes to country data in other systems. The names can be found for example in the python "iso3166" package.

Add some proper tests

Use unittest, or py.test or nose, whatever you prefer. I would recommend using py.test with https://pypi.python.org/pypi/hypothesis/

use spaces for indentation (change tabs to spaces)?

By convention, spaces are nowadays used, see PEP8: https://www.python.org/dev/peps/pep-0008/ - but not at the expense of consistency.

Cleanco not properly identifying Czech companies

Hello, thank you for your work on cleanco.

Cleanco does not seem to work properly for czech companies, eg:

>>> c = cleanco("Company s.r.o.")
>>> c.type() is None
True
>>> c.country() is None
True
>>> c = cleanco("Company a.s.")
>>> c.type() is None
True
>>> c.country() is None
True

Although I see that 's.r.o.' and 'a.s.' are present in termdata.py in the right places.

One more detail, there is also the possibility to use 'spol. s r.o.' instead of 's.r.o.' in Czech Republic - they are equivalent. 'spol. s r.o.' is not present in termdata.py.

switch to setuptools

Done with not being able to use "python setup.py develop" ....

optimization and simplification suggestions

switch to function-based API

it makes no sense to instantiate a class for each cleaned name; it's overcomplex, extra work and unnecessary, especially when most of setup code is now outside the class

switch to working on whitespace-separated name parts rather than full strings

In effect we would check for example in case of suffix for business_name.split()[-1] == term rather than business_name.endswith(' ' + term). Of course the splitting would be done just once in the beginning.

at the moment, the class is splitting and rejoining the name already, to get rid of extra whitespaces
at the moment, the code already looks for a prefix/suffix that's padded by a single whitespace, so in effect it's the same

If we can just handle the fact that some legal terms are "multi-part" (whitespace-separated), this would simplify the code and make it run faster since for example we'd only have to work on the last whitespace-separated name part for suffix, and just the first for prefix. There are other cases, too.

We would not have to presort the data, either.

don't use both legal and countrywise suffixes in clean_name

there are a lot of duplicates, it should be enough to use just either (preferably countrywise data since that would allow dropping off countries easily)

Add license

Under what license is this library distributed? GPL2? BSD? Something else? Please can we have a LICENSE file added?

optimize (is horribly slow)

Due to the way cleanco currently works, quite intensive operations are taking place every time a name is cleaned (a class is instantiated every time a name is cleaned; see what happens in __init__).

This should be optimized so that the operations only take place once.

Not Working for 'p.c.

i am trying to parse a business name which contains p.c. as an extension, but when i try to use x.type() it returns none type object

Ex:-

cleanco(dentistry for children, louis a. pollina, d.d.s., p.c.)
x.type()
Returns none

Getting rid of abbreviations

Just wanted to have some thoughts on this. They seem like they could go beyond the scope of the project.

tox and python setup.py test support

Need two things:

make tests runnable by 'python setup.py test'
support multiple version (2.7 & 3.5) testing using tox

drop Python2 support

support unicode, not just ascii

If a name ends with umlaut char such as 'ä', cleaning fails. To fix, re.search needs to be called with the re.UNICODE flag.

test against all suffixes / prefixes

We have a nice database of suffixes/prefixes. We should have a test that runs cleanco.clean_name() against the full database.

country logic does not work for terms ending with '.'

business_name = "Some Big Pharma sh.a."
x = cleanco(business_name)

print(x.business_name)
print(x.string_stripper(x.business_name))
print(x.clean_name())
print(x.country())

prints:

Some Big Pharma sh.a.
Some Big Pharma sh.a
Some Big Pharma
None

sh.a. is in the Albania terms:

cleanco/termdata.py

Line 46 in 56ff654

'Albania': ['sh.a.', 'sh.p.k.'],

It is not being recognized as Albanian because the . at the end of sh.a. is removed in:

cleanco/cleanco.py

Line 56 in 56ff654

business_name = self.string_stripper(business_name)

Remove old API with 2.2

Suggest we drop it in 2.2, whenever that will be out. See README for description and disclosure of deprecation plans.

cleanco('AMBA').clean_name() is empty

broken handling of dots within suffixes

Sigh. It seems c.clean_name() fails for any suffix with dots within it, or something like that:

>>> c = cleanco("Company l.p.")
>>> c.clean_name()
'Company l.p'
>>> c = cleanco("Company l.p.p.")
>>> c.clean_name()
'Company l.p.p'

Readme clean_name different than code

Hi,

I like what you are doing with this module! I went to run x.clean_name() and received an error saying "cleanco instance has no attribute 'clean_name'.

I looked through your code and noticed that your actual method/attribute is cleanname(). Should be fixed in one of the locations. To me clean_name() makes more sense.

Error is on line 207 of cleanco.py. Thanks.

Clean_name to remove all items after a comma

I like the idea of this and think there is a lot of use to it. I think it would be more useful if it removed all of company name string after(and including) a ','. I'd add this into the clean_name function similar to how you do with hyphens.

brackets handled incorrectly

When clean_name() is used in the following way:

>>> cleanco('company (country) Pvt. Ltd.').clean_name()
'company (country'

it strips not only the organisation name.
The expected output would be: company (country)

Company extensions ending in punctuation

Although its currently removing Inc from the end but unable to remove Inc.. or Inc. Implement multiple punctuation as optional at the end of company extension

Release 2.1

The non-determinism was fixed in #54 in June, but the latest release (2.0.1) was in April. Is it possible to release a version fixing the non-determinism?

Incorrect detection of "Pty Limited" Suffix

>>> cleanco("Example Example Pty Ltd").clean_name() # CORRECT
'Example Example'
>>> cleanco("Example Example Pty Limited").clean_name() # Not so good
'Example Example Pty'

The give you a view on the scope of the problem: I'm working to normalise a database of around on processing a database of around 900k company names which have been typed into an application over a 10 year period. The database contains primarily companies from anglophone countries. Of these, around 580 have a company name like this.

Do you see this as a problem also? If so, I'm happy to put together a patch.

SRL missing

Hi guys
fantastic job.. but one important ending is missing: "SRL" without final dot
name = cleanco("Unimarkt Handelsgesellschaft SRL").clean_name(prefix=True, suffix=True, middle=True, multi=True)
print(name)
output
Unimarkt Handelsgesellschaft SRL

fix multipart term checking

The recent 2.0 work ignored multi-part ie. "co. ltd." type terms that contain a whitespace. A significant minority of the terms are multi-part so this regression needs to be fixed.

improve string comparisons

The current implementation just does case-insensitive matching. Comparisons are however much more complex in Unicode world. See for example:

https://stackoverflow.com/questions/319426/how-do-i-do-a-case-insensitive-string-comparison

http://www.unicode.org/reports/tr15/#Normalization_Forms_Table

use ISO 20275 data from GLEIF

See https://www.gleif.org/en. There's a lot of data that would help improve the legal affix database of cleanco.

add travis ci

so that any changes are automatically tested against. We should have better tests first, though.

test for cyrillic (Russian)

Recent code commits introduced improved Unicode support. However there are no tests to demonstrate it works.

Add SE and AG

Thanks for the great library!
Could we add SE (https://en.wikipedia.org/wiki/Societas_Europaea), "AKTIENGESELLSCHAFT" (which stands for AG, EG (Erwerbsgesellschaft) and see if it is possible to identify a dash also as a separator between a company type and its name? Examples below:

SE:
('ALBA SE';'NEW YORKER SE') --> currently: ('ALBA SE';'NEW YORKER SE'), correct: ('ALBA';'NEW YORKER')

AKTIENGESELLSCHAFT:
'WIELAND-WERKE AKTIENGESELLSCHAFT' --> currently: ('WIELAND-WERKE AKTIENGESELLSCHAFT'), correct: ('WIELAND-WERKE AKTIENGESELLSCHAFT')

EG:
'REWE DORTMUND GROSSHANDEL EG'

-AG
'DEUTSCHE VERSICHERUNGS-AG' --> currently: ('DEUTSCHE VERSICHERUNGS-AG'), correct: ('DEUTSCHE VERSICHERUNGS-AG')

Handle prefixed (and in-middle), possibly multiple terms

In Finland, you sometimes see the format "Oy Corporation Ab" where "Oy" refers to limited liability (in Finnish) and "Ab" the same (in Swedish, the other official language of Finland).

In other words, the abbreviations can also appear in front of the company name - or both before and after.

add translated legal entity names

The legal terms are to some extent translatable across jurisdictions. It would be useful if user could ask for the business types in their own native language.

For example, as a Finnish person, limited liability company is known as "osakeyhtiö" ("oy") to me, whilst a public(ly traded) limited liability company would be called "julkinen osakeyhtiö" ("oyj") in Finland.

Polish legal endings

HI guys
many of the polish companies I received has full legal endings as:
spółka z ograniczoną odpowiedzialnością
spółka Jawna
spółka komandytowa
spółka akcyjna
spółka cywilna
spółka komandytowa
spółka z ograniczoną odpowiedzialnością

Would it be possible to add them on the list?

Inconsistent parsing

The following two equivalent names parse different when presumably they should be the same:

cleanco('Hello World Company Limited').clean_name()

cleanco('Hello World Company Ltd').clean_name()

Package for PyPI

Seems like a good next step. If my tests with this software prove that it is a good fit for my project, I will gladly put in the work to push it up to PyPI.

Estonian entity types mostly missing

Estonian legal entity types such as OÜ, MTÜ, AS, UÜ, TÜ are missing – only FIE seems to be supported.

Could try adding them myself – is termdata.py the only place that needs to have these added in?

Problems parsing company names with punctuations

Hello,

Very nice module but it doesn't always handle well some real human entered company names we deal a lot with. Below some obvious examples where the name is not parsed:

LIBGAS,LTD -> LIBGAS,LTD
AIRDAS USA,LLC -> AIRDAS USA,LLC
GF LOGISTICS.INC -> GF LOGISTICS.INC
HAKUTATZ.TECH.CO.,LTD. -> HAKUTATZ.TECH.CO.,LTD

Thanks

add more test data (company names)

@psolin , would you have any lists of company names that you want to see tested?

still alive?

Hello there!

thanks for this incredible package! I am just wondering: is this package still alive? I havent seen any update for about a year.

Thanks!

Remove build directory from version control

That should not be included.

stripping of suffix fails if it ends in full stop.

Same happens if the ending character is a comma. Example:

>>> from cleanco import cleanco
>>> y='posnegansett properties, llc.'
>>> ya=cleanco(y)
print ya.type()
['Limited Liability Company']
>>> print ya.clean_name()
posnegansett properties, llc

The result should not have the 'llc' suffix.