Giter VIP home page Giter VIP logo

cleanco's Introduction

cleanco - clean organization names

Python package CodeQL

What is it / what does it do?

This is a Python package that processes company names, providing cleaned versions of the names by stripping away terms indicating organization type (such as "Ltd." or "Corp").

Using a database of organization type terms, It also provides an utility to deduce the type of organization, in terms of US/UK business entity types (ie. "limited liability company" or "non-profit").

Finally, the system uses the term information to suggest countries the organization could be established in. For example, the term "Oy" in company name suggests it is established in Finland, whereas "Ltd" in company name could mean UK, US or a number of other countries.

How do I install it?

Just use 'pip install cleanco' if you have pip installed (as most systems do). Or download the zip distribution from this site, unzip it and then:

  • Mac: cd into it, and enter sudo python setup.py install along with your system password.
  • Windows: Same thing but without sudo.

How does it work?

Let's look at some sample code. To get the base name of a business without legal suffix:

>>> from cleanco import basename
>>> business_name = "Some Big Pharma, LLC"
>>> basename(business_name)
>>> 'Some Big Pharma'

Note that sometimes a name may have e.g. two different suffixes after one another. The cleanco term data covers many of these, but you may want to run basename() twice on the name, just in case.

If you want to use your custom terms, please see custom_basename() that also provides some other ways to adjust how base name is produced.

To get the business type or country:

>>> from cleanco import typesources, matches
>>> classification_sources = typesources()
>>> matches("Some Big Pharma, LLC", classification_sources)
['Limited Liability Company']

To get the possible countries of jurisdiction:

>>> from cleanco import countrysources, matches
>>> classification_sources = countrysources()
>>> matches("Some Big Pharma, LLC", classification_sources) ´
['United States of America', 'Philippines']

Are there bugs?

See the issue tracker. If you find a bug or have enhancement suggestion or question, please file an issue and provide a PR if you can. For example, some of the company suffixes may be incorrect or there may be suffixes missing.

To run tests, simply install the package and run python setup.py test. To run tests on multiple Python versions, install tox and run it (see the provided tox.ini).

Special thanks to:

cleanco's People

Contributors

aalars avatar akshaysharma29 avatar d059566 avatar fbnil avatar jachymb avatar jhfvr avatar maxu777 avatar petri avatar psolin avatar saharmor avatar synapticarbors avatar twalen avatar vtasca avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cleanco's Issues

Acronyms for the legal entity should be included

There should be abbreviations of legal entity added in order to classify rightly

In [3]: matches("Relience Private Limited", classification_sources)                                       
Out[3]: 
['Hong Kong',
 'Israel',
 'New Zealand',
 'Pakistan',
 'United Kingdom',
 'United States of America']

Here there should have been India in the output.

Move data away from main class (into country-specific modules?)

To support for example abbreviation expansion for other languages than english, it would be better if the data was split into submodules rathen than kept embedded in the class.

For example, add a "data" subpackage to contain language modules with names from the ISO 639-1 standard. So current ones would be in module "data/uk.py".

If this is ok, I can provide an implementation.

Croatian companies

I minor thing:

'Croatia': ['d.d.', 'd.d.o.', 'obrt'],

should be:

'Croatia': ['d.d.', 'd.o.o.', 'obrt'],

d.o.o = "drustvo ogranicene odgovornosti"; there is no d.d.o (but d.d. is OK, as it stands for "dionicko drustvo")

Add support for case, whitespace & separator normalization

I understand this may fall outside the scope, but it would be very convenient if cleanco also had this kind of simple normalization built-in:

  • standardizing lettercase (e.g., all lowercase)
  • standardizing separators (e.g., commas must be followed by spaces)
  • standardizing whitespace (e.g., converting all runs of whitespace to single spaces)

support "public" company distinction

In many countries, a "public" limited (liability) company has a distinction that its shares are publicly traded or -tradable. We don't have this distinction in cleanco currently.

Use ISO3166 country names

This makes it easier to map the country-specific codes to country data in other systems. The names can be found for example in the python "iso3166" package.

Cleanco not properly identifying Czech companies

Hello, thank you for your work on cleanco.

Cleanco does not seem to work properly for czech companies, eg:

>>> c = cleanco("Company s.r.o.")
>>> c.type() is None
True
>>> c.country() is None
True
>>> c = cleanco("Company a.s.")
>>> c.type() is None
True
>>> c.country() is None
True

Although I see that 's.r.o.' and 'a.s.' are present in termdata.py in the right places.

One more detail, there is also the possibility to use 'spol. s r.o.' instead of 's.r.o.' in Czech Republic - they are equivalent. 'spol. s r.o.' is not present in termdata.py.

optimization and simplification suggestions

switch to function-based API

  • it makes no sense to instantiate a class for each cleaned name; it's overcomplex, extra work and unnecessary, especially when most of setup code is now outside the class

switch to working on whitespace-separated name parts rather than full strings

In effect we would check for example in case of suffix for business_name.split()[-1] == term rather than business_name.endswith(' ' + term). Of course the splitting would be done just once in the beginning.

  • at the moment, the class is splitting and rejoining the name already, to get rid of extra whitespaces
  • at the moment, the code already looks for a prefix/suffix that's padded by a single whitespace, so in effect it's the same

If we can just handle the fact that some legal terms are "multi-part" (whitespace-separated), this would simplify the code and make it run faster since for example we'd only have to work on the last whitespace-separated name part for suffix, and just the first for prefix. There are other cases, too.

We would not have to presort the data, either.

don't use both legal and countrywise suffixes in clean_name

  • there are a lot of duplicates, it should be enough to use just either (preferably countrywise data since that would allow dropping off countries easily)

Add license

Under what license is this library distributed? GPL2? BSD? Something else? Please can we have a LICENSE file added?

optimize (is horribly slow)

Due to the way cleanco currently works, quite intensive operations are taking place every time a name is cleaned (a class is instantiated every time a name is cleaned; see what happens in __init__).

This should be optimized so that the operations only take place once.

Not Working for 'p.c.

i am trying to parse a business name which contains p.c. as an extension, but when i try to use x.type() it returns none type object

Ex:-

cleanco(dentistry for children, louis a. pollina, d.d.s., p.c.)
x.type()
Returns none

support unicode, not just ascii

If a name ends with umlaut char such as 'ä', cleaning fails. To fix, re.search needs to be called with the re.UNICODE flag.

country logic does not work for terms ending with '.'

business_name = "Some Big Pharma sh.a."
x = cleanco(business_name)

print(x.business_name)
print(x.string_stripper(x.business_name))
print(x.clean_name())
print(x.country())

prints:

Some Big Pharma sh.a.
Some Big Pharma sh.a
Some Big Pharma
None

sh.a. is in the Albania terms:

'Albania': ['sh.a.', 'sh.p.k.'],

It is not being recognized as Albanian because the . at the end of sh.a. is removed in:

business_name = self.string_stripper(business_name)

Remove old API with 2.2

Suggest we drop it in 2.2, whenever that will be out. See README for description and disclosure of deprecation plans.

broken handling of dots within suffixes

Sigh. It seems c.clean_name() fails for any suffix with dots within it, or something like that:

>>> c = cleanco("Company l.p.")
>>> c.clean_name()
'Company l.p'
>>> c = cleanco("Company l.p.p.")
>>> c.clean_name()
'Company l.p.p'

Readme clean_name different than code

Hi,

I like what you are doing with this module! I went to run x.clean_name() and received an error saying "cleanco instance has no attribute 'clean_name'.

I looked through your code and noticed that your actual method/attribute is cleanname(). Should be fixed in one of the locations. To me clean_name() makes more sense.

Error is on line 207 of cleanco.py. Thanks.

Clean_name to remove all items after a comma

I like the idea of this and think there is a lot of use to it. I think it would be more useful if it removed all of company name string after(and including) a ','. I'd add this into the clean_name function similar to how you do with hyphens.

brackets handled incorrectly

When clean_name() is used in the following way:

>>> cleanco('company (country) Pvt. Ltd.').clean_name()
'company (country'

it strips not only the organisation name.
The expected output would be: company (country)

Company extensions ending in punctuation

Although its currently removing Inc from the end but unable to remove Inc.. or Inc. Implement multiple punctuation as optional at the end of company extension

Release 2.1

The non-determinism was fixed in #54 in June, but the latest release (2.0.1) was in April. Is it possible to release a version fixing the non-determinism?

Incorrect detection of "Pty Limited" Suffix

>>> cleanco("Example Example Pty Ltd").clean_name() # CORRECT
'Example Example'
>>> cleanco("Example Example Pty Limited").clean_name() # Not so good
'Example Example Pty'

The give you a view on the scope of the problem: I'm working to normalise a database of around on processing a database of around 900k company names which have been typed into an application over a 10 year period. The database contains primarily companies from anglophone countries. Of these, around 580 have a company name like this.

Do you see this as a problem also? If so, I'm happy to put together a patch.

SRL missing

Hi guys
fantastic job.. but one important ending is missing: "SRL" without final dot
name = cleanco("Unimarkt Handelsgesellschaft SRL").clean_name(prefix=True, suffix=True, middle=True, multi=True)
print(name)
output
Unimarkt Handelsgesellschaft SRL

fix multipart term checking

The recent 2.0 work ignored multi-part ie. "co. ltd." type terms that contain a whitespace. A significant minority of the terms are multi-part so this regression needs to be fixed.

add travis ci

so that any changes are automatically tested against. We should have better tests first, though.

test for cyrillic (Russian)

Recent code commits introduced improved Unicode support. However there are no tests to demonstrate it works.

Add SE and AG

Thanks for the great library!
Could we add SE (https://en.wikipedia.org/wiki/Societas_Europaea), "AKTIENGESELLSCHAFT" (which stands for AG, EG (Erwerbsgesellschaft) and see if it is possible to identify a dash also as a separator between a company type and its name? Examples below:

SE:
('ALBA SE';'NEW YORKER SE') --> currently: ('ALBA SE';'NEW YORKER SE'), correct: ('ALBA';'NEW YORKER')

AKTIENGESELLSCHAFT:
'WIELAND-WERKE AKTIENGESELLSCHAFT' --> currently: ('WIELAND-WERKE AKTIENGESELLSCHAFT'), correct: ('WIELAND-WERKE AKTIENGESELLSCHAFT')

EG:
'REWE DORTMUND GROSSHANDEL EG'

-AG
'DEUTSCHE VERSICHERUNGS-AG' --> currently: ('DEUTSCHE VERSICHERUNGS-AG'), correct: ('DEUTSCHE VERSICHERUNGS-AG')

Handle prefixed (and in-middle), possibly multiple terms

In Finland, you sometimes see the format "Oy Corporation Ab" where "Oy" refers to limited liability (in Finnish) and "Ab" the same (in Swedish, the other official language of Finland).

In other words, the abbreviations can also appear in front of the company name - or both before and after.

add translated legal entity names

The legal terms are to some extent translatable across jurisdictions. It would be useful if user could ask for the business types in their own native language.

For example, as a Finnish person, limited liability company is known as "osakeyhtiö" ("oy") to me, whilst a public(ly traded) limited liability company would be called "julkinen osakeyhtiö" ("oyj") in Finland.

Polish legal endings

HI guys
many of the polish companies I received has full legal endings as:
spółka z ograniczoną odpowiedzialnością
spółka Jawna
spółka komandytowa
spółka akcyjna
spółka cywilna
spółka komandytowa
spółka z ograniczoną odpowiedzialnością

Would it be possible to add them on the list?

Inconsistent parsing

The following two equivalent names parse different when presumably they should be the same:

cleanco('Hello World Company Limited').clean_name()

cleanco('Hello World Company Ltd').clean_name()

Package for PyPI

Seems like a good next step. If my tests with this software prove that it is a good fit for my project, I will gladly put in the work to push it up to PyPI.

Problems parsing company names with punctuations

Hello,

Very nice module but it doesn't always handle well some real human entered company names we deal a lot with. Below some obvious examples where the name is not parsed:

LIBGAS,LTD -> LIBGAS,LTD
AIRDAS USA,LLC -> AIRDAS USA,LLC
GF LOGISTICS.INC -> GF LOGISTICS.INC
HAKUTATZ.TECH.CO.,LTD. -> HAKUTATZ.TECH.CO.,LTD

Thanks

still alive?

Hello there!

thanks for this incredible package! I am just wondering: is this package still alive? I havent seen any update for about a year.

Thanks!

stripping of suffix fails if it ends in full stop.

Same happens if the ending character is a comma. Example:

>>> from cleanco import cleanco
>>> y='posnegansett properties, llc.'
>>> ya=cleanco(y)
print ya.type()
['Limited Liability Company']
>>> print ya.clean_name()
posnegansett properties, llc

The result should not have the 'llc' suffix.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.