Giter VIP home page Giter VIP logo

disambiguator's People

Contributors

doolin avatar edwardsun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

disambiguator's Issues

Blocking on non-name attributes

The default implementation appears to have disabled blocking on set objects like coauthor or class (in attribute.h, about line 932 or so):

/**
 * Attribute_Set_Intermediary:
 * Third layer of the attribute hierarchy. Specifically
 * designed to handle the data storage issue.
 */
template < typename AttribType >
class Attribute_Set_Intermediary : public Attribute_Intermediary<AttribType> {

private:

   /**
    * Private:
    * static vector < const string * > temporary_storage:
    * static member temporarily used for data loading.
    */
    static vector < const string * > temporary_storage;

protected:

   /**
    * vector < const string * > & get_data_modifiable():
    * still used for data loading only.
    */
    vector < const string * > & get_data_modifiable() {
      return temporary_storage;
    }

public:

   /**
    * const vector < const string * > & get_data() const:
    * override the base function and throw an error,
    * indicating that this function should be forbidden
    * for this class and its child classes.
    */
    const vector < const string * > & get_data() const {
      throw cException_Invalid_Function("Function Disabled");
    }
};

This is problematic for blocking on anything other than a pure string value (e.g., block on patents that share at least 2 IPC codes, 2 coauthors, etc).

Is re-implementing this feature possible/wise?

Incorrect assignment of inventor IDs

I've been working with the Harvard Patent Dataverse 2010 datasets for quite a while now and have stumbled across an issue with unique inventor identification for records with assignee numbers starting with A or H (e.g. H000000000158 for Shell Oil Company instead of the regular 10266734). The algorithm seems to incorrectly assign different inventor ID's to records with such assignee numbers, while the other characteristics of the record are very similar or exactly the same as the records listing the 'regular' assignee number.

Here's an example for one of Shell's key inventors:

HAROLD J
VINEGAR BELLAIRE US 7631690 SHELL OIL COMPANY 10266734 166 04359687-1 2009
HAROLD J VINEGAR BELLAIRE US 7635023 SHELL OIL COMPANY 10266734 166 04359687-1 2009
HAROLD J VINEGAR BELLAIRE US 7635025 SHELL OIL COMPANY 10266734 166 04359687-1 2009

As you can see, these are OK. The inventor is correctly assigned with Invnum 04359687-1. However, the following records receive a different Invnum, while the inventor is of course the same based on the characteristics of the other data fields:

HAROLD J VINEGAR BELLAIRE US 7640980 SHELL OIL COMPANY H000000000158 166-268/166-302/166-369/405-52 07640980-0 2010
HAROLD J VINEGAR BELLAIRE US 7735935 SHELL OIL COMPANY H000000000158 299-5/166-2721/166-302/299-4 07735935-0 2010
HAROLD J VINEGAR BELLAIRE US 7681647 SHELL OIL COMPANY H000000000158 166-302/166-369 07681647-2 2010
For larger selections of data, this leads to a lot of missing connections and overall less connected or dense networks than is actually the case. So far, I've manually corrected the Invnum's for these records, but of course this is not the way to go for selections containing thousands of records ;-)

Would it be possible to address this issue in the next release of the datasets? Please let me know if there's any other info I can provide to further clarify this issue.

Thanks,

André

Silverbrook disambiguation

From Monte Shaffer:

Based on some very basic "location" analysis,
I would report the following as being the same inventor...
We record the number of patents that have this name-location match:

0003474 Kia Silverbrook|Balmain,AU
0000152 Kia Silverbrook|Sydney,AU
0000038 Kia Silverbrook|Leichhardt,AU
0000017 Kia Silverbrook|Woollahra,AU
0000011 Kia Silverbrook|Balmain,NSW,AU
0000008 Silverbrook Kia|Balmain,AU
0000007 Kia Silverbrook|Wollahra,AU
0000005 Kia Silverbrook|Balmain NSW,AU
0000003 Kia Silverbrook|Blamain,AU
0000003 Kia Silverbrook|Balmain,NSW 2041,AU
0000003 Kia Silverbrook|Balmain,AT
0000002 Kia Siverbrook|Woollahra,AU
0000002 Kia Silverbrook|Leichhardt NSW,AU
0000002 Kia Silverbrook|Balmain,New South Wales,AU
0000001 Kia Sliverbrook|Balmain,AU
0000001 Kia Silverbrook|Woollhara,AU
0000001 Kia Silverbrook|New South Wales,AU
0000001 Kia Silverbrook|New South Wales,AT
0000001 Kia Silverbrook|Liechhardt,AU
0000001 Kia Silverbrook|Leichhardt NSW 2040,AU
0000001 Kia Silverbrook|Leichardt,AU
0000001 Kia Silverbrook|Balmain,NSW,2041,AU
0000001 Kia Silverbrook|Balmain NSW 2041,AU
0000001 Kia Silverbrook|Balmain 2041,AU
0000001 Kia Silverbook|Balmain,AU 

Missing geographical data for records with grant year 2011-2013

For all records with grant year 2011 or 2012, the colums Street, City, State, Country, Zipcode, Longitude and Latitude contain no data at all. This holds for both the Full 2012 and the January 2013 disambiguations. Interestingly, about half of the records with grant year 2013 do contain this data. I checked whether this missing data had an effect on the resulting disambiguations for the respective inventors, but this doesn't seem to be true - regardless of the missing data, the inventors are still properly disambiguated.

First picture attached shows a simple pivot table showing that about half of the 2013 records miss geographical data (country taken as an example); the second picture shows some examples of missing 2011 data.

country
missing_data

Over-consolidation of Inventor ID

I performed a simple check to test inventor names.
0 = i detect a middle name conflict
1 = a name is matched against a name without a middle name
2 = the names contain a middle name and the middle initial matches

nstr invs patents avgpats = patents/invs


0 8,644 176,307 20.39
1 1,611,032 6,279,199 3.89
2 1,480,296 3,955,163 2.67

For example:
The nstr=0 includes the following (first 10 entries):

(inv_id, #patents, unique names clumped together)
03858572-2|31|JOHN F DYE,JOHN DYE,JOHN D DYE
03858760-1|45|ANTONIN GONCALVES,ANTONIN L GONCALVES,ANTONIN C GONCALVES
03858787-3|19|ROGER M FLOYD,ROGER N FLOYD
03859063-2|8|STEVEN I TAUB,STEVEN L TAUB
03859092-1|42|HENRY J GYSLING,HENRY L GYSLING,HENRY JAMES GYSLING
03859097-1|4|FREDRICK L HAMB,FREDERICK L HAMB,FREDERICK T HAMB,FREDERICK D HAMB
03859113-2|18|WILLIAM C STUMPHAUZER,WILLIAM S STUMPHAUZER
03859119-1|316|JAMES C FLETCHER,JAMES ADMINISTR FLETCHER,JAMES CORVIN FLETCHER,J CLINT FLETCHER,J CLINTON FLETCHER
03859298-1|72|JOHN H SELLSTEDT,JOHN H SELLSTED,JOHN M SELLSTEDT
03859356-1|109|WILLIAM J HOULIHAN,WILLIAM H HOULIHAN

As you can see here, the first record --

  • John F Dye gets clumped with John D Dye which is clearly incorrect
  • Same idea for the remainder. The James Fletcher one is particularly concerning (316 patents) and looks to be at least 3+ individuals mashed together. (#03859298-1)

While this is a relatively small % of all inventors identified -- the avgpats for these individuals is extremely high compared to the others. I've run into these individuals when creating networks and they create some strange networks! That said, visually observing the data also presents some interesting blocking mechanisms for further disambiguation which I would love to share. I think the more we show these results visually via APIs, some data issues may become obvious.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.