Giter VIP home page Giter VIP logo

Comments (6)

johnisanerd avatar johnisanerd commented on May 28, 2024

I am pretty new to pandas, for what it's worth!

from patent_client.

johnisanerd avatar johnisanerd commented on May 28, 2024

Also, just to give you a better idea of what's going on:

company_name = '3M Company'


pd.DataFrame.from_records(
    (USApplication.objects
        .filter(first_named_applicant='Microsoft')
        .values('appl_id', 'patent_number', 'patent_title')[0:10]
    )
)

Works great!

and this:

company_name = '3M Company'

pd.DataFrame.from_records(
    (USApplication.objects
        .filter(first_named_applicant='Microsoft')
        .values('appl_id', 'patent_number', 'patent_title')[0:1000]
    )
)

Fails.

from patent_client.

parkerhancock avatar parkerhancock commented on May 28, 2024

Thanks for reporting the issue! I'll take a look at it. I suspect its an issue on how I implemented slicing in the manager.

In the mean time, the pd.DataFrame.from_records can accept a generator as input, not just a list. So the first example should work fine without specifying a slice. That is, as:

company_name = '3M Company'

pd.DataFrame.from_records(
    (USApplication.objects
        .filter(first_named_applicant=company_name)
        .values('app_filing_date', 'patent_number', 'patent_title')
    )
)

from patent_client.

parkerhancock avatar parkerhancock commented on May 28, 2024

Ah! One other thing. I know why your example that asks for the first 10 records works, but the first 1000 does not.

There's an issue with the USPTO's Patent Examination Data System API (which support USApplication). The ordinary JSON API only returns 20 results, and although it has a pagination system, it's broken (it returns sets of 20, but paginates in sets of 25).

If you query returns fewer than 20 results, it just parses the json and returns a result - easy peasy. If your query returns more than 20 results, it has to request a download of a bulk file in XML, download that bulk file, and then parse the data out of XML. Which is why USApplication can be slow for large queries. (It does cache the bulk file, so subsequent identical queries execute quickly)

The issue is something with how the XML is parsed. When you make a big request (e.g. the first 1000 records), something in the XML parser is failing. This is something I do need to fix.

from patent_client.

parkerhancock avatar parkerhancock commented on May 28, 2024

Version 0.4.2 should fix the issue. I tried it with your examples above, and it worked great! turns out it was a busted XML parser, not the slicing.

I hate that XML parser. If the USPTO ever fixes the pagination issue, I'm switching to that immediately and dropping it altogether. Too many moving parts to go wrong. Especially when the JSON is just so easy to deal with.

Let me know if you still have problems, and I'll take another look. Travis CI is testing the new code now, and I'll deploy to PyPI as soon as it comes back green.

Thanks for reporting the issue!

from patent_client.

johnisanerd avatar johnisanerd commented on May 28, 2024

Thanks so much for checking into this @parkerhancock !

I tried to upgrade and rerun the above code, and it throws this issue:


---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-3-d89bd37079ad> in <module>()
----> 1 from patent_client import USApplication, Assignment
      2 import pandas as pd
      3 
      4 pd.DataFrame.from_records(
      5     (USApplication.objects

~/anaconda3/lib/python3.6/site-packages/patent_client/__init__.py in <module>()
     28 SETTINGS = json.load(open(SETTINGS_FILE))
     29 
---> 30 from patent_client.epo_ops.models import Inpadoc, Epo  # isort:skip
     31 from patent_client.uspto_assignments import Assignment  # isort:skip
     32 from patent_client.uspto_exam_data.main import USApplication  # isort:skip

~/anaconda3/lib/python3.6/site-packages/patent_client/epo_ops/__init__.py in <module>()
      4 CACHE_DIR.mkdir(exist_ok=True)
      5 TEST_DIR = TEST_BASE / "epo"
----> 6 TEST_DIR.mkdir(exist_ok=True)

~/anaconda3/lib/python3.6/pathlib.py in mkdir(self, mode, parents, exist_ok)
   1244             self._raise_closed()
   1245         try:
-> 1246             self._accessor.mkdir(self, mode)
   1247         except FileNotFoundError:
   1248             if not parents or self.parent == self:

~/anaconda3/lib/python3.6/pathlib.py in wrapped(pathobj, *args)
    385         @functools.wraps(strfunc)
    386         def wrapped(pathobj, *args):
--> 387             return strfunc(str(pathobj), *args)
    388         return staticmethod(wrapped)
    389 

FileNotFoundError: [Errno 2] No such file or directory: '/Users/johncole/anaconda3/lib/python3.6/tests/fixtures/epo'

Curious, since it's pulling under "tests" maybe there was something left out of the build? I completely uninstalled the pip library, and then reinstalled it, then it started to throw this error.

from patent_client.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.