Giter VIP home page Giter VIP logo

pypatent's Introduction

pypatent

pypatent is a tiny Python package to easily search for and scrape US Patent and Trademark Office Patent Data.

PyPI page

New in version 1.2

This version implements Selenium support for scraping. Previous versions were using the requests library for all requests, however this has had problems with the USPTO site lately. I notice some users have been able to use requests without issue, while others get 4xx errors.

PyPatent Version 1.2 implements an optional new WebConnection object to give the user the option to use Selenium WebDrivers in place of the requests library. This WebConnection object is optional. If used, it should be passed as an argument when initializing Search or Patent objects.

Use it in the following cases:

  • When you want to use Selenium instead of requests
  • When you want to use requests but with a custom user-agent or headers

See bottom of README for examples.

Requirements

Python 3, BeautifulSoup, requests, pandas, re, selenium

Installation

pip install pypatent

If using Selenium for scraping (introduced in version 1.2), be sure to install a Selenium WebDriver. For Chrome, use chromedriver. For Firefox, use geckodriver. See the Selenium download page for more details and options.

Searching for patents

The Search object works similarly to the Advanced Search at the USPTO, with additional options.

Specifying patent criteria for your search

There are two methods to specify your search criteria, and you can use one or both.

Search Method 1: Using a custom string

You may search for a certain string in all fields of the patent:

pypatent.Search('microsoft') # Will return results matching 'microsoft' in any field

You may also specify complex search criteria as demonstrated on the USPTO site:

pypatent.Search('TTL/(tennis AND (racquet OR racket))')

Search Method 2: Specify USPTO search fields (see Field Codes below)

Alternatively, you can specify one or more Field Code arguments to search within the specified fields. Multiple Field Code arguments will create a search with AND logic. OR logic can be used within a single argument. For more complex logic, use a custom string.

pypatent.Search(pn='adobe', ttl='software') # Equivalent to search('PN/adobe AND TTL/software')
pypatent.Search(pn=('adobe or macromedia'), ttl='software') # Equivalent to search('PN/(adobe or macromedia) AND TTL/software')

Combining search methods 1 and 2

String criteria can be used in conjunction with Field Code arguments:

pypatent.Search('acrobat', pn='adobe', ttl='software') # Equivalent to search('acrobat AND PN/adobe AND TTL/software')

The Field Code arguments have the same meaning as on the USPTO site.

Additional search options

Limit the number of results

The results_limit argument lets you change how many patent results are retrieved. The default is 50, equivalent to one page of results.

pypatent.Search('microsoft', results_limit=10) # Fetch 10 results only

Specify whether to fetch details for each patent

By default, pypatent retrieves the details of every patent by visiting each patent's URL from the search results. This can take a long time since each page has to be scraped. If you just need the patent titles and URLs from the search results, set get_patent_details to False:

pypatent.Search('microsoft', get_patent_details=False) # Fetch patent numbers and titles only

Formatting your search results

pypatent has convenience methods to format the Search object into either a Pandas DataFrame or list of dicts.

Format as Pandas DataFrame:

pypatent.Search('microsoft').as_dataframe()

Format as list of dicts:

pypatent.Search('microsoft', get_patent_details=False).as_list()

Sample result (without patent details):

[{
     'title': 'Electronic device',
      'url': 'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=1&p=1&f=G&l=50&d=PTXT&S1=microsoft&OS=microsoft&RS=microsoft'
 },
 
 {'title': 'Portable electric device', ... }

The Patent class

The Search class uses the Patent class to retrieve and store patent details for a given patent URL. You can use it directly if you already know the patent URL (e.g. you ran a Search with get_patent_details=False)

# Create a Patent object
this_patent = pypatent.Patent(title='Base station device, first location management device, terminal device, communication control method, and communication system',
                              url='http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=4&p=1&f=G&l=50&d=PTXT&S1=aaa&OS=aaa&RS=aaa')

# Fetch the details
this_patent.fetch_details()

Patent Attributes Retrieved:

Note, not all fields from the patent page are scraped. I hope to add more, and pull requests are appreciated :)

  • patent_num: Patent Number
  • patent_date: Issue Date
  • abstract: Abstract
  • inventors: List of Names of Inventors and Their Locations
  • applicant_name: Applicant Name
  • applicant_city: Applicant City
  • applicant_state: Applicant State
  • applicant_country: Applicant Country
  • assignee_name: Assignee Name
  • assignee_loc: Assignee Location
  • family_id: Family ID
  • applicant_num: Applicant Number
  • file_date: Filing date
  • claims: Claims Description (as a list)
  • description: Patent Description (as a list)

Field Code Arguments for Search Function

  • PN: Patent Number
  • ISD: Issue Date
  • TTL: Title
  • ABST: Abstract
  • ACLM: Claim(s)
  • SPEC: Description/Specification
  • CCL: Current US Classification
  • CPC: Current CPC Classification
  • CPCL: Current CPC Classification Class
  • ICL: International Classification
  • APN: Application Serial Number
  • APD: Application Date
  • APT: Application Type
  • GOVT: Government Interest
  • FMID: Patent Family ID
  • PARN: Parent Case Information
  • RLAP: Related US App. Data
  • RLFD: Related Application Filing Date
  • PRIR: Foreign Priority
  • PRAD: Priority Filing Date
  • PCT: PCT Information
  • PTAD: PCT Filing Date
  • PT3D: PCT 371 Date
  • PPPD: Prior Published Document Date
  • REIS: Reissue Data
  • RPAF Reissued Patent Application Filing Date
  • AFFF: 130(b) Affirmation Flag
  • AFFT: 130(b) Affirmation Statement
  • IN: Inventor Name
  • IC: Inventor City
  • IS: Inventor State
  • ICN: Inventor Country
  • AANM: Applicant Name
  • AACI: Applicant City
  • AAST: Applicant State
  • AACO: Applicant Country
  • AAAT: Applicant Type
  • LREP: Attorney or agent
  • AN: Assignee Name
  • AC: Assignee City
  • AS: Assignee State
  • ACN: Assignee Country
  • EXP: Primary Examiner
  • EXA: Assistant Examiner
  • REF: Referenced By
  • FREF: Foreign References
  • OREF: Other References
  • COFC: Certificate of Correction
  • REEX: Re-Examination Certificate
  • PTAB: PTAB Trial Certificate
  • SEC: Supplemental Exam Certificate
  • ILRN: International Registration Number
  • ILRD: International Registration Date
  • ILPD: International Registration Publication Date
  • ILFD: Hague International Filing Date

Changelog

New in version 1.2

This version implements Selenium support for scraping. Previous versions were using the requests library for all requests, however the USPTO site has been causing problems for it. I notice some users have been able to use requests without issue, while others get 4xx errors.

PyPatent Version 1.2 implements a new WebConnection object to give the user the option to use Selenium WebDrivers in place of the requests library. This WebConnection object is optional. If used, it should be passed as an argument when initializing Search or Patent objects. Use it in the following cases:

  • When you want to use Selenium instead of requests
  • When you want to use requests but with a custom user-agent or headers

An example using the Firefox WebDriver:

import pypatent
from selenium import webdriver

driver = webdriver.Firefox()  # Requires geckodriver in your PATH

conn = pypatent.WebConnection(use_selenium=True, selenium_driver=driver)

res = pypatent.Search('microsoft', get_patent_details=True, web_connection=conn)

print(res)

An example using the requests library with a custom user agent:

import pypatent

conn = pypatent.WebConnection(use_selenium=False, user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36')

res = pypatent.Search('microsoft', get_patent_details=True, web_connection=conn)

print(res)

An example using the requests library with default user agent (WebConnection is not necessary here as we are using the defaults)

import pypatent

res = pypatent.Search('microsoft', get_patent_details=True)

print(res)

New in version 1.1:

This version makes searching and storing patent data easier:

  • Simplified to 2 objects: Search and Patent
  • A Search object searches the USPTO site and can output the results as a DataFrame or list. It can scrape the details of each patent, or just get the patent title and URL. Most users will only need to use this object.
  • A Patent object fetches and holds a single patent's info. Fetching the patent's details is now optional. This object should only be used when you already have the patent URL and aren't conducting a search.

pypatent's People

Contributors

daneads avatar danhydro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pypatent's Issues

Output recovery

screen shot 2019-03-06 at 2 43 09 pm

I am unable to recover the output of the code:

pypatent.Search('rocket', results_limit=2).as_dataframe()

The function returns the information - but I want to collect it in a separate file. How can I gather the information?

Similarly, I am unable to recover the output of the code:

pypatent.Search('rocket', results_limit=2).as_list()

The function returns a box 'Squeeze Text' which I can open, but it stalls Idle. And, I am unable to right click the box to copy it. Ultimately I am trying to get the return results from the search in a file. Any advice would be appreciated.

Future Development? Patent Client

Hey! This is the only way I can see to contact you, so here I go!

I'm the author and maintainer of patent_client, a library with a similar scope and feature set as your own. patent_client is under active development, and growing, so if you'd like, I'd love to have you contribute, or add a note on your readme pointing to it!

Patent Client Logo

PyPI | GitHub | Docs

Thanks!

Parker

Description gets truncated

Some patents have very long descriptions and they are being truncated with the typical python ... to indicate that the string has been truncated. I set pandas max_colwidth to None and it had no effect on this behavior. This is causing me to lose a portion of the description. Any thoughts on how to correct this?

Too slow to get data

I run the search with below code, what I want is get 5 items , but it seems like the library still will load the first page (50 items). It's too slow to get data.

import pypatent
from selenium import webdriver

with webdriver.Firefox() as driver:
    conn = pypatent.WebConnection(use_selenium=True, selenium_driver=driver)
    res = pypatent.Search('microsoft', results_limit=5,
                          web_connection=conn)
    print(res)

Syntax error - prevents search

I setup a virtual env, and ran:

pip install pypatent

But, when I ran:

pypatent.Search('microsoft')

I received the following error:

-bash: syntax error near unexpected token microsoft'`

I am not sure why the search won't work because I don't know what 'unexpected token' means. Indeed, I am not sure how to proceed and would appreciate any advice or suggestions.

Installing - re error

When installing using "pip install pypatent" the following error occurs:

"Could not find a version that satisfies the requirement re (from pypatent) (from version: )
No matching distribution found for re (from pypatent)"

However, re already comes with python (at least in the Anaconda distribution)

AttributeError: 'NoneType' object has no attribute 'find_next'

Error when running your example:

pypatent.Search('TTL/(tennis AND (racquet OR racket))')

AttributeError                            Traceback (most recent call last)
<ipython-input-2-a7c0dc5b3207> in <module>
----> 1 pypatent.Search('TTL/(tennis AND (racquet OR racket))')

/usr/local/lib/python3.7/site-packages/pypatent/__init__.py in __init__(self, string, results_limit, get_patent_details, pn, isd, ttl, abst, aclm, spec, ccl, cpc, cpcl, icl, apn, apd, apt, govt, fmid, parn, rlap, rlfd, prir, prad, pct, ptad, pt3d, pppd, reis, rpaf, afff, afft, in_, ic, is_, icn, aanm, aaci, aast, aaco, aaat, lrep, an, ac, as_, acn, exp, exa, ref, fref, oref, cofc, reex, ptab, sec, ilrn, ilrd, ilpd, ilfd)
    245         r = requests.get(url, headers=Constants.request_header).text
    246         s = BeautifulSoup(r, 'html.parser')
--> 247         total_results = int(s.find(string=re.compile('out of')).find_next().text.strip())
    248 
    249         patents = self.get_patents_from_results_url(url, limit=results_limit)

AttributeError: 'NoneType' object has no attribute 'find_next'

Patent limit not in line with UPSTO Search Engine

Hi,

I've been testing pypatent lately and tried to retrieve USPTO patents containing the term "melting point". USPTO is recording 270,000+ of them in their USPTO Patent Full-Text and Image Database. When running pypatent I could only get 2290. I ran this code : pypatent.Search('melting point', results_limit=300000).as_dataframe()

Is there a subtility I did not understand?

Thank you in advance,
Best

ConnectionError

In July 2018, these two examples worked well:
pypatent.search("microsoft)
this_patent = patent('http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=4&p=1&f=G&l=50&d=PTXT&S1=aaa&OS=aaa&RS=aaa')

November 2018 I am receiving the following error:
ConnectionError: ('Connection aborted.', TimeoutError(10060, 'A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond', None, 10060, None))

Dependent package meet requirements:
Python 3, BeautifulSoup, requests, re

Loved using this package over the summer :) I tried the script from the summer that worked well, but received the same error. Looking for help with connecting to the server. Thank you for your help in advance! Much appreciated! Any suggestions @daneads ?

ConnectionError: ('Connection aborted.', BadStatusLine('Error #2000\n',))

My script iterates through a list of patents I want to collect information on.
I initially received this error:
Exception is: ('Connection aborted.', error(10054, ''))
I introduced a time.sleep(2) between calls of pypatent.Search function and remediated this error.

In the 5th iteration of pypatent.Search() , I received this error:
ConnectionError: ('Connection aborted.', BadStatusLine('Error #2000\n',))

Any suggestions on remediating this error? Thank you for your help in advance!

Archived?

Is this repo still connected to USPTO? Probably not, as connections take a ton of time without any response.

Is it a bug?

When I try to search two key words, i found the result is zero. I checked code, and found it's more reasonale if
searchstring = searchstring.replace(' ', '+') is changed to searchstring = searchstring.replace('-', '+')
I am using the 1.1.0 version and I found whatever 1.2.0 or 1.1.0 will report error after download some patents.

error in handling the patent before 1975

Hi, it seems that the pypatent can only handle the patent issued after 1976.
When I want to search the patent issued before 1975, and specify ISD/1/$/1975, it always return error.
I noticed that in the USPTO website, user can manually chooose the after 1976 database or the 1790 full datebase.
Can we manually choose the full database in the pypatent?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.