Giter VIP home page Giter VIP logo

uspto-parser's Introduction

uspto-parser

Spark application to parse patents from the USPTO. It mainly consists of a wrapper of the USPTO parser USPTO/PatentPublicData, so thanks to them for the dirty work :)

Build

Using sbt docker to directly get a docker image with the jar: sbt docker

Usage

show usage:

docker run --rm asgard/uspto-parser:latest spark-submit /home/uspto-parser/uspto-parser-assembly-0.0.1.jar

  --folderPath <value>
        path to folder containing patent archives to process
  --outputPath <value>
        path to output parquet file
  --numPartitions <value>
        Number of partitions of rdd to process
  --test
        Flag to test the software, process only 2 patent archive
  --from <value>
        Starting date, in string format, will be infered. For instance 20010101
  --to <value>
        Ending date, in string format: yyyyMMdd or yyMMdd

Data Schema

root
 |-- type: string (nullable = true)
 |-- kind: string (nullable = true)
 |-- patentId: string (nullable = true)
 |-- patentNb: string (nullable = true)
 |-- applicationId: string (nullable = true)
 |-- applicationDate: string (nullable = true)
 |-- publicationDate: string (nullable = true)
 |-- relatedIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- otherIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- abstract: string (nullable = true)
 |-- briefSummary: string (nullable = true)
 |-- detailedDescription: string (nullable = true)
 |-- inventors: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: struct (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- raw: string (nullable = true)
 |    |    |    |-- firstName: string (nullable = true)
 |    |    |    |-- middleName: string (nullable = true)
 |    |    |    |-- lastName: string (nullable = true)
 |    |    |    |-- abbreviated: string (nullable = true)
 |    |    |-- address: struct (nullable = true)
 |    |    |    |-- street: string (nullable = true)
 |    |    |    |-- city: string (nullable = true)
 |    |    |    |-- state: string (nullable = true)
 |    |    |    |-- country: string (nullable = true)
 |    |    |    |-- zipCode: string (nullable = true)
 |    |    |    |-- email: string (nullable = true)
 |    |    |    |-- phone: string (nullable = true)
 |    |    |-- residency: string (nullable = true)
 |    |    |-- nationality: string (nullable = true)
 |-- applicants: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: struct (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- raw: string (nullable = true)
 |    |    |    |-- firstName: string (nullable = true)
 |    |    |    |-- middleName: string (nullable = true)
 |    |    |    |-- lastName: string (nullable = true)
 |    |    |    |-- abbreviated: string (nullable = true)
 |    |    |-- address: struct (nullable = true)
 |    |    |    |-- street: string (nullable = true)
 |    |    |    |-- city: string (nullable = true)
 |    |    |    |-- state: string (nullable = true)
 |    |    |    |-- country: string (nullable = true)
 |    |    |    |-- zipCode: string (nullable = true)
 |    |    |    |-- email: string (nullable = true)
 |    |    |    |-- phone: string (nullable = true)
 |-- assignees: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: struct (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- raw: string (nullable = true)
 |    |    |    |-- firstName: string (nullable = true)
 |    |    |    |-- middleName: string (nullable = true)
 |    |    |    |-- lastName: string (nullable = true)
 |    |    |    |-- abbreviated: string (nullable = true)
 |    |    |-- address: struct (nullable = true)
 |    |    |    |-- street: string (nullable = true)
 |    |    |    |-- city: string (nullable = true)
 |    |    |    |-- state: string (nullable = true)
 |    |    |    |-- country: string (nullable = true)
 |    |    |    |-- zipCode: string (nullable = true)
 |    |    |    |-- email: string (nullable = true)
 |    |    |    |-- phone: string (nullable = true)
 |    |    |-- role: string (nullable = true)
 |    |    |-- roleDesc: string (nullable = true)
 |-- ipcs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- main: boolean (nullable = true)
 |    |    |-- normalized: string (nullable = true)
 |    |    |-- section: string (nullable = true)
 |    |    |-- class: string (nullable = true)
 |    |    |-- subClass: string (nullable = true)
 |    |    |-- group: string (nullable = true)
 |    |    |-- subGroup: string (nullable = true)
 |-- claims: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- text: string (nullable = true)
 |    |    |-- parentIds: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |-- priorities: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- docNumber: string (nullable = true)
 |    |    |-- kind: string (nullable = true)
 |    |    |-- date: string (nullable = true)
 |    |    |-- country: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |-- publicationRef: struct (nullable = true)
 |    |-- docNumber: string (nullable = true)
 |    |-- kind: string (nullable = true)
 |    |-- date: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- id: string (nullable = true)
 |-- applicationRef: struct (nullable = true)
 |    |-- docNumber: string (nullable = true)
 |    |-- kind: string (nullable = true)
 |    |-- date: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- id: string (nullable = true)
 |-- citations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- num: string (nullable = true)
 |    |    |-- documentId: struct (nullable = true)
 |    |    |    |-- docNumber: string (nullable = true)
 |    |    |    |-- kind: string (nullable = true)
 |    |    |    |-- date: string (nullable = true)
 |    |    |    |-- country: string (nullable = true)
 |    |    |    |-- id: string (nullable = true)

uspto-parser's People

Contributors

thomasopsomer avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

uspto-parser's Issues

Null Pointer Exception for local run

Exception in thread "main" java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.newBuilder$extension(ArrayOps.scala:112)
at scala.collection.mutable.ArrayOps$ofRef.newBuilder(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.filter(TraversableLike.scala:262)
at scala.collection.mutable.ArrayOps$ofRef.filter(ArrayOps.scala:108)
at usptoparser.Helper$.getListOfFilesInFolder(Helper.scala:52)
at usptoparser.Helper$.getRecursiveListOfFilesInFolder(Helper.scala:58)
at usptoparser.SparkApp$.runLocal(SparkApp.scala:127)
at usptoparser.SparkApp$.run(SparkApp.scala:170)
at usptoparser.SparkApp$.main(SparkApp.scala:220)
at usptoparser.SparkApp.main(SparkApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
*** spark-submit exited with status 1.
*** Shutting down runit daemon (PID 8)...
*** Killing all processes...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.