Giter VIP home page Giter VIP logo

pluck's Introduction

pluck
Version Code Coverage Code Coverage

Pluck text in a fast and intuitive way. ๐Ÿ“

pluck makes text extraction intuitive and fast. You can specify an extraction in nearly the same way you'd tell a person trying to extract the text by hand: "OK Bob, every time you find X and then Y, copy down everything you see until you encounter Z."

In pluck, X and Y are called activators and Z is called the deactivator. The file/URL being plucked is parsed (or streamed) byte-by-byte into a finite state machine. Once all activators are found, the following bytes are saved to a buffer, which is added to a list of results once the deactivator is found. Multiple queries are extracted simultaneously and there is no requirement on the file format (e.g. XML/HTML), as long as its text.

Why?

pluck was made as a simple alternative to xpath and regexp. Through simple declarations, pluck allows complex procedures like extracting text in nested HTML tags, or extracting the content of an attribute of a HTML tag. pluck may not work in all scenarios, so do not consider it a replacement for xpath or regexp.

Doesn't regex already do this?

Yes basically. Here is an (simple) example:

(?:(?:X.*Y)|(?:Y.*X))(.*)(?:Z)

Basically, this should try and match everything before a Z and after we've seen both X and Y, in any order. This is not a complete example, but it shows the similarity.

The benefit with pluck is simplicity. You don't have to worry about escaping the right characters, nor do you need to know any regex syntax (which is not simple). Also pluck is hard-coded for matching this specific kind of pattern simultaneously, so there is no cost for generating a new deterministic finite automaton from multiple regex.

Doesn't cascadia already do this?

Yes, there is already a command-line tool to extract structured information from XML/HTML. There are many benefits to cascadia, namely you can do a lot more complex things with structured data. If you don't have highly structured data, pluck is advantageous (it extracts from any file). Also, with pluck you don't need to learn CSS selection.

Getting Started

Install

If you have Go1.7+

go get github.com/schollz/pluck

or just download from the latest releases.

Basic usage

Lets say you want to find URLs in a HTML file.

$ wget nytimes.com -O nytimes.html
$ pluck -a '<' -a 'href' -a '"' -d '"' -l 10 -f nytimes.html
{
    "0": [
        "https://static01.nyt.com/favicon.ico",
        "https://static01.nyt.com/images/icons/ios-ipad-144x144.png",
        "https://static01.nyt.com/images/icons/ios-iphone-114x144.png",
        "https://static01.nyt.com/images/icons/ios-default-homescreen-57x57.png",
        "https://www.nytimes.com",
        "http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml",
        "http://mobile.nytimes.com",
        "http://mobile.nytimes.com",
        "https://typeface.nyt.com/css/zam5nzz.css",
        "https://a1.nyt.com/assets/homepage/20170731-135831/css/homepage/styles.css"
    ]
}

The -a specifies activators and can be specified multiple times. Once all activators are found, in order, the bytes are captured. The -d specifies a deactivator. Once a deactivator is found, then it terminates capturing and resets and begins searching again. The -l specifies the limit (optional), after reaching the limit (10 in this example) it stops searching.

Advanced usage

Parse URLs or Files

Files can be parsed with -f FILE and URLs can be parsed by instead using -u URL.

$ pluck -a '<' -a 'href' -a '"' -d '"' -l 10 -u https://nytimes.com

Use Config file

You can also specify multiple things to pluck, simultaneously, by listing the activators and the deactivator in a TOML file. For example, lets say we want to parse ingredients and the title of a recipe. Make a file config.toml:

[[pluck]]
name = "title"
activators = ["<title>"]
deactivator = "</title>"

[[pluck]]
name = "ingredients"
activators = ["<label","Ingredient",">"]
deactivator = "<"
limit = -1

The title follows normal HTML and the ingredients were determined by quickly inspecting the HTML source code of the target site. Then, pluck it with,

$ pluck -c config.toml -u https://goo.gl/DHmqmv
{
    "ingredients": [
        "1 pound medium (26/30) peeled and deveined shrimp, tails removed",
        "2 teaspoons chili powder",
        "Kosher salt",
        "2 tablespoons canola oil",
        "4 scallions, thinly sliced",
        "One 15-ounce can black beans, drained and rinsed well",
        "1/3 cup prepared chipotle mayonnaise ",
        "2 limes, 1 zested and juiced and 1 cut into wedges ",
        "One 14-ounce bag store-bought coleslaw mix (about 6 cups)",
        "1 bunch fresh cilantro, leaves and soft stems roughly chopped",
        "Sour cream or Mexican crema, for serving",
        "8 corn tortillas, warmed "
    ],
    "title": "15-Minute Shrimp Tacos with Spicy Chipotle Slaw Recipe | Food Network Kitchen | Food Network"
}

Extract structured data

Lets say you want to tell Bob "OK Bob, first look for W. Then, every time you find X and then Y, copy down everything you see until you encounter Z. Also, stop if you see U, even if you are not at the end." In this case, W, X, and Y are activators but W is a "Permanent" activator. Once W is found, Bob forgets about looking for it anymore. U is a "Finisher" which tells Bob to stop looking for anything and return whatever result was found.

You can extract information from blocks in pluck by using these two keywords: "permanent" and "finisher". The permanent number determines how many of the activators (from the left to right) will stay activated forever, once activated. The finisher keyword is a new string that will retire the current plucker when found and not capture anything in the buffer.

For example, suppose you want to only extract link3 and link4 from the following:

<h1>Section 1</h1>
<a href="link1">1</a>
<a href="link2">2</a>
<h1>Section 2</h1>
<a href="link3">3</a>
<a href="link4">4</a>
<h1>Section 3</h1>
<a href="link5">5</a>
<a href="link6">6</a>

You can add "Section 2" as an activator and set permanent to 1 so that only the first activator ("Section 2") will continue to remain activated after finding the deactivator. Then you want to finish the plucker when it hits "Section 3", so we can set the finisher keyword as this. Then config.toml is

[[pluck]]
activators = ["Section 2","a","href",'"']
permanent = 1     # designates that the first 1 activators will persist
deactivator = '"'
finisher = "Section 3"

will result in the following:

{
    "0": [
        "link3",
        "link4",
    ]
}

More examples

See EXAMPLES.md for more examples.

Use as a Go package

Import pluck as "github.com/schollz/pluck/pluck" and you can use it in your own project. See the tests for more info.

Development

$ go get -u github.com/schollz/pluck/...
$ cd $GOPATH/src/github.com/schollz/pluck/pluck
$ go test -cover

Current benchmark

The state of the art for xpath is lxml, based on libxml2. Here is a comparison for plucking the same data from the same file, run on Intel i5-4310U CPU @ 2.00GHz ร— 4. (Run Python benchmark cd pluck/test && python3 main.py).

Language Rate
lxml (Python3.5) 300 / s
pluck 1270 / s

A real-world example I use pluck for is processing 1,200 HTML files in parallel, compared to running lxml in parallel:

Language Rate
lxml (Python3.6) 25 / s
pluck 430 / s

I'd like to benchmark a Perl regex, although I don't know how to write this kind of regex! Send a PR if you do :)

To Do

  • Allow OR statements (e.g '|").
  • Quotes match to quotes (single or double)?
  • Allow piping from standard in?
  • API to handle strings, e.g. PluckString(s string)
  • Add parallelism

License

MIT

Acknowledgements

Graphics by: www.vecteezy.com

pluck's People

Contributors

schollz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pluck's Issues

Why is this slow?

@tscholl2

Why is it that when I a method on an struct pointer, like

func (s *Object) process(inputs []byte) {
	for _, i := range inputs {
		// Lots of code
	}	
}

it will slow down a lot if I move // Lots of code to its own function? I.e. I reorganize the above program into

func (s *Object) process(inputs []byte) {
	for _, i := range inputs {
		processInput(i)
	}	
}

func (s *Object) processInput(i byte) {
	// Lots of code
}

This new code runs 30% slower now!

Why?

This matters because I'm in a situation in pluck where I need // Lots of code in two places. You can reproduce this in pluck by running

go get -u github.com/schollz/pluck
cd $GOPATH/src/github.com/schollz/pluck/pluck
git checkout ef1004f && go test -bench=Stream -run=z
git checkout 76c4e96 && go test -bench=Stream -run=z
git diff 76c4e96 ef1004f # shows that I replace lots of code with one function

Feature Request: Hierarchically structured plucked data

Assuming following config file:

# URL: https://www.pogdesign.co.uk/cat/
[[pluck]]
name = "day"
activators = ['href="./day/', 'title="']
deactivator = '"'

[[pluck]]
name = "show"
activators = ['<p data-episode', 'summary" >']
deactivator = '</a>'

[[pluck]]
name = "episode"
activators = ['<a href="', 'font-size: 0.65rem;">']
deactivator = '</a>'

It is possible to get the individual pieces of information from a webpage ("date", "show" and "episode"), but it is not hierarchically structured.

I would like to request a feature that allows to get the plucked data in a hierarchical structure.

Option to pass a string variable into PluckString?

Embarrassing, but I can admit it.

I suck...

Thanks for the brilliant code and apologies for opening an issue.

edit:

Well, I suck a little less than I thought.

I'm importing Pluck as a pkg into some code I am writing with github.com/chromedp/chromedp as my browser driver. PluckString is definitely getting the correct string content from a variable if I print that variable to console after loading a page with chromedp and grabbing a chunk or reactID that I need.

Any ideas for a newb who can get a correct map from (if I first write the same var to a file):

p.Load("config.toml")
p.PluckFile("testpluck.txt")
fmt.Println(p.Result())

But not from this?:

//code that correctly loads desired string into var html here before and after using strconv.Quote.
p.PluckString(html)   // <------- PluckString won't take the string var html here for some reason  
fmt.Println(p.Result())

I am sure that I am just doing something stupid, but can't find out why PluckString would need a string literal in the code "p.PluckString("here")" and won't take a string var "p.PluckString(here)".

Any advice for a newb?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.