skrapeit / skrape.it Goto Github PK

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.

Home Page: https://docs.skrape.it

License: MIT License

Kotlin 97.76% HTML 2.24%

skrape dom parse kotlin html-parser test-automation crawler scraper testing kotlin-dsl

skrape.it's Introduction

skrape{it}

🛎️ 🚨 Help wanted 🚨 🛎️
Looking for Co-Maintainer(s), please contact [email protected] if you are interested in helping to maintain and evolve skrape{it} ❤️

skrape{it} is a Kotlin-based HTML/XML testing and web scraping library that can be used seamlessly in Spring-Boot, Ktor, Android or other Kotlin-JVM projects. The ability to analyze and extract HTML including client-side rendered DOM trees and all other XML-related markup specifications such as SVG, UML, RSS,... makes it unique. It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. First and foremost skrape{it} aims to be a testing tool (not tied to a particular test runner), but it can also be used to scrape websites in a convenient fashion.

Features

Parsing

Deserialization of HTML/XML from websites, local html files and html as string to data classes / POJOs.
Designed to deserialize HTML but can handle any XML-related markup specifications such as SVG, UML, RSS or XML itself.
DSL to select html elements as well as supporting CSS query-selector syntax by string invocation.

Http-Client

Http-Client without verbosity and ceremony to make requests and corresponding request options like headers, cookies etc. in a fluent style interface.
Pre-configure client regarding auth and other request settings
Can handle client side rendered web pages. Javascript execution results can optionally be considered in the response body.

Idiomatic

Easy to use, idiomatic and type-safe DSL to ensure a high level of readability.
Build-in matchers/assertions based on infix functions to archive a very high level of readability.
DSL is behaving like a Fluent-Api to make data extraction/scraping as comfortable as possible.

Compatibility

Not bind to a specific test-runner, framework or whatever.
Open to use any other assertion library of your choice.
Open to implement your own fetcher
Supports non-blocking fetching / Coroutine support

Extensions

In addition, extensions for well-known testing libraries are provided to extend them with the mentioned skrape{it} functionality. Currently available:

skrape{it} MockMvc extension
skrape{it} Ktor extension

Quick Start

Read the Docs

You'll always find the latest documentation, release notes and examples regarding official releases at https://docs.skrape.it. The README file you are reading right now provides example related to the latest master. Just use it if you won't wait for latest changes to be released. If you don't want to read that much or just want to get a rough overview on how to use skrape{it}, you can have a look at the Documentation by Example section which refers to the current master.

Installation

All our official/stable releases will be published to mavens central repository.

Add dependency

Gradle

dependencies {
    implementation("it.skrape:skrapeit:1.2.2")
}

Maven

<dependency>
    <groupId>it.skrape</groupId>
    <artifactId>skrapeit</artifactId>
    <version>1.2.2</version>
</dependency>

using bleeding edge features before official release

We are offering snapshot releases by publishing every successful build of a commit that has been pushed to master branch. Thereby you can just install the latest implementation of skrape{it}. Be careful since these are non-official releases and may be unstable as well as breaking changes can occur at any time.

Add experimental stuff

Gradle

repositories {
    maven { url = uri("https://oss.sonatype.org/content/repositories/snapshots/") }
}
dependencies {
    implementation("it.skrape:skrapeit:0-SNAPSHOT") { isChanging = true } // version number will stay - implementation may change ...
}

// optional
configurations.all {
    resolutionStrategy {
        cacheChangingModulesFor(0, "seconds")
    }
}

Maven

<repositories>
    <repository>
        <id>snapshot</id>
        <url>https://oss.sonatype.org/content/repositories/snapshots/</url>
    </repository>
</repositories>

...

<dependency>
    <groupId>it.skrape</groupId>
    <artifactId>skrapeit</artifactId>
    <version>0-SNAPSHOT</version>
</dependency>

Documentation by Example

(referring to current master)

You can find further examples in the projects integration tests.

Android

We have a working Android sample using jetpack-compose in our example projects as living documentation.

Parse and verify HTML from String

@Test
fun `can read and return html from String`() {
    htmlDocument("""
        <html>
            <body>
                <h1>welcome</h1>
                <div>
                    <p>first p-element</p>
                    <p class="foo">some p-element</p>
                    <p class="foo">last p-element</p>
                </div>
            </body>
        </html>""") {

            h1 {
                findFirst {
                    text toBe "welcome"
                }
            }
            p {
                withClass = "foo"
                findFirst {
                    text toBe "some p-element"
                    className toBe "foo"
                }
            }
            p {
                findAll {
                    text toContain "p-element"
                }
                findLast {
                    text toBe "last p-element"
                }
            }
        }
    }
}

Parse HTML and extract

data class MySimpleDataClass(
    val httpStatusCode: Int,
    val httpStatusMessage: String,
    val paragraph: String,
    val allParagraphs: List<String>,
    val allLinks: List<String>
)

class HtmlExtractionService {

    fun extract() {
        val extracted = skrape(HttpFetcher) {
            request {
                url = "http://localhost:8080"
            }

            response {
                MySimpleDataClass(
                    httpStatusCode = status { code },
                    httpStatusMessage = status { message },
                    allParagraphs = document.p { findAll { eachText } },
                    paragraph = document.p { findFirst { text } },
                    allLinks = document.a { findAll { eachHref } }
                )
            }
        }
        print(extracted)
        // will print:
        // MyDataClass(httpStatusCode=200, httpStatusMessage=OK, paragraph=i'm a paragraph, allParagraphs=[i'm a paragraph, i'm a second paragraph], allLinks=[http://some.url, http://some-other.url])
    }
}

Parse HTML and extract it

data class MyDataClass(
        var httpStatusCode: Int = 0,
        var httpStatusMessage: String = "",
        var paragraph: String = "",
        var allParagraphs: List<String> = emptyList(),
        var allLinks: List<String> = emptyList()
)

class HtmlExtractionService {

    fun extract() {
        val extracted = skrape(HttpFetcher) {
            request {
                url = "http://localhost:8080"
            }           

            extractIt<MyDataClass> {
                it.httpStatusCode = statusCode
                it.httpStatusMessage = statusMessage.toString()
                htmlDocument {
                    it.allParagraphs = p { findAll { eachText }}
                    it.paragraph = p { findFirst { text }}
                    it.allLinks = a { findAll { eachHref }}
                }
            }
        }
        print(extracted)
        // will print:
        // MyDataClass(httpStatusCode=200, httpStatusMessage=OK, paragraph=i'm a paragraph, allParagraphs=[i'm a paragraph, i'm a second paragraph], allLinks=[http://some.url, http://some-other.url])
    }
}

Testing HTML responses:

@Test
fun `dsl can skrape by url`() {
    skrape(HttpFetcher) {
        request {
            url = "http://localhost:8080/example"
        }       
        response {
            htmlDocument {
                // all official html and html5 elements are supported by the DSL
                div {
                    withClass = "foo" and "bar" and "fizz" and "buzz"

                    findFirst {
                        text toBe "div with class foo"

                        // it's possible to search for elements from former search results
                        // e.g. search all matching span elements within the above div with class foo etc...
                        span {
                            findAll {
                                // do something
                            }                       
                        }                   
                    }

                    findAll {
                        toBePresentExactlyTwice
                    }
                }
                // can handle custom tags as well
                "a-custom-tag" {
                    findFirst {
                        toBePresentExactlyOnce
                        text toBe "i'm a custom html5 tag"
                        text
                    }
                }
                // can handle custom tags written in css selctor query syntax
                "div.foo.bar.fizz.buzz" {
                    findFirst {
                        text toBe "div with class foo"
                    }
                }

                // can handle custom tags and add selector specificas via DSL
                "div.foo" {

                    withClass = "bar" and "fizz" and "buzz"

                    findFirst {
                        text toBe "div with class foo"
                    }
                }
            }
        }
    }
}

Scrape a client side rendered page:

fun getDocumentByUrl(urlToScrape: String) = skrape(BrowserFetcher) { // <--- pass BrowserFetcher to include rendered JS
    request { url = urlToScrape }
    response { htmlDocument { this } }
}


fun main() {
    // do stuff with the document
    println(getDocumentByUrl("https://docs.skrape.it").eachLink)
}

Scrape async

skrape{it}'s `AsyncFetcher` provides coroutine support

suspend fun getAllLinks(): Map<String, String> = skrape(AsyncFetcher) {
    request {
        url = "https://my-fancy.website"
    }
    response {
        htmlDocument { eachLink }
    }
}

Configure HTTP-Client:

class ExampleTest {
    val myPreConfiguredClient = skrape(HttpFetcher) {
        // url can be a plain url as string or build by #urlBuilder
        request {
            method = Method.POST // defaults to GET
            
            url = "" // you can  either pass url as String (defaults to 'http://localhost:8080')
            url { // or build url (will respect value from url as String param)
                // thereby you can pass a url and just override or add parts
                protocol = UrlBuilder.Protocol.HTTPS // defaults to given scheme from url param (HTTP if not set)
                host = "skrape.it" // defaults to given host from url param (localhost if not set)
                port = 12345  // defaults to given port from url param (8080 if not set explicitly - none port if given url param value does noit have port) - set to -1 to remove port
                path = "/foo" // defaults to given path from url param (none path if not set)
                queryParam { // can handle adding query parameters of several types (defaults to none)
                    "foo" to "bar" // add query paramter foo=bar
                    "aaa" to false // add query paramter aaa=false
                    "bbb" to .4711 // add query paramter bbb=0.4711
                    "ccc" to 42    // add query paramter ccc=42
                    "ddd" to listOf("a", 1, null) // add query paramter ddd=a,1,null
                    +"xxx"         // add query paramter xxx (just key, no value)
                }
            }
            timeout = 5000 // optional -> defaults to 5000ms
            followRedirects = true // optional -> defaults to true
            userAgent = "some custom user agent" // optional -> defaults to "Mozilla/5.0 skrape.it"
            cookies = mapOf("some-cookie-name" to "some-value") // optional
            headers = mapOf("some-custom-header" to "some-value") // optional
        }
    }
    
    @Test
    fun `can use preconfigured client`() {
    
        myPreConfiguredClient.response {
            status { code toBe 200 }
            // do more stuff
        }
    
        // slightly modify preconfigured client
        myPreConfiguredClient.apply {
            request {
                followRedirects = false
            }
        }.response {
            status { code toBe 301 }
            // do more stuff
        }
    }
}

send request body

1) plain as string

most low level option, needs to set content-type header "by hand"

skrape(HttpFetcher) {
    request {
        url = "https://www.my-fancy.url"
        method = Method.GET
        headers = mapOf("Content-Type" to "application/json")
        body = """{"foo":"bar"}"""
    }
    response {
        htmlDocument {
            ...

2) plain text with auto added content-type header that can be optionally overwritten

skrape(HttpFetcher) {
    request {
        url = "https://www.my-fancy.url"
        method = Method.POST
        body {
            data = "just a plain text" // content-type header will automatically set to "text/plain"
            contentType = "your-custom/content" // can optionally override content-type
        }
    }
    response {
        htmlDocument {
            ...

3) with helper functions for json or xml bodies

supports json and xml autocompletion when using intellij

skrape(HttpFetcher) {
    request {
        url = "https://www.my-fancy.url"
        method = Method.POST
        body {
            json("""{"foo":"bar"}""") // will automatically set content-type header to "application/json" 
            // or
            xml("<foo>bar</foo>") // will automatically set content-type header to "text/xml" 
            // or
            form("foo=bar") // will automatically set content-type header to "application/x-www-form-urlencoded" 
        }
    }
    response {
        htmlDocument {
            ...

4 with on the fly created json via dsl

skrape(HttpFetcher) {
    request {
        url = "https://www.my-fancy.url"
        method = Method.POST
        body {
            // will automatically set content-type header to "application/json"
            // will create {"foo":"bar","xxx":{"a":"b","c":[1,"d"]}} as request body
            json {
                "foo" to "bar"
                "xxx" to json {
                    "a" to "b"
                    "c" to listOf(1, "d")
                }
            }
        }
    }
    response {
        htmlDocument {
            ...

5 with on the fly created form via dsl

skrape(HttpFetcher) {
    request {
        url = "https://www.my-fancy.url"
        method = Method.POST
        body {
            // will automatically set content-type header to "application/x-www-form-urlencoded"
            // will create foo=bar&xxx=1.5 as request body
            form {
                "foo" to "bar"
                "xxx" to 1.5
            }
        }
    }
    response {
        htmlDocument {
            ...

Get in touch

If you need help, have questions on how to use skrape{it} or want to discuss features please don't hesitate to use the projects discussions section on GitHub or raise an issue if you found a bug.

Issues: You can raise issues on GitHub.
Discussions / Questions: Use the Discussions section or join the #skrape-it channel on the Kotlin Slack.
Twitter: Follow @skrape_it on Twitter for updates and release notifications.
Stackoverflow: post or search issues on Stackoverflow

💖 Support the project

Skrape{it} is and always will be free and open-source. I try to reply to everyone needing help using these projects. Obviously, the development, maintenance takes time.

However, if you are using this project and be happy with it or just want to encourage me to continue creating stuff or fund the caffeine and pizzas that fuel its development, there are few ways you can do it :-

Starring and sharing the project 🚀 to help make it more popular
Giving proper credit when you use skrape{it}, tell your friends and others about it 😃
Sponsor skrape{it} with a one-time donations via PayPal by just click this button → or use the GitHub sponsors program to support on a monthly basis 💖

skrape.it's People

Contributors

Stargazers

Watchers

skrape.it's Issues

org.jsoup.UncheckedIOException: java.net.SocketTimeoutException Read timeout

Once in a while I get this exception and I don't know how to prevent it from happening:
org.jsoup.UncheckedIOException: java.net.SocketTimeoutException: Read timeout

Caused by: java.net.SocketTimeoutException: Read timeout
	at org.jsoup.internal.ConstrainableInputStream.read(ConstrainableInputStream.java:58)
	at java.base/java.io.FilterInputStream.read(FilterInputStream.java:107)
	at org.jsoup.internal.ConstrainableInputStream.readToByteBuffer(ConstrainableInputStream.java:87)
	at org.jsoup.helper.DataUtil.readToByteBuffer(DataUtil.java:175)
	at org.jsoup.helper.HttpConnection$Response.prepareByteData(HttpConnection.java:863)
	... 9 more

Add dedicated element pickers for "base"-tag to the DSL

currently it is possible to to pick base-tag using

...
expect {
    element("base") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    base {
        text() toBe "I'm the inner text"
     }
}

[IMPROVEMENT] migrate CI build from travis to GitHub action

[IMPROVEMENT] migrate the project from maven to gradle

the build tool of our choice should be gradle, therefore we want to migrate mavens pom.xml to a gradle.build.kts using the gradle Kotlin DSL!

ACs:

migrated build tooling from maven to gradle with Kotlin dsl
remove maven wrapper
add gradle wrapper
adjust travis build file

[FEATURE] make it possible to add specific properties to an element selection using the DSL

instead of picking elements that require a complex css query via query string like so:
element("div.myClass[myAttr=bar]") it should be possible to describe the selector using the DSL.

e.g.:

element("div") {
    class="myClass"
    attribute="myAttr" to "bar"
}

div {
    class="myClass"
    attribute="myAttr" to "bar"
}

old logic still has to be supported.

Add dedicated element pickers for "blockquote"-tag to the DSL

currently it is possible to to pick blockquote-tag using

...
expect {
    element("blockquote") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    blockquote {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "p"-tag to the DSL

currently it is possible to to pick paragraph element ("p") using

...
expect {
    element("p") {
        text() toBe "i'm a paragraph"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    p {
        text() toBe "i'm a paragraph"
     }
}

[FEATURE] add support for DOM tree relevant assertions made with assertK

skrape{it} should support well-known assertion libraries by adding custom matchers that are relevant when expecting things in the DOM tree.
AssertK is an well-known and nicely extendable assertion library we want to support.

Please add the following custom matchers that can be used on Element objects

isPresent
isNotPresent
isPresentTimes(n)
hasText("")
hasTextContainig("")
hasTextStartingWith("")
hasTextEndingWith("")
hasClass("")
hasClassContainig("")
hasClassStartingWith("")
hasClassEndingWith("")
hasAttribute("" to "")
hasId("")
hasIdContainig("")
hasIdStartingWith("")
hasIdEndingWith("")
isDisabled
- check if element has attribute disabled
hasSrc("")
hasSrcContaining("")
hasSrcStartingWith("")
hasSrcEndingWith("")
hasTitle("")
hasTitleContainig("")
hasTitleStartingWith("")
hasTitleEndingWith("")
hasHref("")
hasHrefContainig("")
hasHrefStartingWith("")
hasHrefEndingWith("")

migrate this project from maven to gradle

We want to use gradle as our build tool of choice. Therefore the project should be migrated from maven to gradle

Add dedicated element picker for "noscript"-tag to the DSL

currently it is possible to to pick noscript-tag using

...
expect {
    element("noscript") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    noscript {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "style"-tag to the DSL

currently it is possible to to pick style-tag using

...
expect {
    element("style") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    style {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "footer"-tag to the DSL

currently it is possible to to pick footer-tag using

...
expect {
    element("footer") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    footer {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "section"-tag to the DSL

currently it is possible to to pick section-tag using

...
expect {
    element("section") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    section {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "address"-tag to the DSL

currently it is possible to to pick address-tag using

...
expect {
    element("address") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    address {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "dt"-tag to the DSL

currently it is possible to to pick dt-tag using

...
expect {
    element("dt") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    dt {
        text() toBe "I'm the inner text"
     }
}

[IMPROVEMENT] module split

Since skrape{it} is basically characterized by 3 functionalities that would also independently offer value for users, we also want to make them technically visible.

This will make it possible to separate/display functionalities individually at release as well as in the code.

for this reason a Gradle Multimodule project should be created:

the modules distinguish between:

core
html parser
assertions
http client
browser client

Create and release separate Maven artifacts

Add dedicated element pickers for "b"-tag to the DSL

currently it is possible to to pick b-tag using

...
expect {
    element("b") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    b {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "figure"-tag to the DSL

currently it is possible to to pick figure-tag using

...
expect {
    element("figure") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    figure {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "script"-tag to the DSL

currently it is possible to to pick script-tag using

...
expect {
    element("script") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    script {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "strong"-tag to the DSL

currently it is possible to to pick strong-tag using

...
expect {
    element("strong") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    strong {
        text() toBe "I'm the inner text"
     }
}

[FEATURE] add Oauth2 request option

I think it would be really awesome if skrape{it} would provide a convenient way to configure Oauth2 authentication via request options

Describe the solution you'd like

skrape {
    url = "http://some.url"
    auth {
        oauth2 {
            token = <access_token>
        }
    }
    expect {
        // do stuff with the response
    }
}

Additional context
It would be nice to have the config wrapped in an auth {} lambda to be open for further authentications like basic, ...

[IMPROVEMENT] make static code analysis work with gradle

we are currently using detekt as static source code analysis tool.
at the moment this is configured by using a maven plugin. make it work with gradle (it exists a plugin as well)

!! has to be done after #11

Add dedicated element pickers for "meta"-tag to the DSL

currently it is possible to to pick meta-tag using

...
expect {
    element("meta") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    meta {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "figcaption"-tag to the DSL

currently it is possible to to pick figcaption-tag using

...
expect {
    element("figcaption") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    figcaption {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "main"-tag to the DSL

currently it is possible to to pick main-tag using

...
expect {
    element("main") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    main {
        text() toBe "I'm the inner text"
     }
}

MatchersKt cannot be represented in dex format.

This occurs when trying to build in android project. Here is the details error message
Method name 'to be' in class 'it.skrape.matchers.MatchersKt' cannot be represented in dex format.

Add dedicated element pickers for "nav"-tag to the DSL

currently it is possible to to pick nav-tag using

...
expect {
    element("nav") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    nav {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "head"-tag to the DSL

currently it is possible to to pick head-tag using

...
expect {
    element("head") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    head {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "div"-tag to the DSL

currently it is possible to to pick div-tag using

...
expect {
    element("div") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    div {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "html"-tag to the DSL

currently it is possible to to pick html-tag using

...
expect {
    element("html") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    html {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "span"-tag to the DSL

currently it is possible to to pick span-tag using

...
expect {
    element("span") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    span {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "header"-tag to the DSL

currently it is possible to to pick header-tag using

...
expect {
    element("header") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    header {
        text() toBe "I'm the inner text"
     }
}

(!) this requires to change the behavior of the current "header" function that is extraction the HTTP headers

Add dedicated element pickers for headline tags to the DSL

currently it is possible to to pick headline tags using

...
expect {
    element("h2") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.
it should support all standard html5 headline tags (h1, h2, h3, h4, h5, h6)

e.g.:

...
expect {
    h2 {
        text() toBe "I'm the inner text"
     }
}

[FEATURE] add support for DOM tree relevant assertions made with assertJ

skrape{it} should support well-known assertion libraries by adding custom matchers that are relevant when expecting things in the DOM tree.
AssertJ is an well-known and nicely extendable assertion library we want to support.

Please add the following custom matchers that can be used on Element objects

isPresent
isNotPresent
isPresentTimes(n)
hasText("")
hasTextContainig("")
hasTextStartingWith("")
hasTextEndingWith("")
hasClass("")
hasClassContainig("")
hasClassStartingWith("")
hasClassEndingWith("")
hasAttribute("" to "")
hasId("")
hasIdContainig("")
hasIdStartingWith("")
hasIdEndingWith("")
isDisabled
check if element has attribute disabled
hasSrc("")
hasSrcContaining("")
hasSrcStartingWith("")
hasSrcEndingWith("")
hasTitle("")
hasTitleContainig("")
hasTitleStartingWith("")
hasTitleEndingWith("")
hasHref("")
hasHrefContainig("")
hasHrefStartingWith("")
hasHrefEndingWith("")

Add dedicated element pickers for "hr"-tag to the DSL

currently it is possible to to pick hr-tag using

...
expect {
    element("hr") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    hr {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "link"-tag to the DSL

currently it is possible to to pick link-tag using

...
expect {
    element("link") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    link {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "dd"-tag to the DSL

currently it is possible to to pick dd-tag using

...
expect {
    element("dd") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    dd {
        text() toBe "I'm the inner text"
     }
}

[FEATURE] add basic-auth to request config

I think it would be really awesome if scrape{it} would provide a convenient way to configure basic authentication via request options.

Describe the solution you'd like

skrape {
    url = "http://some.url"
    auth {
        basic {
            username = "foo"
            password = "bar"
        }
    }
    expect {
        // do stuff with the response
    }
}

Describe alternatives you've considered
It is currently possible to archive basic auth by setting the related header manually.

skrape {
    url = "http://some.url"
    headers = mapOf("Authorization" to "Basic aHR0cHdhdGNoOmY=") // base64 encoded version of <username>:<password>
}
`
``
or by adding the credentials to the url
```kotlin
skrape {
    url = "http://<username>:<password>@some.url"
}

Additional context
It would be nice to have the config wrapped in an auth {} lambda to be open for further authentications like oauth2

Add dedicated element pickers for "li"-tag to the DSL

currently it is possible to to pick li-tag using

...
expect {
    element("li") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    li {
        text() toBe "I'm the inner text"
     }
}

[FEATURE] add support for DOM tree relevant assertions made with Kotlintest

skrape{it} should support well-known assertion libraries by adding custom matchers that are relevant when expecting things in the DOM tree.
KotlinTest is a well-known testing and assertion library we want to support.

Please add the following custom matchers that can be used on Element objects

shouldBePresent
shouldNotBePresent
shouldBePresentTimes(n)
shouldHaveText("")
shouldHaveTextContainig("")
shouldHaveTextStartingWith("")
shouldHaveTextEndingWith("")
shouldHaveClass("")
shouldHaveClassContainig("")
shouldHaveClassStartingWith("")
shouldHaveClassEndingWith("")
shouldHaveAttribute("" to "")
shouldHaveId("")
shouldHaveIdContainig("")
shouldHaveIdStartingWith("")
shouldHaveIdEndingWith("")
shouldBeDisabled
- check if element has attribute disabled
shouldHaveSrc("")
shouldHaveSrcContaining("")
shouldHaveSrcStartingWith("")
shouldHaveSrcEndingWith("")
shouldHaveTitle("")
shouldHaveTitleContainig("")
shouldHaveTitleStartingWith("")
shouldHaveTitleEndingWith("")
shouldHaveHref("")
shouldHaveHrefContainig("")
shouldHaveHrefStartingWith("")
shouldHaveHrefEndingWith("")

Add dedicated element pickers for "ol"-tag to the DSL

currently it is possible to to pick ol-tag using

...
expect {
    element("ol") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    ol {
        text() toBe "I'm the inner text"
     }
}

travis CI build one latest linux dist

the CI build (at travis.org) should run on latest linux dist (18.04 LTS, Bionic)

[IMPROVEMENT] replace http client (HttpFetcher)

we want to use KoHttp as our http client of choice because it has a nice Kotlin DSL and native support for non-blocking requests.

Add dedicated element pickers for "article"-tag to the DSL

currently it is possible to to pick article-tag using

...
expect {
    element("article") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    article {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "dl"-tag to the DSL

currently it is possible to to pick dl-tag using

...
expect {
    element("dl") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    dl {
        text() toBe "I'm the inner text"
     }
}

[FEATURE] add support for DOM tree relevant assertions made with strikt

skrape{it} should support well-known assertion libraries by adding custom matchers that are relevant when expecting things in the DOM tree.
Strikt is an well-known and nicely extendable assertion library we want to support.

Please add the following custom matchers that can be used on Element objects

isPresent
isNotPresent
isPresentTimes(n)
hasText("")
hasTextContainig("")
hasTextStartingWith("")
hasTextEndingWith("")
hasClass("")
hasClassContainig("")
hasClassStartingWith("")
hasClassEndingWith("")
hasAttribute("" to "")
hasId("")
hasIdContainig("")
hasIdStartingWith("")
hasIdEndingWith("")
isDisabled
- check if element has attribute disabled
hasSrc("")
hasSrcContaining("")
hasSrcStartingWith("")
hasSrcEndingWith("")
hasTitle("")
hasTitleContainig("")
hasTitleStartingWith("")
hasTitleEndingWith("")
hasHref("")
hasHrefContainig("")
hasHrefStartingWith("")
hasHrefEndingWith("")

[IMPROVEMENT] publish snapshots to jitpack

We want to publish all commits to the master that have passed the CI build to jitpack.
Thereby we are able to early test integration of potential release candidates and providing bleeding edge features to users.

Regular / stable releases will still go to mavens central repository

Add dedicated element pickers for "pre"-tag to the DSL

currently it is possible to to pick pre-tag using

...
expect {
    element("pre") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    pre {
        text() toBe "I'm the inner text"
     }
}

Add dedicated element pickers for "ul"-tag to the DSL

currently it is possible to to pick ul-tag using

...
expect {
    element("ul") {
        text() toBe "I'm the inner text"
     }
}

make it possible to have a more convenient and type safe way of picking elements.

e.g.:

...
expect {
    ul {
        text() toBe "I'm the inner text"
     }
}

[QUESTION] support kotlin.js

is someone into building Kotlin multiplattform libraries and could give some advise on what to do / pitfalls to be aware of or would be open to help to make this project a multiplattform lib?

skrapeit / skrape.it Goto Github PK

skrape.it's Introduction

Features

Parsing

Http-Client

Idiomatic

Compatibility

Extensions

Quick Start

Read the Docs

Installation

Add dependency

using bleeding edge features before official release

Add experimental stuff

Documentation by Example

(referring to current master)

Android

Parse and verify HTML from String

Parse HTML and extract

Parse HTML and extract it

Testing HTML responses:

Scrape a client side rendered page:

Scrape async

skrape{it}'s AsyncFetcher provides coroutine support

Configure HTTP-Client:

send request body

1) plain as string

most low level option, needs to set content-type header "by hand"

2) plain text with auto added content-type header that can be optionally overwritten

3) with helper functions for json or xml bodies

supports json and xml autocompletion when using intellij

4 with on the fly created json via dsl

5 with on the fly created form via dsl

Get in touch

💖 Support the project

skrape.it's People

Contributors

Stargazers

Watchers

Forkers

skrape.it's Issues

Recommend Projects

Recommend Topics

Recommend Org

skrape{it}'s `AsyncFetcher` provides coroutine support