Giter VIP home page Giter VIP logo

scrapegraph-ai's Introduction

๐Ÿ•ท๏ธ ScrapeGraphAI: You Only Scrape Once

Downloads linting: pylint Pylint License: MIT

ScrapeGraphAI is a web scraping python library which uses LLM and direct graph logic to create scraping pipelines for websites, documents and XML files. Just say which information you want to extract and the library will do it for you!

Scrapegraph-ai Logo

๐Ÿš€ Quick install

The reference page for Scrapegraph-ai is avaible on the official page of pypy: pypi.

pip install scrapegraphai

๐Ÿ” Demo

Official streamlit demo:

My Skills

Try it directly on the web using Google Colab:

Open In Colab

Follow the procedure on the following link to setup your OpenAI API key: link.

๐Ÿ“– Documentation

The documentation for ScrapeGraphAI can be found here.

Check out also the docusaurus documentation.

๐Ÿ’ป Usage

You can use the SmartScraper class to extract information from a website using a prompt.

The SmartScraper class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the documentation.

Case 1: Extracting informations using a local LLM

Note: before using the local model remeber to create the docker container!

    docker-compose up -d
    docker exec -it ollama ollama run stablelm-zephyr

You can use which model you want instead of stablelm-zephyr

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/mistral",
        "temperature": 0,
        "format": "json",  # Ollama needs the format to be specified explicitly
        # "model_tokens": 2000, # set context length arbitrarily
    },
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the news with their description.",
    # also accepts a string with the already downloaded HTML code
    source="https://www.wired.com",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

Case 2: Extracting informations using Openai model

from scrapegraphai.graphs import SmartScraperGraph
OPENAI_API_KEY = "YOUR_API_KEY"

graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the news with their description.",
    # also accepts a string with the already downloaded HTML code
    source="https://www.wired.com",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

Case 3: Extracting informations using Gemini

from scrapegraphai.graphs import SmartScraperGraph
GOOGLE_APIKEY = "YOUR_API_KEY"

# Define the configuration for the graph
graph_config = {
    "llm": {
        "api_key": GOOGLE_APIKEY,
        "model": "gemini-pro",
    },
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the quotes, authors and tags ",
    source="http://quotes.toscrape.com",  # also accepts a string with the already downloaded HTML code as string format
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

The output for alle 3 the cases will be a dictionary with the extracted information, for example:

{
    'titles': [
        'Rotary Pendulum RL'
        ],
    'descriptions': [
        'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'
        ]
}

๐Ÿค Contributing

Fell free to contribute and join our Discord server to discuss with us improvements and give us suggestions!

For more information, please see the contributing guidelines.

My Skills My Skills My Skills

โค๏ธ Contributors

Contributors

๐ŸŽ“ Citations

If you have used our library for research purposes please quote us with the following reference:

  @misc{scrapegraph-ai,
    author = {Marco Perini, Lorenzo Padoan, Marco Vinciguerra},
    title = {Scrapegraph-ai},
    year = {2024},
    url = {https://github.com/VinciGit00/Scrapegraph-ai},
    note = {A Python library for scraping data from graphs}
  }

Authors

Authors Logos

Contact Info
Marco Vinciguerra Linkedin Badge
Marco Perini Linkedin Badge
Lorenzo Padoan Linkedin Badge

๐Ÿ“œ License

ScrapeGraphAI is licensed under the MIT License. See the LICENSE file for more information.

Acknowledgements

  • We would like to thank all the contributors to the project and the open-source community for their support.
  • ScrapeGraphAI is meant to be used for data exploration and research purposes only. We are not responsible for any misuse of the library.

scrapegraph-ai's People

Contributors

vincigit00 avatar perinim avatar lurenss avatar dpende avatar ftoppi avatar dependabot[bot] avatar erjanmx avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.