Giter VIP home page Giter VIP logo

scrapcrab's Introduction

ScrapCrab

Automated webscraping, timeseries database & dashboard.

Deployed in a docker-compose environment.

1 Credit Master project.

Image

Goal

  • Build service Retrieve data from web (Scrapping, API, downloads) [Python, BeautifulSoup, Requests, Selenium]
  • Store data in database ([InfluxDB] for time series, [Prometheus] for Monitoring)
  • Create Dashboard to display data [Grafana]
  • Deploy application via docker-compose. (Optionaly on continously running machine Chromebox)

Data sources

Download files: https://www.pegelonline.wsv.de/webservices/files (Temp & waterlevel)

Page content: https://pegel.bonn.de/php/rheinpegel.php (waterlevel)

CHANGELOG

xx.11.22    - Started project
            - Researched websites for Rhein waterlevel data
            - Included Selenium & BeatuifulSoup Webscraping
            - 4 hours

01.12.22    - Added jupyternotebook 
            - Finished scrapper
            - Researched influxDB Timeseries Database 
            - 4 hours

02.12.22    - Added influxDB write from pandas Dataframe
            - Refactored jupyternotebook
            - Investigated scheduled python task: Advanced Python Scheduler
                Source: https://apscheduler.readthedocs.io/en/3.x/userguide.html
            - Added logging. INFO/WARNING, FORMATER
                Source: https://stackoverflow.com/questions/16757578/what-is-pythons-default-logging-formatter
            - Added headless Chromedriver to python. Required for running Selenium inside docker-container. 
                Window settings window size was important somehow. 
                Source: https://www.geeksforgeeks.org/driving-headless-chrome-with-python/
            - Added Chromedriver-version handling in Dockerfile.
            - Added .dockerignore for smaller Build Context
            - docker-compose added and scraper + influxdb works.
            - 7 hours

05.12.22    - Tried monitoring/ docker-compose file. Grafana/Prometheus/node-exporter.
                Prometheus Error: ts=2022-12-05T17:23:46.631Z caller=dedupe.go:112 component=remote level=warn remote_name=37e61e
                 url=grafana msg="Failed to send batch, retrying" err="Post \"grafana\": unsupported protocol scheme \"\""


22.12.22    - Error: raise NewConnectionError(
  | urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f7f1ef17ac0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution

            Write api cannot find influxdb...somehow. Localhost was accessible.
            Library functinon changed. So i had to add type cast to int.

            - Made grafana work for exported prometheus metrics. But grafana actually imports metrics from prometheus. Prometheus push still fails.

28.12.2022  - Prometheus vs influxDB : Pull-Based / Push-Based System.
            - Combined monitoring with scraping docker-compose.
            - Integrated influxDB Dashboard into Grafana
            - Solved grafana file permission problems.

IDEAS

  • Grafana credentials are in plaintext inside prometheus.yaml. How to import these env files into yaml file?
  • Deploy on Chromebox and Monitor Chromebox Resources
  • Alternatively use selenium standalone images https://github.com/SeleniumHQ/docker-selenium
  • Add env-variables & conditionals for logging levels
  • Check chromedriver compatibility
  • Best practise to load .env variables into Docker Image
  • ngrok-reverseproxy mit cloudflare-api. (Schwierigkeiten mit der Implementierung in .js)

scrapcrab's People

Contributors

thomasjon196 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.