Giter VIP home page Giter VIP logo

url-fishing-cs4295's Introduction

URL Fishing - Team 14

About The Project

This is a project for course Release Engineering for ML Applications (CS4295) at Delft University of Technology.

Getting Started

Installation

  1. Clone the repository
git clone https://github.com/nadinekuo/URL-Fishing-CS4295.git
  1. Setup (and activate) your environment
# using pip

pip install -r requirements.txt

# using Conda
conda create --name <env_name> --file requirements.txt

How To: Dependency Management with Poetry

In your virtual environment, run pip install pipx followed by python -m pipx install poetry. Then, run python -m pipx ensurepath and restart your terminal/IDE.

After re-opening your terminal or IDE, you should now be able to run poetry. Test this by simply writing the command poetry. Then ensure poetry is up to date by running pip install --upgrade poetry.

To install the defined dependencies for this project, run:

poetry install

How To: Run DVC Pipeline

To run the DVC pipeline (as configured in dvc.yaml), firstly make sure you have DVC installed in your working environment.

To run the pipeline, simply use the command dvc repro.

Check the DVC documentation for further details and additional possibilities.

How To: DVC Remotes

If you would like to remotely download the data, you need an AWS access key ID and an AWS secret access key, which you should add in a local .env file with the following format:

AWS_ACCESS_KEY_ID=<aws_access_key_id>
AWS_SECRET_ACCESS_KEY=<aws_secret_access_key>
AWS_BUCKET_NAME=<bucket_name>

If you want to setup a remote run:

dvc remote add -d myremote s3://<bucket>/<key>

Next, add:

dvc remote modify --local <myremote-name> access_key_id '<aws_access_key_id>'
dvc remote modify --local <myremote-name> secret_access_key '<aws_secret_access_key>'

You can push artefacts to the remote by running dvc push. Similarly, pulling from the remote can be performed by running dvc pull.

How To: DVC Experiment Management

In this project, DVC is also used to report metrics and keep track of different experiments/models.

Run the experiment using dvc exp run. See the difference by running dvc metrics diff.

Whenever anything is changed in the project, a new experiment can be run and the experiment log can be checked using dvc exp show.

All metrics will be generated to an output file named metrics.json.

Code Quality

This project uses the following linters to display code quality information:

  • Pylint
  • Flake8

NOTE: we obtained perfect scores for both Pylint and Flake8.

Pylint

To run Pylint on a specific file, use pylint src/<file_name> or pylint src/ to analyse the full directory. It should output something as:

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)

In Pylint's configuration file pylintrc, we have thoroughly analysed linter rules and made the following modifications to adapt it to this specific ML project:

  • We allow the following commonly used variable names in ML: X_train,Y_train, X_test and Y_test
  • To discourage non-informative variable names, we defined a set of bad names: bad-names=foo, baz, toto, tutu, tata, x, y, z, var, vars
  • We extend the list of exceptions that will emit a warning with ArithmeticError, BufferError and LookupError - especially common in ML projects
  • We ignore files that are either auto-generated or do not contain Python code: ignore=CVS, .git, __pycache__, build, dist, .gitignore, requirements.txt, config.json
  • We only show warnings with high confidence levels and those that lead to inference errors (confidence=HIGH, INFERENCE_FAILURE)

Flake8

To analyse our Python code using Flake8, we run flake8 --max-line-length 100. This will configure the maximum allowed line length to 100 (in line with Pylint), instead of the 88 which is the default.

Contributors

  • Dani Rogmans
  • Justin Luu
  • Yang Li
  • Nadine Kuo

url-fishing-cs4295's People

Contributors

nadinekuo avatar dtronmans avatar justinluu2311 avatar luujustin avatar 07liyang avatar

Watchers

 avatar

url-fishing-cs4295's Issues

Audit code quality using linters

Definition of done:

  • Configure Pylint properly into the project, following ML conventions
  • Improve code quality to reach perfect Pylint score (function docs, variable names etc.)
  • Integrate multiple linters, implement different ways to display code quality information
  • Critically analyse linter rules, propose new missing ML rules

Report metrics using DVC

Definition of done:

  • Modify the training script such that different metrics are computed (to JSON fie e.g.), beyond model correctness
  • Extend the "train" stage in the dvc.yaml file with metrics - ensure that dvc exp run and dvc metrics diff can run the ML experiment
  • Ensure all metrics are generated to an output file

Configure all relevant AI pipeline stages using DVC

Definition of done:

  • Include all pipeline stages in dvc.yaml e.g. "preprocess" and "train" with well-defined dependencies and output
  • Ensure dvc repro works without errors
  • Ensure DVC can push to and pull from the remote data storage (AWS)
  • Update README with clear steps on how to run the pipeline (using DVC)
  • Update README with clear steps on how to set up remotes (using DVC)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.