Giter VIP home page Giter VIP logo

dstoolkit's Introduction

Known Vulnerabilities

DSToolkit - utilities for better analytics projects

A library of tools that I use to manage files, clean datasets and do exploratory data analysis

Table of Contents

General Info

This library is a set of tools for managing files, cleaning data and doing exploratory data analysis.

This all started because I found myself creating lots and lots of versions of data files in various states of completeness. I would scrape some data, write it a file (in the data/raw folder) then work on it some and save it to the data/processed folder. After a few iterations, I couldn't remember if it was data/raw/scraped_page1.csv or data/raw/scraped_page101.csv that was the latest. So I started to name the files with a timestamp appendage scraped_page_01011850.csv (for a file that was created on Jan 1 at 6:50pm). So I needed a utility to create the timestamps and then get the lastest version of the file. I copied this code so much that I decided to use it as a way to learn about creating real Python projects, GitHub hooks, Visual Studio Code, Docker Containers and more.

Technologies

Usage

pip install -U mlderes.dstoolkit

In your module:

from mlderes.dstoolkit import get_latest_data_filename, DataFolder, make_ts_filename, write_data

data_folder = DataFolder('./data') # root data folder
DATA_RAW = data_folder.RAW
DATA_EXTERNAL = data_folder/'external'

# Get the filename (path) of the file like foo* in the ./data/raw directory
fp = get_latest_data_filename(DATA_RAW, 'foo')

Contributions

This project was developed using Visual Studio Code and leverages the support the platform has for developing in containers, so if you have Docker Desktop installed, you should be able to fork this repo, download a copy to locally and open the folder in a container. All the dependencies are there, nothing to install, no need to worry about specific versions of libraries, creating venvs on your machine. Heck you don't even need Python installed!

Contributions to documentation, utilities and issues are welcome. All pull requests must include unittests and all existing tests must pass before being considered.

Todo

  • Make documentation as part of build
  • Add more samples to documentation

License

This work is licensed under the GPL, which guarentees end users the freedom to study, share, and modify the software for your own use.

dstoolkit's People

Contributors

mlderes avatar

Watchers

James Cloos avatar  avatar

dstoolkit's Issues

Adjustments to Makefile

As of now, the Makefile has some targets but it is unclear which targets are actually useful. This likely will have to be associated with a thoughtful approach to the build/release process.

Release/build process

The build / release process leaves room to be fixed. This is going to require a top-down approach to determine what the build and release cycle will look like. Here are a few scenarios that need to be considered

  • Adding a new feature/fixing a bug. Does this mean a new branch, then what? Does every change require a pull request? If so, then how do we handle the pull requests, is this where the tag gets applied and the package gets built and released or should the build/test deployment /packages be built prior to the pull request, so that the pull request only needs to update the label and push the build to PyPI?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.