Giter VIP home page Giter VIP logo

edapy's Introduction

PyPI version Python Support Build Status Coverage Status Code style: black GitHub last commit GitHub commits since latest release (by SemVer) CodeFactor

edapy is a first resource to analyze a new dataset.

Installation

$ pip install git+https://github.com/MartinThoma/edapy.git

For the pdf part, you also need pdftotext:

$ sudo apt-get install poppler-utils

Usage

$ edapy --help
Usage: edapy [OPTIONS] COMMAND [ARGS]...

  edapy is a tool for exploratory data analysis with Python.

  You can use it to get a first idea what a CSV is about or to get an
  overview over a directory of PDF files.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  csv     Analyze CSV files.
  images  Analyze image files.
  pdf     Analyze PDF files.

The workflow is as follows:

  • edapy pdf find --path . --output results.csv creates a results.csv for you. This results.csv contains meta data about all PDF files in the path directory.
  • edapy csv predict --csv_path my-new.csv --types types.yaml will start / resume a process in which the user is lead through a series of questions. In those questions, the user has to decide which delimiter, quotechar is used and which types the columns have.
  • edapy generates a types.yaml file which can be used to load the CSV in other applications with df = edapy.load_csv(csv_path, yaml_path).

Example types.yaml

For the Titanic Dataset, the resulting types.yaml looks as follows:

columns:
- dtype: other
  name: Name
- dtype: int
  name: Parch
- dtype: float
  name: Age
- dtype: other
  name: Ticket
- dtype: float
  name: Fare
- dtype: int
  name: PassengerId
- dtype: other
  name: Cabin
- dtype: other
  name: Embarked
- dtype: int
  name: Pclass
- dtype: int
  name: Survived
- dtype: other
  name: Sex
- dtype: int
  name: SibSp
csv_meta:
  delimiter: ','

A sample run then would look like this:

$ edapy csv predict --types types_titanik.yaml --csv_path train.csv
Number of datapoints: 891
2018-04-16 21:51:56,279 WARNING Column 'Survived' has only 2 different values ([0, 1]). You might want to make it a 'category'
2018-04-16 21:51:56,280 WARNING Column 'Pclass' has only 3 different values ([3, 1, 2]). You might want to make it a 'category'
2018-04-16 21:51:56,281 WARNING Column 'Sex' has only 2 different values (['male', 'female']). You might want to make it a 'category'
2018-04-16 21:51:56,282 WARNING Column 'SibSp' has only 7 different values ([0, 1, 2, 4, 3, 8, 5]). You might want to make it a 'category'
2018-04-16 21:51:56,283 WARNING Column 'Parch' has only 7 different values ([0, 1, 2, 5, 3, 4, 6]). You might want to make it a 'category'
2018-04-16 21:51:56,285 WARNING Column 'Embarked' has only 3 different values (['S', 'C', 'Q']). You might want to make it a 'category'

## Integer Columns
Column name: Non-nan  mean   std   min   25%   50%   75%   max
PassengerId:     891  446.00  257.35     1   224   446   668   891
Survived   :     891  0.38  0.49     0     0     0     1     1
Pclass     :     891  2.31  0.84     1     2     3     3     3
SibSp      :     891  0.52  1.10     0     0     0     1     8
Parch      :     891  0.38  0.81     0     0     0     0     6

## Float Columns
Column name: Non-nan   mean    std    min    25%    50%    75%    max
Age        :     714  29.70  14.53   0.42  20.12  28.00  38.00  80.00
Fare       :     891  32.20  49.69   0.00   7.91  14.45  31.00  512.33

## Other Columns
Column name: Non-nan   unique   top (count)
Name       :     891      891   Goldschmidt, Mr. George B (1)
Sex        :     891        2   male (577)
Ticket     :     891      681   347082 (7)
Cabin      :     204      148   C23 C25 C27 (4)
Embarked   :     889        4   S (644)

edapy's People

Contributors

dependabot[bot] avatar martinthoma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

edapy's Issues

ValueError: invalid literal for int() with base 10: b'n'

Traceback (most recent call last):
[...]
File "/home/math/.pyenv/versions/3.8.4/lib/python3.8/site-packages/edapy/pdf.py", line 55, in find
data.append(get_pdf_info(pdf_path))
File "/home/math/.pyenv/versions/3.8.4/lib/python3.8/site-packages/edapy/pdf.py", line 87, in get_pdf_info
pdf_toread = PdfFileReader(fp, strict=False)
File "/home/math/.pyenv/versions/3.8.4/lib/python3.8/site-packages/PyPDF2/pdf.py", line 1084, in init
self.read(stream)
File "/home/math/.pyenv/versions/3.8.4/lib/python3.8/site-packages/PyPDF2/pdf.py", line 1803, in read
idnum, generation = self.readObjectHeader(stream)
File "/home/math/.pyenv/versions/3.8.4/lib/python3.8/site-packages/PyPDF2/pdf.py", line 1667, in readObjectHeader
return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: b'n'

Use of mutation testing in edapy - Help needed

Hello there!

My name is Ana. I noted that you use the mutation testing tool in the project.
I am a postdoctoral researcher at the University of Seville (Spain), and my colleagues and I are studying how mutation testing tools are used in practice. With this aim in mind, we have analysed over 3,500 public GitHub repositories using mutation testing tools, including yours! This work has recently been published in a journal paper available at https://link.springer.com/content/pdf/10.1007/s10664-022-10177-8.pdf.

To complete this study, we are asking for your help to understand better how mutation testing is used in practice, please! We would be extremely grateful if you could contribute to this study by answering a brief survey of 21 simple questions (no more than 6 minutes). This is the link to the questionnaire https://forms.gle/FvXNrimWAsJYC1zB9.

Drop me an e-mail if you have any questions or comments ([email protected]). Thank you very much in advance!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.