Giter VIP home page Giter VIP logo

is-classify's People

Contributors

am-cid avatar chaaals avatar marcusandrev avatar otentackles avatar

Watchers

 avatar

is-classify's Issues

Write cleaned data from dictionary into file

Input

dict[str : set()]
where str is the category of the data and set() is the set of data

format of .txt file

category_1
     data_1
     data_2
     ...
     data_n

category_2
     data_1
     data_2
     ...
     data_n

...

category_n
     data_1
     data_2
     ...
     data_n


sample input

print_to_txt = {
     "name" : set("Yamada, Aiko D.", "Nakamura, Yumi R.", "Tanaka, Takeshi T.", "Kobayashi, Mika Y."),
     "birthday" : set("1995-07-15", "1988-12-04", "2001-03-22", "1990-09-30"),
     "email" : set("[email protected]", "[email protected]", "[email protected]", "[email protected]"),
     "cell no." : set("+63-9876543210", "+63-9012345678", "+63-8765432109", "+63-7654321098")
}

sample output to text file

Name
     Yamada, Aiko D.
     Nakamura, Yumi R.
     Tanaka, Takeshi T.
     Kobayashi, Mika Y.

Birthday
     1995-07-15
     1988-12-04
     2001-03-22
     1990-09-30

Email
     [email protected]
     [email protected]
     [email protected]
     [email protected]

Cellphone No.
     +63-9876543210
     +63-9012345678
     +63-8765432109
     +63-7654321098

categorize data from set of raw data

preprocess

Input: list of sets containing data

  1. Check for redundancy in between sets
    • put redundant data into dict { "value" : count } and increment count whenever repeated again
  2. After checking for redundancy, combine all sets in the list into one big set

process (if else statements in order)

iterate through values in set

cell number

format generated by faker instance: +63-xxxxxxxxxx
first part is country code (+63) and second part is the number with length of 10
check if first part == +63, check if second part is appropriate length
note: use regex

date of birth

format generated by faker instance: yyyy-mm-dd
can follow strictly the format above or validate different ways to write dates
note: use datetime either way

email

format generated by faker instance: [email protected]
check if first part has at least 1 char, check if followed by @, check if followed by at least 1 character, check if followed by ".", check if followed by at least two characters
note: use regex

name

format generated by faker instance: first_name middle_name last_name

  1. split into 3 parts with whitespace as separator
  2. if 3 parts, check if each part has at least two characaters and is_alpha
  3. if 4 parts or more, put Jr., III, and other similar to suffixes; put Dr., Mr., Mrs., into titles
    • default value of suffixes and titles should be empty string
  4. if less than 3 parts, don't include into cleaned dataset
  5. reformat to "title last_name, first_name middle_initial. suffix"

note: use string methods

process into dict

  1. create dict with keys: name, email, birthday, cell_no
    • values for each key will be: list of names, list of emails, list of birthdays, list of cell_no
  2. if raw data from set passed validation, check if any duplicates exist already in appropriate set in dict
  3. if none, put into dict. else throw warning (put into set of warnings to be printed out later???)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.