Project Description: In the DataPrepKit capstone project, students will embark on developing a Python package named "DataPrepKit." This package aims to be a comprehensive toolkit for preprocessing datasets. Utilizing their knowledge in NumPy and Pandas, students will create a series of functions that assist in reading data from a variety of file formats, summarizing datasets, managing missing values, and encoding categorical data. The ultimate goal of this project is to publish the DataPrepKit package on PyPI, thereby making it available to the wider Python community.
-
Data Reading:
- Objective: Implement functions that can read data from different file formats such as CSV, Excel, and JSON.
- Tools: Use Pandas for efficient data importing.
-
Data Summary:
- Objective: Develop functions to print key statistical summaries of the data, including metrics like the average and most frequent values.
- Tools: Utilize NumPy and Pandas to generate these summaries.
-
Handling Missing Values:
- Objective: Create functions for addressing missing values, offering solutions to either remove or impute them based on set strategies.
- Tools: Employ methods that ensure data integrity.
-
Categorical Data Encoding:
- Objective: Design functions for encoding categorical data, allowing their conversion into numerical formats for analysis.
- Tools: Implement encoding techniques effectively.
-
Package Deployment:
- Objective: Successfully publish the DataPrepKit package on PyPI to make it readily accessible for downloading and utilization.
- Tools: Adhere to PyPI guidelines for package deployment.
Project Requirements:
- Proficient use of NumPy and Pandas for data analysis and manipulation.
- Robust function implementation for data reading, summary generation, missing value handling, and categorical data encoding.
- Successful registration and deployment of the package on PyPI.
Evaluation Criteria:
- Functionality and correctness of the data preprocessing features implemented.
- Quality and completeness of the documentation provided.
- Effectiveness of the test suite in ensuring the package's reliability.
- Successful deployment of the package on PyPI.
- Adherence to best practices in coding, packaging, and testing.
- Creativity and efficiency in managing different file formats and data preprocessing challenges.
- Project
https://www.kaggle.com/datasets/divu2001/coffee-shop-sales-analysis
- netflix_titles
https://www.kaggle.com/datasets/arnavvvvv/netflix-movies-and-tv-shows
- Project
kaggle datasets download divu2001/coffee-shop-sales-analysis
- netflix_titles
kaggle datasets download arnavvvvv/netflix-movies-and-tv-shows