ubc-mds / 2023-dsci522-group22 Goto Github PK

UBC MDS DSCI-522 Group Project

Home Page: https://ubc-mds.github.io/2023-DSCI522-Group22/wine_color_classification_report.html

License: Other

Jupyter Notebook 15.44% Dockerfile 0.13% Python 15.67% HTML 27.72% JavaScript 17.50% CSS 22.43% TeX 0.61% Makefile 0.50%

2023-dsci522-group22's People

Contributors

2023-dsci522-group22's Issues

Summary of our script responsibilities

As we discussed in person

Script 1: Yingzi
Script 2: Chris
Script 3: Chun li
Script 4: Jordan

Finished by end of day on Friday, collaborate Saturday morning on how to proceed.

Done for night - passing on the jupyter book generation

Using the build_jupyter_book branch.

Created a jupyter book.
Moved a copy of our notebook to the report folder generated by jupyter book.
Commented out the model section of our notebook, deleted (some) unecessary lines of code and added embeded .png files of our figures.

Here are some tips:

You'll have to conda install jupyter-book=0.15.1 or update from the yaml file.
https://jupyterbook.org/en/stable/basics/build.html <- how to build
https://jupyterbook.org/en/stable/content/executable/output-insert.html#gluing-variables-in-your-notebook <- glue
The model section still needs to be done.
I couldn't figure out how to get a self contained html file built from our base /reports directory similar to https://github.com/ttimbers/breast_cancer_predictor_py/tree/main/report

finished rough draft of script, need to collaborate to progress

This is mostly directed to chun li.
When you are finished your script let me know , so I can see if I need to add more function to my script? I made an educated guess everything my script was supposed to include, but not 100% confident.

command line - how to run our scripts

python scripts/wine_classification_plot_script.py
--dataframe-csv results/tables/logistic_regression_feature_importance.csv
--red-wine-color red
--white-wine-color peachpuff
--output-path results/figures/predict_visualization.png

This is how to execute my script.
Not sure if we are storing this somewhere so we can update the readme.

Function about visualization

Hello team, I am working on converting a visualization to a function. But I wonder how to test it. It seems that it is not possible to test if a chart is bar chart or scatter plot. Any ideas?

Runtime Error for eda.py

When I run the script, it gave me the following error, please check it:
<img width="1173" alt="image" src="https://github.com/UBC-MDS/2023-DSCI522-Group22/assets/108684429/f21b7748-6ef6-40b8-80e9-1f60a64775d5">

Update of env yml and docker

I forgot to update the docker file and yml file with new package: "click". Could someone approve my pull request and then perhaps add that? Thank you in advance :)

Combined Data File

I sent a pull request for the combined data file. Please check it.

Book is fully rendered properly

pre-emptively anticipating feedback, on Monday I got our github pages to render our report.
Today and tomorrow I'm dividng my attention between 573 and 522. I'll create the changelog and get onto fixing some of the suggestions.

I'm a little confused why group feedback isn't in our issues? I must have misinterpreted the instructions about the location.
I'll figure it out in 522 lecture today, I'll ask Tiff/TA.

Makefile

I've sent a pull request for the makefile. Please test it.

Docker is done!

Hi team

I'm happy to report that our report is now fully reproducible on Docker with the instructions from our README.md file. Here's what I did:

I made a new Dockerfile based on the information from the latest Environment file Chris made today.
I built a docker image, and pushed it to Docker Hub to be hosted. The link is here: https://hub.docker.com/repository/docker/lichunubc/wineclassifier/general
With the docker image hosted remotely, I have built a compose.yaml file (based on Tiff's sample file) and placed it under our project's root directory.
I restarted my PC, opened terminal, git cloned a fresh copy of our repo and ran docker compose up, which worked.
I clicked the 0.0.0.127 link on screen, which opened a Jupyter notebook. I clicked terminal, copy/pasted the commands found on our README.md file and hit enter.
Everything worked, flawlessly.

test_download_failure_invalid_url Failed

Also, test_download_failure_invalid_url function failed too.

`__________________________________________ test_download_failure_invalid_url __________________________________________

def test_download_failure_invalid_url():
    runner = CliRunner()
    result = runner.invoke(main, ['--url', 'invalidurl://example.com/somefile.csv'])

  assert result.exit_code != 0

E assert 0 != 0
E + where 0 = .exit_code

test_download_file.py:28: AssertionError
`

No idea to fix it.

Next step to complete our project

Ok, I just pull requested the tests/functions for my part. This will complete the modularization part of our project.

Tomorrow we need to double check things (verify .yaml versions are correct , everyone's tests pass)
, make sure we've published the container image on DockerHub and publish the final report from the readme.

Good work team!

test_download_http_error Failed

My test function gives the following error message:
`______________________________________________ test_download_http_error _______________________________________________

tmp_path = PosixPath('/private/var/folders/c3/4qxt4f9j08z9s0bydw9k0n800000gn/T/pytest-of-f1yingh3ap/pytest-6/test_download_http_error0')

def test_download_http_error(tmp_path):
    test_url = "https://opendata.vancouver.ca/api/explore/v2.1/catalog/datasets/issued-building-permits/exports/csv?lang=en&timezone=America%2FLos_Angeles&use_labels=true&delimiter=%3B"
    test_destination = tmp_path.as_posix()
    mock_response = Mock()
    mock_response.raise_for_status.side_effect = requests.exceptions.HTTPError

    with patch('download_file.requests.get', return_value=mock_response):
        runner = CliRunner()
        result = runner.invoke(main, ['--url', test_url, '--destination', test_destination])

      assert result.exit_code != 0

E assert 0 != 0
E + where 0 = .exit_code

test_download_file.py:41: AssertionError
=============================================== short test summary info ===============================================
FAILED test_download_file.py::test_download_failure_invalid_url - assert 0 != 0
FAILED test_download_file.py::test_download_http_error - assert 0 != 0
============================================= 2 failed, 1 passed in 0.12s =============================================
`

Any idea to fix it?

Workflow to populate the Docker File

Based on our prior agreement, in order to allow each member to have a chance to add a few lines to the Docker File, we will have to make 4 branches and 4 Pull Requests, which seems tedious based on something that is trivial. How about we just make the changes on the Main branch instead only for this task?

trimming our final report

Hey all, wanted to let you know that I'm working on removing some unnecessary EDA steps from our Jupyter book that won't be useful in communicating our project thesis to a general audience. Ill make a sub branch and push it for review when I'm done.

To do list for Milestone 3

Here's a non-exhaustive to do list for this week's group project, based on the differences between our current repo and the sample repo from Tiffany.

Produce a rendered final report in html without any codes explicitly displayed to the audiences. The report should have built-in references to figures and bibliographies powered by Jupyter book.
Create a new "Result" folder to house figures (in PNG), models (pipelines and preprocessors, in pickle), and tables (in csv).
Create a new "Script" folder to house all updated .py files (ones we made last week) with the formats and features Tiffany demonstrated in Lecture 5.
Update the computational environment if additional libraries need to be installed.
Update the README.md file to reflect all the new changes.
...

Overloading Issue

Hey guys, I am writing the scripts for the EDA and I find this step might be a bit overloading. This scripts require to read in 8 strings, which I think would be super inconvenient when we are actually use it in the terminal commands. I am thinking to breakdown into two parts:

Basic Data Wrangling: where we a) read in the downloaded datasets(2 of them),b) concatenate them together, c) save the concatenated output, d) train test split them e) save train and test df locally
- This requires five strings as inputs when we use terminal command
EDA: where all the graphing happens
- This requtires three strings as inputs when we use terminal command
If no one has other opinion, I'll just continue on this

Question about current pull request

I see that in the most recent pull request we are changing the helper function in addition to the main analyses code. Just trying to make sure everything runs because the our last submission for milestone some of the test functions didn't run because of the changes in helper function.

Recieved chun-li's table, executing script

I'm bug fixing the script now, some minor issues so far.
Afterwards I'll push it.

We can work in parallel on the next steps when the script/table is added.

Later , I intend to polish it, add any missing functionality, revise the tests and improve the documentation.

Milestone #4 and to finish things up

Hi Team

I believe we are in a very good shape right now. Our report in HTML is up and running via JupyterBook, our analysis is now fully reproducible on Docker with simple instructions on README.md file, and we have obtained valuable feedbacks from peers.

Despite facing challenges last week, our resilience shone through, and I'm confident we've regained our momentum. This week, let's focus on refining our report by incorporating peer feedback. It's crucial that we meticulously record all modifications in our CHANGELOG.md file for transparency and tracking.

Keep in mind that each member must at least make at least one change that improves the project in a significant way, and we will be graded on the scale and quality of our changes. I encourage everyone to think outside the box and communicate openly, especially when changes impact multiple team members.

Regarding the Makefile, which accounts for 50% of our Milestone marks, Yingzi is currently leading this effort. Whoever has extra time to spend on our project is encouraged to actively collaborate with her on this task to ensure we all excel in this area.

In closing, I want to express how truly impressed I am with our teamwork and progress. Each challenge we've faced has been a stepping stone, helping us grow stronger and more cohesive as a group. Let's carry this positive momentum forward, learning and succeeding together. I'm excited about our journey ahead and confident in our collective ability to achieve great things.

Branch naming conventions

Hey all!
Relatively minor thing, but wanted to remind everyone to name their branches after the functionality type of their update and some sort of version code or time stamp to differentiate new branches from old branches.
It was mentioned in class and we confirmed later it will effect our grades.

Environment and docker yaml file updated to include altair_saver 0.5.0

We needed this package to save the PNG files locally, therefore this needs to be added.

References.bib

If anyone got time, please finish this file and store this file under the report folder for Jupyter book building

Feedback 1 and 10 resolved and documented

Hi team

I have made the following changes to our project to address issues raised by peer review, which have all been documented in changelog.md. I have fully tested this branch and they work on my PC.

Updated LICENSE file based on the feedback received from Peer Review
Updated the environment.yaml file to include the Make version 4.3
Updated the Dockerfile to include Make version 4.3
Rebuilt the Docker image and pushed it to Docker Hub remote repo. Link here (https://hub.docker.com/repository/docker/lichunubc/wine_color_classification/general)
Updated the compose.yaml file to include the latest version of Docker image

Issue with Docker image

When I ran "docker build --tag group22 ." to create a local docker image with our Dockerfile, I got an error message:

Dockerfile:3

1 | FROM quay.io/jupyter/minimal-notebook:2023-11-19
2 |
3 | >>> RUN conda install -y python=3.10
4 | RUN conda install -y pandas=2.1.3
5 |

ERROR: failed to solve: process "/bin/bash -o pipefail -c conda install -y python=3.10" did not complete successfully: exit code: 1

Any idea to solve it?

Rough draft of scripts finished - Next steps

Ok, all the scripts/tables/figures are uploaded. My branch is pending merge.
Next major steps:

Bug fixing - someone run each script on their machine to see if it's reproducible
Modify our report to eliminate code and reference the tables/graphs in our repo.
More (see other github issues) 😄

I'm reasonably confident I can do a good job of the second task, so unless there's objections I'll work on that now.

Weird format for combined data file

I tried to combine wineread.csv and winewhite.csv to be a single csv file. The code for it is: `# Read the CSV files
df1 = pd.read_csv(file1, delimiter=';')
df2 = pd.read_csv(file2, delimiter=';', header=None)

# Concatenate the dataframes
combined_df = pd.concat([df1, df2])`

However, the combined file looks weird. Can anyone tell me how to fix it?

Created the changelog - Addressed the feedback

I created a rough draft of the changelog and pushed it to it's own branch.

I broke down what needs to be done as well as assessed who's responsibility it is to fix based on our in person discussion.

Each time you fix something, please edit this changelog with a quick explanation of what you did to fix the issue and where (file location) the fix took place.

Minor update to repo organization

We will move the analysis ipynb file into a new folder called notebook. The src folder will now house all the custom function scripts and we will make another new folder tests to house all the pytest scripts.

I have seen this kind of organization from a number of other repos and it seems to be a standard way for good practices. Please let me know if this is a good idea.

Remember to populate the README.MD file

As mentioned in Milestone 1 instruction, we need to should use proper grammar and full sentences in your README.MD Point form may occur, but should be less than 10% of your written documents.

This is the summary of our data project:

Our analysis aimed to develop a predictive model to distinguish between red and white wines based on various physicochemical properties. This study employed logistic regression, a model renowned for its balance between predictive power and interpretability.

The regression result suggested that residual sugar and total sulfur dioxide had high positive coefficients, indicating a strong association with white wine, whereas density showed the most substantial negative impact, followed by alcohol and volatile acidity, suggesting these are key indicators of red wine.

The logistic regression model not only achieved high accuracy but also provided valuable insights into the features most indicative of wine type. This model can assist vintners in quality control and classification tasks. Moreover, the interpretability of the model offers a foundation for further research into wine composition and its impact on sensory attributes. Future studies might explore more complex models or delve deeper into feature engineering to enhance predictive accuracy and understanding.

Functions/Test responsibilities

Hey all, I'm writing my assigned tests/functions now.
We decided that I would create functions for the EDA part of our project.
I wanted to double check that there was no overlap with anyone else's work.

The visualizations of the histograms, bar charts ands correlation plot for the EDA is one area I'm concerned might have overlapped.
I only loosely recall what yingzi was working on and that might be part of it?
Can anyone confirm? I'm working on the first part of EDA that's unambiguous while I wait.

I'll go ahead and do the work anyway if I don't hear back by early tomorrow.

ubc-mds / 2023-dsci522-group22 Goto Github PK

2023-dsci522-group22's People

Contributors

2023-dsci522-group22's Issues

Dockerfile:3

1 | FROM quay.io/jupyter/minimal-notebook:2023-11-19 2 | 3 | >>> RUN conda install -y python=3.10 4 | RUN conda install -y pandas=2.1.3 5 |

Recommend Projects

Recommend Topics

Recommend Org

1 | FROM quay.io/jupyter/minimal-notebook:2023-11-19
2 |
3 | >>> RUN conda install -y python=3.10
4 | RUN conda install -y pandas=2.1.3
5 |