Giter VIP home page Giter VIP logo

mindatdatasetschoolproject-old's Introduction

MinDat-Mineral-Image-Dataset

A dataset of +500,000 mineral images with labels taken from [mindat.org]. The dataset is in two CVS files. The first file img_url_list.csv contains a list of the image URLs and the raw labels corresponding to the title of the image as seen on mindat.org. The second file img_url_list_converted.csv contains the same image URLs with cleaned up labels and images with no labels removed. All scripts to generated these CVS files are provided alone with a script to download all the images. Generating the CVS file of images take around 10 hours and downloading all images requires around 24 hours with a good internet connection (>10mbps). All scripts are multi threaded.

Generating Dataset for Yourself

To generate the URL list run make_url_list.py and this will go through [mindat.org] and scrap all the images. This will save several hundred files in the img_urls directory. After this run the concat_url_files script to make one unified CVS file called img_url_list.csv. Now you can run convert_img_url_list.py to generate an image list with cleaned up labels. Right now this scrip just removes variation text. For example, many images are Quartz with variation of Capped Quartz, Chalcedony Quartz, etc. These extra labels get removed to just read Quartz. The script can be easily modified to clean the data in other ways though. The last script is download_images.py that downloads all the images in the img_url_list_converted.csv to a directory specified in the file. The total dataset is around 400G. Some images are extremely high resolution so the total space can be greatly reduce by resizing before saving however for my purposes I wanted the high resolution images and kept that in.

Sample images!

alt tag alt tag

mindatdatasetschoolproject-old's People

Contributors

nuilewis avatar loliverhennigh avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.