flockfysh is an open source, efficient 2D-image tool that combines web scraping with artificial intelligence to generate and curate top quality image / object detection datasets. Feed flockfysh a "mini-dataset" with only ~50 images for each label (in the train category), and get back a hundredfold!
We support your favorite tools such as Roboflow!
We are currently looking for open source contributors, and would love to work with you to further develop this promising tool!
Currently, the dependencies appear are clearly supported on Python 3.8 (the requirements.txt
file was generated using a Python 3.8 virtual env). That would mean that it would likely work in other versions, but you may have to sort through the dependencies (a bunch of pip install
s and lookups for the appropriate version)
For Users
Flockfysh now has a CLI version on PyPi . One can simply run pip install flockfysh
to download the package and then flockfysh input.yaml
to run.
We power up traditional object detection and classic imaging techniques such as data augmentations with data gathering techniques like lightning-fast web-scraping. The higher level algorithm (for most of our supported features is the same) functions is as described below:
General procedure for training and webscraping ("train-scrape")
- Train object detection models on the small sample of data provided (images solely in the
train
folder will be considered) - Scrape various websites for images (based on a small number of searches queries supplied by the user), and use the model to figure out the most relevant and quality images
- Download those images, and train an even better object detection model
- Repeat steps 2-3 for some iterations, and then use the best model to gather the rest of the data
Flockfysh utilizes a seamless format to run its basic tasks. We support multiple workflows to generate quality datasets. Our core tool expects a small dataset with a simple format to begin with and an input file, and then we will automatically generate the rest of the dataset
- A dataset (in the format specified below)
- A .yaml input file
All of the flockfysh operations support datasets in the format shown below:
dataset/
train/ (preferrably shift the most of your images here)
valid/ (validation set, please keep a small number of images here)
test/ (not needed by flockfysh)
We choose this format because it is the most consistent with the majority of Machine Learning and Computer Vision workflows, as well as a structure that supports model training and testing right away.
Note that the dataset labels should be in yolo / yolov5 format
In order to run, flockfysh requires a small amount of guidance from the user to help some information. More specifically, it needs:
- The names of your classes for your dataset
- The input directory name of your dataset folder (in the format above)
- A Python dictionary mapping each class name to a list of search queries that would get you the images you want
We can effectively provide this information to flockfysh by using an input YAML file. Additionally, there are customizable settings (such as training parameters and auxiliary options such as saving the bounding boxes for each image) that you can also toggle in this YAML format.
Flockfysh uses a general format in the input YAML as follows:
job1:
job-type: 'train-scrape'
input-dir:
...
job2:
...
jobn:
...
Each task ("train-scrape" is an example of a task or feature that flockfysh can perform) is treated as a seperate job that flockfysh can do. flockfysh supports multiple job operations, and performs each one in a sequential manner (i.e, does job1
, then job2
, etc). The identifier for each job (ex: job1
) can be changed to whatever the user wants, and when running, flockfysh will notify the user via Terminal when it starts a job. For more information on the different kinds of jobs, see the link below
For each job in YAML, there are a set of mandatory settings to include (take a look at the 3 settings above, for example). The job will not complete (and an error will be thrown) if those mandatory settings are not included. There are also other default, configurable parameters settings for each job that are adopted if they are not specified. Specifying them in the YAML for that job will override the defaults (for that job only).
To take a look at the default options / confing for a specific job, check out the default settings folder.
Make sure to check that you have everything specified in link. For most dataset generations, one can easily adapt a sample YAML workflow instead of needing to write one from scratch.
Running flockfysh using the Github repo (latest code):
- Clone our repository by running the command
git clone https://github.com/teamnebulaco/flockfysh.git
- Run
cd flockfysh
to enter the repo andpip install -r requirements.txt
to install the dependencies. - Export YoloV5 dataset (in format specified above) into a folder inside the repository. If your dataset is on Roboflow, you have the option of exporting it and moving into the directory, or adding a download job at the beginning of the YAML to automatically load it in for you.
- Don't have a dataset? Check out our sample Roboflow dataset
- Create an
input.yaml
file (take a look at the sampleinput.yaml
format) - Run
python run.py input.yaml
to run flockfysh with the specified input fileinput.yaml
- On machines that use the command python3 instead of python to execute Python 3, change it to python or use the command python3 run.py input.yaml
For the purposes of this sample, we will use a publicly available dataset within the Roboflow universe.
The sample YAML file for this workflow is the same as the example input.yaml
.
- After cloning the repo and getting set up, add this into a
input.yaml
file at the base directory (same directory asrun.py
)
job1: #Download the dataset we want to train
job-type: 'download'
api-name: 'roboflow'
api-key: 'ENTER_YOUR_API_KEY_HERE'
workspace-name: "sanka-madushankaresearch"
project-name: "insectbite"
project-version: 1
output-dirname: 'robo'
job2: #job1 can be replaced with any name for the job you prefer
job-type: 'train-scrape'
input-dir: robo
class-names: ['Bed Bug', 'Fire ant', 'Tick', 'Wasp']
class-search-queries: {'Bed Bug' : ['bed bug bites'], 'Fire ant' : ['fire ant bites'], 'Tick' : ['tick bites'], 'Wasp' : ['wasp bites']}
train-workers: 8
images-per-label: 500
total-maximum-images: 7000
image-dimensions: 200
train-batch: 8
- Replace
api-key
with the API key you get from Roboflow (can easily be found when exporting a dataset using code). - Run
python run.py input.yaml
to run the workflow!
For the purposes of this sample, we will use a publicly available dataset within the Roboflow universe, but locally download it. The dataset should be in the format specified above.
- After cloning flockfysh and getting set up, run
git clone https://github.com/teamnebulaco/sample-flockfysh-robo.git
and move therobo
folder inside into the base directory - Add the code below into a
input.yaml
file at the base directory (same directory asrun.py
)
job1: #job1 can be replaced with any name for the job you prefer
job-type: 'train-scrape'
input-dir: robo
class-names: ['Bed Bug', 'Fire ant', 'Tick', 'Wasp']
class-search-queries: {'Bed Bug' : ['bed bug bites'], 'Fire ant' : ['fire ant bites'], 'Tick' : ['tick bites'], 'Wasp' : ['wasp bites']}
train-workers: 8
images-per-label: 500
total-maximum-images: 7000
image-dimensions: 200
train-batch: 8
- Run
python run.py input.yaml
to run the workflow!
You must specify which type of job you want for each using a job-type
attribute. Here are the different job types available.
Automatically downloads a dataset from a specific API. The current APIs supported are listed below:
- Roboflow
- Kaggle
- HuggingFace
Note that each API also has API-specific information (API keys, secrets, etc) that flockfysh needs to utilize to download the dataset.
Utilizes object detection models in tandem with web-scraping to generate an image-dataset. Here are some relevant properties:
input-dir
: The path to the dataset folderclass-names
: An array of labels for the images you are trying to classifyclass-search-queries
: An array of words you'd put in a search (imagine Googling the images yourself) to download the images.train-workers
: Number of workers for trainingimages-per-label
: MAIN PROPERTY TO control dataset size - how many images you want for each of the class names specified inclass-names
total-maximum-images
: Adds an upper limit on the number of images (most support is forimages-per-label
at the moment)
We are extremely excited to open this repository to the community, and can't wait to see the future to which this project heads! Please consider joining our Discord server, which we use as our main platform to communicate, improve, and resolve issues.
Steps to begin development:
- Clone our repository by running the command
git clone https://github.com/teamnebulaco/flockfysh.git
- Switch into our current dev_branch by running
git checkout -b dev-branch
- Pull the code.
git pull origin dev-branch
- Create a virtualenv by running the command
python -m virtualenv venv
(Syntax may vary)- Note that the code above assumes you have the virtualenv package installed If not, run the command
pip install --upgrade virtualenv
- Note that the code above assumes you have the virtualenv package installed If not, run the command
- Run
python run.py input.yaml
to start developing! Happy coding!
This repository is licensed under the BSD-4 license. Note that some of the images from the scraper may have artisitic copyrights, and should only, only, ONLY be used for ML & Training purposes. Under no grounds should this tool be exploited to circumvent copyrights. Besides, it makes everyone's lives easier if we don't mooch off each other's copyrighted images. :))