A Scrapy spider written in python that scrapes almost the entirety of the goBILDA website for each and every .STEP
file, for use in any CAD software supporting .STEP
.
- Install scrapy (assuming you already have python installed)
pip install scrapy
Tip: if the installation fails, you might want to try installing scrapy in a virtualenv
- Make sure you have a valid
FILES_STORE
defined insettings.py
. By default, you will have to create a folder at the root of this repository calledmodels
:
├───gobilda_parts_bot
│ ├───spiders
│ │ └───__pycache__
│ └───__pycache__
└───models <-- what you need to make
Feel free to change
FILES_STORE
to whatever suits your needs
- Run the
parts
spider:
scrapy crawl parts
- Wait for the
FILES_STORE
to populate, and you are good to go!
Note: there will be some empty folders left behind in the models folder, like
full
and some withassembly
in the name. Don't worry, nothing went wrong, I just didn't bother trying to find a simple solution to delete them.
The goBILDA website is organized with repetitive product grid items that a web scraper can easily detect and follow to an indiviudal product page. The spider, responsible for crawling the website and downloading the .STEP
files, starts at a couple main URLS on the goBILDA website to try and get as many parts as possible downloaded while ommitting repeats, merch, and kits.
start_urls = [
'https://www.gobilda.com/motion',
'https://www.gobilda.com/structure/',
'https://www.gobilda.com/electronics/',
'https://www.gobilda.com/hardware/',
]
For every page, the spider's parse()
method is called:
def parse(self, response):
# Get the link attribute from a product box
partsList = response.css('li.product a')
if partsList:
# Sometimes, catalog pages are within others, so we have to recursively
# go through each page to get to actual parts
yield from response.follow_all(partsList, self.parse)
else:
# We are at an actual part page, with the step file,
# so we process the actual values scraped from the page
yield self.parse_product_page(response)
Since some of the parts catalogs contain sub-catalogs within them, we have to do some recursion in order to get to each individual part.
Once we are the product page, we get the name of the part, the sku number (currently not used for anything), and the .STEP
file:
def parse_product_page(self, response):
step_file = response.css('a.ext-zip::attr(href)').get()
name = response.css( 'h1.productView-title::text').get()
if step_file and 'Bundle' not in name: # bundles will contain repeats, we don't want that
loader = ItemLoader(Product(), response=response)
loader.add_css('sku', 'span.productView-sku-input::text')
loader.add_value('file_urls', [f'https://www.gobilda.com{step_file}'])
loader.add_value('name', name)
return loader.load_item()
After that, its a relatively simple matter of downloading the zipped .STEP
file, extracting the file, deleting the zip, and renaming the file. One thing to note: the file's names don't look exactly like the names displayed on the website, as I had to get rid of colons, spaces, and other forbidden characters to get a clean filename.
Note: When I uploaded the complete parts folder to Fusion 360 (my CAD software of choice), it said 10 out of the 80 models failed to upload. I'm not sure which ones are the culprits, or if it is just Fusion being fusion,but create an issue or pull request if you've found the problem and/or solution.