Giter VIP home page Giter VIP logo

bulk-bing-image-downloader's People

Contributors

farishijazi avatar mcallagher avatar nndurj avatar ostrolucky avatar sanghoon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bulk-bing-image-downloader's Issues

Using with a CSV file

I want to use the Bulk Bing Image Downloader to download around 100 images per keyword from a CSV list of 800 keywords. The CSV is structured by the Product #;query_string (E.g. 149;Nespresso Machine).

I tried to add this snippet to the code:

def getTitels():
    with open('Coffee.csv') as csvfile:
       Coffee_names = []
       csvrows = csv.reader(csvfile, delimiter=',', quotechar='"')
       for row in csvrows:
           Coffee_names.append(row)
       return Coffee_names
       New_Array = getTitles()

Do you have any help regarding this?

Many thanks in advance!

Dynamic output Directory

Hi, is it possible to add the domain (c:\dump*domain.com*\filename.ext) the image was downloaded from as a sub directory to the output directory?

Error: socket.timeout: The read operation timed out

I get an error from the socket when invoking

python bbid.py -s "Hello World" -o Test

Traceback (most recent call last):
  File "bbid.py", line 134, in <module>
    fetch_images_from_keyword(pool_sema, args.search_string,output_dir, args.filters, args.limit)
  File "bbid.py", line 72, in fetch_images_from_keyword
    html = response.read().decode('utf8')
  File "C:\Tools\Anaconda3\envs\github\lib\http\client.py", line 471, in read
    s = self._safe_read(self.length)
  File "C:\Tools\Anaconda3\envs\github\lib\http\client.py", line 612, in _safe_read
    data = self.fp.read(amt)
  File "C:\Tools\Anaconda3\envs\github\lib\socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "C:\Tools\Anaconda3\envs\github\lib\ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "C:\Tools\Anaconda3\envs\github\lib\ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

Internet connection works, changed user agent, timeout etc. without effort. Any ideas?

Only around 40 queries could be downloaded

Hi,
I have a query file with a lot of queries but it seems every time I run it, it only downloads the first ~40 queries. I dived into the code but did not see any such check. Could you please help. Thank you.

The only flag I use is:
--limit=200
and bingcount=35.

Everything else is as is provided.

rename downloaded file by search string

It would be very nice to have an option to rename each downloaded file to the search string. So if the search term is pizza, any resulting images would be renamed to pizz.jpg, pizza1.jpg, etc.

That way if you're using a list it's a lot easier to find the downloaded image for each query.

Doesn't download more than 100 images!!

Hi,

I'm facing some issues here, this script is not downloading images more than 100, I think that's because of the no web driver specified (like while using selenium we use chromedriver) if you have updated script, can you please upload? I'm proving an input file it also doesn't traverse through the entire input file, at a random point the script exits (without any error).

Thanks Man, you saved a lot of my time writing a new script.

Add threading and paging command line option

Now, to change the paging and number of threads of bbid, the user needs to look into the source code and find the values at the top to modify them.

It would be nice if options such as --threads=666 and --paging=300 could be passed to the program :).

Anyway, bbid is awesome, and is helping me a lot in my current project. Thanks :)

AttributeError: module 'signal' has no attribute 'SIGTSTP'

Getting bbid.py", line 107, in <module> signal.signal(signal.SIGTSTP, backup_history) AttributeError: module 'signal' has no attribute 'SIGTSTP' when I run the script, and search didn't turn up much for the error other than signal has been deprecated for a while. Any suggestions?

Duplicate Downloads

If I run the script with search string twice or more the script appears to download the file again and again, although appending a digit to the file name, as was intended by author.
It will be nice to avoid such copies.
If the file is present in same directory and the url is present in tried urls, we must not download the file.

Update

New search string for line 80:
Replace:
links = re.findall('imgurl:&quot;(.*?)&quot;',html)
with:
links = re.findall('murl&quot;:&quot;(.*?)&quot;',html)

Keywords with spaces in between

e.g. bbid.py -s hello world is not working, whereas bbid.py -s hello+world is working.

I tried to fix it using nargs = "+", but it was in vain. Could you please do something.

Program not working any more?

when I ran the program it gives me error:

Traceback (most recent call last):
File "./bbid.py", line 118, in
fetch_images_from_keyword(args.search_string,output_dir)
File "./bbid.py", line 60, in fetch_images_from_keyword
response=urllib.request.urlopen(request)
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 469, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 507, in error
return self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 587, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

What's the problem?

Ceases to work when duplicate image is found

Script fails when trying to write a duplicate filename. Here is the error

FAIL Image is a duplicate of 400.jpg, not saving 2100.jpg
history_dumped
history_dumped
Traceback (most recent call last):
File "./bbid.py", line 152, in
if fetch_images_from_keyword(keyword,output_dir):
File "./bbid.py", line 85, in fetch_images_from_keyword
t.start()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 844, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

Just doesn't work...at all

"Provide Either search string or path to file containing search strings" is the error that I get when I enter a valid command such as bbid.py -s "hello world"

Maybe I'm using it wrong...
There should be a guide to using each of the options in this software because usage the output option isn't clear and I don't know what -h does
I assume -output lets me name the folder that the images go into, but I have no idea because it won't download anything

EDIT: After reading the code, I have come to the conclusion that -h is for "help". However, typing bbid.py -h and bbid.py -help gives me an error

download more pic

i wannna know that how can i download more pic?
everytime i download, only about 130 pics,if i wanna more ,like 500 pics
how can i do?

error: [Errno 2] No such file or directory: 'build/scripts-3.8/__init__.py'

I tried to install this software but it failed.

My environment:

$ lsb_release --all
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.4 LTS
Release:	20.04
Codename:	focal
$ python --version
Python 3.8.10
$ pip --version
pip 20.0.2 from /usr/lib/python3/dist-packages/pip (python 3.8)

The complete output:

$ pip install git+https://github.com/FarisHijazi/Bulk-Bing-Image-downloader
Collecting git+https://github.com/FarisHijazi/Bulk-Bing-Image-downloader
  Cloning https://github.com/FarisHijazi/Bulk-Bing-Image-downloader to /tmp/pip-req-build-rrdmrfo4
  Running command git clone -q https://github.com/FarisHijazi/Bulk-Bing-Image-downloader /tmp/pip-req-build-rrdmrfo4
Building wheels for collected packages: bbid
  Building wheel for bbid (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-rrdmrfo4/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-rrdmrfo4/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-4pxc847h
       cwd: /tmp/pip-req-build-rrdmrfo4/
  Complete output (17 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib
  creating build/lib/bbid
  copying bbid/bbid.py -> build/lib/bbid
  copying bbid/__init__.py -> build/lib/bbid
  running build_scripts
  creating build/scripts-3.8
  copying setup.py -> build/scripts-3.8
  copying and adjusting bbid/bbid.py -> build/scripts-3.8
  warning: build_scripts: bbid/__init__.py is an empty file (skipping)
  
  changing mode of build/scripts-3.8/setup.py from 664 to 775
  changing mode of build/scripts-3.8/bbid.py from 664 to 775
  error: [Errno 2] No such file or directory: 'build/scripts-3.8/__init__.py'
  ----------------------------------------
  ERROR: Failed building wheel for bbid
  Running setup.py clean for bbid
Failed to build bbid
Installing collected packages: bbid
    Running setup.py install for bbid ... error
    ERROR: Command errored out with exit status 1:
     command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-rrdmrfo4/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-rrdmrfo4/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-8fencmsi/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /home/alexis/.local/include/python3.8/bbid
         cwd: /tmp/pip-req-build-rrdmrfo4/
    Complete output (17 lines):
    running install
    running build
    running build_py
    creating build
    creating build/lib
    creating build/lib/bbid
    copying bbid/bbid.py -> build/lib/bbid
    copying bbid/__init__.py -> build/lib/bbid
    running build_scripts
    creating build/scripts-3.8
    copying setup.py -> build/scripts-3.8
    copying and adjusting bbid/bbid.py -> build/scripts-3.8
    warning: build_scripts: bbid/__init__.py is an empty file (skipping)
    
    changing mode of build/scripts-3.8/setup.py from 664 to 775
    changing mode of build/scripts-3.8/bbid.py from 664 to 775
    error: [Errno 2] No such file or directory: 'build/scripts-3.8/__init__.py'
    ----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-rrdmrfo4/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-rrdmrfo4/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-8fencmsi/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /home/alexis/.local/include/python3.8/bbid Check the logs for full command output.

File mode weirdness

I am using this command on Python 3.9.1:
python bbid.py -f words.txt -o downloaded --limit 5

It successfully downloads quite a few images for the first term in the file and then does not download anything after that. It just keeps printing "history dumped" forever.

Here is my words.txt: https://gist.github.com/simpleauthority/badd0aeff61a75a8e0d895e9eafb1c6b

Is there any extra information I can provide you? I've used this before, and it worked, but now it is not working.

Unable to bulk download from Win10

C:\Users\john\projects\Bulk-Bing-Image-downloader>bbid -o TestRun --limit 5 -s "Happy Clown"
Traceback (most recent call last):
  File "C:\Python310\lib\urllib\request.py", line 1348, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "C:\Python310\lib\http\client.py", line 1276, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "C:\Python310\lib\http\client.py", line 1322, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "C:\Python310\lib\http\client.py", line 1271, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "C:\Python310\lib\http\client.py", line 1031, in _send_output
    self.send(msg)
  File "C:\Python310\lib\http\client.py", line 969, in send
    self.connect()
  File "C:\Python310\lib\http\client.py", line 1448, in connect
    self.sock = self._context.wrap_socket(self.sock,
  File "C:\Python310\lib\ssl.py", line 512, in wrap_socket
    return self.sslsocket_class._create(
  File "C:\Python310\lib\ssl.py", line 1070, in _create
    self.do_handshake()
  File "C:\Python310\lib\ssl.py", line 1341, in do_handshake
    self._sslobj.do_handshake()
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\john\projects\Bulk-Bing-Image-downloader\bbid.py", line 159, in <module>
    fetch_images_from_keyword(pool_sema, img_sema, args.search_string, output_dir, args.filters, args.limit)
  File "C:\Users\john\projects\Bulk-Bing-Image-downloader\bbid.py", line 95, in fetch_images_from_keyword
    response = urllib.request.urlopen(request)
  File "C:\Python310\lib\urllib\request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python310\lib\urllib\request.py", line 519, in open
    response = self._open(req, data)
  File "C:\Python310\lib\urllib\request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "C:\Python310\lib\urllib\request.py", line 496, in _call_chain
    result = func(*args)
  File "C:\Python310\lib\urllib\request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "C:\Python310\lib\urllib\request.py", line 1351, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [WinError 10054] An existing connection was forcibly closed by the remote host>

C:\Users\john\projects\Bulk-Bing-Image-downloader>

It's not really clear what's wrong here. I'm able to use Bing Images in a browser. There's no change if I use python to invoke it per directions

Different performance on different computers

Hi, I've met a problem that on my mac, bbid can download over 500 images at a time. While on remote ubuntu node, it can only download less than 100 images, with same query and parameters. Could someone explain this a bit?

Thanks in advance!

limited image download

It downloads only limited images around 30

commads i use

python3 bbid.py "yoga" -o "yoga"
python3 bbid.py "yoga" -o "yoga" --limit 100

i ran this multiple times
but it downloads only 27 images

image

input file format

Hi I was trying to use "-f " import mutiple search string, but failed with different types of file format.
I've tried csv, txt, json are not worked out. Could you give an example of the file format?

Unpredictable what number of images will be scraped

Hi, thanks for your Git repo, awesome tool. However, I'm facing some issues when scraping images.
If I use same commands, it seems like the number of images that will be downloaded varies every time. Also, with no object I get the number of pictures I want, but only around 500-600 on average when I want 750 images.
For instance, I want to scrape 750 images of 'bulldozers', so I'm using the following command (the filter is because I'm filtering on 'photographs' in Bing):

python bbid.py -s "buldozer" -o ./images_photo/bulldozer --limit 750 --filter +filterui:photo-photo&form=IRFLTR&first=1&tsc=ImageBasicHover

However, it only gives me 355 images.
It varies per object and per execution how many pictures will be returned, do you have an explanation for that? I already tried to play with the 'sleep_time'. I'm using Windows 10 operating system.
Thanks!

The value of user-agent affects search results

I found something interesting.

Sometimes, the downloader gave me fewer images than the search results on my web browser.
While comparing the difference between the two requests,
I found that modifying user-agent results in more search results.

Here is an example.

Run with the default user-agent (Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:94.0) Gecko/20100101 Firefox/94.0)

  • Finished w/ less than 20 search results
$ python bbid/bbid.py smiley -o output_smiley --filters +filterui:face-face
{'search_string': ['smiley'], 'search_file': False, 'output': 'output_smiley', 'adult_filter_off': False, 'filters': '+filterui:face-face', 'limit': None, 'threads': 20}
 OK : Man_Smiling_Emoji_Icon_ios10_grande.png
...
SKIP: Image is a duplicate of Nose-Piercing-60-650x650.jpg, not saving Nose-Piercing-60-650x650.jpg

$ ls output_smiley | wc -l                                      [21:51:35]
       9

Run with another user-agent (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134)

  • 318 search results
  • The results seem to be similar to the results on a web browser.
$ python bbid/bbid.py smiley -o output_smiley2 --filters +filterui:face-face
{'search_string': ['smiley'], 'search_file': False, 'output': 'output_smiley2', 'adult_filter_off': False, 'filters': '+filterui:face-face', 'limit': None, 'threads': 20}
 OK : smiley-emoticon-cartoon-with-v-sign-.jpg
...
 OK : 3595196bf47d96d3411a4faedc94a9cf.jpg

$ ls output_smiley2 | wc -l                                     [21:51:44]
     215

I'm not sure whether this behavior is dependent solely on a user-agent value or not.
(There might be something complicated I'm not aware of.)

Personally, I modified the user agent value to the latter one, and it seems to work fine for now

Error In Using Script

Hi there,

My issue is the same as this: #22
I've even changed IDE, and also tried reinstalling, but with the same outcome.

Here's what happened (generally):

  1. Opened bbid.py with IDLE IDE
  2. Ran it, and this showed up:
Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 22:39:24) [MSC v.1916 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> 
= RESTART: C:\<directory path here>\bbid.py
usage: bbid.py [-h] [-s SEARCH_STRING] [-f SEARCH_FILE] [-o OUTPUT]
               [--adult-filter-on] [--adult-filter-off] [--filters FILTERS]
               [--limit LIMIT] [--threads THREADS]
bbid.py: error: Provide Either search string or path to file containing search strings
  1. Then tried these two:
>>> bbid.py -h
Traceback (most recent call last):
  File "<pyshell#0>", line 1, in <module>
    bbid.py -h
NameError: name 'bbid' is not defined
>>> bbid.py -s "dog"
SyntaxError: invalid syntax

Can you please help identify the cause?

Having the same IP to scrape consistent images

Thank you for the amazing code -- I'm testing the code and realized that the scraped image results differ from the results if I were to search the keyword directly on the bing website.

Is there a way to add some parameters in bbid.py so that the image results are consistent with what I'm actually seeing on a website from my IP?

I reckon this is more of the bing issue, but figured I'd still ask for an advice. Thanks!

Target file name

Thank you for your code! its very useful!! can u make an additional option to "download" function - newFileName - if null = keyword...

here is modernized function:


def download(pool_sema: threading.Semaphore, url: str, output_dir: str, newname: str):
global in_progress

if url in tried_urls:
    return
pool_sema.acquire()
in_progress += 1
path = urllib.parse.urlsplit(url).path
filename = posixpath.basename(path).split('?')[0]
name, ext = os.path.splitext(filename)
name = **newname**
filename = name + ext

try:
    request=urllib.request.Request(url,None,urlopenheader)
    image=urllib.request.urlopen(request).read()
    if not imghdr.what(None, image):
        print('Invalid image, not saving ' + filename)
        return

    md5_key = hashlib.md5(image).hexdigest()
    if md5_key in image_md5s:
        print('Image is a duplicate of ' + image_md5s[md5_key] + ', not saving ' + filename)
        return

    i = 0
    while os.path.exists(os.path.join(output_dir, filename)):
        if hashlib.md5(open(os.path.join(output_dir, filename), 'rb').read()).hexdigest() == md5_key:
            print('Already downloaded ' + filename + ', not saving')
            return
        i += 1
        filename = "%s-%d%s" % (keyword, i, ext)

    image_md5s[md5_key] = filename
    imagefile=open(os.path.join(output_dir, filename),'wb')
    imagefile.write(image)
    imagefile.close()
    print("OK: " + newname)
    tried_urls.append(url)
except Exception as e:
    print("FAIL: " + filename)
finally:
    pool_sema.release()
    in_progress -= 1

BUT the name of new file is with space before extendion and new file name looks like "image .jpg"

can you correct that code? :) i'm using python first time ;) trying to forget php ;)

problem

Provide Either search string or path to file containing search strings

Could You Please set Limit on pr keywords, ?

Hi,
Hope you are doing well.
i have import a keywords File & set a limit pr keywords Download 10 images, but limit working overall the keywords, could you please set this limit on pr keyword?

Cannot Download More Images

I have follow above Issues, Still I can only Download Images less than 100. I have tried changes in values in time.sleep(50), time.sleep(0.5) many more.

--limit combined with txt file with objects

Hi,

I trying to download 100 images for a bunch of object classes. I made a txt file 'classes.txt' with several objects on each new line. When I type 'python bbid.py -f classes.txt' in the command line, the code works fine. However, when I type 'python bbid.py -f classes.txt --limit 100', I only get 100 images of the first object in the list, and the code stops after it. I hope you can help me, thanks!

Bob

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.