jmcarp / robobrowser Goto Github PK

View Code? Open in Web Editor NEW

3.7K 111.0 339.0 576 KB

License: BSD 3-Clause "New" or "Revised" License

Makefile 1.42% Python 98.58%

robobrowser's Introduction

RoboBrowser: Your friendly neighborhood web scraper

https://badge.fury.io/py/robobrowser.png

https://travis-ci.org/jmcarp/robobrowser.png?branch=master

https://coveralls.io/repos/jmcarp/robobrowser/badge.png?branch=master

Homepage: http://robobrowser.readthedocs.org/

RoboBrowser is a simple, Pythonic library for browsing the web without a standalone web browser. RoboBrowser can fetch a page, click on links and buttons, and fill out and submit forms. If you need to interact with web services that don't have APIs, RoboBrowser can help.

import re
from robobrowser import RoboBrowser

# Browse to Genius
browser = RoboBrowser(history=True)
browser.open('http://genius.com/')

# Search for Porcupine Tree
form = browser.get_form(action='/search')
form                # <RoboForm q=>
form['q'].value = 'porcupine tree'
browser.submit_form(form)

# Look up the first song
songs = browser.select('.song_link')
browser.follow_link(songs[0])
lyrics = browser.select('.lyrics')
lyrics[0].text      # \nHear the sound of music ...

# Back to results page
browser.back()

# Look up my favorite song
song_link = browser.get_link('trains')
browser.follow_link(song_link)

# Can also search HTML using regex patterns
lyrics = browser.find(class_=re.compile(r'\blyrics\b'))
lyrics.text         # \nTrain set and match spied under the blind...

RoboBrowser combines the best of two excellent Python libraries: Requests and BeautifulSoup. RoboBrowser represents browser sessions using Requests and HTML responses using BeautifulSoup, transparently exposing methods of both libraries:

import re
from robobrowser import RoboBrowser

browser = RoboBrowser(user_agent='a python robot')
browser.open('https://github.com/')

# Inspect the browser session
browser.session.cookies['_gh_sess']         # BAh7Bzo...
browser.session.headers['User-Agent']       # a python robot

# Search the parsed HTML
browser.select('div.teaser-icon')       # [<div class="teaser-icon">
                                        # <span class="mega-octicon octicon-checklist"></span>
                                        # </div>,
                                        # ...
browser.find(class_=re.compile(r'column', re.I))    # <div class="one-third column">
                                                    # <div class="teaser-icon">
                                                    # <span class="mega-octicon octicon-checklist"></span>
                                                    # ...

You can also pass a custom Session instance for lower-level configuration:

from requests import Session
from robobrowser import RoboBrowser

session = Session()
session.verify = False  # Skip SSL verification
session.proxies = {'http': 'http://custom.proxy.com/'}  # Set default proxies
browser = RoboBrowser(session=session)

RoboBrowser also includes tools for working with forms, inspired by WebTest and Mechanize.

from robobrowser import RoboBrowser

browser = RoboBrowser()
browser.open('http://twitter.com')

# Get the signup form
signup_form = browser.get_form(class_='signup')
signup_form         # <RoboForm user[name]=, user[email]=, ...

# Inspect its values
signup_form['authenticity_token'].value     # 6d03597 ...

# Fill it out
signup_form['user[name]'].value = 'python-robot'
signup_form['user[user_password]'].value = 'secret'

# Submit the form
browser.submit_form(signup_form)

Checkboxes:

from robobrowser import RoboBrowser

# Browse to a page with checkbox inputs
browser = RoboBrowser()
browser.open('http://www.w3schools.com/html/html_forms.asp')

# Find the form
form = browser.get_forms()[3]
form                            # <RoboForm vehicle=[]>
form['vehicle']                 # <robobrowser.forms.fields.Checkbox...>

# Checked values can be get and set like lists
form['vehicle'].options         # [u'Bike', u'Car']
form['vehicle'].value           # []
form['vehicle'].value = ['Bike']
form['vehicle'].value = ['Bike', 'Car']

# Values can also be set using input labels
form['vehicle'].labels          # [u'I have a bike', u'I have a car \r\n']
form['vehicle'].value = ['I have a bike']
form['vehicle'].value           # [u'Bike']

# Only values that correspond to checkbox values or labels can be set;
# this will raise a `ValueError`
form['vehicle'].value = ['Hot Dogs']

Uploading files:

from robobrowser import RoboBrowser

# Browse to a page with an upload form
browser = RoboBrowser()
browser.open('http://cgi-lib.berkeley.edu/ex/fup.html')

# Find the form
upload_form = browser.get_form()
upload_form                     # <RoboForm upfile=, note=>

# Choose a file to upload
upload_form['upfile']           # <robobrowser.forms.fields.FileInput...>
upload_form['upfile'].value = open('path/to/file.txt', 'r')

# Submit
browser.submit(upload_form)

By default, creating a browser instantiates a new requests Session.

Requirements

Python >= 2.6 or >= 3.3

License

MIT licensed. See the bundled LICENSE file for more details.

robobrowser's People

Contributors

Stargazers

Watchers

Forkers

fashtimedotcom smeggingsmegger robdavis mlitvk diopib avinassh loveragred laosiaudi srkama rayleyva pansuo martinbruce nsdown pombredanne sfall vicidroiddev jamesmeneghello nvaller mrjohnsson77 cyberlame pjpjean tazjel demester strogo henriquechehad giangzuzana giszhang kira-sa jeffzhengye evix2002 evekcin gt11799 facesouth stuntspt bigsharp s910501 nulitype mhburlin mynameisfashanu infixz huskyshi perryhau bobspadger tdryer finallybiubiu trsh pratyushmittal eddiekilbane kolexiang jeremiahmarks dnnr eejiawen xuefeng-zhu rcutmore tonysimpson samiamim tondy1 palaniyappanbala goryszewskig orangain emijrp ssorina crandy rustoceans vswilliamson smashpapst2 stephen-plugable jmortega vin-yuan fkztw orangepole kongz-go priestd09 gled-rs rdhyee vazgenh finfou reidransom boostsup xyzalzhang losintikfos ranamihir yoyocash az0ne teror4uks albertwh1te gerrywong junboz eternonq jdk6979 envylan sky8273 zider moandcompany qbektrix orinocoz koncode security-geeks spencerx bpd1069

robobrowser's Issues

Disabled fields should be enableable

For forms where some fields are initially disabled, but may be enabled through Javascript later, robobrowser seems to be unable to replicate that behaviour. disabled is a @property on the form field which cannot be overridden, so disabled fields cannot be enabled programmatically.

It's possible to work around this by adding a new field (form.add_field(Input('<input name=".." value="..">'))), but this seems rather clunky. It'd be better to be able to do something like this:

form['field'].disabled = False

Add support for robots.txt files

Add support for form.update(dict_object)

How to add any attribute to input tag

I need add filename attribute to input tag. How to?

name="foto_detail[]" filename="foto_detail[]"

I can not login facebook with robobrows?

Could some one help me? Thank you!
from robobrowser import RoboBrowser

browser = RoboBrowser(user_agent='Mozilla/5.0 (Windows; U; Windows NT 1;\en- US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6')
browser.open('https://www.facebook.com/')

Get the signup form

form = browser.get_form(id='login_form')
form['email'].value = 'my_user'
form['pass'].value = 'my_pass'
browser.submit_form(form)
Erro:
requests.exceptions.ConnectionError: ('Connection aborted.', ResponseNotReady('Request-sent',))

Add repr methods to Field classes

Add support for manipulating form

Being able to add, remove, and edit the fields in the form object would be nice.
I'm currently trying to scrape a website where there are 2 fields of the same name on the form and one of them needs to be deleted. Are there already methods for that?

Add support for dynamically added form fields

Hi,

i try to use a file upload form, but unfortunately there is a field which gets added by JS before submitting.
Is there a way to simply add fields, or disable the checks that raise the KeyError?

forms and javascript / ajax does not work?

Hi,

Nice work.
However, I am having some issues with some forms and ajax requests.
Do you confirm that it does not work if a form is sent via javascript and the results are displayed on the same page via ajax?

Thanks
Michael

How to press a button on webpages

I need to click a button on a webpage (no form). Is there a way to accomplish this?

This is a piece of the page:
...div class="alert alert-danger media_size">
a href="verify.html?programmer_id=c1b7&ses=949cc7646c425c788d3f210c5c889b22" class="btn btn-primary">Verify now<....

How do I tick/select a check box?

>>> url = 'https://bitbucket.org/repo/import'
>>> browser.open(url)
>>> import_form = browser.get_form(id='import-form')
>>> import_form
<RoboForm source_scm=, source=source-git, goog_project_name=, goog_scm=svn, sourceforge_project_name=, sourceforge_mount_point=, sourceforge_scm=svn, codeplex_project_name=, codeplex_scm=svn, url=, auth=[], username=, password=, owner=2039394, name=, description=, is_private=[None], forking=no_public_forks, no_forks=, no_public_forks=True, scm=git, has_issues=[], has_wiki=[], language=, csrfmiddlewaretoken=icwpCBLZdWAPht1rmnACawMHcYwtorNA>
>>> type(import_form['auth'])
<class 'robobrowser.forms.fields.Checkbox'>

The 'auth' is a checkbox. How do I set it to true? or selected? I couldn't find required info in documentation, so asking here. Thank you!

Javascript support

I know this question isn't specifically related to robobrowser but I'm using Joshua Carp's awesome robo browser python package. I was hoping to contact him directly but I'm not sure if that's possible on GitHub.

Anyway I'm looking for a pure python 3 headless JavaScript browser package and I can't seem to find any. I've looked and looked and all the ones that I've researched have underlying dependencies which makes them less appealing.

I'm relatively new to python, and I'm using python 3.x. Some of my code may run on a NAS server that has a python 3.5 distribution but uses it's own proprietary 64-bit Linux. (specifically a Synology 716+ NAS) or as a standalone application.

I really want to use something that's self contained, isn't doing any RPC based controlling of a headless browser either, etc.

I can't find anything and I'm thinking nothing exists. Do anyone know of one?

Thanks

Fails on UTF

robobrowser is crashing on this value in the form.

<input type="hidden" id="_utf8" name="_utf8" value="☃">

>>> form
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/robobrowser/forms/form.py", line 200, in __repr__
    for name, field in self.fields.items(multi=True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2603' in position 0: ordinal not in range(128)

Multiple submit..

Hello, I tell that I am using my Odoo robobrowse to connect with an external platform approving products that can be sold on the market. This platform has a form with a button that lets you fill a table with the products to be registered and after filling submit other data is a general button. In sistesis, the problem is given with the button that fills the table, because this is the format:

When I do:
browser = RoboBrowser (history = True)
browser.open ( 'https: //....../')
form = browser.get_form (action = '/ ....')
The form object contains all fields including the field where I enter the product name, but the application form:
browser.submit_form (form)
all form fields unless the product, which only applies when I click the Add Product button is recorded ..
The submit_fields property of the form is an empty list, so I can not use the option:
browser.submit_form (form, submit = submit) ..
Any proposal???

help signing into dropbox.com

Dear all,

I get a 403 response from submit_form when trying to sign into https://www.dropbox.com/login.

I adapted the third example from http://robobrowser.readthedocs.org/en/latest/readme.html (working with forms)

Any ideas on what I missed would be greatly appreciated.

Arye.

from robobrowser import RoboBrowser


browser = RoboBrowser(history=True, user_agent='RoboBrowser python robot')
browser.open('https://www.dropbox.com/login')
print browser.response

# Get the signup form
signup_form = browser.get_form(class_='login-form')

# Fill it out
signup_form['login_email'].value = 'user@example'
signup_form['login_password'].value = 'secret'

# Serialize it to JSON, is this really necessary ?
# signup_form.serialize()

# And submit
browser.submit_form(signup_form)
print browser.response

How to add proxy parameter

I wonder how can I add a proxy when do http request.
Thanks.

InsecurePlatformWarning

I am logging into an HTTPS site, and robobrowser is working great. However, every time my script runs I am getting an 'InsecurePlatformWarning'. When I use my normal browser to log into this site, I am not getting any warnings at all. For reference the site I am logging into is http://didlogic.com

This is the warning:

/Library/Python/2.7/site-packages/requests-2.7.0-py2.7.egg/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning

While its not impacting the script, it is somewhat annoying to see. A workaround that I have found (using the URL in the warning) is to use the logging module to capture these warnings:

import logging
logging.captureWarnings(True)

This will stop this warning from being displayed on screen, and instead re-route them to the logging module (which is not being recorded anywhere)

Thanks,

Suggested Changes To Rap.Genius Example

The target website appears to have been changed, so the first example no longer works.
The following changes bring it back into line:

import re
from robobrowser import RoboBrowser
browser = RoboBrowser(history=True)
browser.open('http://genius.com/')
form = browser.get_form(action='/search')
form['q'].value = 'queen'
browser.submit_form(form)

songs = browser.select('.song_title')
browser.follow_link(songs[1]) # If you want Bohemian Rhapsody
lyrics = browser.select
lyrics[0].text # \n[Intro]\nIs this the real life ...
browser.back

Regards
Trevor

its not a issue, but a sugestion:

Have a way to introduce a "pyautogui"-like function? Because I want to scrape data that is loaded dynamically along a navigation on a google-maps service, and I didn't find a way to do that yet...
(or someone have a clue for me, or a solution?)

forms dependency

just installed and get this message when using browser.get_form

No module named forms.form

i have requests and bs4

Mutiple submit buttons in forms

If I have the following HTML:

<form name="input" action="demo_form_action.asp" method="get">
Username: <input type="text" name="user">
<input type="submit" value="Action1" name="action1_name">
<input type="submit" value="Action2" name="action2_name">
</form>

When I click the Action1 button in my browser (Firefox or Chrome), the following URL gets sent http://localhost:8001/demo_form_action.asp?user=asfdasd&action1_name=Action1. And a different URL gets sent for Action2: http://localhost:8001/demo_form_action.asp?user=asfdasd&action2_name=Action2. Using the submit_form method in robobrowser, puts the actions for both buttons in the request: http://localhost:8001/demo_form_action.asp?user=asdsafd&action2_name=Action2&action1_name=Action1.

I guess I can model the browser button pressing behaviour by deleting the buttons I do not want pressed from the robobrowser form object. However, it would be nice if the API for submit_form could be extended to include the button being pressed to submit the form.

InvalidNameError - Input field does not accpet tag string

I found that robobrowser can add form field dynamically like this:

browser = RoboBrowser(history=True)
myform = browser.get_form('myform2')
new_field = Input('\<input name="myname" value="" \/\>')
myform.add_field( new_field )

But InvalidNameError occured in Input class.

RoboBrowser's Input Field occurred InvalidNameError

Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> from robobrowser.forms.fields import Input
>>> Input('<input name="myfield" value=""/>')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/robobrowser/forms/fields.py", line 72, in __init__
    super(Input, self).__init__(parsed)
  File "/usr/local/lib/python2.7/dist-packages/robobrowser/forms/fields.py", line 46, in __init__
    self.name = self._get_name(self._parsed)
  File "/usr/local/lib/python2.7/dist-packages/robobrowser/forms/fields.py", line 52, in _get_name
    raise exceptions.InvalidNameError
robobrowser.exceptions.InvalidNameError
>>>

Packages version

$ pip search robobrowser
robobrowser               - Your friendly neighborhood web scraper
  INSTALLED: 0.5.1 (latest)
$ pip search beautifulsoup
beautifulsoup4            - Screen-scraping library
  INSTALLED: 4.3.2 (latest)
BeautifulSoup             - HTML/XML parser for quick-turnaround applications like screen-scraping.
  INSTALLED: 3.2.1 (latest)

RoboBrowser with HTTP proxy

Hey!

I was wondering; what is the easiest what to configure RoboBrowser to use HTTP proxy?

Thanks for you answer and for making this nice tool!

Simple AJAX request

Hello. I'm try to emulate simple ajax request.

Scheme:

It' must return json:
{phone: "8 xxx xxx-xx-xx"}

But return None for me.
I'm suggest that the cause - cookies. Can someone help me with it?

# -*- coding: utf-8 -*-
import requests
import os, re, json, csv, sys
from robobrowser import RoboBrowser

class Aggregator(object):

    def __init__(self, config):
        main_url, output_file = [config.get(k) for k in sorted(config.keys())]
        self.main_url = main_url
        self.output_file = output_file

    def start_process(self):

        work_url = "https://m.avito.ru/sankt-peterburg/predlozheniya_uslug/almaznoe_burenie_almaznaya_rezka_usilenie_79225740"

        session = requests.Session()
        session.headers.update({
            ':host': 'm.avito.ru',
            ':method': 'GET',
            ':path': '/sankt-peterburg/predlozheniya_uslug/almaznoe_burenie_almaznaya_rezka_usilenie_79225740',
            ':scheme': 'https',
            ':version': 'HTTP/1.1',
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'accept-encoding': 'gzip, deflate, sdch',
            'accept-language': 'ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4',
            'cache-control': 'no-cache',
            'pragma': 'no-cache',
            'upgrade-insecure-requests': '1',
            'user-agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
            'x-compress': 'null',
        })
        browser = RoboBrowser(session=session, history=True, parser='lxml')

        browser.open(work_url)

        phone_link = browser.find('a', {"class": "action-show-number"}).attrs['href'] + '?async'
        browser.session.headers[':path'] = phone_link
        browser.session.headers['accept'] = 'application/json, text/javascript, */*; q=0.01'
        browser.session.headers['referer'] = work_url
        browser.session.headers['upgrade-insecure-requests'] = ''
        browser.session.headers['x-requested-with'] = 'XMLHttpRequest'
        phone = browser.open(self.main_url + phone_link)

        print(self.main_url + phone_link, phone)
        print(browser.session.headers)
        print(browser.session.cookies)

if __name__ == '__main__':
    settings = { 'main_url': 'https://m.avito.ru', 'output_file': 'output.csv' }
    aggregator = Aggregator(settings)
    aggregator.start_process()

browser.get_form - forms with a text area returns a prefixed \r when empty and \r\rn when filled

I am running into an issue with forms that have text areas. When grabbing the form using browser.get_form all blank text areas return a \r and when submitted now have a new line.

Also, all non-blank text areas are prefixed with a \r\n to the front of it when pulling the form.

TypeError: init() got an unexpected keyword argument 'headers'

TypeError: __init__() got an unexpected keyword argument 'headers'

This is what I get when I run this within a Flask app like this:

from flask import Flask
app = Flask(__name__)

from robobrowser import RoboBrowser

@app.route("/")
def hello():

    browser = RoboBrowser(headers={'User-Agent': 'a python robot'})
    browser.open('http://google.com/')

    return 'Hello'

if __name__ == "__main__":
    app.run(host="localhost", debug=True)

What could be causing this?

I installed RoboBrowser in a virtualenv using pip 7.1.2, python 2.7.10, Windows 10

Checkboxes with same name will not be grouped if some other inputs exists between them

I found that checkboxes with same name will not be grouped if some other inputs exists between them.

from robobrowser.forms.form import _parse_fields

html = '''
            <input type="checkbox" name="member" value="mercury" checked />vocals<br />
            <input type="checkbox" name="member" value="may" />guitar<br />
            <input type="text" /><br />
            <input type="checkbox" name="member" value="taylor" />drums<br />
            <input type="checkbox" name="member" value="deacon" checked />bass<br />
        '''
_fields = _parse_fields(BeautifulSoup(html))
for cbx in _fields:
    print(cbx.name, cbx.options)

which output

member ['mercury', 'may']
member ['taylor', 'deacon']

As seen, 2 robobrowser.forms.fields.Checkbox instances were created, with the same name, options different. I thought it would be one instance, with 4 options.

Maybe it's a bug? I have no idea.

relevant code

def _group_flat_tags(tag, tags):
    """Extract tags sharing the same name as the provided tag. Used to collect
    options for radio and checkbox inputs.
    :param Tag tag: BeautifulSoup tag
    :param list tags: List of tags
    :return: List of matching tags
    """
    grouped = [tag]
    name = tag.get('name', '').lower()
    while tags and tags[0].get('name', '').lower() == name: # <----  HERE
        grouped.append(tags.pop(0))
    return grouped

Installing via pip fails

Installing via pip produces an error. I'm using python 3.4.

TypeError: parse_requirements() missing 1 required keyword argument: 'session'
Complete output from command python setup.py egg_info:
Traceback (most recent call last):

  File "<string>", line 20, in <module>

  File "/.../.../robobrowser/setup.py", line 38, in <module>

    for requirement in parse_requirements('requirements.txt')

  File "/.../.../robobrowser/setup.py", line 37, in <listcomp>

    str(requirement.req)

  File "/.../.../python3.4/site-packages/pip/req/req_file.py", line 19, in parse_requirements

    "parse_requirements() missing 1 required keyword argument: "

TypeError: parse_requirements() missing 1 required keyword argument: 'session'

----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /.../.../robobrowser

multiplier def set to None ,will be 0 is better

sorry, i dont understand english.
if you set parameter tries , parameter can not be None(it's def ,it's be 0 better

error code :
import robobrowser
r = robobrowser.RoboBrowser(tries=3)
r.open("http://www.gosdfagle.com") #any not exist site

to fix this:
file: browser.py
line 70:
def init(self, session=None, parser=None, user_agent=None,
history=True, timeout=None, allow_redirects=True, cache=False,
cache_patterns=None, max_age=None, max_count=None, tries=None,
multiplier=None):
change to:
def init(self, session=None, parser=None, user_agent=None,
history=True, timeout=None, allow_redirects=True, cache=False,
cache_patterns=None, max_age=None, max_count=None, tries=None,
multiplier=0):

Include option for not validating SSL

Is there any way to set page encoding manualy

Sometimes robobrwser get wrong encoding for my page. I know that BeautifulSoup supports manual encoding definition can I set up encoding manually and passed it to BeautifullSoup? In my case it founds windows-1252 instead of UTF-8

And if I use just BeautifullSoup with requests - it works fine

>>> from bs4 import BeautifulSoup
>>> from robobrowser import RoboBrowser
>>> import requests
>>> url = 'http://10x10.com.ua/televizor-bravis-led-32d3000-smart-t2-black-v-dnepropetrovske.html'
>>> rb = RoboBrowser(parser='lxml')
>>> rb.open(url)
>>> rb.select('.product-name h1').pop()
<h1>\xd1\u201a\xd0\xb5\xd0\xbb\xd0\xb5\xd0\xb2\xd0\xb8\xd0·\xd0\xbe\xd1\u20ac Bravis LED-32D3000 Smart +T2 black \xd0\xb2 \xd0\u201d\xd0\xbd\xd0\xb5\xd0\xbf\xd1\u20ac\xd0\xbe\xd0\xbf\xd0\xb5\xd1\u201a\xd1\u20ac\xd0\xbe\xd0\xb2\xd1\x81\xd0\xba\xd0\xb5</h1>
>>> bs = BeautifulSoup(requests.get(url).text, 'lxml')
>>> bs.select('.product-name h1').pop()
<h1>телевизор Bravis LED-32D3000 Smart +T2 black в Днепропетровске</h1>
>>>

Submit method="GET" must ignore query string

In a HTML form GET submit, the URL must ignore query string.

<!-- index.html -->
<form id="form" method="GET">
    <input type=text name=field />
    <button type="submit">Submit</button>
</form>

browser = RoboBrowser(parser="lxml")
browser.open("http://localhost:8000/")

form = browser.get_form("form")
form["field"] = "value"
browser.submit_form(form)
# browser.url == "http://localhost:8000/?field=value"

form = browser.get_form("form")
form["field"] = "other"
browser.submit_form(form)
# browser.url == "http://localhost:8000/?field=value&field=other"

The url must be http://localhost:8000/?field=other, removing previous query string.

Add 'for humans' to the project name or documentation

Seems to be an ideal case for the common suffix ;-)

ValueError when selecting an inappropriate value has no description.

https://github.com/jmcarp/robobrowser/blob/master/robobrowser/forms/fields.py#L106

This exception is confusing. It would be nice if it named at the least the form field you were trying to set and said that the value being set wasn't located, and it would be even nicer if it printed the first ten (or some other reasonable number) available options or something like that.

RoboBrowser and forms: .submit_form() fails when action="javascript:void(0)"

Setting action="javascript:void(0)" is admittedly a rather lousy strategy for staying on the same page after clicking on the submit button, but this is what the page I'm trying to scrape does, and robobrowser unfortunately fails:

>>> browser.submit_form(form)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/robobrowser/browser.py", line 347, in submit_form
    response = self.session.request(method, url, **send_args)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 456, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 553, in send
    adapter = self.get_adapter(url=request.url)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 608, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'javascript:void(0)'

Thanks.

How to uplaod files

Hi,

can anybody tell me how to upload files using robobrowser?

Some links time out for no reason

For example: dvn-4.hmdc.harvard.edu/dvn/

Sending POST requests without submitting a form

If possible please provide a more convenient way for direct sending of POST requests so that it will automatically _update_state.

https://stackoverflow.com/questions/23737719/how-to-make-post-request-in-robobrowser-python

Error with two Submits

Hello, I have the following form:

<form action="/composer/mbasic/?av=100003111991563&refid=18" method="post"><input autocomplete="off" name="fb_dtsg" type="hidden" value="AQG1ETPXS38n:AQGO2WaIwVNU"/><input name="charset_test" type="hidden" value="€,´,€,´,水,Д,Є"/><input name="target" type="hidden" value="208910505938561"/><input name="c_src" type="hidden" value="group"/><input name="cwevent" type="hidden" value="composer_entry"/><input name="referrer" type="hidden" value="group"/><input name="ctype" type="hidden" value="inline"/><input name="cver" type="hidden" value="amber"/><input name="rst_icv" type="hidden"/><label class="bw" for="u_0_0">Escreva algo...</label><table class="l bx"><tbody><tr><td class="r"><div class="by bz"><table class="l ca cb"><tbody><tr><td class="m cc"><label for="u_0_0"><img alt="João-Wellmara Ribeiro" class="cd img" height="32" src="https://fbcdn-profile-a.akamaihd.net/hprofile-ak-xfa1/v/t1.0-1/cp0/e15/q65/c193.2.623.623/s64x64/69216_424777024302694_750875852_n.jpg?efg=eyJpIjoiYiJ9&oh=4d6b037d8e5f05a45336e23c363281f3&oe=574B42B8&__gda__=1465173705_7290807370b55f707b242f68bf5e22f1" width="32"/></label></td><td class="r ce bj"><textarea class="cf cg ch ci cj" id="u_0_0" name="xc_message" rows="2"></textarea></td></tr></tbody></table></div></td><td class="m"><div class="ck"><input class="w x z" name="view_post" type="submit" value="Publicar"/></div></td></tr></tbody></table><div class="cl"><span class="cm cn co"><div class="cp"><table class="l cq bj"><tbody><tr><td class="u m"><label class="cr" for="u_0_1"><img class="cs ct img" height="12" src="https://fbstatic-a.akamaihd.net/rsrc.php/v2/yd/r/10cmx89F1gU.png" width="12"/></label></td><td class="u m"><input class="w cu cv cw" id="u_0_1" name="view_photo" type="submit" value="Adicionar fotos"/></td></tr></tbody></table></div> <div class="cp"><table class="l cq bj"><tbody><tr><td class="u m"><label class="cr" for="u_0_2"><img class="cs ct img" height="12" src="https://fbstatic-a.akamaihd.net/rsrc.php/v2/yg/r/upxxJ9A52e7.png" width="12"/></label></td><td class="u m"><input class="w cu cv cw" id="u_0_2" name="view_overview" type="submit" value="Mais"/></td></tr></tbody></table></div></span></div></form>

it has two Submits the view_post and view_photo. I'm trying to send the view_post but I'm not succeeding

>>> br.submit_form(form, form.fields['view_post']) Traceback (most recent call last): File "<input>", line 1, in <module> File "C:\Users\lmr21\AppData\Local\Programs\Python\Python35\lib\site-packages\robobrowser\browser.py", line 339, in submit_form payload = form.serialize(submit=submit) File "C:\Users\lmr21\AppData\Local\Programs\Python\Python35\lib\site-packages\robobrowser\forms\form.py", line 227, in serialize return Payload.from_fields(include_fields) File "C:\Users\lmr21\AppData\Local\Programs\Python\Python35\lib\site-packages\robobrowser\forms\form.py", line 118, in from_fields if not field.disabled: AttributeError: 'str' object has no attribute 'disabled'

I'm getting this error .. what am I doing wrong?

exceptions.InvalidSubmitError with submit_form

urlopt = 'http://info512.taifex.com.tw/Future/OptQuote_Norl.aspx'
browser = RoboBrowser()
browser.open(urlopt)
form = browser.get_form()
form['ctl00$ContentPlaceHolder1$ddlFusa_SelMon'].value =form['ctl00$ContentPlaceHolder1$ddlFusa_SelMon'].options[1]
browser.submit_form(form)

but it return

robobrowser\forms\form.py
in prepare_fields(all_fields, submit_fields, submit)
152 if len(list(submit_fields.items(multi=True))) > 1:
153 if not submit:
--> 154 raise exceptions.InvalidSubmitError()
155 if submit not in submit_fields.getlist(submit.name):
156 raise exceptions.InvalidSubmitError()
InvalidSubmitError:

I want to select the options[1] but it return the Error. How to fix the problem?

browser.submit_form(form,submit=form['ctl00$ContentPlaceHolder1$ddlFusa_SelMon'])

I try this way but also return the same error. Can someone help me?
Thanks!

select / option without value attribute

On some page I've found this html:

<select id="hitsPerPage" onchange="equaliseSortForms(1);" name="hitsPerPage">
    <option selected="">10</option>
    <option>20</option>
    <option>50</option>
</select>

And robobrowser parsed this into ["sel", "sel", "sel"] options and sends "hitsPerPage=sel" on posting form, but browser (firefox) returns "labels" instead ("hitsPerPage=10", for example), so I think it's better to do the same if options omitted.

cannot install with pip

I'm getting the following error when I try to install with pip:

$ pip install robobrowser
Downloading/unpacking robobrowser
  Running setup.py egg_info for package robobrowser
    Traceback (most recent call last):
      File "<string>", line 14, in <module>
      File "/home/tsc/.virtualenvs/vm_export_tool/build/robobrowser/setup.py", line 19, in <module>
        for requirement in parse_requirements('requirements.txt')
      File "/home/tsc/.virtualenvs/vm_export_tool/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg/pip/req.py", line 1240, in parse_requirements
        skip_regex = options.skip_requirements_regex
    AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 14, in <module>

  File "/home/tsc/.virtualenvs/vm_export_tool/build/robobrowser/setup.py", line 19, in <module>

    for requirement in parse_requirements('requirements.txt')

  File "/home/tsc/.virtualenvs/vm_export_tool/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg/pip/req.py", line 1240, in parse_requirements

    skip_regex = options.skip_requirements_regex

AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'

----------------------------------------
Command python setup.py egg_info failed with error code 1 in /home/tsc/.virtualenvs/vm_export_tool/build/robobrowser
Storing complete log in /home/tsc/.pip/pip.log

python 2.7
pip 1.1
ubuntu 12.10

Errors Running the example script from the documentation

Hi,

Im trying to run the first script found here:
http://robobrowser.readthedocs.io/en/latest/readme.html
(The one that scrapes Genius.com)

And I'm getting the following error + warning:

/Users/saulfuhrmann/Computers/VirtualEnviroments/TinderHack/lib/python2.7/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 9 of the file test_robo.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))
Traceback (most recent call last):
  File "test_robo.py", line 11, in <module>
    form['q'].value = 'queen'
TypeError: 'NoneType' object has no attribute '__getitem__'

Im using Python 2.7.10 and my OS is OS X El Capitan (10.11.4)

Am I doing something wrong?
Or did Genius.com just changed their interface and the script is not updated?

-- Saul

RoboBrowser and Wordpress

Hi!
I'm quite the newbie when it comes to robobrowser but I want to post in WordPress using robobrowser.
I managed to log in but since WP is AJAX driven, I can't manage to create, fill out and publish posts in WP.

Any chance this is possible at all with robobrowser or is this not supported? If it is supported an anyone tell me how to work with AJAX sites that refresh themselves and do auto-saves like WP does?

Fix form submission when action is GET

Form data are serialized to the data parameter of requests.request regardless of method. This works for POST requests but not for GETs.

Disabled <input> elements should not be submitted.

According to 1 and 2, the disabled attribute prevents value from being submitted.

It seems that robobrowser does not check if elements are disabled or not. In my case there was a disabled checkbox in a form that was submitted unexpectedly.

The constructor for Input should accept a string

The following patch will allow the constructor for Input to accept a string, as its docstring says it will.

*** fields.py   2014-07-18 13:22:40.616011172 +0200
--- fields.py~  2014-07-15 16:17:32.000000000 +0200
***************
*** 39,45 ****
      def __init__(self, parsed):
          self._parsed = helpers.ensure_soup(parsed)
          self._value = None
!         self.name = self._get_name(self._parsed)

      def _get_name(self, parsed):
          return parsed.get('name')
--- 39,45 ----
      def __init__(self, parsed):
          self._parsed = helpers.ensure_soup(parsed)
          self._value = None
!         self.name = self._get_name(parsed)

      def _get_name(self, parsed):
          return parsed.get('name')

Add support for rate-limiting

badly handle web pages with encoding errors

As the BeautifullSoup doc says : 'Sometimes a document is mostly in UTF-8, but contains Windows-1252 characters...' . As an example, all pages from the 'http://www.programme-tv.com' web site are like this (due to the Ô in 'France Ô').
In that case, RoboBrowser convert the full document to Windows-1252 encoding, leaving all accentuated chars unreadable :

from robobrowser import RoboBrowser
browser = RoboBrowser()
browser.open('http://www.programme-tv.com')
browser.find('span', 'slogan1').text

Using self.response.text instead of self.response.content when calling BeautifullSoup is solving the problem but it probably have some drawbacks.

Cheers,
Loïc

BS4 warns about explicitly setting the parser to be used

I got the following warning with the latest BeautifulSoup4 and robobrowser from PyPI and Python 3.5

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

BeautifulSoup([your markup])

to this:

BeautifulSoup([your markup], "html.parser")

I think robobrowser should specify the parser to get repeatable results