Giter VIP home page Giter VIP logo

robobrowser's Introduction

RoboBrowser: Your friendly neighborhood web scraper

https://badge.fury.io/py/robobrowser.png https://travis-ci.org/jmcarp/robobrowser.png?branch=master https://coveralls.io/repos/jmcarp/robobrowser/badge.png?branch=master

Homepage: http://robobrowser.readthedocs.org/

RoboBrowser is a simple, Pythonic library for browsing the web without a standalone web browser. RoboBrowser can fetch a page, click on links and buttons, and fill out and submit forms. If you need to interact with web services that don't have APIs, RoboBrowser can help.

import re
from robobrowser import RoboBrowser

# Browse to Genius
browser = RoboBrowser(history=True)
browser.open('http://genius.com/')

# Search for Porcupine Tree
form = browser.get_form(action='/search')
form                # <RoboForm q=>
form['q'].value = 'porcupine tree'
browser.submit_form(form)

# Look up the first song
songs = browser.select('.song_link')
browser.follow_link(songs[0])
lyrics = browser.select('.lyrics')
lyrics[0].text      # \nHear the sound of music ...

# Back to results page
browser.back()

# Look up my favorite song
song_link = browser.get_link('trains')
browser.follow_link(song_link)

# Can also search HTML using regex patterns
lyrics = browser.find(class_=re.compile(r'\blyrics\b'))
lyrics.text         # \nTrain set and match spied under the blind...

RoboBrowser combines the best of two excellent Python libraries: Requests and BeautifulSoup. RoboBrowser represents browser sessions using Requests and HTML responses using BeautifulSoup, transparently exposing methods of both libraries:

import re
from robobrowser import RoboBrowser

browser = RoboBrowser(user_agent='a python robot')
browser.open('https://github.com/')

# Inspect the browser session
browser.session.cookies['_gh_sess']         # BAh7Bzo...
browser.session.headers['User-Agent']       # a python robot

# Search the parsed HTML
browser.select('div.teaser-icon')       # [<div class="teaser-icon">
                                        # <span class="mega-octicon octicon-checklist"></span>
                                        # </div>,
                                        # ...
browser.find(class_=re.compile(r'column', re.I))    # <div class="one-third column">
                                                    # <div class="teaser-icon">
                                                    # <span class="mega-octicon octicon-checklist"></span>
                                                    # ...

You can also pass a custom Session instance for lower-level configuration:

from requests import Session
from robobrowser import RoboBrowser

session = Session()
session.verify = False  # Skip SSL verification
session.proxies = {'http': 'http://custom.proxy.com/'}  # Set default proxies
browser = RoboBrowser(session=session)

RoboBrowser also includes tools for working with forms, inspired by WebTest and Mechanize.

from robobrowser import RoboBrowser

browser = RoboBrowser()
browser.open('http://twitter.com')

# Get the signup form
signup_form = browser.get_form(class_='signup')
signup_form         # <RoboForm user[name]=, user[email]=, ...

# Inspect its values
signup_form['authenticity_token'].value     # 6d03597 ...

# Fill it out
signup_form['user[name]'].value = 'python-robot'
signup_form['user[user_password]'].value = 'secret'

# Submit the form
browser.submit_form(signup_form)

Checkboxes:

from robobrowser import RoboBrowser

# Browse to a page with checkbox inputs
browser = RoboBrowser()
browser.open('http://www.w3schools.com/html/html_forms.asp')

# Find the form
form = browser.get_forms()[3]
form                            # <RoboForm vehicle=[]>
form['vehicle']                 # <robobrowser.forms.fields.Checkbox...>

# Checked values can be get and set like lists
form['vehicle'].options         # [u'Bike', u'Car']
form['vehicle'].value           # []
form['vehicle'].value = ['Bike']
form['vehicle'].value = ['Bike', 'Car']

# Values can also be set using input labels
form['vehicle'].labels          # [u'I have a bike', u'I have a car \r\n']
form['vehicle'].value = ['I have a bike']
form['vehicle'].value           # [u'Bike']

# Only values that correspond to checkbox values or labels can be set;
# this will raise a `ValueError`
form['vehicle'].value = ['Hot Dogs']

Uploading files:

from robobrowser import RoboBrowser

# Browse to a page with an upload form
browser = RoboBrowser()
browser.open('http://cgi-lib.berkeley.edu/ex/fup.html')

# Find the form
upload_form = browser.get_form()
upload_form                     # <RoboForm upfile=, note=>

# Choose a file to upload
upload_form['upfile']           # <robobrowser.forms.fields.FileInput...>
upload_form['upfile'].value = open('path/to/file.txt', 'r')

# Submit
browser.submit(upload_form)

By default, creating a browser instantiates a new requests Session.

Requirements

  • Python >= 2.6 or >= 3.3

License

MIT licensed. See the bundled LICENSE file for more details.

robobrowser's People

Contributors

jamesmeneghello avatar jmcarp avatar mattdbr avatar mlitvk avatar pratyushmittal avatar rcutmore avatar sfall avatar stuntspt avatar voyageur avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

robobrowser's Issues

Disabled fields should be enableable

For forms where some fields are initially disabled, but may be enabled through Javascript later, robobrowser seems to be unable to replicate that behaviour. disabled is a @property on the form field which cannot be overridden, so disabled fields cannot be enabled programmatically.

It's possible to work around this by adding a new field (form.add_field(Input('<input name=".." value="..">'))), but this seems rather clunky. It'd be better to be able to do something like this:

form['field'].disabled = False

I can not login facebook with robobrows?

Could some one help me? Thank you!
from robobrowser import RoboBrowser

browser = RoboBrowser(user_agent='Mozilla/5.0 (Windows; U; Windows NT 1;\en- US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6')
browser.open('https://www.facebook.com/')

Get the signup form

form = browser.get_form(id='login_form')
form['email'].value = 'my_user'
form['pass'].value = 'my_pass'
browser.submit_form(form)
Erro:
requests.exceptions.ConnectionError: ('Connection aborted.', ResponseNotReady('Request-sent',))

Add support for manipulating form

Being able to add, remove, and edit the fields in the form object would be nice.
I'm currently trying to scrape a website where there are 2 fields of the same name on the form and one of them needs to be deleted. Are there already methods for that?

Add support for dynamically added form fields

Hi,

i try to use a file upload form, but unfortunately there is a field which gets added by JS before submitting.
Is there a way to simply add fields, or disable the checks that raise the KeyError?

forms and javascript / ajax does not work?

Hi,

Nice work.
However, I am having some issues with some forms and ajax requests.
Do you confirm that it does not work if a form is sent via javascript and the results are displayed on the same page via ajax?

Thanks
Michael

How to press a button on webpages

I need to click a button on a webpage (no form). Is there a way to accomplish this?

This is a piece of the page:
...div class="alert alert-danger media_size">
a href="verify.html?programmer_id=c1b7&ses=949cc7646c425c788d3f210c5c889b22" class="btn btn-primary">Verify now<....

How do I tick/select a check box?

>>> url = 'https://bitbucket.org/repo/import'
>>> browser.open(url)
>>> import_form = browser.get_form(id='import-form')
>>> import_form
<RoboForm source_scm=, source=source-git, goog_project_name=, goog_scm=svn, sourceforge_project_name=, sourceforge_mount_point=, sourceforge_scm=svn, codeplex_project_name=, codeplex_scm=svn, url=, auth=[], username=, password=, owner=2039394, name=, description=, is_private=[None], forking=no_public_forks, no_forks=, no_public_forks=True, scm=git, has_issues=[], has_wiki=[], language=, csrfmiddlewaretoken=icwpCBLZdWAPht1rmnACawMHcYwtorNA>
>>> type(import_form['auth'])
<class 'robobrowser.forms.fields.Checkbox'>

The 'auth' is a checkbox. How do I set it to true? or selected? I couldn't find required info in documentation, so asking here. Thank you!

Javascript support

I know this question isn't specifically related to robobrowser but I'm using Joshua Carp's awesome robo browser python package. I was hoping to contact him directly but I'm not sure if that's possible on GitHub.

Anyway I'm looking for a pure python 3 headless JavaScript browser package and I can't seem to find any. I've looked and looked and all the ones that I've researched have underlying dependencies which makes them less appealing.

I'm relatively new to python, and I'm using python 3.x. Some of my code may run on a NAS server that has a python 3.5 distribution but uses it's own proprietary 64-bit Linux. (specifically a Synology 716+ NAS) or as a standalone application.

I really want to use something that's self contained, isn't doing any RPC based controlling of a headless browser either, etc.

I can't find anything and I'm thinking nothing exists. Do anyone know of one?

Thanks

Fails on UTF

robobrowser is crashing on this value in the form.

<input type="hidden" id="_utf8" name="_utf8" value="☃">
>>> form
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/robobrowser/forms/form.py", line 200, in __repr__
    for name, field in self.fields.items(multi=True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2603' in position 0: ordinal not in range(128)

Multiple submit..

Hello, I tell that I am using my Odoo robobrowse to connect with an external platform approving products that can be sold on the market. This platform has a form with a button that lets you fill a table with the products to be registered and after filling submit other data is a general button. In sistesis, the problem is given with the button that fills the table, because this is the format:

When I do:
browser = RoboBrowser (history = True)
browser.open ( 'https: //....../')
form = browser.get_form (action = '/ ....')
The form object contains all fields including the field where I enter the product name, but the application form:
browser.submit_form (form)
all form fields unless the product, which only applies when I click the Add Product button is recorded ..
The submit_fields property of the form is an empty list, so I can not use the option:
browser.submit_form (form, submit = submit) ..
Any proposal???

help signing into dropbox.com

Dear all,

I get a 403 response from submit_form when trying to sign into https://www.dropbox.com/login.

I adapted the third example from http://robobrowser.readthedocs.org/en/latest/readme.html (working with forms)

Any ideas on what I missed would be greatly appreciated.

Arye.

from robobrowser import RoboBrowser


browser = RoboBrowser(history=True, user_agent='RoboBrowser python robot')
browser.open('https://www.dropbox.com/login')
print browser.response

# Get the signup form
signup_form = browser.get_form(class_='login-form')

# Fill it out
signup_form['login_email'].value = 'user@example'
signup_form['login_password'].value = 'secret'

# Serialize it to JSON, is this really necessary ?
# signup_form.serialize()

# And submit
browser.submit_form(signup_form)
print browser.response

InsecurePlatformWarning

I am logging into an HTTPS site, and robobrowser is working great. However, every time my script runs I am getting an 'InsecurePlatformWarning'. When I use my normal browser to log into this site, I am not getting any warnings at all. For reference the site I am logging into is http://didlogic.com

This is the warning:

/Library/Python/2.7/site-packages/requests-2.7.0-py2.7.egg/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning

While its not impacting the script, it is somewhat annoying to see. A workaround that I have found (using the URL in the warning) is to use the logging module to capture these warnings:

import logging
logging.captureWarnings(True)

This will stop this warning from being displayed on screen, and instead re-route them to the logging module (which is not being recorded anywhere)

Thanks,

Suggested Changes To Rap.Genius Example

The target website appears to have been changed, so the first example no longer works.
The following changes bring it back into line:

import re
from robobrowser import RoboBrowser
browser = RoboBrowser(history=True)
browser.open('http://genius.com/')
form = browser.get_form(action='/search')
form['q'].value = 'queen'
browser.submit_form(form)

songs = browser.select('.song_title')
browser.follow_link(songs[1]) # If you want Bohemian Rhapsody
lyrics = browser.select
lyrics[0].text # \n[Intro]\nIs this the real life ...
browser.back

Regards
Trevor

its not a issue, but a sugestion:

Have a way to introduce a "pyautogui"-like function? Because I want to scrape data that is loaded dynamically along a navigation on a google-maps service, and I didn't find a way to do that yet...
(or someone have a clue for me, or a solution?)

forms dependency

just installed and get this message when using browser.get_form

No module named forms.form

i have requests and bs4

Mutiple submit buttons in forms

If I have the following HTML:

<form name="input" action="demo_form_action.asp" method="get">
Username: <input type="text" name="user">
<input type="submit" value="Action1" name="action1_name">
<input type="submit" value="Action2" name="action2_name">
</form> 

When I click the Action1 button in my browser (Firefox or Chrome), the following URL gets sent http://localhost:8001/demo_form_action.asp?user=asfdasd&action1_name=Action1. And a different URL gets sent for Action2: http://localhost:8001/demo_form_action.asp?user=asfdasd&action2_name=Action2. Using the submit_form method in robobrowser, puts the actions for both buttons in the request: http://localhost:8001/demo_form_action.asp?user=asdsafd&action2_name=Action2&action1_name=Action1.

I guess I can model the browser button pressing behaviour by deleting the buttons I do not want pressed from the robobrowser form object. However, it would be nice if the API for submit_form could be extended to include the button being pressed to submit the form.

InvalidNameError - Input field does not accpet tag string

I found that robobrowser can add form field dynamically like this:

browser = RoboBrowser(history=True)
myform = browser.get_form('myform2')
new_field = Input('\<input name="myname" value="" \/\>')
myform.add_field( new_field )

But InvalidNameError occured in Input class.

RoboBrowser's Input Field occurred InvalidNameError

Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> from robobrowser.forms.fields import Input
>>> Input('<input name="myfield" value=""/>')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/robobrowser/forms/fields.py", line 72, in __init__
    super(Input, self).__init__(parsed)
  File "/usr/local/lib/python2.7/dist-packages/robobrowser/forms/fields.py", line 46, in __init__
    self.name = self._get_name(self._parsed)
  File "/usr/local/lib/python2.7/dist-packages/robobrowser/forms/fields.py", line 52, in _get_name
    raise exceptions.InvalidNameError
robobrowser.exceptions.InvalidNameError
>>> 

Packages version

$ pip search robobrowser
robobrowser               - Your friendly neighborhood web scraper
  INSTALLED: 0.5.1 (latest)
$ pip search beautifulsoup
beautifulsoup4            - Screen-scraping library
  INSTALLED: 4.3.2 (latest)
BeautifulSoup             - HTML/XML parser for quick-turnaround applications like screen-scraping.
  INSTALLED: 3.2.1 (latest)

Simple AJAX request

Hello. I'm try to emulate simple ajax request.

Scheme:
image

It' must return json:
{phone: "8 xxx xxx-xx-xx"}

But return None for me.
I'm suggest that the cause - cookies. Can someone help me with it?

# -*- coding: utf-8 -*-
import requests
import os, re, json, csv, sys
from robobrowser import RoboBrowser

class Aggregator(object):

    def __init__(self, config):
        main_url, output_file = [config.get(k) for k in sorted(config.keys())]
        self.main_url = main_url
        self.output_file = output_file

    def start_process(self):

        work_url = "https://m.avito.ru/sankt-peterburg/predlozheniya_uslug/almaznoe_burenie_almaznaya_rezka_usilenie_79225740"

        session = requests.Session()
        session.headers.update({
            ':host': 'm.avito.ru',
            ':method': 'GET',
            ':path': '/sankt-peterburg/predlozheniya_uslug/almaznoe_burenie_almaznaya_rezka_usilenie_79225740',
            ':scheme': 'https',
            ':version': 'HTTP/1.1',
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'accept-encoding': 'gzip, deflate, sdch',
            'accept-language': 'ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4',
            'cache-control': 'no-cache',
            'pragma': 'no-cache',
            'upgrade-insecure-requests': '1',
            'user-agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
            'x-compress': 'null',
        })
        browser = RoboBrowser(session=session, history=True, parser='lxml')

        browser.open(work_url)

        phone_link = browser.find('a', {"class": "action-show-number"}).attrs['href'] + '?async'
        browser.session.headers[':path'] = phone_link
        browser.session.headers['accept'] = 'application/json, text/javascript, */*; q=0.01'
        browser.session.headers['referer'] = work_url
        browser.session.headers['upgrade-insecure-requests'] = ''
        browser.session.headers['x-requested-with'] = 'XMLHttpRequest'
        phone = browser.open(self.main_url + phone_link)

        print(self.main_url + phone_link, phone)
        print(browser.session.headers)
        print(browser.session.cookies)

if __name__ == '__main__':
    settings = { 'main_url': 'https://m.avito.ru', 'output_file': 'output.csv' }
    aggregator = Aggregator(settings)
    aggregator.start_process()

TypeError: __init__() got an unexpected keyword argument 'headers'

TypeError: __init__() got an unexpected keyword argument 'headers'

This is what I get when I run this within a Flask app like this:

from flask import Flask
app = Flask(__name__)

from robobrowser import RoboBrowser

@app.route("/")
def hello():

    browser = RoboBrowser(headers={'User-Agent': 'a python robot'})
    browser.open('http://google.com/')

    return 'Hello'

if __name__ == "__main__":
    app.run(host="localhost", debug=True)

What could be causing this?

I installed RoboBrowser in a virtualenv using pip 7.1.2, python 2.7.10, Windows 10

Checkboxes with same name will not be grouped if some other inputs exists between them

I found that checkboxes with same name will not be grouped if some other inputs exists between them.

from robobrowser.forms.form import _parse_fields

html = '''
            <input type="checkbox" name="member" value="mercury" checked />vocals<br />
            <input type="checkbox" name="member" value="may" />guitar<br />
            <input type="text" /><br />
            <input type="checkbox" name="member" value="taylor" />drums<br />
            <input type="checkbox" name="member" value="deacon" checked />bass<br />
        '''
_fields = _parse_fields(BeautifulSoup(html))
for cbx in _fields:
    print(cbx.name, cbx.options)

which output

member ['mercury', 'may']
member ['taylor', 'deacon']

As seen, 2 robobrowser.forms.fields.Checkbox instances were created, with the same name, options different. I thought it would be one instance, with 4 options.

Maybe it's a bug? I have no idea.

relevant code

def _group_flat_tags(tag, tags):
    """Extract tags sharing the same name as the provided tag. Used to collect
    options for radio and checkbox inputs.
    :param Tag tag: BeautifulSoup tag
    :param list tags: List of tags
    :return: List of matching tags
    """
    grouped = [tag]
    name = tag.get('name', '').lower()
    while tags and tags[0].get('name', '').lower() == name: # <----  HERE
        grouped.append(tags.pop(0))
    return grouped

Installing via pip fails

Installing via pip produces an error. I'm using python 3.4.

TypeError: parse_requirements() missing 1 required keyword argument: 'session'
Complete output from command python setup.py egg_info:
Traceback (most recent call last):

  File "<string>", line 20, in <module>

  File "/.../.../robobrowser/setup.py", line 38, in <module>

    for requirement in parse_requirements('requirements.txt')

  File "/.../.../robobrowser/setup.py", line 37, in <listcomp>

    str(requirement.req)

  File "/.../.../python3.4/site-packages/pip/req/req_file.py", line 19, in parse_requirements

    "parse_requirements() missing 1 required keyword argument: "

TypeError: parse_requirements() missing 1 required keyword argument: 'session'

----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /.../.../robobrowser

multiplier def set to None ,will be 0 is better

sorry, i dont understand english.
if you set parameter tries , parameter can not be None(it's def ,it's be 0 better

error code :
import robobrowser
r = robobrowser.RoboBrowser(tries=3)
r.open("http://www.gosdfagle.com") #any not exist site

to fix this:
file: browser.py
line 70:
def init(self, session=None, parser=None, user_agent=None,
history=True, timeout=None, allow_redirects=True, cache=False,
cache_patterns=None, max_age=None, max_count=None, tries=None,
multiplier=None):
change to:
def init(self, session=None, parser=None, user_agent=None,
history=True, timeout=None, allow_redirects=True, cache=False,
cache_patterns=None, max_age=None, max_count=None, tries=None,
multiplier=0):

Is there any way to set page encoding manualy

Sometimes robobrwser get wrong encoding for my page. I know that BeautifulSoup supports manual encoding definition can I set up encoding manually and passed it to BeautifullSoup? In my case it founds windows-1252 instead of UTF-8

And if I use just BeautifullSoup with requests - it works fine

>>> from bs4 import BeautifulSoup
>>> from robobrowser import RoboBrowser
>>> import requests
>>> url = 'http://10x10.com.ua/televizor-bravis-led-32d3000-smart-t2-black-v-dnepropetrovske.html'
>>> rb = RoboBrowser(parser='lxml')
>>> rb.open(url)
>>> rb.select('.product-name h1').pop()
<h1>\xd1\u201a\xd0\xb5\xd0\xbb\xd0\xb5\xd0\xb2\xd0\xb8\xd0·\xd0\xbe\xd1\u20ac Bravis LED-32D3000 Smart +T2 black \xd0\xb2 \xd0\u201d\xd0\xbd\xd0\xb5\xd0\xbf\xd1\u20ac\xd0\xbe\xd0\xbf\xd0\xb5\xd1\u201a\xd1\u20ac\xd0\xbe\xd0\xb2\xd1\x81\xd0\xba\xd0\xb5</h1>
>>> bs = BeautifulSoup(requests.get(url).text, 'lxml')
>>> bs.select('.product-name h1').pop()
<h1>телевизор Bravis LED-32D3000 Smart +T2 black в Днепропетровске</h1>
>>>

Submit method="GET" must ignore query string

In a HTML form GET submit, the URL must ignore query string.

<!-- index.html -->
<form id="form" method="GET">
    <input type=text name=field />
    <button type="submit">Submit</button>
</form>
browser = RoboBrowser(parser="lxml")
browser.open("http://localhost:8000/")

form = browser.get_form("form")
form["field"] = "value"
browser.submit_form(form)
# browser.url == "http://localhost:8000/?field=value"

form = browser.get_form("form")
form["field"] = "other"
browser.submit_form(form)
# browser.url == "http://localhost:8000/?field=value&field=other"

The url must be http://localhost:8000/?field=other, removing previous query string.

RoboBrowser and forms: .submit_form() fails when action="javascript:void(0)"

Setting action="javascript:void(0)" is admittedly a rather lousy strategy for staying on the same page after clicking on the submit button, but this is what the page I'm trying to scrape does, and robobrowser unfortunately fails:

>>> browser.submit_form(form)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/robobrowser/browser.py", line 347, in submit_form
    response = self.session.request(method, url, **send_args)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 456, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 553, in send
    adapter = self.get_adapter(url=request.url)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 608, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'javascript:void(0)'

Thanks.

Error with two Submits

Hello, I have the following form:

<form action="/composer/mbasic/?av=100003111991563&amp;refid=18" method="post"><input autocomplete="off" name="fb_dtsg" type="hidden" value="AQG1ETPXS38n:AQGO2WaIwVNU"/><input name="charset_test" type="hidden" value="€,´,€,´,水,Д,Є"/><input name="target" type="hidden" value="208910505938561"/><input name="c_src" type="hidden" value="group"/><input name="cwevent" type="hidden" value="composer_entry"/><input name="referrer" type="hidden" value="group"/><input name="ctype" type="hidden" value="inline"/><input name="cver" type="hidden" value="amber"/><input name="rst_icv" type="hidden"/><label class="bw" for="u_0_0">Escreva algo...</label><table class="l bx"><tbody><tr><td class="r"><div class="by bz"><table class="l ca cb"><tbody><tr><td class="m cc"><label for="u_0_0"><img alt="João-Wellmara Ribeiro" class="cd img" height="32" src="https://fbcdn-profile-a.akamaihd.net/hprofile-ak-xfa1/v/t1.0-1/cp0/e15/q65/c193.2.623.623/s64x64/69216_424777024302694_750875852_n.jpg?efg=eyJpIjoiYiJ9&amp;oh=4d6b037d8e5f05a45336e23c363281f3&amp;oe=574B42B8&amp;__gda__=1465173705_7290807370b55f707b242f68bf5e22f1" width="32"/></label></td><td class="r ce bj"><textarea class="cf cg ch ci cj" id="u_0_0" name="xc_message" rows="2"></textarea></td></tr></tbody></table></div></td><td class="m"><div class="ck"><input class="w x z" name="view_post" type="submit" value="Publicar"/></div></td></tr></tbody></table><div class="cl"><span class="cm cn co"><div class="cp"><table class="l cq bj"><tbody><tr><td class="u m"><label class="cr" for="u_0_1"><img class="cs ct img" height="12" src="https://fbstatic-a.akamaihd.net/rsrc.php/v2/yd/r/10cmx89F1gU.png" width="12"/></label></td><td class="u m"><input class="w cu cv cw" id="u_0_1" name="view_photo" type="submit" value="Adicionar fotos"/></td></tr></tbody></table></div> <div class="cp"><table class="l cq bj"><tbody><tr><td class="u m"><label class="cr" for="u_0_2"><img class="cs ct img" height="12" src="https://fbstatic-a.akamaihd.net/rsrc.php/v2/yg/r/upxxJ9A52e7.png" width="12"/></label></td><td class="u m"><input class="w cu cv cw" id="u_0_2" name="view_overview" type="submit" value="Mais"/></td></tr></tbody></table></div></span></div></form>

it has two Submits the view_post and view_photo. I'm trying to send the view_post but I'm not succeeding

>>> br.submit_form(form, form.fields['view_post']) Traceback (most recent call last): File "<input>", line 1, in <module> File "C:\Users\lmr21\AppData\Local\Programs\Python\Python35\lib\site-packages\robobrowser\browser.py", line 339, in submit_form payload = form.serialize(submit=submit) File "C:\Users\lmr21\AppData\Local\Programs\Python\Python35\lib\site-packages\robobrowser\forms\form.py", line 227, in serialize return Payload.from_fields(include_fields) File "C:\Users\lmr21\AppData\Local\Programs\Python\Python35\lib\site-packages\robobrowser\forms\form.py", line 118, in from_fields if not field.disabled: AttributeError: 'str' object has no attribute 'disabled'

I'm getting this error .. what am I doing wrong?

exceptions.InvalidSubmitError with submit_form

urlopt = 'http://info512.taifex.com.tw/Future/OptQuote_Norl.aspx'
browser = RoboBrowser()
browser.open(urlopt)
form = browser.get_form()
form['ctl00$ContentPlaceHolder1$ddlFusa_SelMon'].value =form['ctl00$ContentPlaceHolder1$ddlFusa_SelMon'].options[1]
browser.submit_form(form)

but it return

robobrowser\forms\form.py
in prepare_fields(all_fields, submit_fields, submit)
152 if len(list(submit_fields.items(multi=True))) > 1:
153 if not submit:
--> 154 raise exceptions.InvalidSubmitError()
155 if submit not in submit_fields.getlist(submit.name):
156 raise exceptions.InvalidSubmitError()
InvalidSubmitError:

I want to select the options[1] but it return the Error. How to fix the problem?

browser.submit_form(form,submit=form['ctl00$ContentPlaceHolder1$ddlFusa_SelMon'])

I try this way but also return the same error. Can someone help me?
Thanks!

select / option without value attribute

On some page I've found this html:

<select id="hitsPerPage" onchange="equaliseSortForms(1);" name="hitsPerPage">
    <option selected="">10</option>
    <option>20</option>
    <option>50</option>
</select>

And robobrowser parsed this into ["sel", "sel", "sel"] options and sends "hitsPerPage=sel" on posting form, but browser (firefox) returns "labels" instead ("hitsPerPage=10", for example), so I think it's better to do the same if options omitted.

cannot install with pip

I'm getting the following error when I try to install with pip:

$ pip install robobrowser
Downloading/unpacking robobrowser
  Running setup.py egg_info for package robobrowser
    Traceback (most recent call last):
      File "<string>", line 14, in <module>
      File "/home/tsc/.virtualenvs/vm_export_tool/build/robobrowser/setup.py", line 19, in <module>
        for requirement in parse_requirements('requirements.txt')
      File "/home/tsc/.virtualenvs/vm_export_tool/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg/pip/req.py", line 1240, in parse_requirements
        skip_regex = options.skip_requirements_regex
    AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 14, in <module>

  File "/home/tsc/.virtualenvs/vm_export_tool/build/robobrowser/setup.py", line 19, in <module>

    for requirement in parse_requirements('requirements.txt')

  File "/home/tsc/.virtualenvs/vm_export_tool/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg/pip/req.py", line 1240, in parse_requirements

    skip_regex = options.skip_requirements_regex

AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'

----------------------------------------
Command python setup.py egg_info failed with error code 1 in /home/tsc/.virtualenvs/vm_export_tool/build/robobrowser
Storing complete log in /home/tsc/.pip/pip.log

python 2.7
pip 1.1
ubuntu 12.10

Errors Running the example script from the documentation

Hi,

Im trying to run the first script found here:
http://robobrowser.readthedocs.io/en/latest/readme.html
(The one that scrapes Genius.com)

And I'm getting the following error + warning:

/Users/saulfuhrmann/Computers/VirtualEnviroments/TinderHack/lib/python2.7/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 9 of the file test_robo.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))
Traceback (most recent call last):
  File "test_robo.py", line 11, in <module>
    form['q'].value = 'queen'
TypeError: 'NoneType' object has no attribute '__getitem__'

Im using Python 2.7.10 and my OS is OS X El Capitan (10.11.4)

Am I doing something wrong?
Or did Genius.com just changed their interface and the script is not updated?

-- Saul

RoboBrowser and Wordpress

Hi!
I'm quite the newbie when it comes to robobrowser but I want to post in WordPress using robobrowser.
I managed to log in but since WP is AJAX driven, I can't manage to create, fill out and publish posts in WP.

Any chance this is possible at all with robobrowser or is this not supported? If it is supported an anyone tell me how to work with AJAX sites that refresh themselves and do auto-saves like WP does?

Disabled <input> elements should not be submitted.

According to 1 and 2, the disabled attribute prevents value from being submitted.

It seems that robobrowser does not check if elements are disabled or not. In my case there was a disabled checkbox in a form that was submitted unexpectedly.

The constructor for Input should accept a string

The following patch will allow the constructor for Input to accept a string, as its docstring says it will.

*** fields.py   2014-07-18 13:22:40.616011172 +0200
--- fields.py~  2014-07-15 16:17:32.000000000 +0200
***************
*** 39,45 ****
      def __init__(self, parsed):
          self._parsed = helpers.ensure_soup(parsed)
          self._value = None
!         self.name = self._get_name(self._parsed)

      def _get_name(self, parsed):
          return parsed.get('name')
--- 39,45 ----
      def __init__(self, parsed):
          self._parsed = helpers.ensure_soup(parsed)
          self._value = None
!         self.name = self._get_name(parsed)

      def _get_name(self, parsed):
          return parsed.get('name')

badly handle web pages with encoding errors

As the BeautifullSoup doc says : 'Sometimes a document is mostly in UTF-8, but contains Windows-1252 characters...' . As an example, all pages from the 'http://www.programme-tv.com' web site are like this (due to the Ô in 'France Ô').
In that case, RoboBrowser convert the full document to Windows-1252 encoding, leaving all accentuated chars unreadable :

from robobrowser import RoboBrowser
browser = RoboBrowser()
browser.open('http://www.programme-tv.com')
browser.find('span', 'slogan1').text

The output is :
'Ne ratez plus vos émissions favorites!'
instead of
'Ne ratez plus vos Émissions favorites!'

Using self.response.text instead of self.response.content when calling BeautifullSoup is solving the problem but it probably have some drawbacks.

Cheers,
Loïc

BS4 warns about explicitly setting the parser to be used

I got the following warning with the latest BeautifulSoup4 and robobrowser from PyPI and Python 3.5

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

BeautifulSoup([your markup])

to this:

BeautifulSoup([your markup], "html.parser")

I think robobrowser should specify the parser to get repeatable results

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.