rugantio / fbcrawl Goto Github PK

View Code? Open in Web Editor NEW

657.0 657.0 230.0 1.53 MB

A Facebook crawler

License: Apache License 2.0

Python 99.41% Shell 0.59%

crawl crawler facebook python scraper scrapy spider

fbcrawl's People

Contributors

Stargazers

Watchers

Forkers

isacjoseph2006 anhngml janes osberntw mrkenzi hohvn htetaungkhantbobu ziostanko vietinbank dilberdillu gaukhar98 sahwar anhntbk08 tkunwar teslahenry beannguyen jasmith1338 darksmile92 showhelloworld noke8868 www10177 digitaldust nntin 0x163ml wuarthur karinadelcheva milyasyousuf shaomanlee hkd213 kevin8701111 neraunzaran maxwellsun freanuperiaa urstory rohitereddy ahmad-rzk mingjiezhan hhy5277 zhaolongbin joshuakissoon manhnd1112 rizaldntr masudr4n4 tab-1 flogger157803 phongtnit arjun-go-go murilopy feedmari lmkhuyen ovis35 yingrenerjie falsechord edward95914 liqiang-ict github-userx chrismudongo nguyendtu wiki-n pmmarq stefanyohansson pramoth msmenzyk thanhson10f theblackcat102 rickmunene ankitkhadria1 welrn kaets amchayd longwind48 jbot19k-fork trieu thanajade leohmoraes jbot19k-crawler truonghai001 blind675 thientu gajmp leonardomarciano freeguy1 afm-sayem bestrunner harimittapalli smartfreedom crrinko ffmpegd rajaemmela howie kznmft maulvi67 starmpcc kurodenjiro cuulee dramsauer keffery alfaproton kapoe2345 mhbai

fbcrawl's Issues

File contains no section headers

Hello there,
Please, I need help with this error. whenever I try to run the first line of the code, I get this error. how can I fix it, please.
Thank you
(base) C:\Users\NoNo\Desktop\scrapy\fbcrawl>scrapy crawl fbcrawl -a email="[email protected]" -a password="10wnyu31" -a page="Nike" -a date="2019-01-01" -a lang="en" -o Trump.csv
Traceback (most recent call last):
File "C:\Users\NoNo\Anaconda3\Scripts\scrapy-script.py", line 10, in
sys.exit(execute())
File "C:\Users\NoNo\Anaconda3\lib\site-packages\scrapy\cmdline.py", line 110, in execute
settings = get_project_settings()
File "C:\Users\NoNo\Anaconda3\lib\site-packages\scrapy\utils\project.py", line 63, in get_project_settings
init_env(project)
File "C:\Users\NoNo\Anaconda3\lib\site-packages\scrapy\utils\conf.py", line 84, in init_env
cfg = get_config()
File "C:\Users\NoNo\Anaconda3\lib\site-packages\scrapy\utils\conf.py", line 98, in get_config
cfg.read(sources)
File "C:\Users\NoNo\Anaconda3\lib\configparser.py", line 696, in read
self._read(fp, filename)
File "C:\Users\NoNo\Anaconda3\lib\configparser.py", line 1079, in _read
raise MissingSectionHeaderError(fpname, lineno, line)
configparser.MissingSectionHeaderError: File contains no section headers.
file: 'C:\Users\NoNo\Desktop\scrapy\fbcrawl\scrapy.cfg', line: 7
'\n'

Email and password login issues

I'm trying to get comments from one public page for academic purposes and it's saying that my password and email login is invalid, but when I try to log in via on fb's page it's all fine. How can I handle with this situation?

getting empty csv files from the updated comments crawler

Hi,

I am new to scrapy and am learning from running your code. I run in console:
scrapy crawl comments -a email=“XXXXXXX” -a password=“YYYYYY” -a page=“https://mbasic.facebook.com/DonaldTrump/posts/10162238538600725” -a lang=en -o DUMPFILE.csv

However, the csv files created are empty. Would you please point out what I might have got? Here are the logs.

2019-03-04 14:02:02 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: fbcrawl)
2019-03-04 14:02:02 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 12:04:33) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 17.5.0 (OpenSSL 1.0.2o 27 Mar 2018), cryptography 2.1.4, Platform Darwin-18.2.0-x86_64-i386-64bit
2019-03-04 14:02:02 [scrapy.crawler] INFO: Overridden settings: {‘BOT_NAME’: ‘fbcrawl’, ‘CONCURRENT_REQUESTS’: 1, ‘DUPEFILTER_CLASS’: ‘scrapy.dupefilters.BaseDupeFilter’, ‘FEED_EXPORT_ENCODING’: ‘utf-8’, ‘FEED_EXPORT_FIELDS’: [‘source’, ‘reply_to’, ‘date’, ‘reactions’, ‘text’, ‘url’], ‘FEED_FORMAT’: ‘csv’, ‘FEED_URI’: ‘DUMPFILE.csv’, ‘NEWSPIDER_MODULE’: ‘fbcrawl.spiders’, ‘SPIDER_MODULES’: [‘fbcrawl.spiders’], ‘USER_AGENT’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36’}
2019-03-04 14:02:02 [scrapy.extensions.telnet] INFO: Telnet Password: fd22697acc9cc93e
2019-03-04 14:02:02 [scrapy.middleware] INFO: Enabled extensions:
[‘scrapy.extensions.corestats.CoreStats’,
‘scrapy.extensions.telnet.TelnetConsole’,
‘scrapy.extensions.memusage.MemoryUsage’,
‘scrapy.extensions.feedexport.FeedExporter’,
‘scrapy.extensions.logstats.LogStats’]
2019-03-04 14:02:02 [comments] INFO: Email and password provided, using these as credentials
2019-03-04 14:02:02 [comments] INFO: Page attribute provided, scraping “/DonaldTrump/posts/10162238538600725””
2019-03-04 14:02:02 [comments] INFO: Year attribute not found, set scraping back to 2018
2019-03-04 14:02:02 [comments] INFO: Language attribute recognized, using “en” for the facebook interface
2019-03-04 14:02:02 [scrapy.core.engine] INFO: Spider opened
2019-03-04 14:02:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-04 14:02:02 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-03-04 14:02:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mbasic.facebook.com> (referer: None)
2019-03-04 14:02:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://mbasic.facebook.com/login/?email=efsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr> from <POST https://mbasic.facebook.com/login/device-based/regular/login/?email=......&refsrcrefsrc=https%3A%2F%2Fmbasic.facebook.com%2F&lwv=100&refid=8>
2019-03-04 14:02:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mbasic.facebook.com/login/?email=......&refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr> (referer: https://mbasic.facebook.com)
2019-03-04 14:02:03 [comments] INFO: Scraping facebook page https://mbasic.facebook.com/DonaldTrump/posts/10162238538600725”
2019-03-04 14:02:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mbasic.facebook.com/DonaldTrump/posts/10162238538600725%E2%80%9D> (referer: https://mbasic.facebook.com/login/?email=......&refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr)
2019-03-04 14:02:03 [scrapy.core.engine] INFO: Closing spider (finished)
2019-03-04 14:02:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{‘downloader/request_bytes’: 2217,
‘downloader/request_count’: 4,
‘downloader/request_method_count/GET’: 3,
‘downloader/request_method_count/POST’: 1,
‘downloader/response_bytes’: 14263,
‘downloader/response_count’: 4,
‘downloader/response_status_count/200’: 3,
‘downloader/response_status_count/302’: 1,
‘finish_reason’: ‘finished’,
‘finish_time’: datetime.datetime(2019, 3, 4, 22, 2, 3, 978350),
‘log_count/DEBUG’: 4,
‘log_count/INFO’: 11,
‘memusage/max’: 52232192,
‘memusage/startup’: 52232192,
‘request_depth_max’: 2,
‘response_received_count’: 3,
‘scheduler/dequeued’: 4,
‘scheduler/dequeued/memory’: 4,
‘scheduler/enqueued’: 4,
‘scheduler/enqueued/memory’: 4,
‘start_time’: datetime.datetime(2019, 3, 4, 22, 2, 2, 184250)}
2019-03-04 14:02:03 [scrapy.core.engine] INFO: Spider closed (finished)

Profiles crawler

Thank you for very useful crawler!
I saw there a spider for profiles (profiles.py file). Does it work? If yes, what command does start it?

Can't not crawl due to Non-ASCII character

I ran "scrapy crawl fb -a email="[email protected]" -a password="10wnyu31" -a page="DonaldTrump" -a date="2018-01-01" -a lang="it" -o Trump.csv" at cm but it didn't work
This error : " File "/Users/elchapo/fbcrawl-master/fbcrawl/spiders/comments.py", line 297
SyntaxError: Non-ASCII character '\xe2' in file /Users/elchapo/fbcrawl-master/fbcrawl/spiders/comments.py on line 297, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details"
Can you help me to fix this?

No module named "items"

Hello,

Im getting a problem while running the command.
It sames that there is no module to import named items. I don´t know if its about a library a dont have installed or its an other setting i didn´t configure.

pls find the screenshot related to the error.

Thanks for your help.

Can't crawl older posts

Using this script, it's impossible to crawl posts older than 3 month. every time it reaches 3 months ago, it will reshowing recent posts instead of older results.

Is there anyway to solve this problem?

Add method to save login

So Facebook save login cookies inside c_user, fr and xs. Is there any way we can save cookies login? So we dont have to login each time run a scrapper.

Error in establishing the year.

Hello again. I have found that by setting the year I want the posts, the code does that year and also the previous one .. I just want to be scratched by 2019, but I also get 2018. Any way to fix it?
Thank you.

scrapy crawl fb -a email="[email protected]" -a password="xxxxxxx." -a page="MotoShopOnline.es" -a year="2019" -a lang="en" -o moto.csv
2019-03-23 12:33:15 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: fbcrawl)
2019-03-23 12:33:16 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.6.7 (default, Oct 22 2018, 11:32:17) - [GCC 8.2.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.1.4, Platform Linux-4.15.0-46-generic-x86_64-with-Ubuntu-18.04-bionic
2019-03-23 12:33:16 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'fbcrawl', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['source', 'shared_from', 'date', 'text', 'reactions', 'likes', 'ahah', 'love', 'wow', 'sigh', 'grrr', 'comments', 'url'], 'FEED_FORMAT': 'csv', 'FEED_URI': 'moto.csv', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODULES': ['fbcrawl.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
2019-03-23 12:33:16 [scrapy.extensions.telnet] INFO: Telnet Password: e0f8938f46c492da
2019-03-23 12:33:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2019-03-23 12:33:17 [fb] INFO: Email and password provided, using these as credentials
2019-03-23 12:33:17 [fb] INFO: Page attribute provided, scraping "MotoShopOnline.es"
2019-03-23 12:33:17 [fb] INFO: Year attribute found, set scraping back to 2019
2019-03-23 12:33:17 [fb] INFO: Language attribute recognized, using "en" for the facebook interface
2019-03-23 12:33:18 [scrapy.core.engine] INFO: Spider opened
2019-03-23 12:33:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-23 12:33:20 [fb] INFO: Got stuck in "save-device" checkpoint
2019-03-23 12:33:20 [fb] INFO: I will now try to redirect to the correct page
2019-03-23 12:33:22 [fb] INFO: Scraping facebook page https://mbasic.facebook.com/MotoShopOnline.es
2019-03-23 12:33:23 [fb] INFO: Parsing post n = 0
2019-03-23 12:33:23 [fb] INFO: Parsing post n = 1
2019-03-23 12:33:24 [fb] INFO: Parsing post n = 2
2019-03-23 12:33:24 [fb] INFO: Parsing post n = 3
2019-03-23 12:33:24 [fb] INFO: Parsing post n = 4
2019-03-23 12:33:24 [fb] INFO: FLAG DOES NOT ALWAYS REPRESENT ACTUAL YEAR
2019-03-23 12:33:24 [fb] INFO: First page scraped, click on more! Flag not set, default flag = 2019
2019-03-23 12:33:26 [fb] INFO: Parsing post n = 5
2019-03-23 12:33:26 [fb] INFO: Parsing post n = 6
2019-03-23 12:33:26 [fb] INFO: Parsing post n = 7
2019-03-23 12:33:26 [fb] INFO: Parsing post n = 8
2019-03-23 12:33:26 [fb] INFO: Parsing post n = 9
2019-03-23 12:33:26 [fb] INFO: Page scraped, click on more! flag = 2019
2019-03-23 12:33:28 [fb] INFO: Parsing post n = 10
2019-03-23 12:33:28 [fb] INFO: Parsing post n = 11
2019-03-23 12:33:28 [fb] INFO: Parsing post n = 12
2019-03-23 12:33:28 [fb] INFO: Parsing post n = 13
2019-03-23 12:33:29 [fb] INFO: Parsing post n = 14
2019-03-23 12:33:29 [fb] INFO: Page scraped, click on more! flag = 2019
2019-03-23 12:33:30 [fb] INFO: Parsing post n = 15
2019-03-23 12:33:30 [fb] INFO: Parsing post n = 16
2019-03-23 12:33:30 [fb] INFO: Parsing post n = 17
2019-03-23 12:33:31 [fb] INFO: Parsing post n = 18
2019-03-23 12:33:31 [fb] INFO: Parsing post n = 19
2019-03-23 12:33:31 [fb] INFO: Page scraped, click on more! flag = 2019
2019-03-23 12:33:35 [fb] INFO: Parsing post n = 20
2019-03-23 12:33:35 [fb] INFO: Parsing post n = 21
2019-03-23 12:33:35 [fb] INFO: Parsing post n = 22
2019-03-23 12:33:35 [fb] INFO: Parsing post n = 23
2019-03-23 12:33:35 [fb] INFO: Parsing post n = 24
2019-03-23 12:33:36 [fb] INFO: Page scraped, click on more! flag = 2019
2019-03-23 12:33:37 [fb] INFO: Parsing post n = 25
2019-03-23 12:33:37 [fb] INFO: Parsing post n = 26
2019-03-23 12:33:38 [fb] INFO: Parsing post n = 27
2019-03-23 12:33:38 [fb] INFO: Parsing post n = 28
2019-03-23 12:33:38 [fb] INFO: Parsing post n = 29
2019-03-23 12:33:38 [fb] INFO: Page scraped, click on more! flag = 2019
2019-03-23 12:33:39 [fb] INFO: Parsing post n = 30
2019-03-23 12:33:39 [fb] INFO: Parsing post n = 31
2019-03-23 12:33:39 [fb] INFO: Parsing post n = 32
2019-03-23 12:33:40 [fb] INFO: Parsing post n = 33
2019-03-23 12:33:40 [fb] INFO: Parsing post n = 34
2019-03-23 12:33:40 [fb] INFO: Page scraped, click on more! flag = 2019
2019-03-23 12:33:41 [fb] INFO: Parsing post n = 35
2019-03-23 12:33:42 [fb] INFO: Parsing post n = 36
2019-03-23 12:33:42 [fb] INFO: Parsing post n = 37
2019-03-23 12:33:42 [fb] INFO: Parsing post n = 38
2019-03-23 12:33:42 [fb] INFO: Parsing post n = 39
2019-03-23 12:33:42 [fb] INFO: Page scraped, click on more! flag = 2019
2019-03-23 12:33:43 [fb] INFO: Parsing post n = 40
2019-03-23 12:33:43 [fb] INFO: Parsing post n = 41
2019-03-23 12:33:43 [fb] INFO: Parsing post n = 42
2019-03-23 12:33:43 [fb] INFO: Parsing post n = 43
2019-03-23 12:33:44 [fb] INFO: Parsing post n = 44
2019-03-23 12:33:44 [fb] INFO: Page scraped, click on more! flag = 2019
2019-03-23 12:33:46 [fb] INFO: Parsing post n = 45
2019-03-23 12:33:46 [fb] INFO: Parsing post n = 46
2019-03-23 12:33:46 [fb] INFO: Parsing post n = 47
2019-03-23 12:33:46 [fb] INFO: Parsing post n = 48
2019-03-23 12:33:47 [fb] INFO: Parsing post n = 49
2019-03-23 12:33:47 [fb] INFO: Page scraped, click on more! flag = 2019
2019-03-23 12:33:49 [fb] INFO: Parsing post n = 50
2019-03-23 12:33:49 [fb] INFO: Parsing post n = 51
2019-03-23 12:33:49 [fb] INFO: Parsing post n = 52
2019-03-23 12:33:49 [fb] INFO: Parsing post n = 53
2019-03-23 12:33:49 [fb] INFO: Parsing post n = 54
2019-03-23 12:33:50 [fb] INFO: Page scraped, click on more! flag = 2019
2019-03-23 12:33:51 [fb] INFO: Parsing post n = 55
2019-03-23 12:33:51 [fb] INFO: Parsing post n = 56
2019-03-23 12:33:51 [fb] INFO: Parsing post n = 57
2019-03-23 12:33:51 [fb] INFO: Parsing post n = 58
2019-03-23 12:33:52 [fb] INFO: Parsing post n = 59
2019-03-23 12:33:52 [fb] INFO: Page scraped, click on more! flag = 2019
2019-03-23 12:33:53 [fb] INFO: Parsing post n = 60
2019-03-23 12:33:53 [fb] INFO: Parsing post n = 61
2019-03-23 12:33:54 [fb] INFO: There are no more, flag set at = 2019
2019-03-23 12:33:54 [fb] INFO: XPATH not found for year 2018
2019-03-23 12:33:54 [fb] INFO: Trying with previous year, flag=2018
2019-03-23 12:33:54 [fb] INFO: The previous year to crawl is less than the parameter year: 2018 < 2019
2019-03-23 12:33:54 [fb] INFO: This is not handled well, please re-run with -a year="2018" or less
2019-03-23 12:33:54 [fb] INFO: New page found with flag 2018
2019-03-23 12:33:54 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/MotoShopOnline.es?sectionLoadingID=m_timeline_loading_div_1554101999_0_36_timeline_unit%3A1%3A00000000001536657425%3A04611686018427387904%3A09223372036854775746%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001536657425%3A04611686018427387904%3A09223372036854775746%3A04611686018427387904&timeend=1554101999&timestart=0&tm=AQAJ6WfaA8LJagg-&refid=17> (referer: https://mbasic.facebook.com/MotoShopOnline.es?sectionLoadingID=m_timeline_loading_div_1554101999_0_36_timeline_unit%3A1%3A00000000001536661899%3A04611686018427387904%3A09223372036854775751%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001536661899%3A04611686018427387904%3A09223372036854775751%3A04611686018427387904&timeend=1554101999&timestart=0&tm=AQAJ6WfaA8LJagg-&refid=17)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/media/compartida/master/marzo/virgen/fbcrawl-master/fbcrawl/spiders/fbcrawl.py", line 179, in parse_page
new_page = response.urljoin(new_page[0])
IndexError: list index out of range
2019-03-23 12:33:55 [scrapy.core.engine] INFO: Closing spider (finished)
2019-03-23 12:33:55 [scrapy.extensions.feedexport] INFO: Stored csv feed (62 items) in: moto.csv
2019-03-23 12:33:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 303245,
'downloader/request_count': 142,
'downloader/request_method_count/GET': 140,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 1203167,
'downloader/response_count': 142,
'downloader/response_status_count/200': 140,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 3, 23, 11, 33, 55, 506133),
'item_scraped_count': 62,
'log_count/ERROR': 1,
'log_count/INFO': 94,
'memusage/max': 50860032,
'memusage/startup': 50860032,
'request_depth_max': 17,
'response_received_count': 140,
'scheduler/dequeued': 142,
'scheduler/dequeued/memory': 142,
'scheduler/enqueued': 142,
'scheduler/enqueued/memory': 142,
'spider_exceptions/IndexError': 1,
'start_time': datetime.datetime(2019, 3, 23, 11, 33, 18, 222031)}
2019-03-23 12:33:55 [scrapy.core.engine] INFO: Spider closed (finished)

Cannot crawl more than 60 pages comment

I tried to crawl a post with 2700 comments. But I can only run it to page 60

The post link:
m.facebook.com/story.php?story_fbid=2226458920929531&id=2226454927596597&p=60&av=100036884506828&eav=AfaGbbgANEKjTi_nwspovtij7sx25oDoiBkQDA3hr_fqX5KDrdqrLBeclI6ydoKspu8&refid=52

The command:
scrapy crawl comments -a email="MAIL" -a password="PASSWORD" -a post="m.facebook.com/story.php?story_fbid=2226458920929531&id=2226454927596597" -o comment_post.csv -a lang="en" -a date="2019-05-05"

Because of that, I could on get about 126 comments from this post.
Is there a way to improve this? or an alternative way?
Any suggestions would be welcomed.

Old posts returning errors

In old posts reactions were not implemented, parse_post and parse_reactions need to be adjust accordingly.

1st error: Full text link (notizia completa) is returning

[scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/ninuxfirenze?sectionLoadingID=m_timeline_loading_div_1388563199_1357027200_8_timeline_unit%3A1%3A00000000001382986503%3A04611686018427387904%3A09223372036854775798%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001382986503%3A04611686018427387904%3A09223372036854775798%3A04611686018427387904&timeend=1388563199&timestart=1357027200&tm=AQBL9hUDXCpxoiCM&refid=17> (referer: https://mbasic.facebook.com/ninuxfirenze?sectionLoadingID=m_timeline_loading_div_1388563199_1357027200_8_timeline_unit%3A1%3A00000000001384117176%3A04611686018427387904%3A09223372036854775803%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001384117176%3A04611686018427387904%3A09223372036854775803%3A04611686018427387904&timeend=1388563199&timestart=1357027200&tm=AQBL9hUDXCpxoiCM&refid=17)
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/rugantio/Downloads/fbcrawl/fbcrawl/spiders/fbcrawl.py", line 193, in parse_page
    new_page = response.urljoin(new_page[0])
IndexError: list index out of range

2nd error: Reaction page is empty

[scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/ufi/reaction/profile/browser/?ft_ent_identifier=717978068231023&refid=17&_ft_=top_level_post_id.717978478230982%3Atl_objid.717978478230982%3Apage_id.717952914900205%3Aphoto_attachments_list.%5B717978291564334%2C717978641564299%2C717978478230982%5D%3Aphoto_id.717978291564334%3Astory_location.4%3Astory_attachment_style.new_album%3Apage_insights.%7B%22717952914900205%22%3A%7B%22role%22%3A1%2C%22page_id%22%3A717952914900205%2C%22post_context%22%3A%7B%22story_fbid%22%3A%5B717978754897621%2C717978068231023%5D%2C%22publish_time%22%3A1382985680%2C%22story_name%22%3A%22EntPhotoNodeBasedEdgeStory%22%2C%22object_fbtype%22%3A22%7D%2C%22actor_id%22%3A717952914900205%2C%22psn%22%3A%22EntPhotoNodeBasedEdgeStory%22%2C%22sl%22%3A4%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22targets%22%3A%5B%7B%22page_id%22%3A717952914900205%2C%22actor_id%22%3A717952914900205%2C%22role%22%3A1%2C%22post_id%22%3A717978754897621%2C%22share_id%22%3A0%7D%2C%7B%22page_id%22%3A717952914900205%2C%22actor_id%22%3A717952914900205%2C%22role%22%3A1%2C%22post_id%22%3A717978068231023%2C%22share_id%22%3A0%7D%5D%7D%7D%3Athid.717952914900205%3A306061129499414%3A43%3A0%3A1556693999%3A-3054257213565457848&__tn__=%2AW-R#footer_action_list> (referer: https://mbasic.facebook.com/ninuxfirenze?sectionLoadingID=m_timeline_loading_div_1556693999_0_36_timeline_unit%3A1%3A00000000001383574409%3A04611686018427387904%3A09223372036854775776%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001383574409%3A04611686018427387904%3A09223372036854775776%3A04611686018427387904&timeend=1556693999&timestart=0&tm=AQBTYgKZm-RBkzwc&refid=17)
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/rugantio/Downloads/fbcrawl/fbcrawl/spiders/fbcrawl.py", line 218, in parse_post
    reactions = response.urljoin(reactions[0].extract())
  File "/usr/lib/python3.7/site-packages/parsel/selector.py", line 61, in __getitem__
    o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range

Http Error

Hey, I am getting a http error when I try to run the scraper. Any suggestions?

2018-09-17 15:41:57 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://mbasic.facebook.com/login/save-device/SoCipe2L>: HTTP status code is not handled or not allowed

I uncommented user agent, tried putting in other user agents, getting the same thing.

Duplicates and low number of posts for past years

Hi @rugantio
If I try to go back in time and get the posts for previous years, I get very few posts.
For instance, if I scrape https://www.facebook.com/Repubblica, until 2013, I obtain very few posts for years before 2018, and many posts are actually duplicates.
Do you experience the same behavior?

Thanks,

id post changes

Last time I crawled, the result for post ID is like 145165568851292_2255025891198572, but after a while I re-crawled and the result for post ID changes to 2.77E+15. Does anyone know how to change it back to the older version (145165568851292_2255025891198572)? Or is there any changes in code?

0 pages is being crawled

Hello,

I am new to scrapy and I have tried your codes.
I tried to scrap Donald Trump page
I have this being displayed:
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
I can't figure out where actually the problem is.

Please find below the entire message being output:
2018-09-10 23:14:01 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: fbcrawl)
2018-09-10 23:14:01 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'fbcrawl', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['source', 'date', 'text', 'reactions', 'likes', 'ahah', 'love', 'wow', 'sigh', 'grrr', 'comments', 'url'], 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODULES': ['fbcrawl.spiders']}
2018-09-10 23:14:01 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-09-10 23:14:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-09-10 23:14:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-09-10 23:14:02 [scrapy.middleware] INFO: Enabled item pipelines:
['fbcrawl.pipelines.FbcrawlPipeline']
2018-09-10 23:14:02 [scrapy.core.engine] INFO: Spider opened
2018-09-10 23:14:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-09-10 23:14:05 [fb] INFO: Parse function called on https://mbasic.facebook.com/DonaldTrump/?refid=46
2018-09-10 23:14:06 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/DonaldTrump/?refid=46> (referer: https://mbasic.facebook.com/login/save-device/?login_source=login&refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&refid=8&_rdr)

Could I create a pull request?

Hello,

I'm just proposing a pull request. I've adding error handling for getting blocked and using incorrect and old passwords. I've created capability for viewing more complex reaction data and profile data within comments.py so you can view a single csv file with all of this information. I added some random time pauses to avoid blocking and extend the amount of data collected from profiles. I also noticed that the comment parser only would recurse backwards through comments even though sometimes on mbasic comment threads (especially long ones) would start in the middle of a comment thread, losing out on lots of data. I'd like to create a pull request and have you look over my code if possible. Also, thank you so much for creating your code, it wa s incredibly helpful for my research into social media data.

no module named urllib2

Hi all,

I keep getting the error "no module named 'urllib2'". I am using python 3.7 (tried earlier versions too) and urllib2 is a python 2 module as far as I could understand. Can anyone help me in solving this?

INFO: Ignoring response <404

[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://mbasic.facebook.com/gettingstarted/groups/2296031677301785>: HTTP status code is not handled or not allowed

Redirect wrong url after login

I can't run the example command to crawl Donald Trump page because crawler redirect wrong url of page after login. This is log:

2019-07-22 14:23:59 [fb] INFO: Email and password provided, will be used to log in
2019-07-22 14:23:59 [fb] INFO: Date attribute provided, fbcrawl will stop crawling at 2018-01-01
2019-07-22 14:23:59 [fb] INFO: Language attribute recognized, using "it" for the facebook interface
2019-07-22 14:23:59 [scrapy.core.engine] INFO: Spider opened
2019-07-22 14:23:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-22 14:23:59 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-07-22 14:23:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mbasic.facebook.com> (referer: None)
2019-07-22 14:24:12 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://mbasic.facebook.com/login/?email=barackobama%40gmail.com&li=j2Q1XXELbOA51hb-357uE9ux&e=1348028&refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr> from <POST https://mbasic.facebook.com/login/device-based/regular/login/?refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&lwv=100&refid=8>
2019-07-22 14:24:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mbasic.facebook.com/login/?email=barackobama%40gmail.com&li=j2Q1XXELbOA51hb-357uE9ux&e=1348028&refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr> (referer: https://mbasic.facebook.com)
2019-07-22 14:24:12 [fb] INFO: Scraping facebook page https://mbasic.facebook.com/login/DonaldTrump
2019-07-22 14:24:16 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://mbasic.facebook.com/login/DonaldTrump> (referer: https://mbasic.facebook.com/login/?email=barackobama%40gmail.com&li=j2Q1XXELbOA51hb-357uE9ux&e=1348028&refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr)
2019-07-22 14:24:16 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://mbasic.facebook.com/login/DonaldTrump>: HTTP status code is not handled or not allowed
2019-07-22 14:24:16 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-22 14:24:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2586,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 3,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 13791,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/302': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 17.708235,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 7, 22, 7, 24, 16, 859339),
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/404': 1,
 'log_count/DEBUG': 4,
 'log_count/INFO': 12,
 'memusage/max': 50221056,
 'memusage/startup': 50221056,
 'request_depth_max': 2,
 'response_received_count': 3,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'start_time': datetime.datetime(2019, 7, 22, 7, 23, 59, 151104)}
2019-07-22 14:24:16 [scrapy.core.engine] INFO: Spider closed (finished)

Friend Pages Not Scraped

Thank you for your contribution, this repo is awesome!

When scraping a persons 'page' like a celebrity or public figure the program works perfectly without a hitch.
The problem arises when scraping a friends profile, scraping stops after only 1 page.
Also, when trying to scrape the limited 'profile' available of someone whom I am not friends with, nothing is scraped.

I am willing to fix this issue myself and contribute where I can, however, I mostly do ml and have little knowledge of web.
If someone can help resolve this (saving me much precious time) or guide me, that would highly be appreciated.

Thank You!

Error from VMWare Windows Instance

I am running fbcrawl from VMWare Windows 10 image. I have python 3.7, and I am seeing following error when I run the command:

2020-03-11 19:31:57 [fb] INFO: Going through the "save-device" checkpoint
2020-03-11 19:32:03 [fb] INFO: Scraping facebook page https://mbasic.facebook.com/cnn
2020-03-11 19:32:07 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/cnn> (referer: https://mbasic.facebook.com/?_rdr)
Traceback (most recent call last):
File "c:\users\user\appdata\local\programs\python\python37\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "c:\users\user\appdata\local\programs\python\python37\lib\site-packages\scrapy\core\downloader\middleware.py", line 42, in process_request
defer.returnValue((yield download_func(request=request, spider=spider)))
File "c:\users\user\appdata\local\programs\python\python37\lib\site-packages\twisted\internet\defer.py", line 1362, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://mbasic.facebook.com/cnn>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "c:\users\user\appdata\local\programs\python\python37\lib\site-packages\scrapy\utils\defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "c:\users\user\appdata\local\programs\python\python37\lib\site-packages\scrapy\core\spidermw.py", line 60, in process_spider_input
return scrape_func(response, request, spider)
File "c:\users\user\appdata\local\programs\python\python37\lib\site-packages\scrapy\core\scraper.py", line 148, in call_spider
warn_on_generator_with_return_value(spider, callback)
File "c:\users\user\appdata\local\programs\python\python37\lib\site-packages\scrapy\utils\misc.py", line 202, in warn_on_generator_with_return_value
if is_generator_with_return_value(callable):
File "c:\users\user\appdata\local\programs\python\python37\lib\site-packages\scrapy\utils\misc.py", line 187, in is_generator_with_return_value
tree = ast.parse(dedent(inspect.getsource(callable)))
File "c:\users\user\appdata\local\programs\python\python37\lib\ast.py", line 35, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "", line 1
def parse_page(self, response):
^
IndentationError: unexpected indent
2020-03-11 19:32:07 [scrapy.core.engine] INFO: Closing spider (finished)

Blocked after crawling

Don't use your personal facebook profile to crawl

Hello,
We're starting to experience some blockage by facebook. After a certain number of "next pages" have been visited the profile is temporarily suspended for about 1 hour.

If scrapy ends abruptly with this error, your account has been blocked:

  File "/fbcrawl/fbcrawl/spiders/fbcrawl.py", line 170, in parse_page
    if response.meta['flag'] == self.k and self.k >= self.year:
KeyError: 'flag'

This prevents you from visiting any page during the blocking period from mbasic.facebook.com, however, it seems that the blockage is not fully enforced on m.facebook.com and facebook.com you can still access the public pages but not private profiles!

If you are experiencing this issue, in settings.py set:

CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 1

This will force a sequential crawling and will also noticeably slow the crawler down but will assure a better final result. DOWNLOAD_DELAY should be increased if you're still experiencing blockage.
More experiments need to be done to assess the situation, please report here your findings and suggestions

error in crawling data

I try follow command
scrapy crawl comments -a email="[email protected]" -a password="lotus19650807" -a page="https://mbasic.facebook.com/XxSunJinxX" -o DUMPFILE.csv

but I got error as below

2018-11-20 14:26:28 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/XxSunJinxX> (referer: https://mbasic.facebook.com/login/save-device/?login_source=login&refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/Users/chiangandy/fino/crawler/fbcrawl/fbcrawl/spiders/fbcrawler.py", line 87, in parse_page
temp_post = response.urljoin(post[0])
IndexError: list index out of range
2018-11-20 14:26:28 [scrapy.core.engine] INFO: Closing spider (finished)

I am using python 2 with scrapy 1.5.1.... could you please guide me what wrong with this....Thanks.

Collection of "poll" posts

Hi guys, thanks for this very good tool. I was performing some tests with it, and I realize it cannot scrape "poll" post properly.

Is this a thing to solve in the future?

Thanks,

[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://mbasic.facebook.com/gettingstarted/groups/2296031677301785>: HTTP status code is not handled or not allowed

It's not exporting the items to a csv

I managed to run it, and I tried the exporting option

-o DUMPFILE.csv

However, it doesn't seem to be working. It just creates the file but leaves it empty.

Comment scraping for groups

It seems like the spider stops prematurely when scraping for comments on posts that is posted in a group.

I ran it with this command scrapy crawl comments -a email="XXXXXXX" -a password="XXXXXXXX" -a page="https://www.facebook.com/groups/725870897781323?view=permalink&id=834512296917182" -o DUMPFILE.csv

The spider usually stops after around 34 comments are crawled.
I've tried links without '/groups/' and those ones seems to work great :)

LOGS:
2019-03-19 13:42:23 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: fbcrawl)
2019-03-19 13:42:23 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.2 (default, Jan 13 2019, 12:50:01) - [Clang 10.0.0 (clang-1000.11.45.5)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Darwin-18.2.0-x86_64-i386-64bit
2019-03-19 13:42:23 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'fbcrawl', 'CONCURRENT_REQUESTS': 1, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['source', 'reply_to', 'date', 'reactions', 'text', 'url'], 'FEED_FORMAT': 'csv', 'FEED_URI': 'DUMPFILE.csv', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODULES': ['fbcrawl.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
2019-03-19 13:42:23 [scrapy.extensions.telnet] INFO: Telnet Password: f02b39771d3538f9
2019-03-19 13:42:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2019-03-19 13:42:23 [comments] INFO: Email and password provided, using these as credentials
2019-03-19 13:42:23 [comments] INFO: Page attribute provided, scraping "groups/725870897781323?view=permalink&id=834512296917182"
2019-03-19 13:42:23 [comments] INFO: Year attribute not found, set scraping back to 2018
2019-03-19 13:42:23 [comments] INFO: Language attribute not provided, I will try to guess it from the fb interface
2019-03-19 13:42:23 [comments] INFO: To specify, add the lang parameter: scrapy fb -a lang="LANGUAGE"
2019-03-19 13:42:23 [comments] INFO: Currently choices for "LANGUAGE" are: "en", "es", "fr", "it", "pt"
2019-03-19 13:42:23 [scrapy.core.engine] INFO: Spider opened
2019-03-19 13:42:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-19 13:42:25 [comments] INFO: Got stuck in "save-device" checkpoint
2019-03-19 13:42:25 [comments] INFO: I will now try to redirect to the correct page
2019-03-19 13:42:27 [comments] INFO: Language recognized: lang="en"
2019-03-19 13:42:27 [comments] INFO: Scraping facebook page https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182
2019-03-19 13:42:28 [comments] INFO: 1 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_841883949513350&count=3&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQCQx78VIonSP0ki&refid=18&__tn__=R
2019-03-19 13:42:28 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182
2019-03-19 13:42:29 [comments] INFO: 2 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_841927396175672&count=10&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQDD4eOYntwp8ufh&refid=18&__tn__=R
2019-03-19 13:42:29 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182
2019-03-19 13:42:30 [comments] INFO: 3 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842051229496622&count=2&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQApAo67_snULAYR&refid=18&__tn__=R
2019-03-19 13:42:30 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182
2019-03-19 13:42:31 [comments] INFO: 4 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842259386142473&count=1&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQCjES4drsLkPo2W&refid=18&__tn__=R
2019-03-19 13:42:31 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182
2019-03-19 13:42:31 [comments] INFO: 5 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842265922808486&count=3&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQCC-ZHoxE7DnK6X&refid=18&__tn__=R
2019-03-19 13:42:31 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182
2019-03-19 13:42:32 [comments] INFO: 6 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842272482807830&count=1&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQCSYRmqHSZRd9Ai&refid=18&__tn__=R
2019-03-19 13:42:32 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182
2019-03-19 13:42:33 [comments] INFO: 7 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842291626139249&count=1&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQBIpjf2-ucvFcKl&refid=18&__tn__=R
2019-03-19 13:42:33 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182
2019-03-19 13:42:34 [comments] INFO: 8 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842308216137590&count=2&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQDzHPU2LEZh2EOM&refid=18&__tn__=R
2019-03-19 13:42:34 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182
2019-03-19 13:42:35 [comments] INFO: 0 regular comment @ page https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182
2019-03-19 13:42:35 [comments] INFO: 1 regular comment @ page https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182
2019-03-19 13:42:35 [scrapy.core.engine] INFO: Closing spider (finished)
2019-03-19 13:42:35 [scrapy.extensions.feedexport] INFO: Stored csv feed (33 items) in: DUMPFILE.csv
2019-03-19 13:42:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 16578,
'downloader/request_count': 22,
'downloader/request_method_count/GET': 20,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 193757,
'downloader/response_count': 22,
'downloader/response_status_count/200': 20,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 3, 19, 20, 42, 35, 102554),
'item_scraped_count': 33,
'log_count/INFO': 34,
'memusage/max': 50237440,
'memusage/startup': 50233344,
'request_depth_max': 19,
'response_received_count': 20,
'scheduler/dequeued': 22,
'scheduler/dequeued/memory': 22,
'scheduler/enqueued': 22,
'scheduler/enqueued/memory': 22,
'start_time': datetime.datetime(2019, 3, 19, 20, 42, 23, 689448)}
2019-03-19 13:42:35 [scrapy.core.engine] INFO: Spider closed (finished)

Is there a way to crawl ALL likers of a post ?

404 returned

Hi @rugantio , thanks for this!
I have tried to use it now but it is not working:

It returns a 404 as it tries to go to mbasic.facebook.com/login/pageid.
Looks like there is a problem similar to what you were mentioning in #2 : I logged in with my browser but didn't solve it.

Am I doing something wrong or there is something to fix in the login process?
I also tried to change, in line 73 of fbcrawl.py, the href value skipping "/login/", but I get another error:

cannot redirect to page

Hi,

I tried to run your code by use command
scrapy crawl fb -a email="facebook_email" -a password="facebook_password" -a page="KompasCOM" -o DUMPFILE.csv

i also tried changing the page parameter to
-a page="/KompasCOM"

but it give an error

INFO: Parse function called on https://mbasic.facebook.com/KompasCOM
ERROR: Spider error processing <GET https://mbasic.facebook.com/KompasCOM> (referer: https://mbasic.facebook.com/home.php?_rdr)
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
for x in result:
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "D:\KERJA\Neviim\fbcrawl-master\fbcrawl\spiders\fbcrawl.py", line 87, in parse_page
temp_post = response.urljoin(post[0])
IndexError: list index out of range

How do i solve this??

EDIT

I tried the work around you mentions by log in via web browser, but it still give me the same error. Also i got no email about unknown device login like you mentions before.
is there any work around i can try?

Thanks

Crawl the URL of the people's profile that commented a post?

Hi buddy,
Great job on this script it's pretty awesome,
I wanted to know if there's possiblity to crawl the people's profile url
Best regards and thanks!

error crawlind results 0

Hi. I am a student and I am starting in the world of web spiders. I have a problem with the code and that is that when you enter the Donal Trump Facebook , the error that appears below is generated. On the other hand, if I introduce a facebook page that has few posts, the error does not appear, but neither does it scratch anything. Could you help me please?

scrapy crawl fb -a email="[email protected]" -a password="--------" -a page="https://mbasic.facebook.com/DonaldTrump" -o donald.csv
2019-01-28 17:40:16 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: fbcrawl)
2019-01-28 17:40:16 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.6.7 (default, Oct 22 2018, 11:32:17) - [GCC 8.2.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.1.4, Platform Linux-4.15.0-43-generic-x86_64-with-Ubuntu-18.04-bionic
2019-01-28 17:40:16 [scrapy.crawler] INFO: Overridden settings: {'AUTOTHROTTLE_ENABLED': True, 'BOT_NAME': 'fbcrawl', 'CONCURRENT_REQUESTS': 32, 'CONCURRENT_REQUESTS_PER_DOMAIN': 16, 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 3, 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['source', 'date', 'text', 'reactions', 'likes', 'ahah', 'love', 'wow', 'sigh', 'grrr', 'comments', 'url'], 'FEED_FORMAT': 'csv', 'FEED_URI': 'volcado5.csv', 'HTTPCACHE_ENABLED': True, 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODULES': ['fbcrawl.spiders'], 'TELNETCONSOLE_ENABLED': False}
2019-01-28 17:40:16 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2019-01-28 17:40:16 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'fbcrawl.middlewares.FbcrawlDownloaderMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
2019-01-28 17:40:16 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'fbcrawl.middlewares.FbcrawlSpiderMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-01-28 17:40:16 [scrapy.middleware] INFO: Enabled item pipelines:
['fbcrawl.pipelines.FbcrawlPipeline']
2019-01-28 17:40:16 [scrapy.core.engine] INFO: Spider opened
2019-01-28 17:40:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-28 17:40:16 [fb] INFO: Spider opened: fb
2019-01-28 17:40:16 [fb] INFO: Spider opened: fb
2019-01-28 17:40:17 [fb] INFO: Parse function called on https://mbasic.facebook.com/DonaldTrump
2019-01-28 17:40:17 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/DonaldTrump> (referer: https://mbasic.facebook.com/home.php?refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&m_sess=c2VzczoxMDAwMTA3MDE5OTA1MDY6MzY6Z1BFNktaUWNwR1ZxZ2c6MjoxNTQ4NjkxNjQ1OjE1MDU1OjM5MDE6&_rdr)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/home/paula/master/fbcrawl/middlewares.py", line 35, in process_spider_output
for i in result:
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/home/paula/master/fbcrawl/spiders/fbcrawl.py", line 88, in parse_page
temp_post = response.urljoin(post[0])
IndexError: list index out of range
2019-01-28 17:40:17 [scrapy.core.engine] INFO: Closing spider (finished)
2019-01-28 17:40:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1625,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 29241,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 1, 28, 16, 40, 17, 521668),
'httpcache/hit': 4,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'memusage/max': 49934336,
'memusage/startup': 49934336,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'spider_exceptions/IndexError': 1,
'start_time': datetime.datetime(2019, 1, 28, 16, 40, 16, 981032)}
2019-01-28 17:40:17 [scrapy.core.engine] INFO: Spider closed (finished)

comments crawl fail with IndexError: list index out of range

Hello

I found your project last night and installed it today. My primary interest lies with scraping comments. I ran the Trump comment crawl example which fails. After reading related issues here I ran the Trump fbcrawl example which runs without any issues.

I have
Changed the Facebook interface language and tried both English and Italian.
Checked to see if they have sent me any e-mails about new devices etc which they have not.
Double and trippled checked the Facebook language settings.

Command line and output from fbcrawl Trump which works (I killed it with CTRL+C)

scrapy crawl fb -a email='redacted' -a password='redacted' -a page='DonaldTrump' -a date='2019-06-01' -a lang='en' -o Trump.csv
2019-06-14 22:06:18 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: fbcrawl)
2019-06-14 22:06:18 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.3 (default, Jun 14 2019, 20:59:39) - [GCC 6.3.0 20170516], pyOpenSSL 19.0.0 (Open
SSL 1.1.0j 20 Nov 2018), cryptography 2.7, Platform Linux-4.14.98+-armv6l-with-debian-9.8
2019-06-14 22:06:18 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'fbcrawl', 'DOWNLOAD_DELAY': 3, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['source', 'shar
ed_from', 'date', 'text', 'reactions', 'likes', 'ahah', 'love', 'wow', 'sigh', 'grrr', 'comments', 'post_id', 'url'], 'FEED_FORMAT': 'csv', 'FEED_URI': 'Trump.csv', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODU
LES': ['fbcrawl.spiders'], 'URLLENGTH_LIMIT': 99999, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
2019-06-14 22:06:18 [scrapy.extensions.telnet] INFO: Telnet Password: 24d0feccce794b1f
2019-06-14 22:06:19 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2019-06-14 22:06:19 [fb] INFO: Email and password provided, will be used to log in
2019-06-14 22:06:19 [fb] INFO: Date attribute provided, fbcrawl will stop crawling at 2019-06-01
2019-06-14 22:06:19 [fb] INFO: Language attribute recognized, using "en" for the facebook interface
2019-06-14 22:06:21 [scrapy.core.engine] INFO: Spider opened
2019-06-14 22:06:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-06-14 22:06:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-06-14 22:06:30 [fb] INFO: Scraping facebook page https://mbasic.facebook.com/DonaldTrump
2019-06-14 22:06:34 [fb] INFO: Parsing post n = 1, post_date = 2019-06-14 21:48:10
2019-06-14 22:06:34 [fb] INFO: Parsing post n = 2, post_date = 2019-06-14 20:19:31
2019-06-14 22:06:34 [fb] INFO: Parsing post n = 3, post_date = 2019-06-14 17:57:49
2019-06-14 22:06:34 [fb] INFO: Parsing post n = 4, post_date = 2019-06-14 17:23:31
2019-06-14 22:06:35 [fb] INFO: Parsing post n = 5, post_date = 2019-06-14 15:24:22
2019-06-14 22:06:35 [fb] INFO: First page scraped, clicking on "more"! new_page = https://mbasic.facebook.com/DonaldTrump?sectionLoadingID=m_timeline_loading_div_1561964399_0_36_timeline_unit%3A1%3A00000000001560522262%3A0461168601842738
7904%3A09223372036854775803%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001560522262%3A04611686018427387904%3A09223372036854775803%3A04611686018427387904&timeend=1561964399&timestart=0&tm=AQDx6EGIN3RHRB9r&refid=17
2019-06-14 22:06:57 [fb] INFO: Parsing post n = 6, post_date = 2019-06-14 14:42:39
2019-06-14 22:06:57 [fb] INFO: Parsing post n = 7, post_date = 2019-06-14 13:11:33
2019-06-14 22:06:57 [fb] INFO: Parsing post n = 8, post_date = 2019-06-14 02:34:00
2019-06-14 22:06:57 [fb] INFO: Parsing post n = 9, post_date = 2019-06-14 01:23:37
2019-06-14 22:06:57 [fb] INFO: Parsing post n = 10, post_date = 2019-06-13 21:43:25
2019-06-14 22:06:57 [fb] INFO: Page scraped, clicking on "more"! new_page = https://mbasic.facebook.com/DonaldTrump?sectionLoadingID=m_timeline_loading_div_1561964399_0_36_timeline_unit%3A1%3A00000000001560458605%3A04611686018427387904%3
A09223372036854775798%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001560458605%3A04611686018427387904%3A09223372036854775798%3A04611686018427387904&timeend=1561964399&timestart=0&tm=AQDx6EGIN3RHRB9r&refid=17
2019-06-14 22:07:21 [scrapy.extensions.logstats] INFO: Crawled 15 pages (at 15 pages/min), scraped 5 items (at 5 items/min)
2019-06-14 22:07:37 [fb] INFO: Parsing post n = 11, post_date = 2019-06-13 21:17:53
2019-06-14 22:07:37 [fb] INFO: Parsing post n = 12, post_date = 2019-06-13 19:52:54
2019-06-14 22:07:37 [fb] INFO: Parsing post n = 13, post_date = 2019-06-13 18:29:16
2019-06-14 22:07:37 [fb] INFO: Parsing post n = 14, post_date = 2019-06-13 16:52:00
2019-06-14 22:07:37 [fb] INFO: Parsing post n = 15, post_date = 2019-06-13 15:08:00
2019-06-14 22:07:37 [fb] INFO: Page scraped, clicking on "more"! new_page = https://mbasic.facebook.com/DonaldTrump?sectionLoadingID=m_timeline_loading_div_1561964399_0_36_timeline_unit%3A1%3A00000000001560434880%3A04611686018427387904%3
A09223372036854775793%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001560434880%3A04611686018427387904%3A09223372036854775793%3A04611686018427387904&timeend=1561964399&timestart=0&tm=AQDx6EGIN3RHRB9r&refid=17
2019-06-14 22:08:21 [fb] INFO: Parsing post n = 16, post_date = 2019-06-13 14:20:06
2019-06-14 22:08:21 [scrapy.extensions.logstats] INFO: Crawled 31 pages (at 16 pages/min), scraped 10 items (at 5 items/min)
2019-06-14 22:08:21 [fb] INFO: Parsing post n = 17, post_date = 2019-06-13 00:18:00
2019-06-14 22:08:21 [fb] INFO: Parsing post n = 18, post_date = 2019-06-12 23:38:00
2019-06-14 22:08:21 [fb] INFO: Parsing post n = 19, post_date = 2019-06-12 23:02:00
2019-06-14 22:08:21 [fb] INFO: Parsing post n = 20, post_date = 2019-06-12 22:23:13
2019-06-14 22:08:21 [fb] INFO: Page scraped, clicking on "more"! new_page = https://mbasic.facebook.com/DonaldTrump?sectionLoadingID=m_timeline_loading_div_1561964399_0_36_timeline_unit%3A1%3A00000000001560374593%3A04611686018427387904%3A09223372036854775788%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001560374593%3A04611686018427387904%3A09223372036854775788%3A04611686018427387904&timeend=1561964399&timestart=0&tm=AQDx6EGIN3RHRB9r&refid=17
^C2019-06-14 22:08:55 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
2019-06-14 22:08:55 [scrapy.core.engine] INFO: Closing spider (shutdown)
2019-06-14 22:09:01 [fb] INFO: Parsing post n = 21, post_date = 2019-06-12 21:04:00
2019-06-14 22:09:01 [fb] INFO: Parsing post n = 22, post_date = 2019-06-12 19:15:32
2019-06-14 22:09:01 [fb] INFO: Parsing post n = 23, post_date = 2019-06-12 17:42:59
2019-06-14 22:09:01 [fb] INFO: Parsing post n = 24, post_date = 2019-06-12 15:08:00
2019-06-14 22:09:01 [fb] INFO: Parsing post n = 25, post_date = 2019-06-12 14:22:11
2019-06-14 22:09:01 [fb] INFO: Page scraped, clicking on "more"! new_page = https://mbasic.facebook.com/DonaldTrump?sectionLoadingID=m_timeline_loading_div_1561964399_0_36_timeline_unit%3A1%3A00000000001560345731%3A04611686018427387904%3A09223372036854775783%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001560345731%3A04611686018427387904%3A09223372036854775783%3A04611686018427387904&timeend=1561964399&timestart=0&tm=AQDx6EGIN3RHRB9r&refid=17
2019-06-14 22:09:14 [scrapy.extensions.feedexport] INFO: Stored csv feed (19 items) in: Trump.csv
2019-06-14 22:09:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 102843,
'downloader/request_count': 47,
'downloader/request_method_count/GET': 46,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 541012,
'downloader/response_count': 47,
'downloader/response_status_count/200': 46,
'downloader/response_status_count/302': 1,
'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2019, 6, 14, 21, 9, 14, 758367),
'item_scraped_count': 19,
'log_count/INFO': 44,
'memusage/max': 58400768,
'memusage/startup': 37265408,
'request_depth_max': 7,
'response_received_count': 46,
'scheduler/dequeued': 47,
'scheduler/dequeued/memory': 47,
'scheduler/enqueued': 54,
'scheduler/enqueued/memory': 54,
'start_time': datetime.datetime(2019, 6, 14, 21, 6, 21, 254303)}
2019-06-14 22:09:14 [scrapy.core.engine] INFO: Spider closed (shutdown)

Command line and output from comments crawl Trump which does not work

scrapy crawl comments -a email='redacted' -a password='redacted' -a page='https://mbasic.facebook.com/story.php?story_fbid=10162169751605725&id=153080620724' -a lang='en' -o trump_comments.csv
2019-06-14 22:18:52 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: fbcrawl)
2019-06-14 22:18:52 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.3 (default, Jun 14 2019, 20:59:39) - [GCC 6.3.0 20170516], pyOpenSSL 19.0.0 (OpenSSL 1.1.0j 20 Nov 2018), cryptography 2.7, Platform Linux-4.14.98+-armv6l-with-debian-9.8
2019-06-14 22:18:52 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'fbcrawl', 'CONCURRENT_REQUESTS': 1, 'DOWNLOAD_DELAY': 3, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['source', 'reply_to', 'date', 'reactions', 'text', 'source_url', 'url'], 'FEED_FORMAT': 'csv', 'FEED_URI': 'trump_comments.csv', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODULES': ['fbcrawl.spiders'], 'URLLENGTH_LIMIT': 99999, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
2019-06-14 22:18:52 [scrapy.extensions.telnet] INFO: Telnet Password: da3a63c0631aefa3
2019-06-14 22:18:53 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2019-06-14 22:18:53 [comments] INFO: Email and password provided, will be used to log in
2019-06-14 22:18:53 [comments] INFO: Date attribute not provided, scraping date set to 2004-02-04 (fb launch date)
2019-06-14 22:18:53 [comments] INFO: Language attribute recognized, using "en" for the facebook interface
2019-06-14 22:18:55 [scrapy.core.engine] INFO: Spider opened
2019-06-14 22:18:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-06-14 22:18:55 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-06-14 22:19:03 [comments] INFO: Scraping facebook page https://mbasic.facebook.com/story.php?story_fbid=10162169751605725&id=153080620724
2019-06-14 22:19:07 [comments] INFO: Parsing post n = 1, post_date = 2019-02-16 19:00:01
2019-06-14 22:19:07 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/story.php?story_fbid=10162169751605725&id=153080620724> (referer: https://mbasic.facebook.com/home.php?refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr)
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/home/pi/fbcrawl/fbcrawl/spiders/comments.py", line 62, in parse_page
temp_post = response.urljoin(post[0])
IndexError: list index out of range
2019-06-14 22:19:08 [scrapy.core.engine] INFO: Closing spider (finished)
2019-06-14 22:19:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2345,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 30374,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 6, 14, 21, 19, 8, 83522),
'log_count/ERROR': 1,
'log_count/INFO': 11,
'memusage/max': 37359616,
'memusage/startup': 37359616,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'spider_exceptions/IndexError': 1,
'start_time': datetime.datetime(2019, 6, 14, 21, 18, 55, 248622)}
2019-06-14 22:19:08 [scrapy.core.engine] INFO: Spider closed (finished)

Any ideas?

Comment scraper empty csv file

Hello!
First of all, thank you for your work. I'm able to scrape posts, but i can't scrape comments. I read there are several related issues, but i'd like to be sure that my situation is the same of the other users.
At the end of this message you can find the log.
I noticed this error:
ValueError: Error with output processor: field='date' value=['4 mar'] error='JSONDecodeError: Extra data: line 1 column 3 (char 2)' 2019-05-07 21:51:17 [scrapy.core.engine] INFO: Closing spider (finished)
Is it possible that they changed the format of json data?
Thank you for your attention

`2019-05-07 21:50:55 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: fbcrawl)
2019-05-07 21:50:55 [scrapy.utils.log] INFO: Versions: lxml 3.5.0.0, libxml2 2.9.3, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.5.2 (default, Nov 12 2018, 13:43:14) - [GCC 5.4.0 20160609], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryptography 2.6.1, Platform Linux-4.4.0-141-generic-x86_64-with-Ubuntu-16.04-xenial
2019-05-07 21:50:55 [scrapy.crawler] INFO: Overridden settings: {'DOWNLOAD_DELAY': 3, 'FEED_EXPORT_FIELDS': ['source', 'reply_to', 'date', 'reactions', 'text', 'source_url', 'url'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1', 'LOG_LEVEL': 'INFO', 'FEED_FORMAT': 'csv', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'BOT_NAME': 'fbcrawl', 'SPIDER_MODULES': ['fbcrawl.spiders'], 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_URI': 'DUMPFILE.csv', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'CONCURRENT_REQUESTS': 1}
2019-05-07 21:50:55 [scrapy.extensions.telnet] INFO: Telnet Password: e3e13270b811ad1b
2019-05-07 21:50:55 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.corestats.CoreStats']
2019-05-07 21:50:55 [comments] INFO: Email and password provided, will be used to log in
2019-05-07 21:50:55 [comments] INFO: Date attribute not provided, scraping date set to 2004-02-04 (fb launch date)
2019-05-07 21:50:55 [comments] INFO: Language attribute not provided, fbcrawl will try to guess it from the fb interface
2019-05-07 21:50:55 [comments] INFO: To specify, add the lang parameter: scrapy fb -a lang="LANGUAGE"
2019-05-07 21:50:55 [comments] INFO: Currently choices for "LANGUAGE" are: "en", "es", "fr", "it", "pt"
2019-05-07 21:50:55 [scrapy.core.engine] INFO: Spider opened
2019-05-07 21:50:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-07 21:50:55 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-05-07 21:51:03 [comments] INFO: Going through the "save-device" checkpoint
2019-05-07 21:51:11 [comments] INFO: Language recognized: lang="it"
2019-05-07 21:51:11 [comments] INFO: Scraping facebook page https://mbasic.facebook.com/DonaldTrump/posts/10162238538600725
2019-05-07 21:51:14 [comments] INFO: 1 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=10162238538600725_10162238553370725&count=709&curr&pc=1&ft_ent_identifier=10162238538600725&gfid=AQDbVKdhk7pwAFpu&__tn__=R
2019-05-07 21:51:17 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/comment/replies/?ctoken=10162238538600725_10162238553370725&count=709&curr&pc=1&ft_ent_identifier=10162238538600725&gfid=AQDbVKdhk7pwAFpu&__tn__=R> (referer: https://mbasic.facebook.com/DonaldTrump/posts/10162238538600725)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/scrapy/loader/init.py", line 125, in get_output_value
return proc(self._values[field_name])
File "/home/jacopo/Python_Projects/FB_scraper/fbcrawl/fbcrawl/items.py", line 87, in parse_date
d = json.loads(date[0]) #nested dict of features
File "/usr/lib/python3.5/json/init.py", line 319, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.5/json/decoder.py", line 342, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 3 (char 2)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/home/jacopo/Python_Projects/FB_scraper/fbcrawl/fbcrawl/spiders/comments.py", line 103, in parse_reply
yield new.load_item()
File "/usr/local/lib/python3.5/dist-packages/scrapy/loader/init.py", line 115, in load_item
value = self.get_output_value(field_name)
File "/usr/local/lib/python3.5/dist-packages/scrapy/loader/init.py", line 128, in get_output_value
(field_name, self._values[field_name], type(e).name, str(e)))
ValueError: Error with output processor: field='date' value=['4 mar'] error='JSONDecodeError: Extra data: line 1 column 3 (char 2)'
2019-05-07 21:51:17 [scrapy.core.engine] INFO: Closing spider (finished)
2019-05-07 21:51:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4540,
'downloader/request_count': 7,
'downloader/request_method_count/GET': 5,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 53673,
'downloader/response_count': 7,
'downloader/response_status_count/200': 5,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 5, 7, 19, 51, 17, 840331),
'log_count/ERROR': 1,
'log_count/INFO': 15,
'memusage/max': 55971840,
'memusage/startup': 55971840,
'request_depth_max': 4,
'response_received_count': 5,
'scheduler/dequeued': 7,
'scheduler/dequeued/memory': 7,
'scheduler/enqueued': 7,
'scheduler/enqueued/memory': 7,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2019, 5, 7, 19, 50, 55, 271390)}
2019-05-07 21:51:17 [scrapy.core.engine] INFO: Spider closed (finished)
`

Add crawling for events?

Is there an option to crawl events out of Facebook?
If not, would it be easy to implement? I could assist if there is interest for that.

scraping everything but video posts

I've been trying to find why the script only extract posts if they are images or texts (even 360º posts worked just fine), when the posts are videos it doesn't download it along the rest of the data. At first, I thought the html was different or the attribute "top_level_post_id" didn't apply to them, but i looked into it in the mbasic fb feed and it doesn't seem the case.

Any idea? Thanks!

Comment crawler doesn't work

scrapy crawl comments -a email="[email protected]" -a password="password" -a page="DonaldTrump/story.php?story_fbid=2123087281087706&id=779444322118682" -a lang="en" -o Trump.csv
It is not woking!
And gives ERROR as below:
2019-02-12 07:21:42 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: fbcrawl)
2019-02-12 07:21:42 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2o 27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
2019-02-12 07:21:42 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'fbcrawl', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_FORMAT': 'csv', 'FEED_URI': 'chinese.csv', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODULES': ['fbcrawl.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
2019-02-12 07:21:42 [scrapy.extensions.telnet] INFO: Telnet Password: 3895d32ba798cd1e
2019-02-12 07:21:42 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2019-02-12 07:21:43 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-02-12 07:21:43 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-02-12 07:21:43 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-02-12 07:21:43 [scrapy.core.engine] INFO: Spider opened
2019-02-12 07:21:43 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-02-12 07:21:43 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-02-12 07:21:47 [comments] INFO: Parse function called on https://mbasic.facebook.com/DonaldTrump/story.php?story_fbid=2123087281087706&id=779444322118682
2019-02-12 07:21:48 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://mbasic.facebook.com/DonaldTrump/story.php?story_fbid=2123087281087706&id=779444322118682>: HTTP status code is not handled or not allowed
2019-02-12 07:21:48 [scrapy.core.engine] INFO: Closing spider (finished)
2019-02-12 07:21:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4038,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 40779,
'downloader/response_count': 6,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 2,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 2, 12, 1, 21, 48, 207776),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/404': 1,
'log_count/INFO': 11,
'request_depth_max': 3,
'response_received_count': 4,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6,
'start_time': datetime.datetime(2019, 2, 12, 1, 21, 43, 376171)}
2019-02-12 07:21:48 [scrapy.core.engine] INFO: Spider closed (finished)

Why scraper add "checkpoint" into the address ?

Hi,

When I make a test with Donald Trump I have this result, can you help me ? :

INFO: Ignoring response <404 https://mbasic.facebook.com/checkpoint/DonaldTrump>: HTTP status code is not handled or not allowed

Why the scraper add /checkpoint/ into the link ?

Please help. KeyError: 'flag'**

Hello again
I have a problem and there are Facebook profiles in which I get the following error that I show below.
When it comes to scratching Donald Trump's Facebook, for example, I have no problems and it works perfectly.

Thank you.

scrapy crawl fb -a email="[email protected]" -a password="-------" -a page="fiorela.alvaradoleon" -a year="2019" -a lang="en" -o 2.csv
2019-02-27 12:22:23 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: fbcrawl)
2019-02-27 12:22:23 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.6.7 (default, Oct 22 2018, 11:32:17) - [GCC 8.2.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.1.4, Platform Linux-4.15.0-45-generic-x86_64-with-Ubuntu-18.04-bionic
2019-02-27 12:22:23 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'fbcrawl', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['source', 'shared_from', 'date', 'text', 'reactions', 'likes', 'ahah', 'love', 'wow', 'sigh', 'grrr', 'comments', 'url'], 'FEED_FORMAT': 'csv', 'FEED_URI': '2.csv', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODULES': ['fbcrawl.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
2019-02-27 12:22:23 [scrapy.extensions.telnet] INFO: Telnet Password: ---
2019-02-27 12:22:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2019-02-27 12:22:23 [fb] INFO: Email and password provided, using these as credentials
2019-02-27 12:22:23 [fb] INFO: Page attribute provided, scraping "fiorela.alvaradoleon"
2019-02-27 12:22:23 [fb] INFO: Year attribute found, set scraping back to 2019
2019-02-27 12:22:23 [fb] INFO: Language attribute recognized, using "en" for the facebook interface
2019-02-27 12:22:24 [scrapy.core.engine] INFO: Spider opened
2019-02-27 12:22:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-02-27 12:22:26 [fb] INFO: Got stuck in "save-device" checkpoint
2019-02-27 12:22:26 [fb] INFO: I will now try to redirect to the correct page
2019-02-27 12:22:27 [fb] INFO: Scraping facebook page https://mbasic.facebook.com/fiorela.alvaradoleon
2019-02-27 12:22:28 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/fiorela.alvaradoleon> (referer: https://mbasic.facebook.com/home.php?_rdr)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python3.6/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/fbcrawl-master/fbcrawl/spiders/fbcrawl.py", line 158, in parse_page
if response.meta['flag'] == self.k and self.k >= self.year:
KeyError: 'flag'
2019-02-27 12:22:28 [scrapy.core.engine] INFO: Closing spider (finished)
2019-02-27 12:22:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3869,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 30486,
'downloader/response_count': 6,
'downloader/response_status_count/200': 4,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 2, 27, 11, 22, 28, 520792),
'log_count/ERROR': 1,
'log_count/INFO': 12,
'memusage/max': 50511872,
'memusage/startup': 50511872,
'request_depth_max': 3,
'response_received_count': 4,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6,
'spider_exceptions/KeyError': 1,
'start_time': datetime.datetime(2019, 2, 27, 11, 22, 24, 170753)}
2019-02-27 12:22:28 [scrapy.core.engine] INFO: Spider closed (finished)

Crawl posts image as well?

Is it possible to include image as well?
I assume it will be added at fbcrawl.py parse_post function, however I am not sure about the html hierarchy path and the format should be stored as csv. If someone can pointed the possible path I can write the required code and submit a PR for this.

unexpected end + traceback issue

Hi,

Thanks for your work on this project. I am using your scraper for a project where I am trying to scrape posts from a page which goes back until 2009. It's a lot of posts and I understand that this might be causing some troubles. I am facing two issues, I was hoping you could help with:

unexpected end, I am not sure why. It does this after scraping couple of months of post, usually it never reaches 2017 and it already quits with the following message:


2019-04-27 22:16:00 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-27 22:16:00 [scrapy.extensions.feedexport] INFO: Stored csv feed (417 items) in: jobbik3.csv
2019-04-27 22:16:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2295668,
 'downloader/request_count': 981,
 'downloader/request_method_count/GET': 979,
 'downloader/request_method_count/POST': 2,
 'downloader/response_bytes': 12077130,
 'downloader/response_count': 981,
 'downloader/response_status_count/200': 979,
 'downloader/response_status_count/302': 2,
 'dupefilter/filtered': 40,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 27, 20, 16, 0, 363479),
 'item_scraped_count': 417,
 'log_count/DEBUG': 1438,
 'log_count/ERROR': 8,
 'log_count/INFO': 675,
 'request_depth_max': 98,
 'response_received_count': 979,
 'scheduler/dequeued': 981,
 'scheduler/dequeued/memory': 981,
 'scheduler/enqueued': 981,
 'scheduler/enqueued/memory': 981,
 'spider_exceptions/IndexError': 8,
 'start_time': datetime.datetime(2019, 4, 27, 18, 8, 37, 318807)}
2019-04-27 22:16:00 [scrapy.core.engine] INFO: Spider closed (finished)

There is no obvious error message and it seems that it has ended, whereas in fact it really did not. I was able to circumvent this error by using an adapted version of your script from here.

The second issue is a traceback problem which occurs during the process. The script is not interrupted but I feel like this should still be handled:


2019-04-27 22:12:02 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/ufi/reaction/profile/browser/?ft_ent_identifier=10156090261316405&refid=17&_ft_=top_level_post_id.10156090262701405%3Atl_objid.10156090262701405%3Apage_id.287770891404%3Aphoto_attachments_list.%5B10156090262436405%2C10156090262491405%2C10156090263331405%2C10156090261941405%5D%3Aphoto_id.10156090262436405%3Astory_location.4%3Astory_attachment_style.new_album%3Apage_insights.%7B%22287770891404%22%3A%7B%22role%22%3A1%2C%22page_id%22%3A287770891404%2C%22post_context%22%3A%7B%22story_fbid%22%3A%5B10156090262436405%2C10156090261316405%5D%2C%22publish_time%22%3A1524336339%2C%22story_name%22%3A%22EntPhotoNodeBasedEdgeStory%22%2C%22object_fbtype%22%3A22%7D%2C%22actor_id%22%3A287770891404%2C%22psn%22%3A%22EntPhotoNodeBasedEdgeStory%22%2C%22sl%22%3A4%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22targets%22%3A%5B%7B%22page_id%22%3A287770891404%2C%22actor_id%22%3A287770891404%2C%22role%22%3A1%2C%22post_id%22%3A10156090262436405%2C%22share_id%22%3A0%7D%2C%7B%22page_id%22%3A287770891404%2C%22actor_id%22%3A287770891404%2C%22role%22%3A1%2C%22post_id%22%3A10156090261316405%2C%22share_id%22%3A0%7D%5D%7D%7D%3Athid.287770891404%3A306061129499414%3A43%3A1514793600%3A1546329599%3A4304607315176871197&__tn__=%2AW-R#footer_action_list> (referer: https://mbasic.facebook.com/JobbikMagyarorszagertMozgalom?sectionLoadingID=m_timeline_loading_div_1546329599_1514793600_8_timeline_unit%3A1%3A00000000001524508080%3A04611686018427387904%3A09223372036854775742%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001524508080%3A04611686018427387904%3A09223372036854775742%3A04611686018427387904&timeend=1546329599&timestart=1514793600&tm=AQCM-FSLJ77YlQat&refid=17)
Traceback (most recent call last):
  File "c:\program files\python\python37\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\program files\python\python37\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\program files\python\python37\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\program files\python\python37\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\program files\python\python37\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\endre\Documents\GitHub\fbcrawl\fbcrawl\spiders\fbcrawl.py", line 221, in parse_post
    reactions = response.urljoin(reactions[0].extract())
  File "c:\program files\python\python37\lib\site-packages\parsel\selector.py", line 61, in __getitem__
    o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range

I have changed the language of facebook to Italian. The page I am trying to scrape posts in Hungarian. Maybe that's an issue?

Otherwise I am using the following specifications. Note, the relatively long delay is to avoid fb blocking my account. I also did not want to overdo it with many concurrent requests given the point would be to go back 10 years in this page.

# -*- coding: utf-8 -*-

# Scrapy settings for fbcrawl project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'fbcrawl'

SPIDER_MODULES = ['fbcrawl.spiders']
NEWSPIDER_MODULE = 'fbcrawl.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 3

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 1
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'fbcrawl.middlewares.FbcrawlSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'fbcrawl.middlewares.FbcrawlDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'fbcrawl.pipelines.FbcrawlPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
#FEED_EXPORT_FIELDS = ["source", "date", "text", "reactions","likes","ahah","love","wow","sigh","grrr","comments","url"] # specifies the order of the column to export as CSV
FEED_EXPORT_ENCODING = 'utf-8'
DUPEFILTER_DEBUG = True
#LOG_LEVEL = 'INFO'
LOG_LEVEL = 'DEBUG'
URLLENGTH_LIMIT = (999999999999)

Thanks for taking a look!

Can not login now

Thank you for providing such a useful crawler.

last week it work well but now it doesn't.

I log the response.text to see what happen and it show my username or password invalid.
but i can login with the same account and password by browser.

any suggestion?

Getting banned after crawling

Excuse me, the issue is not about crawler itself, but about how to use it. I am creating new facebook accounts and after couple of crawls FB starts showing "We've recently noticed unusual activity from your account....".
I am out of phone numbers and don't want to send a photo of ID. Is there a way to using this crawl and not getting banned? I tried to add some DOWNLOAD_DELAY to setting.py without success. Also tried to disable cookies by COOKIES_ENABLED = False but crawler does not work with it.

Thanks.

Hi guys how to extract group member data?by scrapy

Originally posted by @maaudrana in #20 (comment)

feed list of links to scrape comments?

Hi,

I was wondering, is there a way to feed the list with the post urls the page crawler downloads to the scraper of the comments? It seems to me that the structure of the scraper of the comments requires individual links which would require re-running the command for every single post. Perhaps making the two scrapers compatible and adding support for a csv list of links to the comments scaper could be a way to avoid that. Or is there an obvious way to do that which I am missing? Thanks!

Comment crawler does not work

Hi Rugantio! I am using the comment crawler for comments from FB pages in Chinese (e.g., "govnews.hk"). The comment crawler returned 0 tweet although there should be several comments. But the crawler works well for comments from Trump's page. Could you please help?

Crawled (404) not found

i found the problem which occur Crawled (404) <GET https://mbasic.facebook.com/gettingstarted/DonaldTrump> (referer: https://mbasic.facebook.com/gettingstarted/?_rdr)
how to slove this problem?

.

it's supported?

csv file limit

Hi, I'm crawling the comments of one of famous guy's post.

Some comments are up to 50000+ and I found that the scrapy engine would stop to crawler when it crawls at about 10000 comments.
I checked my fake fb account and it was not been banned.
Is there any limit at the csv file? If it is, it there any method to solve it.

Really appreciate about this tool.

Scraping comments automatically in a given page

Is it possibel to merge the two spiders (fbcrawl and comments) into one spider that scrapes the comments of all posts in a given page instead of giving the link of a post in everytime

rugantio / fbcrawl Goto Github PK

fbcrawl's People

Contributors

Stargazers

Watchers

Forkers

fbcrawl's Issues

Don't use your personal facebook profile to crawl

Recommend Projects

Recommend Topics

Recommend Org