cloudshao / birdcoop Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 3.0 9.32 MB

Birdcoop is a distributed Twitter crawler

Shell 0.70% JavaScript 38.62% Python 60.67%

birdcoop's People

Contributors

Stargazers

Watchers

Forkers

frankiea ebottabi zymitsky

birdcoop's Issues

Worker only returns up to 100 followers and 100 followees of each user

Right now worker responses only contain up to 100 followers.

When the worker asks the twitter API for followers, it should use pagination and make multiple requests to get all the followers and followees, not just 100.

Workers should start making requests at different times

Right now all the workers start at the top of the hour and work as they can, so the traffic spikes and then goes down gradually throughout the hour. This is a bit overwhelming and not very good for testing.

The workers should be somehow spaced out throughout the hour.

Master receives responses, but doesn't seem to persist them in the database

Steps:

Log on to reala and start/view the server app
Start 'sqlite3 awesomeDB' and note the number of users using 'select count(*) from user_table'
Run a worker a bunch of times, or wait an hour for the regular workers to check in
Again start sqlite3 and see the number of users

Expected:

Number of users increased
There is some output on the console that says
Beginning parsing data for user

Actual:

Number of users is unchanged
The only output on the console is:

Worker Connection Received
Followers received on server
loading json
Followers received on server

May be related to #1

Worker should return null followers/followees when user was private/uncrawlable

Right now if a worker is assigned to crawl a user that is private, it will:

Hit an exception
Stop crawling, and
Never return to the master with anything

It should actually:

Return the following dict:
{
'user' : <user_id>
}
Keep crawling

Master not replying with user to crawl

This is happening at the moment, but if you restart the server it'll probably go away, so try to capture as much debug info as you can before restarting...

Repro steps:

Run a worker with 'python worker.py' (can do it on your own machine)

Expected:

Worker script finishes after a while
Master prints something about connection received to console

Actual:

Worker never finishes
If you press ^c, you'll see it was stuck at receiving the user to crawl
Master doesn't print out anything new to console

GNU screen takes up the majority of the CPU when master is parsing

When the master is doing its response parsing, this is the output of top -u cloud:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
11593 cloud     20   0  4916  976  572 S 55.3  0.0   2:09.97 screen             
11625 cloud     20   0  294m  28m 3088 S 35.6  1.4   1:19.59 python

I would expect python to use a lot of CPU to parse, but gnu screen taking it up suggests that it might be wasting all the cycles on printing to stdout.

Worker should exit gracefully on 404 (When crawl rate exceeded)

When a worker tries to crawl more than the allowed rate twitter returns a 404 which the worker doesn't catch and so just dies on an exception. It should catch it and die gracefully - will look better when we're being graded.

Master service rates drop the longer it runs

When a server is started, it seems to go at a rate of about 5000 requests serviced per hour. After the first two hours, the rate seems to go down drastically. This is the rate history output from Dec 13 (from typing 'rate'):

rates:
(5436, 5089)
(5458, 5447)
(911, 740)
(669, 508)
(648, 497)
(518, 350)
(506, 366)
(306, 279)
connections, responses this hour: 719, 411

This might be due to "error: can't start new thread" that doesn't seem to happen when we first start the server, but appears sometime later.

Master is consuming too much memory

Our app on reala seems to be consuming 842MB of RAM and growing when there's a worker request.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
27840 cloud     20   0  2364 1008  808 R  0.3  0.0   0:00.02 top               
15530 cloud     20   0  5048 1160  580 S  0.0  0.1 116:24.91 screen            
15531 cloud     20   0  4932 1576 1284 S  0.0  0.1   0:00.01 bash              
15655 cloud     20   0  4932 1572 1284 S  0.0  0.1   0:00.01 bash              
15690 cloud     20   0  842m 775m 3092 S  0.0 38.5  75:22.24 python

cloudshao / birdcoop Goto Github PK

birdcoop's People

Contributors

Stargazers

Watchers

Forkers

birdcoop's Issues

Worker only returns up to 100 followers and 100 followees of each user

Workers should start making requests at different times

Master receives responses, but doesn't seem to persist them in the database

Worker should return null followers/followees when user was private/uncrawlable

Master not replying with user to crawl

GNU screen takes up the majority of the CPU when master is parsing

Worker should exit gracefully on 404 (When crawl rate exceeded)

Master service rates drop the longer it runs

Master is consuming too much memory

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent