cloudshao / birdcoop Goto Github PK
View Code? Open in Web Editor NEWBirdcoop is a distributed Twitter crawler
Birdcoop is a distributed Twitter crawler
Right now worker responses only contain up to 100 followers.
When the worker asks the twitter API for followers, it should use pagination and make multiple requests to get all the followers and followees, not just 100.
Right now all the workers start at the top of the hour and work as they can, so the traffic spikes and then goes down gradually throughout the hour. This is a bit overwhelming and not very good for testing.
The workers should be somehow spaced out throughout the hour.
Steps:
Expected:
Actual:
Number of users is unchanged
The only output on the console is:
Worker Connection Received
Followers received on server
loading json
Followers received on server
May be related to #1
Right now if a worker is assigned to crawl a user that is private, it will:
It should actually:
This is happening at the moment, but if you restart the server it'll probably go away, so try to capture as much debug info as you can before restarting...
Repro steps:
Expected:
Actual:
When the master is doing its response parsing, this is the output of top -u cloud:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11593 cloud 20 0 4916 976 572 S 55.3 0.0 2:09.97 screen
11625 cloud 20 0 294m 28m 3088 S 35.6 1.4 1:19.59 python
I would expect python to use a lot of CPU to parse, but gnu screen taking it up suggests that it might be wasting all the cycles on printing to stdout.
When a worker tries to crawl more than the allowed rate twitter returns a 404 which the worker doesn't catch and so just dies on an exception. It should catch it and die gracefully - will look better when we're being graded.
When a server is started, it seems to go at a rate of about 5000 requests serviced per hour. After the first two hours, the rate seems to go down drastically. This is the rate history output from Dec 13 (from typing 'rate'):
rates:
(5436, 5089)
(5458, 5447)
(911, 740)
(669, 508)
(648, 497)
(518, 350)
(506, 366)
(306, 279)
connections, responses this hour: 719, 411
This might be due to "error: can't start new thread" that doesn't seem to happen when we first start the server, but appears sometime later.
Our app on reala seems to be consuming 842MB of RAM and growing when there's a worker request.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27840 cloud 20 0 2364 1008 808 R 0.3 0.0 0:00.02 top
15530 cloud 20 0 5048 1160 580 S 0.0 0.1 116:24.91 screen
15531 cloud 20 0 4932 1576 1284 S 0.0 0.1 0:00.01 bash
15655 cloud 20 0 4932 1572 1284 S 0.0 0.1 0:00.01 bash
15690 cloud 20 0 842m 775m 3092 S 0.0 38.5 75:22.24 python
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.