Giter VIP home page Giter VIP logo

Comments (23)

mwpenny avatar mwpenny commented on August 22, 2024 1

Update (good news): from what I can tell, the mobile API is not rate limited, or at least, stressing it does not block me. I was able to make 30 requests in parallel with no delays and not get banned. I don't recommend being this aggressive, but using this API seems viable. Further, an obvious benefit is that it offers information in a much more easily parsable form than their HTML markup.

However, if I use their private API in this module, in time Kijiji will almost certainly crack down in the same way they have with scraping. It will very likely become a similar cat/mouse blocking game in the future to what we're seeing now. So now the question becomes: do I use it for short-term stability, or keep scraping the HTML and build in rate limiting to keep Kijiji happy. I have a feeling that if libraries like this one play nice then Kijiji won't kick up as much of a fuss. On the other hand, the API is very useful and well-structured.

So what I'm going to do is create a proof of concept version of this module that uses the mobile API and also build in configurable rate limiting and ban detection (for helpful error messages). This experiment will likely be toggleable or I'll stick it in another branch - I haven't decided yet.

from kijiji-scraper.

mwpenny avatar mwpenny commented on August 22, 2024

Hm the same thing just happened to me. I was able to access the site, ran the script a few times, and started getting the access denied message. I also got the message on a separate computer which I very rarely use for Kijiji and I could no longer use the mobile app. Clearing my cache and cookies did not help. Only after waiting ~10-15 minutes could I access Kijiji again.

This sounds like a temporary IP ban. I suspect Kijiji is doing some sort of blacklisting to discourage scraping if it detects "unusual" activity. I found that if I make repeated requests on the site in my browser, everything works fine no matter how many pages I request. However, if I change my user agent to node-fetch/1.0 (+https://github.com/bitinn/node-fetch) (which is what the scraper sends) then I immediately get a 403 after the first request.

After I am unbanned (it seems longer this time) I will update the module to send a User-Agent header that looks like a web browser and see if that fixes this.

from kijiji-scraper.

brentdixon84 avatar brentdixon84 commented on August 22, 2024

I am getting this access denied error as well. I was using "User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.4147.135 Safari/537.36"

I tried it on my phone (connected via wifi) and it is banned too. Disconnected from wifi and reset the cache and its back. They definitely appear to be doing targeted blocks.

from kijiji-scraper.

mwpenny avatar mwpenny commented on August 22, 2024

@brentdixon84, that user agent string is what your browser sends, correct? Or are you saying you modified kijiji-scraper to send that and still got blocked?

from kijiji-scraper.

mwpenny avatar mwpenny commented on August 22, 2024

@zabubu and @brentdixon84, please try the branch scraper-detection. I've modified the module to send a browser user agent string. I did some basic testing over mobile data (so my IP was different). I was able to scrape again many times on one specific ad and a few times on a search result page. However, after few searches I got blocked again.

Kijiji could be looking at how frequently requests are made so it may be necessary for me to add a small random delay when pulling down the information for multiple ads while searching. I would appreciate it if you two could try out my changes and let me know how it goes so we can get a better sense for how sophisticated this new blocking mechanism is. I'll be able to look at this more when I'm unbanned 😛

from kijiji-scraper.

paulwainwright avatar paulwainwright commented on August 22, 2024

I was banned for over 12 hours
Only a suggestion, but maybe test using the RSS feed, (weblink) This only helps with the search result, not the individual ads. Might not be enough info and could still get banned.
PS I am experimenting using the feedly app and building a bunch of RSS web pages.

from kijiji-scraper.

mwpenny avatar mwpenny commented on August 22, 2024

This module used to use the RSS feed but I switched away from it a while ago because paging was unreliable. It's possible they've made it better, so I can look into that. However, I'll still need to make requests for the individual ads on the results page in order to populate the details (scrapeResultDetails is on by default).

I have a few other ideas too which may be less detectable. I'll be experimenting with those as well

from kijiji-scraper.

mwpenny avatar mwpenny commented on August 22, 2024

@paulwainwright it's very possible that the blocking is exponential

from kijiji-scraper.

heysenbug avatar heysenbug commented on August 22, 2024

Hey @mwpenny, I tried running my script with the scraper-detection branch and got zero results back for all the queries. After running the script like 3 times, I got locked out of kijiji

from kijiji-scraper.

scd31 avatar scd31 commented on August 22, 2024

My bad, pretty sure this is my fault. I've been scraping their entire website every 24 hours for the past 6 months.

Currently I'm using 100 web proxies to scan the site(I only scan 3 categories now). One request every second was too much, and one request every 5 seconds seemed fine for a while, until they blocked my proxies. I also had the user agent set to look like a Chrome user. I'm not sure if I should send queries even less often, or if I should use something like headless Chrome to emulate a real user.

It would be great if someone could use a tool like mitmproxy to snoop on the app and see if they're using an API. I don't have a play store on my phone(I have removed all proprietary software) otherwise I'd gladly do it myself.

I could also offer a service for making API requests to Kijiji. If I bought some more proxies I could go back to scraping their entire site, and I could offer API access to my DB. It would probably be free for a certain number of requests, and then a buck or two a month after that - I'm not interested in making any money(especially as I kind of screwed you all over) I just want to cover the proxy costs.

Once again, sorry about that!

from kijiji-scraper.

mwpenny avatar mwpenny commented on August 22, 2024

Update: I've done some more experimentation tonight and this bot detection seems pretty overzealous. I was able to get blocked without using the module at all! I went to an ad page in Firefox and refreshed several times in quick succession (~10 requests of the same page with ~0.5-1 second in between each). This was enough to get me locked out - I wasn't even running requests in parallel like the module does. My IP could be on a "suspicious" list by this point, but it still seems very trigger happy either way.

So I'm not sure how effective it will be to try to mimic a web browser since Kijiji comes down pretty hard on those too! It seems like they're being extra cautious. Maybe mimicking the Google crawler would work - I can test that tomorrow.

But since the method of bot/scraper detection is seemingly not very intelligent, if any way around this exists then it will probably come down to an oversight on Kijiji's end rather than getting smart with browser fingerprinting, etc. Of course, one may not exist and any that do have the potential to be fixed. Worst (and most likely) case I will have to make this module play by Kijiji's rules. That means adding delays between requests, making only one request at a time, lazily fetching search result pages, etc.

@scd31, why do you need to download the entire website? Also, I know you have good intentions here but I'd like to rely on official Kijiji servers and all of the load balancing/infrastructure that they have put into place. Your archive of Kijiji data will quickly become out of date so I don't think access to it will be very useful. If you buy more proxies, it's possible that they will block those ones and/or step up their scraper detection even more than they already have. And if you one day decide/have to shut down your server, then we will all be out of luck. So I think official servers are the way to go here.

I don't think this is solely your fault as I'm sure others scrape pretty aggressively as well. With such aggressive usage by many users, this was only a matter of time. With that said, I suggest that you ease off if you regain access. When I used to use this every day, I scraped <= 500 ads every hour with no troubles.

As for an API, I can confirm that the mobile app does in fact use a private/undocumented API. I looked into this about 2 years ago but sat on it since I knew it would turn into an inevitable cat/mouse game between me and Kijiji the moment I started using it in this module (kind of like what is happening now with the bot detection). I'm going to be looking into the feasibility of using it in the coming days. However, when I am blocked from the website I am also blocked from using the mobile app. That doesn't bode well. In the mean time I ask that if people are curious about this API and look into it - please don't abuse it. I've already documented it and would rather not use it at scale until I'm sure it can be used to work around the blocking.

I will update this thread with my findings!

from kijiji-scraper.

scd31 avatar scd31 commented on August 22, 2024

@mwpenny I was scraping the site every day so that I could keep track of ad changes and general site trends over time. The main thing I used this for was to power a browser extension which would show you the history of any ad(For example, if the price went down or if the description changed) but I had some other things in the works.

Now, I scan the house, apartment, and land categories to power my side project, which is a real estate aggregator. https://www.victr.ca

It does look like they mark IPs as "suspicious" - With my VPN on, I wasn't able to get it to lock me out by clicking around on their site.

I think the mobile API is the way to go - there's a lot of room for advanced browser fingerprinting(such as mouse tracking, etc.) but it's hard to do much more than just basic rate limiting with an API.

Also, web proxies are cheap. I pay $2.50/month for 100 of them. I don't think it would be the end of the world if they were required for scraping the site at more than a few requests per minute.

from kijiji-scraper.

mwpenny avatar mwpenny commented on August 22, 2024

Trends in historical data is an interesting use case. Although Kijiji is probably actively working against things like that. If I were them, I'd want people to go to my site for information (and sweet sweet ad traffic), not others.

That's a good find on evidence for an "IP watchlist", thanks. As for the API, they could still block IPs like they're doing now if requests are too frequent. That's one of the things I'm going to look into, and if it's the case then we're not much better off. It's more reliable than scraping though.

If proxies are required then I'd place that burden on users of this module rather than building the proxy-hopping functionality into the module itself. That isn't its goal. Its goal is to simply provide an easy to use interface to retrieve Kijiji data. This allows the code to stay simple and reduces friction when trying to use it in different situations.

Due to the separation of concerns, it is trivial for calling code to proxy the requests in a way that's transparent to the module (as you have done). If I need to add rate limiting, I'll make it configurable so that consumers can crank it up if they want to. I could also add a parameter for a proxy server address.

With all of this said, I am still looking at ways to mitigate this problem. My last resort will be adding rate limiting to the module.

from kijiji-scraper.

scd31 avatar scd31 commented on August 22, 2024

Oh yes, I definitely agree that this package shouldn't be responsible for managing proxies.

from kijiji-scraper.

mwpenny avatar mwpenny commented on August 22, 2024

Before I make any changes, I'm going to do some refactoring and add a few tests to make this and future large changes easier

from kijiji-scraper.

scd31 avatar scd31 commented on August 22, 2024

Any chance I could get a copy of the API documentation? I ported your library to Ruby for a project, so the JS implementation isn't as important to me. Of course, I don't mind waiting for you to implement it so that I can reverse engineer it if you'd prefer.

from kijiji-scraper.

mwpenny avatar mwpenny commented on August 22, 2024

I'd rather wait until it's implemented to make sure we don't preemptively set off any red flags. The JS implementation is my priority. I'm going to be cautious with the implementation to try to make it look as much like normal traffic as I can and make the most of this opportunity. I'm not going to obfuscate my code or be deliberately confusing - it should be pretty easy to understand when I'm done.

Once I'm finished refactoring and testing it shouldn't take long so add a backend for the mobile API. But I'd like to do this right and not rush so I can be confident in the results and longevity of this method.

from kijiji-scraper.

mwpenny avatar mwpenny commented on August 22, 2024

I've refactored this module to my satisfaction (#38; it is now written in TypeScript) and added many tests. These two things make it very easy to make changes with confidence. I will add support for the mobile API next which should be fairly quick due to the new project structure.

However, I tested my refactor against Kijiji several times within the past hour or so and was not banned. I didn't even fake my user agent. Their blocking policy may have become less aggressive. Can anybody else confirm similar results? Granted, I did my search with scrapeResultDetails set to false so I didn't hammer them with 40 requests at once - that could be why (I'm holding off on trying to get banned until I've finalized support for the mobile API).

from kijiji-scraper.

mwpenny avatar mwpenny commented on August 22, 2024

I've finished my implementation. It's been merged and an updated version of the module is now in NPM (v6.0.0).

The mobile API will be used by default but you can use the old method if you want by using the new Scraper Options type. Ad.Get() takes it as an argument and search() accepts the scraperType property in the existing options argument. Note that some search parameters are affected by this change (e.g., params.keywords -> params.q).

When using the mobile API, I try to appear the same as the Android app by sending the same HTTP headers. When using the website, I sent a User-Agent header indicating Firefox on Windows.

I've also added configurable rate limiting. When searching, pageDelayMs can be used to add a delay between each page of results. Its value is 1000 by default. When scrapeResultDetails is set to true, resultDetailsDelayMs can be used to add a delay in between each request for details (note that this forces the requests to be executed serially rather than in parallel like they were up until now - a value of 0 is the same as the old behavior). One nice thing about the mobile API is that result details are part of the search results so scrapeResultDetails is actually unnecessary if using the mobile API (if true in that case it will have no effect).

Finally, the module will detect if you get blocked and throw an error letting you know. I was unable to get myself banned using the mobile API but to be fair, I didn't try to hard. This is good to have anyway.

Sorry everyone for how long this took. I used this as an opportunity to do some work I've been meaning to do for quite a while (refactoring+tests). Future changes will be significantly easier and it should be much more difficult to introduce bugs. In any case, I'm happy with how this turned out.

Enjoy, and please try to be reasonable so that others can enjoy as well.

from kijiji-scraper.

mwpenny avatar mwpenny commented on August 22, 2024

Correction: it's possible to be blocked from the API. It just happened to me. I'm not sure if it was due to how fast I scraped or not - I only made one request. Maybe it's my pattern of activity today, making the exact same request or one with slightly different parameters than the app normally uses. It could be that they were able to fingerprint me as an unofficial client (this is what I'm leaning towards). I was unblocked after a fairly short time.

Note that while blocked from the API I am not blocked from the website. Before, when I was blocked from the website I would be unable to access the API (I assumed the rule worked in reverse too). I haven't tested if this is still true but it seems like they're still tweaking their blocking mechanism.

Interestingly, after being blocked I was still able to successfully hit the API from my web browser (while the module and mobile app were both unable to). This makes me think they're blocking based on something like <IP, user agent> pair.

Anyway there are still some unknowns here but the new version should be much better in this regard. Please test it out and let me know how it goes. I had no issues with blocking for all of last week. If you end up digging into the API and finding something useful regarding blocking, please share your findings.

from kijiji-scraper.

DudeManGuy0 avatar DudeManGuy0 commented on August 22, 2024

Why not set up an organized experiment so we can test this properly and make an inference on the algorithm they use to detect scrapers and then engineer ours to work with it? Also, how was the hidden kijiji API found in the first place? (if it was found at all).

from kijiji-scraper.

mwpenny avatar mwpenny commented on August 22, 2024

I found the API a few years ago by intercepting network communications between the mobile app and Kijiji servers. I waited until recently to use it in this module because at the time of discovery I didn't find any other information about it online. Rather than publish information about the API and tip off Kijiji when my module already worked, I wanted it as a backup plan in case they ever cracked down on HTML scrapers - which they have now done.

Based on my experience poking around with both API and HTML scraping, I don't think the detection or amount of time to block are solely governed by simple rules like checking for HTTP headers. There's probably some AI classification going on that takes a number of factors into account and assigns a risk score.

So at the end of the day, I believe that they will rate limit you if you scrape "suspiciously". Because the definition of "suspicious" depends on not just the number but also the nature of the requests you're making (how many pages of data, time between requests for consecutive search result pages, etc.) I decided to ultimately leave it up to the user since they will know how "suspicious" their scraping is better than I ever can.

I've chosen what I believe are sane defaults to mitigate blocking in most cases (mobile API as the default backend, 1 second delay between requests for search result pages). Based on my testing, they allow a more than reasonable amount of scraping, but it's all configurable to suit the type of scraping you're doing.

With all of that said, I'm more than happy to incorporate ideas and feedback to improve the module in this area. Are you being blocked? If yes, how often and what kinds of requests are you making? Also, what kind of organized experiment did you have in mind? Part of the problem is that it's difficult to anticipate all of the different kinds of scraping different people will do when I'm testing this. I'd appreciate if people could share what their usage is like and how that translates to blocking (how often, for how long, how detection and block time changes after being blocked and unblocked more and more, what scraper settings are being used, etc.). I think that real-world data will be the most useful to me. With that, I may be able to do more here.

from kijiji-scraper.

mwpenny avatar mwpenny commented on August 22, 2024

I'm going to close this now that mobile API support has been in for a while, along with options to scrape more reasonably (i.e., pageDelayMs). Please comment here or create a new issue if there are problems.

from kijiji-scraper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.