bernardro / actor-youtube-scraper Goto Github PK

Apify actor to scrape Youtube search results. You can set the maximum videos to scrape per page as well as the date from which to start scraping.

Home Page: https://apify.com/bernardo/youtube-scraper

License: Apache License 2.0

Dockerfile 1.98% JavaScript 98.02%

apifier apify crawler pupetteer search youtube

actor-youtube-scraper's People

Contributors

Stargazers

Watchers

Forkers

wizehood alex051186 rajivm1991 gcoyle17 windowsales james-darko x0r0x robertfobrien levent91 natashalekh philip-lin rethink23 phohen mstephen19 olehveselov92 ken2190 aaronabi kerwinchina codemasterdevops421

actor-youtube-scraper's Issues

Scraping subtitles

Result set not filtering correctly

I am running a task with getting yesterday dated result set. Previously I used give input PARAM Time frame as "yesterday" it gives correct result set. Couple of days back task not taking the input as "yesterday". Now I am running the task with Param as "1 day" but it gives lot of old data. can you fix this issue?

dataset_YouTubeScrape-Beta_inutparamas1day.csv
dataset_YouTubeScrape-Beta_inputparamasyestrday.csv

Each request fails with timeout error

Looks like they changed something on the site. Selector is there, but each request fails with:
TimeoutError: waiting for XPath \"//ytd-video-primary-info-renderer/div/h1/yt-formatted-string\" failed: timeout 30000ms exceeded

Only 30 results and old videos missing

{
"maxResults": 200,
"postsFromDate": "15 years",
"verboseLog": false,
"extendOutputFunction": "async ({ data, item, page, request, customData }) => {\n return item; \n}",
"extendScraperFunction": "async ({ page, request, requestQueue, customData, Apify, extendOutputFunction }) => {\n \n}",
"handlePageTimeoutSecs": 3600,
"proxyConfiguration": {
"useApifyProxy": true
},
"startUrls": [
{
"url": "https://www.youtube.com/c/scooterofficial/videos"
}
],
"customData": {}
}

"https://www.youtube.com/channel/UCZAZTSd0xnor7hJFmINIBIw" doesn't yield videos

Collect total number of videos for a channel and retry if Google serves less videos

You can see number of videos if you search for the channel but I cannot see it on the channel detail. So we could add a request for that to retry in case the scroll doesn't work. It will be another request so it will make runtime a little longer so we can make it opt-in feature at the start.

github hosted readme image

await is only valid in async function

Your example "const run = await Apify.call('bernardo/youtube-scraper', input);"
produces an error " await is only valid in async function"

Update SDK

"https://www.youtube.com/results?search_query=magic+arena" doesn't load videos

https://my.apify.com/view/runs/3lmUozrpL48Hv09Vo

Add language the subtitles are in

For example from this video where the subs are in Czech: https://www.youtube.com/watch?v=c2eJql_OnHw

how to capture the type of subtitles

Hi guys. Thanks for this great tool. We would like a new feature added so that the actor can determine the type of subtitles provided - whether they are user generated, or auto-generated.

This info is available on the youtube video if the actor could click the three dots to reveal the transcripts popup.

See image here: https://ibb.co/dLYcJFv

Can you guys code this for us in the script? We can pay for your work.
Thankyou!

Add option to scrape channel detail info

We can do either dropdown or checkboxes.

Multiple keyword search

actor does not return any data

this input has not returned any data
https://my.apify.com/view/runs/nMnbYA8zKm8iiAwWx

Too little results

A general search input like this only gave 29 results and some errors

  "searchKeywords": "makeup ",
  "maxResults": 999,
  "postsFromDate": "6 month ago",
  "startUrl": "https://www.youtube.com/",
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "verboseLog": true
}```

stealth mode failing

It seems all attempts to scrape are failing.

Here are the errors:

"ERROR Stealth: StealthError: Failed to apply stealth trick reason: Refused to evaluate a string as JavaScript because 'unsafe-eval' is not an allowed source of script in the following Content Security Policy directive: "script-src 'report-sample' 'nonce-7PyoQsK7StgoaW6nndvU9g' 'unsafe-inline'".
2021-03-12T20:49:52.869Z ",

2021-03-12T20:50:03.949Z ERROR Request https://www.youtube.com/watch?v=BDZ6ujYN610 failed too many times

Node is either not visible or not an HTMLElement

Often it fails to open youtube at all? https://my.apify.com/view/runs/QCIZW5HCCEJNz8JCb

Actor gets stuck

on this selector: #button[aria-label="Search filters"] Maybe when it hit different language mutation? In Czech e.g. there is "Filtry vyhledavani" instead of "Search filters" (Can be caused by proxies or typical search word?) https://my.apify.com/view/runs/pYuyEZ4kJoqtdzFKX

How to know which video belongs to which keyword?

if I let input be multiple keywords
How to know which video belongs to which keyword?
I only see the returned result is a list of videos without keyword

Date is sometimes "Invalid date"

Getting-viewCount-failed - Raw error: waiting for XPath `//yt-view-count-renderer/span[1]` failed: timeout 120000ms exceeded

Implement snapshotter to see where is gets stuck

Add duration of the videos in the actor

I have one more question. Is there a way to get the duration of the videos in youtube scraper.
Thanks & Regards, Shalini BattooBusiness Brio (intercome)

Old videos are not scraped

{
  "maxResults": 999999,
  "postsFromDate": "20 years",
  "verboseLog": false,
  "startUrls": [
    {
      "url": "https://www.youtube.com/user/dysonteam",
      "method": "GET"
    }
  ],
  "extendOutputFunction": "async ({ data, item, page, request, customData }) => {\n  return item; \n  \"title\"; \"likes\"; \"dislikes\"; \"url\"; \"upload date\"\n}",
  "extendScraperFunction": "async ({ page, request, requestQueue, customData, Apify, extendOutputFunction }) => {\n \n}",
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "customData": {}
}

Did not scrape 7 years old videos from the channel

Add option to scrape comments

This might be harder so not high priority

Cannot read diacritics ("á")

https://my.apify.com/view/runs/Z5ZbRYH3RACo2PTRn

How to go to the next page?

Hi there, I managed to get the actor working and returning 50 results. In the channel we are getting however, there are much more than 50 results. How can we get the actor to step through ALL pages so that it can scrape all videos of the channel, not just the first 50?

Incorrect information for videos with more than 1mio views

Incorrect information for videos with more than 1mio views
example run - https://my.apify.com/view/runs/ZkRAZ8AY4QpcsqkBL
if a video has more than 1mio views the result shows only 1k
"Except for one of the videos where the actual views on the page are 5,369,385 but the results show only 5369"

Option for simplified output from scrolling that doesn't go to video details

Was suggested by a customer

Multiple requests per video - ERROR PupeteerCrawler handleRequestFunction failed

For some reason when I try to scrape the videos of a channel, multiple requests are done per video due to the error mentioned in the title. Even though the data is collected on the first request, the error makes the request repeat until it has failed too many times. Unfortunately, this has an impact on time and cost so I would greatly appreciate some feedback on whether this is a known problem or I am doing something wrong.

My input:

{
  "extendOutputFunction": "async ({ data, item, page, request, customData }) => {\n  return item; \n}",
  "extendScraperFunction": "async ({ page, request, requestQueue, customData, Apify, extendOutputFunction }) => {\n \n}",
  "handlePageTimeoutSecs": 3600,
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "startUrls": [
    {
      "url": "https://www.youtube.com/channel/UCCgVtpDnUeUgOjqKVB3bE_A"
    }
  ],
  "subtitlesLanguage": "en",
  "customData": {},
  "maxComments": 0
}