bernardro / actor-youtube-scraper Goto Github PK

Apify actor to scrape Youtube search results. You can set the maximum videos to scrape per page as well as the date from which to start scraping.

Home Page: https://apify.com/bernardo/youtube-scraper

License: Apache License 2.0

Dockerfile 1.98% JavaScript 98.02%

apify apifier crawler search youtube pupetteer

actor-youtube-scraper's Issues

Actor gets stuck

on this selector: #button[aria-label="Search filters"] Maybe when it hit different language mutation? In Czech e.g. there is "Filtry vyhledavani" instead of "Search filters" (Can be caused by proxies or typical search word?) https://my.apify.com/view/runs/pYuyEZ4kJoqtdzFKX

Node is either not visible or not an HTMLElement

Often it fails to open youtube at all? https://my.apify.com/view/runs/QCIZW5HCCEJNz8JCb

actor does not return any data

this input has not returned any data
https://my.apify.com/view/runs/nMnbYA8zKm8iiAwWx

Duration is sometimes wrong (too low)

Look at the durations that are under 1 min
https://api.apify.com/v2/datasets/PWoOJRCpZZQvwUIrn/items?format=json&clean=1
Maybe showing the duration of an ad?

Gets stucked on video which are yet to be uploaded (concerts in the future)

It timeouts on movie_player span.ytp-time-duration selector as there is no duration yet.
Suggested changes: Either skip those or change the selector with waitFor and distinguish between pages that have a video and that will have a video.

Result set not filtering correctly

I am running a task with getting yesterday dated result set. Previously I used give input PARAM Time frame as "yesterday" it gives correct result set. Couple of days back task not taking the input as "yesterday". Now I am running the task with Param as "1 day" but it gives lot of old data. can you fix this issue?

dataset_YouTubeScrape-Beta_inutparamas1day.csv
dataset_YouTubeScrape-Beta_inputparamasyestrday.csv

Only 30 results and old videos missing

{
"maxResults": 200,
"postsFromDate": "15 years",
"verboseLog": false,
"extendOutputFunction": "async ({ data, item, page, request, customData }) => {\n return item; \n}",
"extendScraperFunction": "async ({ page, request, requestQueue, customData, Apify, extendOutputFunction }) => {\n \n}",
"handlePageTimeoutSecs": 3600,
"proxyConfiguration": {
"useApifyProxy": true
},
"startUrls": [
{
"url": "https://www.youtube.com/c/scooterofficial/videos"
}
],
"customData": {}
}

Incorrect information for videos with more than 1mio views

Incorrect information for videos with more than 1mio views
example run - https://my.apify.com/view/runs/ZkRAZ8AY4QpcsqkBL
if a video has more than 1mio views the result shows only 1k
"Except for one of the videos where the actual views on the page are 5,369,385 but the results show only 5369"

update input schema

add the functionality to scrape as well by URL, not only search terms but as well by multiple start URLs
for example to be able to input https://www.youtube.com/results?search_query=tristan+eckerson

Collect total number of videos for a channel and retry if Google serves less videos

You can see number of videos if you search for the channel but I cannot see it on the channel detail. So we could add a request for that to retry in case the scroll doesn't work. It will be another request so it will make runtime a little longer so we can make it opt-in feature at the start.

Multiple requests per video - ERROR PupeteerCrawler handleRequestFunction failed

For some reason when I try to scrape the videos of a channel, multiple requests are done per video due to the error mentioned in the title. Even though the data is collected on the first request, the error makes the request repeat until it has failed too many times. Unfortunately, this has an impact on time and cost so I would greatly appreciate some feedback on whether this is a known problem or I am doing something wrong.

My input:

{
  "extendOutputFunction": "async ({ data, item, page, request, customData }) => {\n  return item; \n}",
  "extendScraperFunction": "async ({ page, request, requestQueue, customData, Apify, extendOutputFunction }) => {\n \n}",
  "handlePageTimeoutSecs": 3600,
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "startUrls": [
    {
      "url": "https://www.youtube.com/channel/UCCgVtpDnUeUgOjqKVB3bE_A"
    }
  ],
  "subtitlesLanguage": "en",
  "customData": {},
  "maxComments": 0
}

Getting-viewCount-failed - Raw error: waiting for XPath `//yt-view-count-renderer/span[1]` failed: timeout 120000ms exceeded

Implement snapshotter to see where is gets stuck

Date is sometimes "Invalid date"

Update SDK

Each request fails with timeout error

Looks like they changed something on the site. Selector is there, but each request fails with:
TimeoutError: waiting for XPath \"//ytd-video-primary-info-renderer/div/h1/yt-formatted-string\" failed: timeout 30000ms exceeded

await is only valid in async function

Your example "const run = await Apify.call('bernardo/youtube-scraper', input);"
produces an error " await is only valid in async function"

"https://www.youtube.com/results?search_query=magic+arena" doesn't load videos

https://my.apify.com/view/runs/3lmUozrpL48Hv09Vo

Add option to scrape channel detail info

We can do either dropdown or checkboxes.

Add language the subtitles are in

For example from this video where the subs are in Czech: https://www.youtube.com/watch?v=c2eJql_OnHw

Add option to scrape comments

This might be harder so not high priority

Too little results

A general search input like this only gave 29 results and some errors

  "searchKeywords": "makeup ",
  "maxResults": 999,
  "postsFromDate": "6 month ago",
  "startUrl": "https://www.youtube.com/",
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "verboseLog": true
}```

How to know which video belongs to which keyword?

if I let input be multiple keywords
How to know which video belongs to which keyword?
I only see the returned result is a list of videos without keyword

How to go to the next page?

Hi there, I managed to get the actor working and returning 50 results. In the channel we are getting however, there are much more than 50 results. How can we get the actor to step through ALL pages so that it can scrape all videos of the channel, not just the first 50?

stealth mode failing

It seems all attempts to scrape are failing.

Here are the errors:

"ERROR Stealth: StealthError: Failed to apply stealth trick reason: Refused to evaluate a string as JavaScript because 'unsafe-eval' is not an allowed source of script in the following Content Security Policy directive: "script-src 'report-sample' 'nonce-7PyoQsK7StgoaW6nndvU9g' 'unsafe-inline'".
2021-03-12T20:49:52.869Z ",

2021-03-12T20:50:03.949Z ERROR Request https://www.youtube.com/watch?v=BDZ6ujYN610 failed too many times

Option for simplified output from scrolling that doesn't go to video details

Was suggested by a customer

The time filter works lamely

With the time filter set the actor works worse and often doesnt return any results at all.
Especially combination limited results and limited time frame.
But may be a different issue altogether.
https://my.apify.com/view/runs/dZ0Y1kj72mkoJ3Gwq

Ability to search only channels, videos or both

Old videos are not scraped

{
  "maxResults": 999999,
  "postsFromDate": "20 years",
  "verboseLog": false,
  "startUrls": [
    {
      "url": "https://www.youtube.com/user/dysonteam",
      "method": "GET"
    }
  ],
  "extendOutputFunction": "async ({ data, item, page, request, customData }) => {\n  return item; \n  \"title\"; \"likes\"; \"dislikes\"; \"url\"; \"upload date\"\n}",
  "extendScraperFunction": "async ({ page, request, requestQueue, customData, Apify, extendOutputFunction }) => {\n \n}",
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "customData": {}
}

Did not scrape 7 years old videos from the channel

See image here: https://ibb.co/dLYcJFv

Can you guys code this for us in the script? We can pay for your work.
Thankyou!

bernardro / actor-youtube-scraper Goto Github PK

actor-youtube-scraper's Issues

Recommend Projects

Recommend Topics

Recommend Org