Here should be some modern logo
Kotlin/JVM library and cli tool which allows scraping and downloading posts, attachments, other meta from more than 10 sources without any authorization or full page rendering. Based on coroutines and JSoup.
Repository contains:
Current list of implemented sources:
Unfortunately, each web-site is subject to change without any notice, so the tool may work incorrectly because of that. If that happens, please let me know via an issue or some message.
Cli tool allows to:
- download media with flag
--media-only
from almost all presented sources. - scrape posts meta information
Requirements:
- Java: 1.8 +
- Maven (optional)
Build tool
./mvnw clean package -DskipTests=true
Usage:
./skraper --help
usage: [-h] PROVIDER PATH [-n LIMIT] [-t TYPE] [-o OUTPUT] [-m]
[--parallel-downloads PARALLEL_DOWNLOADS]
optional arguments:
-h, --help show this help message and exit
-n LIMIT, --limit LIMIT posts limit (50 by default)
-t TYPE, --type TYPE output type, options: [log, csv, json, xml, yaml]
-o OUTPUT, --output OUTPUT output path
-m, --media-only scrape media only
--parallel-downloads PARALLEL_DOWNLOADS amount of parallel downloads for media items if
enabled flag --media-only (4 by default)
positional arguments:
PROVIDER skraper provider, options: [facebook, instagram,
twitter, youtube, twitch, reddit, ninegag, pinterest,
flickr, tumblr, ifunny, vk, pikabu]
PATH path to user/community/channel/topic/trend
usage: [-h] PROVIDER PATH [-n LIMIT] [-t TYPE] [-o OUTPUT] [-m]
[--parallel-downloads PARALLEL_DOWNLOADS]
optional arguments:
-h, --help show this help message and exit
-n LIMIT, --limit LIMIT posts limit (50 by default)
-t TYPE, --type TYPE output type, options: [log, csv, json, xml, yaml]
-o OUTPUT, --output OUTPUT output path
-m, --media-only scrape media only
--parallel-downloads PARALLEL_DOWNLOADS amount of parallel downloads for media items if
enabled flag --media-only (4 by default)
positional arguments:
PROVIDER skraper provider, options: [facebook, instagram,
twitter, youtube, twitch, reddit, ninegag, pinterest,
flickr, tumblr, ifunny, vk, pikabu]
PATH path to user/community/channel/topic/trend
Examples:
./skraper ninegag /hot
./skraper reddit /r/memes -n 5 -t csv -o ./reddit/posts
./skraper youtube /user/JetBrainsTV/videos --media-only -n 2
Maven:
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>com.github.sokomishalov.skraper</groupId>
<artifactId>skrapers</artifactId>
<version>0.5.1</version>
</dependency>
</dependencies>
Gradle kotlin dsl:
repositories {
maven { url = uri("http://jitpack.io") }
}
dependencies {
implementation("com.github.sokomishalov.skraper:skrapers:0.5.1")
}
You may take a look at library usage in this android sample app or telegram bot
As mentioned before, the provider implementation list is:
- FacebookSkraper
- InstagramSkraper
- TwitterSkraper
- YoutubeSkraper
- TwitchSkraper
- RedditSkraper
- NinegagSkraper
- PinterestSkraper
- FlickrSkraper
- TumblrSkraper
- IFunnySkraper
- VkSkraper
- PikabuSkraper
After that usage as simple as is:
val skraper = InstagramSkraper(client = ReactorNettySkraperClient())
Important moment: it is highly recommended to not use DefaultBlockingSkraperClient. There are some more efficient, non-blocking and resource-friendly implementations for SkraperClient. To use them you just have to put required dependencies in the classpath.
Current http-client implementation list:
- DefaultBlockingClient - simple java.net.* blocking api implementation
- OkHttp3SkraperClient - okhttp3 implementation
- ReactorNettySkraperClient - reactor-netty implementation
- SpringReactiveSkraperClient - spring-webflux client implementation
- KtorSkraperClient - ktor-client-jvm implementation
Each scraper is a class which implements Skraper interface:
interface Skraper {
val baseUrl: URLString
val client: SkraperClient get() = DefaultBlockingSkraperClient
suspend fun getProviderInfo(): ProviderInfo?
suspend fun getPageInfo(path: String): PageInfo?
suspend fun getPosts(path: String, limit: Int = DEFAULT_POSTS_LIMIT): List<Post>
suspend fun resolve(media: Media): Media
}
Also, there are some provider-specific kotlin extensions for implementations. You can find them out at the provider implementation package.
To scrape the latest posts for specific user, channel or trend use skraper like that:
fun main() = runBlocking {
val skraper = FacebookSkraper()
val posts = skraper.getUserPosts(username = "memes", limit = 2) // extension for getPosts()
println(JsonMapper().writerWithDefaultPrettyPrinter().writeValueAsString(posts))
}
Received data structure is similar to each other provider's. Output data example:
[
{
"id" : "5029851093699104",
"text" : "gotta love em!",
"publishedAt" : 1580744400000,
"rating" : 79,
"commentsCount" : 3,
"media" : [ {
"url" : "https://facebook.com/memes/posts/5029851093699104?__xts__%5B0%5D=68.ARA2yRI2YnlXQRKX7Pdphh8ztgvnP11aYE_bZFPNmqLpJZLhwJaG24gDPUTiKDLv-J_E09u2vLjCXalpmEuGSmVR0BkVtcng_i6QV8x5e-aZUv0Mkn1wwKLlhp5NNH6zQWKlqDqRjZrwvcKeUi0unzzulRCHRvDIrbz2leM6PLescFySwMYbMmKFc7ctqaC_F7nJ09Ya0lz9Pqaq_Rh6UsNKom6fqdgHAuoHV894a3QRuyY0BC6fQuXZLOLbRIfEVK3cF9Z5UQiXUYruCySF-WpQEV0k72x6DIjT6B3iovYFnBGHaji9VAx2PByZ-MDs33D1Hz96Mk-O1Pj7zBwO6FvXGhkUJgepiwUOVd0q-pV83rS5EhjtPFDylNoNO2xkDUSIi483p49vumVPWtmab8LX1V6w2anf55kh6pedCXcH3D8rBjz8DaTBnv995u9kk5im-1-HdAGQHyKrCZpaA0QyC-I4oGsCoIJGck3RO8u_SoHcfe2tKjTgPe6j9p1D&__tn__=-R",
"aspectRatio" : 0.864,
"duration" : 10860.000000000
} ]
}, {
"id" : "4990218157662398",
"text" : "Interesting",
"publishedAt" : 1580742000000,
"rating" : 3092,
"commentsCount" : 514,
"media" : [ {
"url" : "https://scontent.fhrk1-1.fna.fbcdn.net/v/t1.0-0/p526x296/52333452_10157743612509879_529328953723191296_n.png?_nc_cat=1&_nc_ohc=oNMb8_mCbD8AX-w9zeY&_nc_ht=scontent.fhrk1-1.fna&oh=ca8a719518ecfb1a24f871282b860124&oe=5E910D0C",
"aspectRatio" : 0.8960573476702509
} ]
}
]
You can see the full model structure for posts and others here
It is possible to scrape user/channel/trend info for some purposes:
fun main() = runBlocking {
val skraper = TwitterSkraper()
val pageInfo = skraper.getUserInfo(username = "memes") // extension for `getPageInfo()`
println(JsonMapper().writerWithDefaultPrettyPrinter().writeValueAsString(pageInfo))
}
Output:
{
"nick" : "memes",
"name" : "Memes.com",
"description" : "http://memes.com is your number one website for the funniest content on the web. You will find funny pictures, funny memes and much more.",
"postsCount" : 10848,
"followersCount" : 154718,
"avatarsMap" : {
"SMALL" : {
"url" : "https://pbs.twimg.com/profile_images/824808708332941313/mJ4xM6PH_normal.jpg"
},
"MEDIUM" : {
"url" : "https://pbs.twimg.com/profile_images/824808708332941313/mJ4xM6PH_normal.jpg"
},
"LARGE" : {
"url" : "https://pbs.twimg.com/profile_images/824808708332941313/mJ4xM6PH_normal.jpg"
}
},
"coversMap" : {
"SMALL" : {
"url" : "https://abs.twimg.com/images/themes/theme1/bg.png"
},
"MEDIUM" : {
"url" : "https://abs.twimg.com/images/themes/theme1/bg.png"
},
"LARGE" : {
"url" : "https://abs.twimg.com/images/themes/theme1/bg.png"
}
}
}
Sometimes you need to know direct media link:
fun main() = runBlocking {
val skraper = InstagramSkraper()
val info = skraper.resolve(Video(url = "https://www.instagram.com/p/B-flad2F5o7/"))
println(JsonMapper().writerWithDefaultPrettyPrinter().writeValueAsString(info))
}
Output:
{
"url" : "https://scontent-amt2-1.cdninstagram.com/v/t50.2886-16/91508191_213297693225472_2759719910220905597_n.mp4?_nc_ht=scontent-amt2-1.cdninstagram.com&_nc_cat=104&_nc_ohc=27bC52qar_oAX-7J2Zh&oe=5EC0BC52&oh=0aafee2860c540452b76e7b8e336147d",
"aspectRatio" : 0.8010012515644556,
"thumbnail" : {
"url" : "https://scontent-amt2-1.cdninstagram.com/v/t51.2885-15/e35/91435498_533808773845524_5302421141680378393_n.jpg?_nc_ht=scontent-amt2-1.cdninstagram.com&_nc_cat=100&_nc_ohc=8gPAcByc6YAAX_kDBWm&oh=5edf6b9d90d606f9c0e055b7dbcbfa45&oe=5EC0DDE8",
"aspectRatio" : 0.8010012515644556
}
}
There is "static" method which allows to download any media from all known implemented sources:
fun main() = runBlocking {
val tmpDir = Files.createTempDirectory("skraper").toFile()
val testVideo = Skraper.download(
media = Video("https://youtu.be/fjUO7xaUHJQ"),
destDir = tmpDir,
filename = "Gandalf"
)
val testImage = Skraper.download(
media = Image("https://www.pinterest.ru/pin/89509111320495523/"),
destDir = tmpDir,
filename = "Do_no_harm"
)
println(testVideo)
println(testImage)
}
Output:
/var/folders/sf/hm2h5chx5fl4f70bj77xccsc0000gp/T/skraper8377953374796527777/Gandalf.mp4
/var/folders/sf/hm2h5chx5fl4f70bj77xccsc0000gp/T/skraper8377953374796527777/Do_no_harm.jpg
It is also possible to scrape provider info for some purposes:
fun main() = runBlocking {
val skraper = InstagramSkraper()
val info = skraper.getProviderInfo()
println(JsonMapper().writerWithDefaultPrettyPrinter().writeValueAsString(info))
}
Output:
{
"name" : "Instagram",
"logoMap" : {
"SMALL" : {
"url" : "https://instagram.com/favicon.ico"
},
"MEDIUM" : {
"url" : "https://instagram.com/favicon.ico"
},
"LARGE" : {
"url" : "https://instagram.com/favicon.ico"
}
}
}
To use the bot follow the link. You are also able to have a look on the bot main logic code.