jina-ai / reader Goto Github PK

View Code? Open in Web Editor NEW

3.7K 22.0 260.0 24.51 MB

Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/

Home Page: https://jina.ai/reader

License: Apache License 2.0

TypeScript 99.40% JavaScript 0.60%

llm proxy

reader's Introduction

Reader

Your LLMs deserve better input.

Reader converts any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/. Get improved output for your agent and RAG systems at no cost.

Live demo: https://jina.ai/reader
Or just visit these URLs https://r.jina.ai/https://github.com/jina-ai/reader, https://r.jina.ai/https://x.com/elonmusk and see yourself.

Feel free to use Reader API in production. It is free, stable and scalable. We are maintaining it actively as one of the core products of Jina AI.

Updates

2024-04-24: You now have more fine-grained control over Reader API using headers, e.g. forwarding cookies, using HTTP proxy.
2024-04-15: Reader now supports image reading! It captions all images at the specified URL and adds Image [idx]: [caption] as an alt tag (if they initially lack one). This enables downstream LLMs to interact with the images in reasoning, summarizing etc. See example here.

Usage

Simply prepend https://r.jina.ai/ to any URL. For example, to convert the URL https://en.wikipedia.org/wiki/Artificial_intelligence to an LLM-friendly input, use the following URL:

https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence

All images in that page that lack alt tag are auto-captioned by a VLM (vision langauge model) and formatted as !(Image [idx]: [VLM_caption])[img_URL]. This should give your downstream text-only LLM just enough hints to include those images into reasoning, selecting, and summarization.

Streaming mode

Streaming mode is useful when you find that the standard mode provides an incomplete result. This is because the Reader will wait a bit longer until the page is stablely rendered. Use the accept-header to toggle the streaming mode:

curl -H "Accept: text/event-stream" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page

The data comes in a stream; each subsequent chunk contains more complete information. The last chunk should provide the most complete and final result. If you come from LLMs, please note that it is a different behavior than the LLMs' text-generation streaming.

For example, compare these two curl commands below. You can see streaming one gives you complete information at last, whereas standard mode does not. This is because the content loading on this particular site is triggered by some js after the page is fully loaded, and standard mode returns the page "too soon".

curl -H 'x-no-cache: true' https://access.redhat.com/security/cve/CVE-2023-45853
curl -H "Accept: text/event-stream" -H 'x-no-cache: true' https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853

Note: -H 'x-no-cache: true' is used only for demonstration purposes to bypass the cache.

Streaming mode is also useful if your downstream LLM/agent system requires immediate content delivery or needs to process data in chunks to interleave I/O and LLM processing times. This allows for quicker access and more efficient data handling:

Reader API:  streamContent1 ----> streamContent2 ----> streamContent3 ---> ... 
                          |                    |                     |
                          v                    |                     |
Your LLM:                 LLM(streamContent1)  |                     |
                                               v                     |
                                               LLM(streamContent2)   |
                                                                     v
                                                                     LLM(streamContent3)

Note that in terms of completeness: ... > streamContent3 > streamContent2 > streamContent1, each subsequent chunk contains more complete information.

Using request headers

As you have already seen above, one can control the behavior of the Reader API using request headers. Here is a complete list of supported headers.

You can ask the Reader API to forward cookies settings via the x-set-cookie header.
- Note that requests with cookies will not be cached.
You can bypass readability filtering via the x-respond-with header, specifically:
- x-respond-with: markdown returns markdown without going through reability
- x-respond-with: html returns documentElement.outerHTML
- x-respond-with: text returns document.body.innerText
- x-respond-with: screenshot returns the URL of the webpage's screenshot
You can specify a proxy server via the x-proxy-url header.
You can bypass the cached page (lifetime 300s) via the x-no-cache header.

JSON mode (super early beta)

This is still very early and the result is not really a "useful" JSON. It contains three fields url, title and content only. Nonetheless, you can use accept-header to control the output format:

curl -H "Accept: application/json" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page

Install

You will need the following tools to run the project:

Node v18 (The build fails for Node version >18)
Firebase CLI (npm install -g firebase-tools)

For backend, go to the backend/functions directory and install the npm dependencies.

git clone [email protected]:jina-ai/reader.git
cd backend/functions
npm install

What is `thinapps-shared` submodule?

You might notice a reference to thinapps-shared submodule, an internal package we use to share code across our products. While it’s not open-sourced and isn't integral to the Reader's functions, it mainly helps with decorators, logging, secrets management, etc. Feel free to ignore it for now.

That said, this is the single codebase behind https://r.jina.ai, so everytime we commit here, we will deploy the new version to the https://r.jina.ai.

Having trouble on some websites?

Please raise an issue with the URL you are having trouble with. We will look into it and try to fix it.

License

Reader is backed by Jina AI and licensed under Apache-2.0.

reader's People

Contributors

Stargazers

Watchers

Forkers

endexai jtsang4 jnnnthnn shaneholloman aicodehunt yacineali74 xieisabug polya20 beimingmaster cassight 0smboy majiajue ericxsun ekaone xiechengmude igaozp wangbooth cced3000 maddyonline helihui luc4t codeaudit haodaohong fallondown macs1207 yanniszhou yuan505 ai-mou isaac-chao xinghz proximamonkey kyrolabs haiandaxia peacepowerx zhuifeng1016 songkq sherlockouo saliksik techventurebuilder samzong caomaocao ww-jermaine antouank ramstorageai wondergcj hyhnet derekreed96563 misterypoem shahabkazemi mrtnrocks hafei m00yy jenningsje jdelacasa mbrukman rajendharmendra miles626 numinousmuses digimedic endexai solijafari samarsheikh001 cognosysai oxfordoutlander john-rice linecode cifangyiquan kovogo sunmingze oldcai oczarri seattand36 meyemucu boardtwinkle-baseat romancexoxox-p arani-k chamerlireackste wardipity28headlinte raerocketv cephagne-n henryhesz p-centhart heavenjoycertready yhopwator rocketsizzlin99 gament7chattymp f-farerthebest dubyapi94eyemucu ksinceringe jamiegarhart71817 sovyborn2hannah beachroon-r excillu-newsabar jakubik2023 humanshangcottonhope t-anarchyfully 64fuzzyfo unlimitorbe-tacticusal jansystemic yingzi6776

reader's Issues

Timeout for both twitter and reddit happening frequently

Around 50% of the time, this tweet and the few others I tried shows a timeout error: https://r.jina.ai/https://twitter.com/EntreEden/status/1780771887624417315.

The response I get is this:

{"name":"TimeoutError",
"domainThrown":true,
"message":"Timed out after 10000 ms while waiting for the WS endpoint URL to appear in stdout!"}

Same with a reddit post I tried: https://www.reddit.com/r/cscareerquestions/comments/ufdyhd/after_4_years_of_working_im_slowly_learning_how/

I am in Canada, all on my home wifi and computer.

If you need help with scaling, I have a $5 Hetzner vps that I can pitch in. If I get guidance, I'm willing to help out with the issue as well.

Published time to Json mode

Please add "Published Time" to JSON mode. We are investigating how to incorporate the published time to check for updated content downstream and replace vectors based on whether the published time has changed.

Add parameters to request full text (i.e. don't parse with @mozilla/readability)

Great tool, would love to make more use of this, however, I scrape a lot of home pages of websites and those pages wind up having far too much info removed by readability.js. I'd love to have a parameter I can pass that would allow me to use html-to-text instead. The code to add that is very simple:

import { convert } from 'html-to-text';

const options = {
	wordwrap: false,
	selectors: [
		{ selector: 'a', options: { hideLinkHrefIfSameAsText: true, noAnchorUrl: true, ignoreHref: true, linkBrackets: false } },
		{ selector: 'img', format: 'skip' },
		{ selector: 'p', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },
		{ selector: 'pre', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },
	]
};

text = convert(html, options);

That will get you clean text that's well structured from the html though much much longer for home pages like https://www.tanium.com. For that home page, the current reader produces text with: 327 tokens (using the gpt4 tiktoken tokenizer) and html-to-text gives back 2554 tokens. That's far too much information loss for my use case, esp given that much of that information is critical to understanding what the business does.

Finally, while it wouldn't work in js - if you're willing to connect a vision model to interpret images, perhaps you'd consider implementing Trafilatura for articles and similar pages as it slightly outperforms readability.js based on a 2023 analysis from this paper: https://downloads.webis.de/publications/papers/bevendorff_2023b.pdf

Here's one image from the analysis:

While adding a parameter like this conflicts a bit with prioritizing the convenience of just dropping the url at the end of https://r.jina.ai/ I think the added info is really critical. It's the difference between me being able to use this for my use case (which I don't think is an extreme edge case) and not being able to use this...

URL loses information in the conversion

Converting the URL at https://www.eventbrite.com/e/brian-mclaren-wisdom-and-courage-for-a-world-falling-apart-tickets-823891721317 results in the loss of several headings (Date and Time, Location, Refund Policy, etc.) in Jina reader. Rendered result is this:

Title: Brian McLaren | “Wisdom and Courage for a World Falling Apart"

URL Source: https://www.eventbrite.com/e/brian-mclaren-wisdom-and-courage-for-a-world-falling-apart-tickets-823891721317

Markdown Content:
Brian McLaren, author, speaker, activist, and public theologian notes that the challenge of living well and maintaining resilience in turbulent times requires new ways of thinking, becoming, and belonging. Facing nations, ecosystems, economies, religions, and other institutions in disarray, we are called to a spiritual transformation in our own lives that will express itself in transformation in our world.

Childcare for ages 2-12 will be provided by reservation through April 15.

Leebaby

Extraction didn't work

First of all, this is amazing! I tested it out a website I was looking at converting and it dind't work.

https://r.jina.ai/https://repairpal.com/honda/civic/radiator-fan-not-working

Original: https://repairpal.com/honda/civic/radiator-fan-not-working

Thoughts?

Error fetching webpage content via code but successful via browser

When attempting to fetch webpage content via code, I encounter an error consistently. However, I've noticed that the same URL can be successfully accessed via a browser. The error message I'm receiving is as follows:

Error data from response: {
data: null,
path: 'url',
code: 400,
name: 'ParamValidationError',
status: 40001,
message: 'TypeError: Invalid URL',
readableMessage: 'ParamValidationError(url): TypeError: Invalid URL'
}

URL：https://www.trustpilot.com/review/cortexi.io

I've tried troubleshooting this issue, but so far, I haven't been able to pinpoint the exact cause. Any insights or suggestions on how to resolve this would be greatly appreciated. Thank you!

feature suggestion: add light theme

Not a large webpage, but timeout frequently

I think this tool is quite useful, but when I try it, TimeoutError occurs frequently.
I tried webpages from https://baijiahao.baidu.com/, which are not large. For example, https://baijiahao.baidu.com/s?id=1796552710224293134&wfr=spider&for=pc

访问的时候产生了报错

访问的时候产生了报错：{"data":null,"cause":{},"code":422,"name":"AssertionFailureError","status":42206,"message":"Failed to goto https://ddjjjc.gov.cn/pages/news.asp?id=4173: Error: net::ERR_CERT_COMMON_NAME_INVALID at https://ddjjjc.gov.cn/pages/news.asp?id=4173","readableMessage":"AssertionFailureError: Failed to goto https://ddjjjc.gov.cn/pages/news.asp?id=4173: Error: net::ERR_CERT_COMMON_NAME_INVALID at https://ddjjjc.gov.cn/pages/news.asp?id=4173"}
我是用requests请求的，这是i请求代码：
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.7'
}
urllib = "https://r.jina.ai/https://www.ddjjjc.gov.cn/pages/news.asp?id=4173"
resource = requests.get(urllib,headers=headers)
print(resource.text)

需要登陆才能访问的页面如何使用jina-reader呢

想要读取的页面需要登陆态才能访问
如何使用curl调用呢
在curl -H 添加cookie可以吗

Can I give a document content directly to you to get the reader result instead of a URL? Is there an API similar to this, or is there any channel to do so?

How to use it on pages that require authentication and login?

I tried it on some public pages and it worked fine, but it doesn't seem to work for pages that require login/authentication.

Anyway, thank you.

Pile in reader format

Hi! This looks interesting. I wonder if you could convert the Pile dataset taken from respective urls in the jina reader format to experiment with LLM pre-training?

Issue: Jina reader fails to parse URLs containing Chinese characters

Description:

We have encountered an issue where the Jina reader fails to parse URLs that contain Chinese characters. This issue is causing our application to throw errors and prevents us from properly extracting content from certain websites.

Steps to Reproduce:

Make a request to the Jina reader API with a URL containing Chinese characters, such as https://zh.wikipedia.org/wiki/%E5%91%A8%E6%9D%B0%E5%80%AB.
Observe that the Jina reader fails to parse the URL and returns an error.

Expected Behavior:

We expect the Jina reader to properly handle URLs containing Chinese characters and successfully parse the corresponding web pages. The reader should be able to decode the URL, retrieve the web page content, and return it as expected.

Actual Behavior:

When a URL containing Chinese characters is passed to the Jina reader, it fails to parse the URL and throws an error. The error message typically indicates that the reader is unable to read properties of undefined, specifically the 'parentNode' property.

Example Error Message:

Failed to fetch https://zh.wikipedia.org/wiki/%E5%91%A8%E6%9D%B0%E5%80%AB: {"code":500,"status":50000,"message":"Cannot read properties of undefined (reading 'parentNode')","name":"TypeError"}

Impact:

This issue prevents our application from properly extracting content from websites that have URLs containing Chinese characters. It limits the functionality of our application and affects the user experience when dealing with such websites.

Potential Causes:

The Jina reader may not be properly decoding the URL before making the request, leading to an invalid URL being passed to the underlying parsing logic.
The parsing logic within the Jina reader may not be handling URLs with Chinese characters correctly, resulting in the "Cannot read properties of undefined" error.

Workaround:

As a temporary workaround, we have implemented a filtering mechanism in our application to skip URLs that contain Chinese characters. However, this is not an ideal solution as it limits the functionality and coverage of our application.

Request:

We kindly request the Jina team to investigate this issue and provide a fix that allows the Jina reader to properly handle URLs containing Chinese characters. It would be greatly appreciated if you could provide an update on the progress and an estimated timeline for the resolution.

Additional Information:

We are currently using the official Jina reader API, not the open-source service.
We are in the process of setting up our own service, but we are unsure if the open-source service also has this issue.

Please let us know if you require any further information or if there are any specific details you need to investigate and resolve this issue.

Thank you for your attention to this matter. We look forward to your response and resolution.

Best regards,
Loki.W

Data Retrieval After Initial Page Load for Timely Updates

When using API, there's a limitation where it only captures data loaded during the initial page load. However, certain data is loaded dynamically after a brief delay, making it inaccessible through the current implementation. so is their any option available?

Delivery of food

Configure HTTP Proxy

Hey!

How can I configure/use an HTTP Proxy?

For example: https://docs.zyte.com/zyte-api/usage/proxy-mode.html

the search bar should have a closing option or should return to just displaying the logo after search ends.

why the folder named thinapps-shared is empty，but it is used?

How do I deploy it locally?

Cannot get complete infomation for url

I cannot get complete infomation for url:
https://r.jina.ai/https://help.webex.com/29odsb/

Please help check

How to start ?

npm WARN EBADENGINE Unsupported engine {
npm WARN EBADENGINE package: undefined,
npm WARN EBADENGINE required: { node: '20' },
npm WARN EBADENGINE current: { node: 'v21.7.3', npm: '10.5.0' }
npm WARN EBADENGINE }

up to date, audited 1005 packages in 2s

146 packages are looking for funding
run npm fund for details

4 critical severity vulnerabilities

To address all issues (including breaking changes), run:
npm audit fix --force

Run npm audit for details.

这不是完整的开源项目把？

没有本地部署文档。什么都不全，readme里边还写着安装命令干什么嘞？到底是想开源还是不想开源？想开源不能放这样的残次品出来把。

您好，您这个项目我觉得非常有用，请问您是用的LLM来做的处理吗？

Option to toggle the usage of Readability?

As I was experimenting with your API I noticed that it was a bit "too aggressive" on some pages, removing sections that I would want to keep in the final Markdown.

So I looked around both in the project code, as well as setting up an isolated test that only used turndown directly, but finally I found that the "culprit" was @mozilla/readability.

While this seems to do a great job at removing "irrelevant" content before it's converted to Markdown in most cases, I can definitely see how it might be a bit too greedy/aggressive in its cleanup strategy (i.e. not only in my specific case), and since I couldn't really find any combinations of config options for Readability that kept the specific "hero" section on the page I was trying with, I instead wanted to suggest that you might add the ability to simply enable/disable Readability completely?

As I'm not using the project by hosting the actual code locally or on my own server, but rather just using your public API, the ideal scenario would therefore be if this toggle could even exist as e.g. an extra parameter or alternative API endpoint.

Of course turndown should still be configured to remove things like <script> and <style> when not using Readability (if you don't already explicitly do this), but other than that I really think this alternative parsing option could be a very valuable addition!

Scraping failure (JS): globalrelay.com/company/careers/jobs

Jina Reader does not extract the job postings from this website:
https://r.jina.ai/https://www.globalrelay.com/company/careers/jobs/?gh_jid=5117891004

Only non-relevant page components returned

Fantastic project. Thank you!

Here is a page that (one would think) is straightforward to parse: https://access.redhat.com/security/cve/CVE-2023-45853 . However, none of the relevant information in the page makes it to the parsed version, only the corporate links and "scaffolding".

I figured I'd report it in case this can highlight some areas of improvement. Thanks again!

npm run build failed because shared files are not found

Error:

$ npm run build

> build
> tsc -p .

src/cloud-functions/crawler.ts:3:79 - error TS2307: Cannot find module '../shared' or its corresponding type declarations.

3 import { CloudHTTPv2, Ctx, Logger, OutputServerEventStream, RPCReflect } from '../shared';
                                                                                ~~~~~~~~~~~

src/db/crawled.ts:2:33 - error TS2307: Cannot find module '../shared/lib/firestore' or its corresponding type declarations.

2 import { FirestoreRecord } from '../shared/lib/firestore';
                                  ~~~~~~~~~~~~~~~~~~~~~~~~~

src/db/crawled.ts:9:21 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.

9     static override collectionName = 'crawled';
                      ~~~~~~~~~~~~~~

src/db/crawled.ts:11:14 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.

11     override _id!: string;
                ~~~

src/db/crawled.ts:36:21 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.

36     static override from(input: any) {
                       ~~~~

src/db/crawled.ts:46:14 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.

46     override degradeForFireStore() {
                ~~~~~~~~~~~~~~~~~~~

src/index.ts:6:50 - error TS2307: Cannot find module './shared' or its corresponding type declarations.

6 import { loadModulesDynamically, registry } from './shared';
                                                   ~~~~~~~~~~

src/services/puppeteer.ts:4:24 - error TS2307: Cannot find module '../shared/services/logger' or its corresponding type declarations.

4 import { Logger } from '../shared/services/logger';
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~

src/services/puppeteer.ts:178:43 - error TS2339: Property 'fromFirestoreQuery' does not exist on type 'typeof Crawled'.

178             const cached = (await Crawled.fromFirestoreQuery(Crawled.COLLECTION.where('urlPathDigest', '==', digest).orderBy('createdAt', 'desc').limit(1)))?.[0];
                                              ~~~~~~~~~~~~~~~~~~

src/services/puppeteer.ts:178:70 - error TS2339: Property 'COLLECTION' does not exist on type 'typeof Crawled'.

178             const cached = (await Crawled.fromFirestoreQuery(Crawled.COLLECTION.where('urlPathDigest', '==', digest).orderBy('createdAt', 'desc').limit(1)))?.[0];
                                                                         ~~~~~~~~~~

src/services/puppeteer.ts:232:25 - error TS2339: Property 'save' does not exist on type 'typeof Crawled'.

232                 Crawled.save(
                            ~~~~

src/services/puppeteer.ts:240:26 - error TS7006: Parameter 'err' implicitly has an 'any' type.

240                 ).catch((err) => {
                             ~~~


Found 12 errors in 4 files.

Errors  Files
     1  src/cloud-functions/crawler.ts:3
     5  src/db/crawled.ts:2
     1  src/index.ts:6
     5  src/services/puppeteer.ts:4

To reproduce go to backend/functions and run npm run build from an account which doesn't have access to the thinapps-shared/backend.

It looks like the thinapps-shared backend does mostly logging monitoring and caching, but at this point, the project doesn't build without it.

Not pulling image links correctly

Testing jina on our norwegian websites, we see that the images are not pulled correctly:

example URL: https://mikalsenutvikling.no/
Jina URL: https://r.jina.ai/https://mikalsenutvikling.no

First image on the website by Jina is given as ![Image 1: Daglig Leder - André mikalsen](data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%201200%201307'%3E%3C/svg%3E)

But in the HTML you can clearly see there is a src image with .png that should be the URL given in the Jina reader version:

src="https://mikalsenutvikling.no/wp-content/uploads/2022/11/Andre-Mikalsen-optimalisert.png

This has been the same on all sites we have tested.

feat: read pdf like arxiv

https://arxiv.org/pdf/2403.09060.pdf

Can you give a version that can be run independently locally?

DOMException 500 Error

@hanxiao Could you please give some advice for this issue?

`https://r.jina.ai/https://www.msn.com/en-us/news/technology/the-best-ai-search-engines-and-tools-you-can-use-to-search-the-web/ar-BB1kbzFL`

{"code":500,"status":50000,"message":"Failed to execute 'setAttribute' on 'Element': '}' is not a valid attribute name.","name":"DOMException"}

Respect robots.txt and identify your system

Recently, some AI companies have given website administrators the option of opting out of AI training by using configuration options in robots.txt.

While this project is for prompting and RAG rather than training, I still think you should provide an option for website users to prevent their websites from becoming ad-hoc databases for or components of AI systems. It seems like you have made your software default to evading detection by using puppeteer's stealth plugin; the user-agent configuration that would allow website owners to identify your project's bots is commented out.

I think this default is deceptive and irresponsible. You should make sure users of your project respect these preferences by incorporating them into the software's defaults. Web administrators may not be inclined to support the additional traffic generated by people using their websites as a component of AI systems.

Headers (wrongly) removed

In this page https://creator.poe.com/docs/quick-start
all the (bold and big) headers are wrongly removed by Jina.

Example (html on the left, jina markdown (rendered) on the right)

At the html level they look like this:

Client-Side Rendering websites

how does your package deal with Client-Side Rendering websites, it seems it fails to parse a YouTube video page

I've found a special case where the reader doesn't run well

For example this page https://m.ke.com/bj/ershoufang/101120972798.html

This site builder jumps to a default page for requests, and you need to click "Continue" on the default page to go directly to the target page the next time.

I found that the reader gets the content of the default page every time. Can you please see if there is a good way to fix this?

support docker deployment

Could you please add support for docker deployment to streamline setting up and running the project?