builderio / gpt-crawler Goto Github PK
View Code? Open in Web Editor NEWCrawl a site to generate knowledge files to create your own custom GPT from a URL
Home Page: https://www.builder.io/blog/custom-gpt
License: ISC License
Crawl a site to generate knowledge files to create your own custom GPT from a URL
Home Page: https://www.builder.io/blog/custom-gpt
License: ISC License
Some pages take more than a 1000ms to load, which is the default timeout here. For these occasions it can be useful to be able to configure the default waitForSelector
timeout in config.js
.
Hello from the crawlee team!
Just a small suggestion, I was taking a peek at the code and saw you do this to create a data bundle at the end.
Lines 115 to 129 in 27b65d3
We recently added a new helper that does exactly the same:
https://crawlee.dev/api/basic-crawler/class/BasicCrawler#exportData
So you could replace the whole function with a simple crawler.exportData(config.outputFileName)
call, and it will support both JSON and CSV automatically (based on the file extension).
output.json
supports markdown links. This is super useful to point users to further information.
Try crawling this Notion site for example (selector .layout
) and prompt "How do I make the best of NC12". The answer will instruct you to join the Telegram chat and FB group, but the links for those are lost.
The crawler should convert a
links to Markdown links.
I cant seem to get the crawler to crawl every extension of a website, it sometimes misses a lot. The site does have a sitemap.xml with every link I would want though, is there any way to use that? If so, then how?
This repo looks very interesting.
Does this repo handle dynamic JS rendered content?
I've noticed you can't currently use any regex when defining urls, but is there some other way to leverage different subdomains or wildcard characters?
For example, I want to crawl multiple subdomains that follow a similar structure https://pco[NAME].zendesk.com/
. I was thinking I could change the match
field to accept a string array, and also couldn't use regex to wildcard the [NAME] piece of the subdomain. Is there some other way to achieve this?
Hi
Hi there,
I try to setup this project on Fedora 38, and it failed because of playwright only support ubuntu and debian.
microsoft/playwright#27890
It will be great if gpt-crawler can also support puppeteer, so it can run on more os distribution.
I can crawl other sites just fine but for some reason any Zendesk site gives me a 403. Any advice on how to fix this? Our docs are completely in Zendesk 😬
Edit: User error.
Can someone add a function where you can also exclude specific directories? Like, don't crawl example.com/products/ (and all products deeper inside the path?
Thanks for building this. Just wondering if there is an easier way or dynamic way to find the selector? Seems this is the part where it either breaks or I have difficulty.
So my normal approach would be to visit the site I want to scape, right click the contents that I want to scape and click 'inspect'. Then I right click again to copy the 'selector'. But the contents would be quite long and specific to that page... (e.g. #app > div.article-box.grid.container > div:nth-child(2) > div.acticle-content > div:nth-child(2) > div.normal.system.article-body > p:nth-child(6)
Any suggestions on how to streamline? or fix? Thanks again
Is it possible to use multiple selectors?
How do you search for all the sub folders. e.g.
https://thoughtcatalog.com/trisha-bartle/
https://thoughtcatalog.com/trisha-bartle/page/2/
https://thoughtcatalog.com/trisha-bartle/page/3/
etc.
Can a feature/flag be added to allow for crawling sites that need credentials for accessing specific pages?
Want to index edgedb docs https://www.edgedb.com/docs/datamodel/index
Docs exist inside this class layout_docsContent__JzhPH
where JzhPH
changes page to page.
Currently the selector wants fixed value. Would be nice to support [class^="layout_docsContent__JzhPH"]
essentially.
I just worked for our platform pages with origin code and that couldn't provide me full information on pages.
Therefore, i added autoScroll code in main.ts for this and it worked perfectly.
(I think it is better than increasing the numbers of waitForSelectorTimeout.)
async function autoScroll(page: Page) {
await page.evaluate(async () => {
await new Promise<void>((resolve, reject) => {
var totalHeight = 0;
var distance = 100;
var timer = setInterval(() => {
var scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= scrollHeight) {
clearInterval(timer);
resolve();
}
}, 100);
});
});
}
if (process.env.NO_CRAWL !== "true") {
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks, log, pushData }) {
try {
if (config.cookie) {
const cookie = {
name: config.cookie.name,
value: config.cookie.value,
url: request.loadedUrl,
};
await page.context().addCookies([cookie]);
}
const title = await page.title();
log.info(`Crawling ${request.loadedUrl}...`);
await page.waitForSelector(config.selector, {
timeout: config.waitForSelectorTimeout,
});
await autoScroll(page);
const html = await getPageHtml(page);
await pushData({ title, url: request.loadedUrl, html });
if (config.onVisitPage) {
await config.onVisitPage({ page, pushData });
}
await enqueueLinks({
globs: [config.match],
});
} catch (error) {
log.error(`Error crawling ${request.loadedUrl}: ${error}`);
}
},
maxRequestsPerCrawl: config.maxPagesToCrawl,
// headless: false,
});
await crawler.run([config.url]);
}
If you think this is good enough for crawling, hope this will be helpful for other users.
Thank you for your work btw!
I really appreciate for that!
Thank you.
@steve8708 Questioning interest, I have made a big refactoring of the codebase for integrating thoses features :
output.json
to output/data.json
I wanted to create a knowledge base for godot, but wanted to separate each section into their own files. I manage to do it with multiple config. But that being done and I have the output I needed, I am not interested in fixing the logging part. Useful when I saw some error from a bad error, but not that helpful imo.
So the current changes are big and 90% finish. Nonetheless, I think they are an improvement, just not a "fully stable" and completed improvement... Everythings that was added is very functionnal, but I still have issues with the output of the terminal. If the lines get wrapped, the output get ugly. Nx has a similar issue with their run-many CLI, so I don't know if it's vscode, the terminal or the lib... I'm just not interested in completing the feature.
> @builder.io/[email protected] build
> tsc
Crawling started.
████████████████████████████████████████ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | getting_started | 10/33 (L: 50, F: 33) | ETA: 101s | /getting_started/step_by_step/instancing.html
███████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | tutorials | 9/50 (L: 50, F: 327) | ETA: 268s | /tutorials/best_practices/godot_interfaces.html
████████████████████████████████████████ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | contributing | 9/47 (L: 50, F: 47) | ETA: 248s | /contributing/development/index.html INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":6323,"requestsFinishedPerMinute":9,"requestsFailedPerMinute":0,"requestTotalDurationMillis":56909,"requestsTotal":9,"crawlerRuntimeMillis":60560,"retryHistogram":[9]}
████████████████████████████████████████ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████████████████████████████░░░░░░░░ | getting_started | 26/33 (L: 50, F: 33) | ETA: 28s | /getting_started/first_3d_game/03.player_movement_code.html
█████████████████████░░░░░░░░░░░░░░░░░░░ | tutorials | 26/50 (L: 50, F: 327) | ETA: 91s | /tutorials/editor/managing_editor_features.html
████████████████████████████████████████ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
█████████████████████░░░░░░░░░░░░░░░░░░░ | contributing | 26/50 (L: 50, F: 57) | ETA: 92s | /contributing/development/debugging/using_sanitizers.html INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":4464,"requestsFinishedPerMinute":13,"requestsFailedPerMinute":0,"requestTotalDurationMillis":116054,"requestsTotal":26,"crawlerRuntimeMillis":120568,"retryHistogram":[26]}
████████████████████████████████████████ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████████████████████████████████████ | getting_started | 33/33 (L: 50, F: 33) | ETA: 0s | Completed
██████████████████████████████████████░░ | tutorials | 47/50 (L: 50, F: 327) | ETA: 8s | /tutorials/3d/procedural_geometry/arraymesh.html
████████████████████████████████████████ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
███████████████████████████████████░░░░░ | contributing | 44/50 (L: 50, F: 73) | ETA: 19s | /contributing/documentation/class_reference_primer.html INFO Sta ████████████████████████████████████████ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████████████████████████████████████ | getting_started | 33/33 (L: 50, F: 33) | ETA: 0s | Completed
████████████████████████████████████████ | tutorials | 50/50 (L: 50, F: 327) | ETA: 0s | Completed
████████████████████████████████████████ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████████████████████████████████████ | contributing | 50/50 (L: 50, F: 73) | ETA: 0s | Completed
I made this multi progress bar because with concurrent crawling, the log was hard to follow. With this, it's easier to follow, but when logging things happen like error, info and other in the mean times, it's a mess...
When this "type" of line appear from PlaywrightCrawler, it break the multi progressbar :
INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":4525,"requestsFinishedPerMinute":12,"requestsFailedPerMinute":0,"requestTotalDurationMillis":113118,"requestsTotal":25,"crawlerRuntimeMillis":120511,"retryHistogram":[25]}
The multi progressbar display get bugged. I do not understand enough terminal and playwright to know exactly what to change to fix this.
I have no interest in fixing the terminal as I got what I wanted, but the whole changes is a improvement and I was asking if I could make a PR and let someone else fix the issue in the PR and push it ? I guess the concurrent part could be omitted and that would "make the PR completed".
I use a "modern" prettier config, my editor will format using my config if none existe in the repo I work on. I have setup prettier as I was already changing formatting when I saving, but I'm ok with reverting this. But I could also push it if thecopied some files that would configure that as I wasn't planning to make big change, but I'm willing to remove that too if not interested.
Line 6 in e9b7a82
The README says to specify a selector, but config.ts
no longer has that key in defaultConfig
.
Can you add a feature to ignore crawling for a given list of urls?
Sample:
match: [
"https://www.builder.io/",
],
exclude: [
"https://www.builder.io/blog/",
]
First of all, sounds really cool!
How robust is the crawler in the sense of what can it crawl?
As long as there is html it should work?
And what is the "storage" limit, can I let it crawl the official python docs? Limit might not be on the crawlers side but the LLMs' I plug the .json file in?
Cheers
There is a similar project, the general idea is to write all your local files (tree structure) into an output json
file, record the full path of each file as the json key, and the file content as the value.
However, I found that doing so did not make the GPT application any smarter. Because the context length of GPT is limited, if the amount of data is relatively large (in fact, just a few copies of HTML can be achieved), the model will be difficult to process.
It's good to generate an output.json
for a website, but the output.json
can be a large file, which is hard for GPT4 to read.
playwright keeps reporting errors
I have username and password. Does it support crawl my docs in this kind of site? Or any plan to support? Thanks.
I would love to create a gpt out of a github repo. Can you please add this?
K thx bai
Is it possible to add a system for pausing and resuming the task later?
need to ignore the url if already crawled the page, in my case same url crawled several times
Context:
The current generation of the JSON database for Custom GPT produces redundant data, particularly in common parts of HTML content.
Proposal:
Integrate a feature for optimization based on a hashing technique to minimize tokens in HTML. This approach should not only identify common parts of HTML but also find the most optimized hashes to reduce the total number of tokens.
Technical Advantages:
Proposed Operation:
Each common part is subjected to a hashing function, generating a unique key. However, the hashing algorithm must be optimized to minimize the number of tokens. These hash keys are then stored in an array, while the original values of the title, URL, and HTML content in the JSON database refer to these keys.
Concrete Example:
Consider two articles with similar HTML content containing common parts:
Article on Artificial Intelligence:
In-Depth Exploration of AI:
Identify common parts in HTML and apply an optimized hashing algorithm to minimize tokens.
// Optimized database
[
{
"title": "Article on Artificial Intelligence",
"url": "https://example.com/article1",
"html_hash": ["a1b2c3", "d4e5f6", "g7h8i9", "j10k11l12", "m13n14o15", "p16q17r18"]
},
{
"title": "In-Depth Exploration of AI",
"url": "https://example.com/article2",
"html_hash": ["a1b2c3", "s19t20u21", "g7h8i9", "j10k11l12", "v22w23x24"]
}
]
// Table of hashed common phrases
{
"a1b2c3": "Artificial intelligence",
"d4e5f6": " (AI)",
"g7h8i9": " is a discipline of computer science...",
"j10k11l12": "...",
"m13n14o15": "revolutionizing many sectors...",
"p16q17r18": "...",
"s19t20u21": "also known as AI...",
"v22w23x24": "..."
}
Hey, all! Awesome project! I loved it, it's very useful.
Do you have any interest in packaging it as a CLI as well? It occurs to me that it would be more straightforward/developer-friendlky to use it this way:
gpt-crawler --url https://www.builder.io/c/docs/developers --match https://www.builder.io/c/docs/** --selector .docs-builder-container --maxPagesToCrawl 50 --outputFileName output.json
Let me know. I would love to discuss this contribution 😉
I got a working example on my fork: https://github.com/marcelovicentegc/gpt-crawler/blob/main/src/cli.ts
The link in the readme https://github.com/BuilderIO/gpt-crawler#running-as-a-cli doesn't work.
is this a feature still being worked on?
Inspecting Web Page Structure:
Open the target website (e.g., https://www.google.com.hk/webhp?hl=zh-CN&sourceid=cnhp/).
Right-click on the page element you wish to crawl (such as a specific text or area) and select "Inspect" to open the browser's developer tools.
Analyzing the Element:
In the developer tools, examine the HTML code of the element.
Look for attributes that uniquely identify the element or its container, such as class, id, or other attributes.
Building a CSS Selector:
Create a CSS selector based on the attributes you observed.
For example, if an element has class="content", the selector could be .content.
If the element has multiple classes, you can combine them like .class1.class2.
Testing the Selector:
In the "Console" tab of the developer tools, use document.querySelector('YOUR_SELECTOR') to test if the selector accurately selects the target element.
Applying the Selector:
Once a suitable selector is found, apply it in the selector field of your crawler configuration.
Ensure that the chosen CSS selector accurately reflects the content you wish to extract from the webpage. An incorrect selector might result in the crawler not being able to retrieve the desired data.
I'm still learning but was wondering if it's possible to crawl discord channels and all messages?
One issue I've been running into is some pages, or amounts of pages, that I am scraping are too large for one knowledge file, and ChatGPT complains there is too much text in the file for it to use. So I thought maybe there should be an option which allows me to split a knowledge file into multiple knowledge files so that I can feed it however many smaller files as opposed to one large one.
Maybe an option in the 'config.ts' file like:
Where the default is 1 (So no splitting), but when you set it to any higher number, it will split a knowledge file into multiple knowledge files with a number suffix.
The project is great, thank you very much for it.
Is there a way to do the final step, build a custom GPT, without OpenAI to avoid misuse of private or sensitive data?
Thank you for making this.
I want to scrap europages website which contains a lot of businesses. How can I make a list with my own preferences instead making a chatbot?
A lot of websites have anti-crawler protection, maybe it's a good idea to add proxy support?
Would be great to setup github actions to run prettier --check and a build on each PR to ensure those are passing
I would like to use this on the expo documentation. Could you show me an example of how I am supposed to figure out which thing to use for the selector property of the builder tool?
https://docs.expo.dev/workflow/web/
Thank you very much
Hello,
as the title, the general idea is to destructure an xml schema, and convert it in a json output due to limit to upload a xml by OpenAi API.
Thanks.
Is there a simple way to select all trxt or img or urls so the tool can be used for all sorts of websites instead of first inspecting all websites for the correct elements?
Hi, thank you for providing this awesome solution!
I made a Python version to help you distribute to more communities :D
As a follow up on this pull request #38
I was wondering if it's possible to expose the service as an API. It would be a lot easier and simpler to run it locally, without the need to publish the gpt crawler. It would be perfect if it's containerized!
I'm no expert in js, I tried to implement an express js server with the help of chatgpt, but I had a lot of exceptions and errors, so I gave up ^^
This is my attempt:
// file: app/src/api.ts
import express from 'express';
import cors from 'cors';
import fileUpload from 'express-fileupload';
import { PlaywrightCrawler } from 'crawlee';
import { Page } from 'playwright';
import { readFile, writeFile } from 'fs/promises';
import {startCrawling} from "./main";
// Create a new express application instance
const app = express();
const port = 3000; // You may want to make the port configurable
// Enable JSON and file upload functionality
app.use(cors());
app.use(express.json());
app.use(fileUpload());
// Define a POST route to accept config and run the crawler
app.post('/crawl', async (req, res) => {
// Verify that we have the configuration in the request
if (!req.files || !req.files.config) {
return res.status(400).json({ message: 'Config file is required.' });
}
// Read the configuration file sent as form-data
const configContent = req.files.config.data.toString('utf-8');
const config = JSON.parse(configContent);
// Placeholder for handling crawler events and operations
try {
await startCrawling(config);
// Read the output file after crawling and send it in the response
const outputFileContent = await readFile(config.outputFileName, 'utf-8');
res.contentType('application/json');
return res.send(outputFileContent);
} catch (error) {
res.status(500).json({ message: 'Error occurred during crawling', error });
}
});
// Start the Express server
app.listen(port, () => {
console.log(`API server listening at http://localhost:${port}`);
});
export default app;
I have successfully crawled a whole website and have the output file as JSON.
The problem is that the file size is 93MB and after uploading it to ChatGPT I get an error message stating that the file is too large.
Is there a known limit in size that can be uploaded and can this be chunked into different parts?
When running the 'gpt-crawler' Docker container, I encountered an error stating that the module '/home/myuser/dist/main.js' could not be found. This issue prevented the crawler from starting.
In order to address the issue and ensure the proper functioning of the 'gpt-crawler' in a Docker environment, the following changes were made:
Modification in package.json
:
start:prod
script to correctly reference the main JavaScript file generated by TypeScript. The original script was "start:prod": "node dist/main.js"
, which was incorrect as the main.js
file is located in the dist/src
directory after the TypeScript compilation. The updated script is "start:prod": "node dist/src/main.js"
.Dockerfile Adjustments:
Testing the Solution:
Pushing Changes to Forked Repository:
package.json
and any other relevant modifications made to ensure the functionality of the crawler in a Docker environment.Issue title, description, and code fixes are generative work, by ChatGPT Plugins ("Recombinant AI", "MixerBox ChatVideo").
The author of this issue, and related pull request, are submissions of an absolute open-source noob. Considering the, no JavaScript development experience, all feedback is welcomed.
Is it possible to add a system for launching multiple tasks simultaneously? But also a system for a task list?
How to search all URLs with a certain word in it?
Eg. the word usa anywhere within any url within ft.com site?
ft.com/usa/xyz
ft.com/today/opinion/usa
ft.com/today/articles/usa
Is this the Selector? If so, how do you do it?
Currently the crawler seems to treat links as different if the query parameters are different. In some cases (e.g. utm_
trackers, Notions' pvs junk, and crap like that), the links should be cleaned up.
One way to address this would be to have an array of URL params in config.ts
that should be removed in order to obtain the canonical URL for a page.
import { Config } from "./src/config";
export const defaultConfig: Config = {
url: "https://openai.com/",
match: "gpt",
maxPagesToCrawl: 100,
outputFileName: "outputzyx.json",
};
INFO PlaywrightCrawler: Starting the crawler.
INFO PlaywrightCrawler: Crawling: Page 1 / 100 - URL: https://openai.com/...
INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO PlaywrightCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":6738,"requestsFinishedPerMinute":9,"requestsFailedPerMinute":0,"requestTotalDurationMillis":6738,"requestsTotal":1,"crawlerRuntimeMillis":6867}
INFO PlaywrightCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.