builderio / gpt-crawler Goto Github PK

View Code? Open in Web Editor NEW

17.8K 17.8K 1.8K 705 KB

Crawl a site to generate knowledge files to create your own custom GPT from a URL

Home Page: https://www.builder.io/blog/custom-gpt

License: ISC License

Dockerfile 16.91% TypeScript 65.95% Shell 4.67% JavaScript 12.46%

gpt-crawler's People

Contributors

Stargazers

Watchers

Forkers

vhuy036 ryolambert lucky-victory arifur-rahman-arif ukaleem rizky renzof dennishn jackykwandesign bmaddy35 lautapercuspain hanic-becoco highergroundstudio kseikyo mchoquette1984 lyeskara jasonsie lukefost f901107 kenychen wingblow tseng-w pipech ekaanth rainnighte jaleelb quelicm buildrealone samfromaway mandarini dnh33 kazani-351 jerkovicl boreip thedragonlord1939 zoney cally99 danesz donraz mbenitez01 hololeo andy-lee-io dilan-dio4 tonywhite11 partnerise zlishanka dahmaniadame cdcordobaa lexsf pdro-dev aelieve-it ali-abassi bodonkey michaellarrubis jethrotseng 1997roylee sam-nolimit jgui1129 email516888 laike developer-gptio maljefairi k2m5t2 albo3 archit15singh uzair-akrum vskynet theodoregc ophydami b-00 kivianko 994ak amandakelake sakimj kiligfei netsstea lucianoprdo rkp64 adamlaz nicenonecb fishfl lplzyp manish-andodariya allwavemedia hongquandev aidev-pt sudharsan-selvaraj iperzic 0x-sen ivar3245 wysrocket okkann collisioncataclysm farhadfa22 jotacorsino nilesjamex sundonecindy safuledani satyamisme basics-school

gpt-crawler's Issues

Allow `waitForSelector` timeout configuration

Some pages take more than a 1000ms to load, which is the default timeout here. For these occasions it can be useful to be able to configure the default waitForSelector timeout in config.js.

Use new `crawler.exportData` helper

Hello from the crawlee team!

Just a small suggestion, I was taking a peek at the code and saw you do this to create a data bundle at the end.

gpt-crawler/src/core.ts

Lines 115 to 129 in 27b65d3

 export async function write(config: Config) { 

 configSchema.parse(config); 

 const jsonFiles = await glob("storage/datasets/default/*.json", { 

 absolute: true, 

 }); 

 const results = []; 

 for (const file of jsonFiles) { 

 const data = JSON.parse(await readFile(file, "utf-8")); 

 results.push(data); 

 } 

 await writeFile(config.outputFileName, JSON.stringify(results, null, 2)); 

 }

We recently added a new helper that does exactly the same:

https://crawlee.dev/api/basic-crawler/class/BasicCrawler#exportData

So you could replace the whole function with a simple crawler.exportData(config.outputFileName) call, and it will support both JSON and CSV automatically (based on the file extension).

FR: preserve links

output.json supports markdown links. This is super useful to point users to further information.

Try crawling this Notion site for example (selector .layout) and prompt "How do I make the best of NC12". The answer will instruct you to join the Telegram chat and FB group, but the links for those are lost.

The crawler should convert a links to Markdown links.

Any way to use a sitemap.xml for the crawler?

I cant seem to get the crawler to crawl every extension of a website, it sometimes misses a lot. The site does have a sitemap.xml with every link I would want though, is there any way to use that? If so, then how?

Question Handling Dynamic JS content

This repo looks very interesting.
Does this repo handle dynamic JS rendered content?

Wildcard support

I've noticed you can't currently use any regex when defining urls, but is there some other way to leverage different subdomains or wildcard characters?

For example, I want to crawl multiple subdomains that follow a similar structure https://pco[NAME].zendesk.com/. I was thinking I could change the match field to accept a string array, and also couldn't use regex to wildcard the [NAME] piece of the subdomain. Is there some other way to achieve this?

Hi

[Feature Request]Will gpt-crawler support puppeteer ?

Hi there,
I try to setup this project on Fedora 38, and it failed because of playwright only support ubuntu and debian.
microsoft/playwright#27890
It will be great if gpt-crawler can also support puppeteer, so it can run on more os distribution.

403 error on Zendesk specifically

I can crawl other sites just fine but for some reason any Zendesk site gives me a 403. Any advice on how to fix this? Our docs are completely in Zendesk 😬

Edit: User error.

Exclude directories

Can someone add a function where you can also exclude specific directories? Like, don't crawl example.com/products/ (and all products deeper inside the path?

Selector help

Thanks for building this. Just wondering if there is an easier way or dynamic way to find the selector? Seems this is the part where it either breaks or I have difficulty.

So my normal approach would be to visit the site I want to scape, right click the contents that I want to scape and click 'inspect'. Then I right click again to copy the 'selector'. But the contents would be quite long and specific to that page... (e.g. #app > div.article-box.grid.container > div:nth-child(2) > div.acticle-content > div:nth-child(2) > div.normal.system.article-body > p:nth-child(6)

Any suggestions on how to streamline? or fix? Thanks again

Multiple Selectors

Is it possible to use multiple selectors?

How to search sub-folders e.g. xyz.com/folder/page1, page 2 etc.

How do you search for all the sub folders. e.g.

https://thoughtcatalog.com/trisha-bartle/
https://thoughtcatalog.com/trisha-bartle/page/2/
https://thoughtcatalog.com/trisha-bartle/page/3/
etc.

Crawl a Site that requires Credentials

Can a feature/flag be added to allow for crawling sites that need credentials for accessing specific pages?

Support `startsWith` selector

Want to index edgedb docs https://www.edgedb.com/docs/datamodel/index

Docs exist inside this class layout_docsContent__JzhPH where JzhPH changes page to page.

Currently the selector wants fixed value. Would be nice to support [class^="layout_docsContent__JzhPH"] essentially.

Adds for autoScroll for crawling the multi pages?

I just worked for our platform pages with origin code and that couldn't provide me full information on pages.

Therefore, i added autoScroll code in main.ts for this and it worked perfectly.
(I think it is better than increasing the numbers of waitForSelectorTimeout.)

async function autoScroll(page: Page) {
  await page.evaluate(async () => {
    await new Promise<void>((resolve, reject) => {
      var totalHeight = 0;
      var distance = 100;
      var timer = setInterval(() => {
        var scrollHeight = document.body.scrollHeight;
        window.scrollBy(0, distance);
        totalHeight += distance;

        if (totalHeight >= scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, 100);
    });
  });
}

if (process.env.NO_CRAWL !== "true") {
  const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks, log, pushData }) {
      try {
        if (config.cookie) {
          const cookie = {
            name: config.cookie.name,
            value: config.cookie.value,
            url: request.loadedUrl, 
          };
          await page.context().addCookies([cookie]);
        }

        const title = await page.title();
        log.info(`Crawling ${request.loadedUrl}...`);

        await page.waitForSelector(config.selector, {
          timeout: config.waitForSelectorTimeout,
        });

        await autoScroll(page);  

        const html = await getPageHtml(page);
        await pushData({ title, url: request.loadedUrl, html });

        if (config.onVisitPage) {
          await config.onVisitPage({ page, pushData });
        }

        await enqueueLinks({
          globs: [config.match],
        });
      } catch (error) {
        log.error(`Error crawling ${request.loadedUrl}: ${error}`);
      }
    },
    maxRequestsPerCrawl: config.maxPagesToCrawl,
    // headless: false,
  });

  await crawler.run([config.url]);
}

If you think this is good enough for crawling, hope this will be helpful for other users.

Thank you for your work btw!

I really appreciate for that!

Thank you.

Multiple concurrent crawler with split output. Asking if there is interest in completing my Fork.

@steve8708 Questioning interest, I have made a big refactoring of the codebase for integrating thoses features :

excludeSelectors : Remove elements that you don't want in the output data
Cleaner output : Remove some
Refactoring of the full code
Concurrency
Multiple config
Config parsing now set default if not defined.
ProgressBar logging
Sub Routing namings
output now generated in it's own folder
change output.json to output/data.json
Fix .gitignore
added Prettier in the project. (Wouldn't mind to revert that if not wanted)

Things that would be required to fully "complete" the PR:

CLI full support
Terminal logs fixed. (Mostly INFO and ERROR logs from PlaywrightCrawler)

My needs:

I wanted to create a knowledge base for godot, but wanted to separate each section into their own files. I manage to do it with multiple config. But that being done and I have the output I needed, I am not interested in fixing the logging part. Useful when I saw some error from a bad error, but not that helpful imo.

Current state

So the current changes are big and 90% finish. Nonetheless, I think they are an improvement, just not a "fully stable" and completed improvement... Everythings that was added is very functionnal, but I still have issues with the output of the terminal. If the lines get wrapped, the output get ugly. Nx has a similar issue with their run-many CLI, so I don't know if it's vscode, the terminal or the lib... I'm just not interested in completing the feature.

> @builder.io/[email protected] build
> tsc

Crawling started.
████████████████████████████████████████ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | getting_started | 10/33 (L: 50, F: 33) | ETA: 101s | /getting_started/step_by_step/instancing.html
███████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | tutorials | 9/50 (L: 50, F: 327) | ETA: 268s | /tutorials/best_practices/godot_interfaces.html
████████████████████████████████████████ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | contributing | 9/47 (L: 50, F: 47) | ETA: 248s | /contributing/development/index.html INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":6323,"requestsFinishedPerMinute":9,"requestsFailedPerMinute":0,"requestTotalDurationMillis":56909,"requestsTotal":9,"crawlerRuntimeMillis":60560,"retryHistogram":[9]}
████████████████████████████████████████ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████████████████████████████░░░░░░░░ | getting_started | 26/33 (L: 50, F: 33) | ETA: 28s | /getting_started/first_3d_game/03.player_movement_code.html
█████████████████████░░░░░░░░░░░░░░░░░░░ | tutorials | 26/50 (L: 50, F: 327) | ETA: 91s | /tutorials/editor/managing_editor_features.html
████████████████████████████████████████ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
█████████████████████░░░░░░░░░░░░░░░░░░░ | contributing | 26/50 (L: 50, F: 57) | ETA: 92s | /contributing/development/debugging/using_sanitizers.html INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":4464,"requestsFinishedPerMinute":13,"requestsFailedPerMinute":0,"requestTotalDurationMillis":116054,"requestsTotal":26,"crawlerRuntimeMillis":120568,"retryHistogram":[26]}
████████████████████████████████████████ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████████████████████████████████████ | getting_started | 33/33 (L: 50, F: 33) | ETA: 0s | Completed
██████████████████████████████████████░░ | tutorials | 47/50 (L: 50, F: 327) | ETA: 8s | /tutorials/3d/procedural_geometry/arraymesh.html
████████████████████████████████████████ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
███████████████████████████████████░░░░░ | contributing | 44/50 (L: 50, F: 73) | ETA: 19s | /contributing/documentation/class_reference_primer.html INFO Sta ████████████████████████████████████████ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████████████████████████████████████ | getting_started | 33/33 (L: 50, F: 33) | ETA: 0s | Completed
████████████████████████████████████████ | tutorials | 50/50 (L: 50, F: 327) | ETA: 0s | Completed
████████████████████████████████████████ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████████████████████████████████████ | contributing | 50/50 (L: 50, F: 73) | ETA: 0s | Completed

I made this multi progress bar because with concurrent crawling, the log was hard to follow. With this, it's easier to follow, but when logging things happen like error, info and other in the mean times, it's a mess...

The issue :

When this "type" of line appear from PlaywrightCrawler, it break the multi progressbar :

INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":4525,"requestsFinishedPerMinute":12,"requestsFailedPerMinute":0,"requestTotalDurationMillis":113118,"requestsTotal":25,"crawlerRuntimeMillis":120511,"retryHistogram":[25]}

The multi progressbar display get bugged. I do not understand enough terminal and playwright to know exactly what to change to fix this.

Why Asking ?

I have no interest in fixing the terminal as I got what I wanted, but the whole changes is a improvement and I was asking if I could make a PR and let someone else fix the issue in the PR and push it ? I guess the concurrent part could be omitted and that would "make the PR completed".

Other changes that I can omit if not wanted.

I use a "modern" prettier config, my editor will format using my config if none existe in the repo I work on. I have setup prettier as I was already changing formatting when I saving, but I'm ok with reverting this. But I could also push it if thecopied some files that would configure that as I wasn't planning to make big change, but I'm willing to remove that too if not interested.

Here's some visual preview :

Won't push the config changes tho. (Maybe only the typing)

4435

gpt-crawler/.dockerignore

Line 6 in e9b7a82

crawlee_storage

Is the selector still required?

The README says to specify a selector, but config.ts no longer has that key in defaultConfig.

[FR] Exclude a list of urls

Can you add a feature to ignore crawling for a given list of urls?

Sample:
match: [
"https://www.builder.io/",
],
exclude: [
"https://www.builder.io/blog/",
]

Scope of the crawler (limits)

First of all, sounds really cool!

How robust is the crawler in the sense of what can it crawl?
As long as there is html it should work?
And what is the "storage" limit, can I let it crawl the official python docs? Limit might not be on the crawlers side but the LLMs' I plug the .json file in?

Cheers

Turning a website into json data doesn't make the GPT more useful.

There is a similar project, the general idea is to write all your local files (tree structure) into an output json file, record the full path of each file as the json key, and the file content as the value.

However, I found that doing so did not make the GPT application any smarter. Because the context length of GPT is limited, if the amount of data is relatively large (in fact, just a few copies of HTML can be achieved), the model will be difficult to process.

It's good to generate an output.json for a website, but the output.json can be a large file, which is hard for GPT4 to read.

Does the project not support Mac systems?

playwright keeps reporting errors

How to crawl sites which need login

I have username and password. Does it support crawl my docs in this kind of site? Or any plan to support? Thanks.

Comparison with well-established crawlers

How exactly is this project different from an established crawler that would just dump the HTML text into the .html field of a JSON array?

It's got 12k stars, but it lacks basic features like canonicalizing links (see #73) or preserving links (#74).

Add help file to crawl github repos

I would love to create a gpt out of a github repo. Can you please add this?

K thx bai

[FR] System to pause and resume the task later

Is it possible to add a system for pausing and resuming the task later?

Crawling duplicated url

need to ignore the url if already crawled the page, in my case same url crawled several times

[FR] Optimization of Data Formatting for Custom GPT

Context:
The current generation of the JSON database for Custom GPT produces redundant data, particularly in common parts of HTML content.

Proposal:
Integrate a feature for optimization based on a hashing technique to minimize tokens in HTML. This approach should not only identify common parts of HTML but also find the most optimized hashes to reduce the total number of tokens.

Technical Advantages:

Space Optimization: Hash-based deduplication minimizes data replication, significantly reducing the number of tokens and the overall weight of the JSON file.
Storage Efficiency: Hash representation allows storing common parts only once, saving space and improving storage efficiency.
Lightweight Transmission: The resulting lightweight file facilitates data transmission, reducing transfer times and enhancing performance.

Proposed Operation:
Each common part is subjected to a hashing function, generating a unique key. However, the hashing algorithm must be optimized to minimize the number of tokens. These hash keys are then stored in an array, while the original values of the title, URL, and HTML content in the JSON database refer to these keys.

Concrete Example:
Consider two articles with similar HTML content containing common parts:

Article on Artificial Intelligence:
- Title: "Article on Artificial Intelligence"
- URL: "https://example.com/article1"
- HTML: "Artificial intelligence (AI) is a discipline of computer science revolutionizing many sectors. Welcome to our site."
In-Depth Exploration of AI:
- Title: "In-Depth Exploration of AI"
- URL: "https://example.com/article2"
- HTML: "Artificial intelligence, also known as AI, is a discipline of computer science. Welcome to our site."

Identify common parts in HTML and apply an optimized hashing algorithm to minimize tokens.

// Optimized database
[
  {
    "title": "Article on Artificial Intelligence",
    "url": "https://example.com/article1",
    "html_hash": ["a1b2c3", "d4e5f6", "g7h8i9", "j10k11l12", "m13n14o15", "p16q17r18"]
  },
  {
    "title": "In-Depth Exploration of AI",
    "url": "https://example.com/article2",
    "html_hash": ["a1b2c3", "s19t20u21", "g7h8i9", "j10k11l12", "v22w23x24"]
  }
]

// Table of hashed common phrases
{
  "a1b2c3": "Artificial intelligence",
  "d4e5f6": " (AI)",
  "g7h8i9": " is a discipline of computer science...",
  "j10k11l12": "...",
  "m13n14o15": "revolutionizing many sectors...",
  "p16q17r18": "...",
  "s19t20u21": "also known as AI...",
  "v22w23x24": "..."
}

Are you interested in packaging it as a CLI?

Hey, all! Awesome project! I loved it, it's very useful.

Do you have any interest in packaging it as a CLI as well? It occurs to me that it would be more straightforward/developer-friendlky to use it this way:

gpt-crawler --url https://www.builder.io/c/docs/developers --match https://www.builder.io/c/docs/** --selector .docs-builder-container --maxPagesToCrawl 50 --outputFileName output.json

Let me know. I would love to discuss this contribution 😉

I got a working example on my fork: https://github.com/marcelovicentegc/gpt-crawler/blob/main/src/cli.ts

Running as CLI (broken link)

The link in the readme https://github.com/BuilderIO/gpt-crawler#running-as-a-cli doesn't work.

is this a feature still being worked on?

How to Choose a Suitable CSS Selector for a Website

Inspecting Web Page Structure:

Open the target website (e.g., https://www.google.com.hk/webhp?hl=zh-CN&sourceid=cnhp/).
Right-click on the page element you wish to crawl (such as a specific text or area) and select "Inspect" to open the browser's developer tools.
Analyzing the Element:

In the developer tools, examine the HTML code of the element.
Look for attributes that uniquely identify the element or its container, such as class, id, or other attributes.
Building a CSS Selector:

Create a CSS selector based on the attributes you observed.
For example, if an element has class="content", the selector could be .content.
If the element has multiple classes, you can combine them like .class1.class2.
Testing the Selector:

In the "Console" tab of the developer tools, use document.querySelector('YOUR_SELECTOR') to test if the selector accurately selects the target element.
Applying the Selector:

Once a suitable selector is found, apply it in the selector field of your crawler configuration.
Ensure that the chosen CSS selector accurately reflects the content you wish to extract from the webpage. An incorrect selector might result in the crawler not being able to retrieve the desired data.

Is it possible to crawl discord messages (via a browser) ?

I'm still learning but was wondering if it's possible to crawl discord channels and all messages?

[Feature Request] A Way to Split a Knowledge File into Multiple Files

One issue I've been running into is some pages, or amounts of pages, that I am scraping are too large for one knowledge file, and ChatGPT complains there is too much text in the file for it to use. So I thought maybe there should be an option which allows me to split a knowledge file into multiple knowledge files so that I can feed it however many smaller files as opposed to one large one.

Maybe an option in the 'config.ts' file like:

splitKnownledgeFile: 3;

Where the default is 1 (So no splitting), but when you set it to any higher number, it will split a knowledge file into multiple knowledge files with a number suffix.

Custom assistant without OpenAI to use the custom GPT for private and sensitive data

The project is great, thank you very much for it.

Is there a way to do the final step, build a custom GPT, without OpenAI to avoid misuse of private or sensitive data?

Scrap website

Thank you for making this.
I want to scrap europages website which contains a lot of businesses. How can I make a list with my own preferences instead making a chatbot?

Adding proxy support?

A lot of websites have anti-crawler protection, maybe it's a good idea to add proxy support?

Add CI

Would be great to setup github actions to run prettier --check and a build on each PR to ensure those are passing

How does the selector work?

I would like to use this on the expo documentation. Could you show me an example of how I am supposed to figure out which thing to use for the selector property of the builder tool?

https://docs.expo.dev/workflow/web/

Thank you very much

Possiblity to crawl a generic and entire xml without selector (for instance a schema)

Hello,

as the title, the general idea is to destructure an xml schema, and convert it in a json output due to limit to upload a xml by OpenAi API.

Thanks.

Selector scrape by type

Is there a simple way to select all trxt or img or urls so the tool can be used for all sorts of websites instead of first inspecting all websites for the correct elements?

I made a Python version of it

Hi, thank you for providing this awesome solution!
I made a Python version to help you distribute to more communities :D

https://github.com/A-baoYang/gpt-crawler-py/tree/main

Expose the service as a REST API

As a follow up on this pull request #38

I was wondering if it's possible to expose the service as an API. It would be a lot easier and simpler to run it locally, without the need to publish the gpt crawler. It would be perfect if it's containerized!
I'm no expert in js, I tried to implement an express js server with the help of chatgpt, but I had a lot of exceptions and errors, so I gave up ^^

This is my attempt:

// file: app/src/api.ts

import express from 'express';
import cors from 'cors';
import fileUpload from 'express-fileupload';
import { PlaywrightCrawler } from 'crawlee';
import { Page } from 'playwright';
import { readFile, writeFile } from 'fs/promises';
import {startCrawling} from "./main";

// Create a new express application instance
const app = express();
const port = 3000; // You may want to make the port configurable

// Enable JSON and file upload functionality
app.use(cors());
app.use(express.json());
app.use(fileUpload());

// Define a POST route to accept config and run the crawler
app.post('/crawl', async (req, res) => {
    // Verify that we have the configuration in the request
    if (!req.files || !req.files.config) {
        return res.status(400).json({ message: 'Config file is required.' });
    }

    // Read the configuration file sent as form-data
    const configContent = req.files.config.data.toString('utf-8');
    const config = JSON.parse(configContent);

    // Placeholder for handling crawler events and operations
    try {
        await startCrawling(config);

        // Read the output file after crawling and send it in the response
        const outputFileContent = await readFile(config.outputFileName, 'utf-8');
        res.contentType('application/json');
        return res.send(outputFileContent);
    } catch (error) {
        res.status(500).json({ message: 'Error occurred during crawling', error });
    }
});

// Start the Express server
app.listen(port, () => {
    console.log(`API server listening at http://localhost:${port}`);
});

export default app;

Size

I have successfully crawled a whole website and have the output file as JSON.

The problem is that the file size is 93MB and after uploading it to ChatGPT I get an error message stating that the file is too large.
Is there a known limit in size that can be uploaded and can this be chunked into different parts?

Fix for "Cannot find module '/home/myuser/dist/main.js'" Error in Docker Container

Issue Description:

When running the 'gpt-crawler' Docker container, I encountered an error stating that the module '/home/myuser/dist/main.js' could not be found. This issue prevented the crawler from starting.

Steps to Reproduce:

Clone the 'gpt-crawler' repository.
Build the Docker image using the Dockerfile provided in the root directory.
Run the Docker container.
Observe the error message indicating that '/home/myuser/dist/main.js' is missing.

Diagnostic Steps:

Checked the contents of the start_xvfb_and_run_cmd.sh script and verified it was executable.
Ran the script and observed the output, confirming the error.
Reviewed the package.json file and noticed the script "start:prod": "node dist/main.js".
Checked the Dockerfile and noticed the multi-stage build process.
Encountered an error during the build process: sh: 1: tsc: not found, indicating TypeScript (tsc) was not installed or not found in the PATH in the Docker container.
Realized the need to update the file path in package.json to point to the correct location of main.js.

Solution:

Updated the start:prod script in package.json from "start:prod": "node dist/main.js" to "start:prod": "node dist/src/main.js".
Rebuilt the Docker image with the updated package.json.
Ran the Docker container with the new image.
Confirmed that the crawler started successfully and began crawling the default website.

Suggested Changes:

Update the package.json file to correct the path in the start:prod script.
Ensure that the Dockerfile and associated scripts are set up to correctly locate and execute main.js.
Update documentation if necessary to reflect these changes and assist future users in setting up the crawler.

Proposed Solution in Detail:

In order to address the issue and ensure the proper functioning of the 'gpt-crawler' in a Docker environment, the following changes were made:

Modification in package.json:
- Updated the start:prod script to correctly reference the main JavaScript file generated by TypeScript. The original script was "start:prod": "node dist/main.js", which was incorrect as the main.js file is located in the dist/src directory after the TypeScript compilation. The updated script is "start:prod": "node dist/src/main.js".
- This change ensures that when the Docker container starts, it correctly locates and executes the main JavaScript file.
Dockerfile Adjustments:
- The Dockerfile used for this fix was the one located in the root directory of the 'gpt-crawler' repository.
- During the Docker build process, an error was encountered indicating that TypeScript (tsc) was not found. This was resolved by ensuring that TypeScript is installed and correctly set up in the Docker environment.
- The multi-stage build process in the Dockerfile was reviewed and retained as it efficiently separates the build and runtime environments, reducing the final image size.
Testing the Solution:
- After making the above changes, the Docker image was rebuilt to incorporate these modifications.
- The rebuilt Docker image was then run, and it was confirmed that the crawler successfully started and began crawling the default website without encountering the previous error.
Pushing Changes to Forked Repository:
- These changes have been committed and pushed to my fork of the 'gpt-crawler' repository. This includes the updated package.json and any other relevant modifications made to ensure the functionality of the crawler in a Docker environment.
- The forked repository can be reviewed for a detailed view of all changes made.

Additional Notes:

The Dockerfile in the 'containerapp' directory was not used for this fix. It might serve a different purpose and should be reviewed separately.
Consider adding more detailed error handling and logging for easier troubleshooting in the future.

Issue title, description, and code fixes are generative work, by ChatGPT Plugins ("Recombinant AI", "MixerBox ChatVideo").

ChatGPT Plugins - Chatlog with user's work

The author of this issue, and related pull request, are submissions of an absolute open-source noob. Considering the, no JavaScript development experience, all feedback is welcomed.

[FR] Multitasking system

Is it possible to add a system for launching multiple tasks simultaneously? But also a system for a task list?

How to search all URLs with a certain word in it

How to search all URLs with a certain word in it?

Eg. the word usa anywhere within any url within ft.com site?

ft.com/usa/xyz
ft.com/today/opinion/usa
ft.com/today/articles/usa

Is this the Selector? If so, how do you do it?

FR: remove cruft from links

Currently the crawler seems to treat links as different if the query parameters are different. In some cases (e.g. utm_ trackers, Notions' pvs junk, and crap like that), the links should be cleaned up.

One way to address this would be to have an array of URL params in config.ts that should be removed in order to obtain the canonical URL for a page.

Error: Object with guid frame@da17bfa5502b55fac55ab6dcc355fabe was not bound in the connection

craw openai --get nothing

import { Config } from "./src/config";

export const defaultConfig: Config = {
url: "https://openai.com/",
match: "gpt",
maxPagesToCrawl: 100,
outputFileName: "outputzyx.json",
};

INFO PlaywrightCrawler: Starting the crawler.
INFO PlaywrightCrawler: Crawling: Page 1 / 100 - URL: https://openai.com/...
INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO PlaywrightCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":6738,"requestsFinishedPerMinute":9,"requestsFailedPerMinute":0,"requestTotalDurationMillis":6738,"requestsTotal":1,"crawlerRuntimeMillis":6867}
INFO PlaywrightCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}

	export async function write(config: Config) {
	configSchema.parse(config);

	const jsonFiles = await glob("storage/datasets/default/*.json", {
	absolute: true,
	});

	const results = [];
	for (const file of jsonFiles) {
	const data = JSON.parse(await readFile(file, "utf-8"));
	results.push(data);
	}

	await writeFile(config.outputFileName, JSON.stringify(results, null, 2));
	}