Giter VIP home page Giter VIP logo

Comments (7)

janbuchar avatar janbuchar commented on June 27, 2024 2

Hello @austinbuckler! Do I understand correctly that the problem is that the sitemap content type is not detected correctly because the extension is followed by a query string?

If that's the case, the solution that you propose should help. If you want to contribute a patch, feel free to do so - we'll be grateful to accept that!

from crawlee.

janbuchar avatar janbuchar commented on June 27, 2024 1

@janbuchar here is the patch 🥂

Thank you very much! I assumed you'd open a pull request so that we 1. see if it passes tests and 2. can discuss it better. Also, if we accept that PR, you'll be listed as a contributor 🙂 Care to do that?

from crawlee.

B4nan avatar B4nan commented on June 27, 2024 1

Also please don't update changelogs, they are generated.

from crawlee.

janbuchar avatar janbuchar commented on June 27, 2024 1

good to know, will adjust — the contribution document implies that it should be done manually 😅

Ah, it's great that somebody actually reads those 😁 We'll update it!

from crawlee.

austinbuckler avatar austinbuckler commented on June 27, 2024

Do I understand correctly that the problem is that the sitemap content type is not detected correctly because the extension is followed by a query string?

You are correct!

If that's the case, the solution that you propose should help. If you want to contribute a patch, feel free to do so - we'll be grateful to accept that!

Awesome, will submit a patch this week. Thank you for the prompt response!

from crawlee.

austinbuckler avatar austinbuckler commented on June 27, 2024

@janbuchar here is the patch 🥂

From: Austin <[email protected]>
Date: Thu, 18 Apr 2024 06:59:13 +0000
Subject: [PATCH] fix: malformed sitemap when child loc contains querystrings.

---
 packages/utils/CHANGELOG.md             | 12 ++++++++++++
 packages/utils/src/internals/sitemap.ts | 12 ++++++------
 packages/utils/test/sitemap.test.ts     |  8 +++++++-
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/packages/utils/CHANGELOG.md b/packages/utils/CHANGELOG.md
index 70181f66..dd1ec681 100644
--- a/packages/utils/CHANGELOG.md
+++ b/packages/utils/CHANGELOG.md
@@ -3,6 +3,18 @@
 All notable changes to this project will be documented in this file.
 See [Conventional Commits](https://conventionalcommits.org) for commit guidelines.
 
+
+## [3.9.3](https://github.com/apify/crawlee/compare/v3.9.2...v3.9.3) (2024-04-18)
+
+
+### Features
+
+* **sitemap:** Support for querystrings in sitemap child urls. ([#2420](https://github.com/apify/crawlee/issues/2420))
+
+
+
+
+
 ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)
 
 
diff --git a/packages/utils/src/internals/sitemap.ts b/packages/utils/src/internals/sitemap.ts
index bf3c88e3..c25d891d 100644
--- a/packages/utils/src/internals/sitemap.ts
+++ b/packages/utils/src/internals/sitemap.ts
@@ -149,8 +149,8 @@ export class Sitemap {
         parsingState.sitemapUrls = Array.isArray(urls) ? urls : [urls];
 
         while (parsingState.sitemapUrls.length > 0) {
-            let sitemapUrl = parsingState.sitemapUrls.pop()!;
-            parsingState.visitedSitemapUrls.push(sitemapUrl);
+            let sitemapUrl = new URL(parsingState.sitemapUrls.pop()!);
+            parsingState.visitedSitemapUrls.push(sitemapUrl.toString());
             parsingState.resetContext();
 
             try {
@@ -163,19 +163,19 @@ export class Sitemap {
                 if (sitemapStream.response!.statusCode === 200) {
                     await new Promise((resolve, reject) => {
                         let stream: Duplex = sitemapStream;
-                        if (sitemapUrl.endsWith('.gz')) {
+                        if (sitemapUrl.pathname.endsWith('.gz')) {
                             stream = stream.pipe(createGunzip()).on('error', reject);
-                            sitemapUrl = sitemapUrl.substring(0, sitemapUrl.length - 3);
+                            sitemapUrl.pathname = sitemapUrl.pathname.substring(0, sitemapUrl.pathname.length - 3)
                         }
 
                         const parser = (() => {
                             const contentType = sitemapStream.response!.headers['content-type'];
 
-                            if (['text/xml', 'application/xml'].includes(contentType ?? '') || sitemapUrl.endsWith('.xml')) {
+                            if (['text/xml', 'application/xml'].includes(contentType ?? '') || sitemapUrl.pathname.endsWith('.xml')) {
                                 return Sitemap.createXmlParser(parsingState, () => resolve(undefined), reject);
                             }
 
-                            if (contentType === 'text/plain' || sitemapUrl.endsWith('.txt')) {
+                            if (contentType === 'text/plain' || sitemapUrl.pathname.endsWith('.txt')) {
                                 return new SitemapTxtParser(parsingState, () => resolve(undefined));
                             }
 
diff --git a/packages/utils/test/sitemap.test.ts b/packages/utils/test/sitemap.test.ts
index b3791c52..1b0087ca 100644
--- a/packages/utils/test/sitemap.test.ts
+++ b/packages/utils/test/sitemap.test.ts
@@ -7,7 +7,9 @@ describe('Sitemap', () => {
     beforeEach(() => {
         nock.disableNetConnect();
         nock('http://not-exists.com').persist()
-            .get('/sitemap_child.xml')
+            .get(url => {
+                return url === '/sitemap_child.xml' || url === '/sitemap_child_2.xml'
+            })
             .reply(200, [
                 '<?xml version="1.0" encoding="UTF-8"?>',
                 '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">',
@@ -59,6 +61,10 @@ describe('Sitemap', () => {
                 '<loc>http://not-exists.com/sitemap_child.xml</loc>',
                 '<lastmod>2004-12-23</lastmod>',
                 '</sitemap>',
+                '<sitemap>',
+                '<loc>http://not-exists.com/sitemap_child_2.xml?from=94937939985&amp;to=1318570721404</loc>',
+                '<lastmod>2004-12-23</lastmod>',
+                '</sitemap>',
                 '</sitemapindex>',
             ].join('\n'))
             .get('/not_actual_xml.xml')
-- 
2.44.0

from crawlee.

austinbuckler avatar austinbuckler commented on June 27, 2024

Thank you very much! I assumed you'd open a pull request so that we 1. see if it passes tests and 2. can discuss it better. Also, if we accept that PR, you'll be listed as a contributor 🙂 Care to do that?

sure, will get that done this evening.

Also please don't update changelogs, they are generated.

good to know, will adjust — the contribution document implies that it should be done manually 😅

from crawlee.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.