Which package is this bug report for? If unsure which one to select, leave blank

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Malformed Sitemap content when url contains searchParams about crawlee HOT 7 CLOSED

austinbuckler commented on June 27, 2024

Malformed Sitemap content when url contains searchParams

from crawlee.

Comments (7)

janbuchar commented on June 27, 2024 2

Hello @austinbuckler! Do I understand correctly that the problem is that the sitemap content type is not detected correctly because the extension is followed by a query string?

If that's the case, the solution that you propose should help. If you want to contribute a patch, feel free to do so - we'll be grateful to accept that!

from crawlee.

janbuchar commented on June 27, 2024 1

@janbuchar here is the patch 🥂

Thank you very much! I assumed you'd open a pull request so that we 1. see if it passes tests and 2. can discuss it better. Also, if we accept that PR, you'll be listed as a contributor 🙂 Care to do that?

from crawlee.

B4nan commented on June 27, 2024 1

Also please don't update changelogs, they are generated.

from crawlee.

janbuchar commented on June 27, 2024 1

good to know, will adjust — the contribution document implies that it should be done manually 😅

Ah, it's great that somebody actually reads those 😁 We'll update it!

from crawlee.

austinbuckler commented on June 27, 2024

Do I understand correctly that the problem is that the sitemap content type is not detected correctly because the extension is followed by a query string?

You are correct!

If that's the case, the solution that you propose should help. If you want to contribute a patch, feel free to do so - we'll be grateful to accept that!

Awesome, will submit a patch this week. Thank you for the prompt response!

from crawlee.

austinbuckler commented on June 27, 2024

@janbuchar here is the patch 🥂

From: Austin <[email protected]>
Date: Thu, 18 Apr 2024 06:59:13 +0000
Subject: [PATCH] fix: malformed sitemap when child loc contains querystrings.

---
 packages/utils/CHANGELOG.md             | 12 ++++++++++++
 packages/utils/src/internals/sitemap.ts | 12 ++++++------
 packages/utils/test/sitemap.test.ts     |  8 +++++++-
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/packages/utils/CHANGELOG.md b/packages/utils/CHANGELOG.md
index 70181f66..dd1ec681 100644
--- a/packages/utils/CHANGELOG.md
+++ b/packages/utils/CHANGELOG.md
@@ -3,6 +3,18 @@
 All notable changes to this project will be documented in this file.
 See [Conventional Commits](https://conventionalcommits.org) for commit guidelines.
 
+
+## [3.9.3](https://github.com/apify/crawlee/compare/v3.9.2...v3.9.3) (2024-04-18)
+
+
+### Features
+
+* **sitemap:** Support for querystrings in sitemap child urls. ([#2420](https://github.com/apify/crawlee/issues/2420))
+
+
+
+
+
 ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)
 
 
diff --git a/packages/utils/src/internals/sitemap.ts b/packages/utils/src/internals/sitemap.ts
index bf3c88e3..c25d891d 100644
--- a/packages/utils/src/internals/sitemap.ts
+++ b/packages/utils/src/internals/sitemap.ts
@@ -149,8 +149,8 @@ export class Sitemap {
         parsingState.sitemapUrls = Array.isArray(urls) ? urls : [urls];
 
         while (parsingState.sitemapUrls.length > 0) {
-            let sitemapUrl = parsingState.sitemapUrls.pop()!;
-            parsingState.visitedSitemapUrls.push(sitemapUrl);
+            let sitemapUrl = new URL(parsingState.sitemapUrls.pop()!);
+            parsingState.visitedSitemapUrls.push(sitemapUrl.toString());
             parsingState.resetContext();
 
             try {
@@ -163,19 +163,19 @@ export class Sitemap {
                 if (sitemapStream.response!.statusCode === 200) {
                     await new Promise((resolve, reject) => {
                         let stream: Duplex = sitemapStream;
-                        if (sitemapUrl.endsWith('.gz')) {
+                        if (sitemapUrl.pathname.endsWith('.gz')) {
                             stream = stream.pipe(createGunzip()).on('error', reject);
-                            sitemapUrl = sitemapUrl.substring(0, sitemapUrl.length - 3);
+                            sitemapUrl.pathname = sitemapUrl.pathname.substring(0, sitemapUrl.pathname.length - 3)
                         }
 
                         const parser = (() => {
                             const contentType = sitemapStream.response!.headers['content-type'];
 
-                            if (['text/xml', 'application/xml'].includes(contentType ?? '') || sitemapUrl.endsWith('.xml')) {
+                            if (['text/xml', 'application/xml'].includes(contentType ?? '') || sitemapUrl.pathname.endsWith('.xml')) {
                                 return Sitemap.createXmlParser(parsingState, () => resolve(undefined), reject);
                             }
 
-                            if (contentType === 'text/plain' || sitemapUrl.endsWith('.txt')) {
+                            if (contentType === 'text/plain' || sitemapUrl.pathname.endsWith('.txt')) {
                                 return new SitemapTxtParser(parsingState, () => resolve(undefined));
                             }
 
diff --git a/packages/utils/test/sitemap.test.ts b/packages/utils/test/sitemap.test.ts
index b3791c52..1b0087ca 100644
--- a/packages/utils/test/sitemap.test.ts
+++ b/packages/utils/test/sitemap.test.ts
@@ -7,7 +7,9 @@ describe('Sitemap', () => {
     beforeEach(() => {
         nock.disableNetConnect();
         nock('http://not-exists.com').persist()
-            .get('/sitemap_child.xml')
+            .get(url => {
+                return url === '/sitemap_child.xml' || url === '/sitemap_child_2.xml'
+            })
             .reply(200, [
                 '<?xml version="1.0" encoding="UTF-8"?>',
                 '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">',
@@ -59,6 +61,10 @@ describe('Sitemap', () => {
                 '<loc>http://not-exists.com/sitemap_child.xml</loc>',
                 '<lastmod>2004-12-23</lastmod>',
                 '</sitemap>',
+                '<sitemap>',
+                '<loc>http://not-exists.com/sitemap_child_2.xml?from=94937939985&amp;to=1318570721404</loc>',
+                '<lastmod>2004-12-23</lastmod>',
+                '</sitemap>',
                 '</sitemapindex>',
             ].join('\n'))
             .get('/not_actual_xml.xml')
-- 
2.44.0

from crawlee.

austinbuckler commented on June 27, 2024

Thank you very much! I assumed you'd open a pull request so that we 1. see if it passes tests and 2. can discuss it better. Also, if we accept that PR, you'll be listed as a contributor 🙂 Care to do that?

sure, will get that done this evening.

Also please don't update changelogs, they are generated.

good to know, will adjust — the contribution document implies that it should be done manually 😅

from crawlee.

Malformed Sitemap content when url contains searchParams about crawlee HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent