Comments (7)
Hello @austinbuckler! Do I understand correctly that the problem is that the sitemap content type is not detected correctly because the extension is followed by a query string?
If that's the case, the solution that you propose should help. If you want to contribute a patch, feel free to do so - we'll be grateful to accept that!
from crawlee.
@janbuchar here is the patch 🥂
Thank you very much! I assumed you'd open a pull request so that we 1. see if it passes tests and 2. can discuss it better. Also, if we accept that PR, you'll be listed as a contributor 🙂 Care to do that?
from crawlee.
Also please don't update changelogs, they are generated.
from crawlee.
good to know, will adjust — the contribution document implies that it should be done manually 😅
Ah, it's great that somebody actually reads those 😁 We'll update it!
from crawlee.
Do I understand correctly that the problem is that the sitemap content type is not detected correctly because the extension is followed by a query string?
You are correct!
If that's the case, the solution that you propose should help. If you want to contribute a patch, feel free to do so - we'll be grateful to accept that!
Awesome, will submit a patch this week. Thank you for the prompt response!
from crawlee.
@janbuchar here is the patch 🥂
From: Austin <[email protected]>
Date: Thu, 18 Apr 2024 06:59:13 +0000
Subject: [PATCH] fix: malformed sitemap when child loc contains querystrings.
---
packages/utils/CHANGELOG.md | 12 ++++++++++++
packages/utils/src/internals/sitemap.ts | 12 ++++++------
packages/utils/test/sitemap.test.ts | 8 +++++++-
3 files changed, 25 insertions(+), 7 deletions(-)
diff --git a/packages/utils/CHANGELOG.md b/packages/utils/CHANGELOG.md
index 70181f66..dd1ec681 100644
--- a/packages/utils/CHANGELOG.md
+++ b/packages/utils/CHANGELOG.md
@@ -3,6 +3,18 @@
All notable changes to this project will be documented in this file.
See [Conventional Commits](https://conventionalcommits.org) for commit guidelines.
+
+## [3.9.3](https://github.com/apify/crawlee/compare/v3.9.2...v3.9.3) (2024-04-18)
+
+
+### Features
+
+* **sitemap:** Support for querystrings in sitemap child urls. ([#2420](https://github.com/apify/crawlee/issues/2420))
+
+
+
+
+
## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)
diff --git a/packages/utils/src/internals/sitemap.ts b/packages/utils/src/internals/sitemap.ts
index bf3c88e3..c25d891d 100644
--- a/packages/utils/src/internals/sitemap.ts
+++ b/packages/utils/src/internals/sitemap.ts
@@ -149,8 +149,8 @@ export class Sitemap {
parsingState.sitemapUrls = Array.isArray(urls) ? urls : [urls];
while (parsingState.sitemapUrls.length > 0) {
- let sitemapUrl = parsingState.sitemapUrls.pop()!;
- parsingState.visitedSitemapUrls.push(sitemapUrl);
+ let sitemapUrl = new URL(parsingState.sitemapUrls.pop()!);
+ parsingState.visitedSitemapUrls.push(sitemapUrl.toString());
parsingState.resetContext();
try {
@@ -163,19 +163,19 @@ export class Sitemap {
if (sitemapStream.response!.statusCode === 200) {
await new Promise((resolve, reject) => {
let stream: Duplex = sitemapStream;
- if (sitemapUrl.endsWith('.gz')) {
+ if (sitemapUrl.pathname.endsWith('.gz')) {
stream = stream.pipe(createGunzip()).on('error', reject);
- sitemapUrl = sitemapUrl.substring(0, sitemapUrl.length - 3);
+ sitemapUrl.pathname = sitemapUrl.pathname.substring(0, sitemapUrl.pathname.length - 3)
}
const parser = (() => {
const contentType = sitemapStream.response!.headers['content-type'];
- if (['text/xml', 'application/xml'].includes(contentType ?? '') || sitemapUrl.endsWith('.xml')) {
+ if (['text/xml', 'application/xml'].includes(contentType ?? '') || sitemapUrl.pathname.endsWith('.xml')) {
return Sitemap.createXmlParser(parsingState, () => resolve(undefined), reject);
}
- if (contentType === 'text/plain' || sitemapUrl.endsWith('.txt')) {
+ if (contentType === 'text/plain' || sitemapUrl.pathname.endsWith('.txt')) {
return new SitemapTxtParser(parsingState, () => resolve(undefined));
}
diff --git a/packages/utils/test/sitemap.test.ts b/packages/utils/test/sitemap.test.ts
index b3791c52..1b0087ca 100644
--- a/packages/utils/test/sitemap.test.ts
+++ b/packages/utils/test/sitemap.test.ts
@@ -7,7 +7,9 @@ describe('Sitemap', () => {
beforeEach(() => {
nock.disableNetConnect();
nock('http://not-exists.com').persist()
- .get('/sitemap_child.xml')
+ .get(url => {
+ return url === '/sitemap_child.xml' || url === '/sitemap_child_2.xml'
+ })
.reply(200, [
'<?xml version="1.0" encoding="UTF-8"?>',
'<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">',
@@ -59,6 +61,10 @@ describe('Sitemap', () => {
'<loc>http://not-exists.com/sitemap_child.xml</loc>',
'<lastmod>2004-12-23</lastmod>',
'</sitemap>',
+ '<sitemap>',
+ '<loc>http://not-exists.com/sitemap_child_2.xml?from=94937939985&to=1318570721404</loc>',
+ '<lastmod>2004-12-23</lastmod>',
+ '</sitemap>',
'</sitemapindex>',
].join('\n'))
.get('/not_actual_xml.xml')
--
2.44.0
from crawlee.
Thank you very much! I assumed you'd open a pull request so that we 1. see if it passes tests and 2. can discuss it better. Also, if we accept that PR, you'll be listed as a contributor 🙂 Care to do that?
sure, will get that done this evening.
Also please don't update changelogs, they are generated.
good to know, will adjust — the contribution document implies that it should be done manually 😅
from crawlee.
Related Issues (20)
- Proxy authentication error HOT 1
- URLs rejected from file HOT 4
- Third-party proxy IP does not take effect
- Unify `RequestList` and `RequestProvider` interfaces and extract their "tandem" behavior from `BasicCrawler`
- Investigate a more explicit way of detecting retries in `ProxyConfiguration`
- Proxy changes for same session HOT 1
- feat: `parseWithCheerio` pierces `iframe` elements HOT 2
- storageDir in crawlee.json doesn't work?
- Crawlee forces me to install full puppeteer eventough I am giving it the executablePath
- Bug? PlaywrightCrawler enqueueLinks fails after WWW redirect. HOT 1
- Documentation for "SINGLE FILE DATA STORAGE OPTIONS" Missing HOT 1
- Enqueue strategy check after redirects is not working with adaptive crawler HOT 3
- SessionPool's throws memory leark warning and hangs playwright crawler HOT 2
- Browser crashes after 25 requests HOT 2
- HTTP Crawler Memory Leak? HOT 1
- Proxy config seems to be causing problems related to memory leak HOT 2
- I am unable to restore a saved storageState
- Cannot use multiple 'PlaywrightCrawlers' simultaneously HOT 1
- Some Actors started to fail after the Dockerfile Release in Crawlee HOT 1
- Large threaded, kubernetes scrape = Target page, context or browser has been closed
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crawlee.