Giter VIP home page Giter VIP logo

viasite / site-audit-seo Goto Github PK

View Code? Open in Web Editor NEW
188.0 8.0 27.0 10.7 MB

Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv, xlsx

Home Page: http://json-viewer.popstas.pro/scan

JavaScript 98.56% Shell 0.37% Dockerfile 0.65% TypeScript 0.11% Batchfile 0.31%
crawler scraper puppeteer seo cli xlsx audit seo-audit site-audit lighthouse

site-audit-seo's Introduction

npm npm

Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv.

Web view report - json-viewer.

Demo:

Русское описание ниже

site-audit-demo

Using without install

Open https://json-viewer.popstas.pro/. Public server allow to scan up to 100 pages at once.

Features:

  • Crawls the entire site, collects links to pages and documents
  • Does not follow links outside the scanned domain (configurable)
  • Analyse each page with Lighthouse (see below)
  • Analyse main page text with Mozilla Readability and Yake
  • Search pages with SSL mixed content
  • Scan list of urls, --url-list
  • Set default report fields and filters
  • Scan presets
  • Documents with the extensions doc, docx, xls, xlsx, ppt, pptx, pdf, rar, zip are added to the list with a depth == 0

Technical details:

  • Does not load images, css, js (configurable)
  • Each site is saved to a file with a domain name in ~/site-audit-seo/
  • Some URLs are ignored (preRequest in src/scrap-site.js)

Web viewer features:

  • Fixed table header and url column
  • Add/remove columns
  • Column presets
  • Field groups by categories
  • Filters presets (ex. h1_count != 1)
  • Color validation
  • Verbose page details (+ button)
  • Direct URL to same report with selected fields, filters, sort
  • Stats for whole scanned pages, validation summary
  • Persistent URL to report when --upload using
  • Switch between last uploaded reports
  • Rescan current report

Fields list (18.08.2020):

  • url
  • mixed_content_url
  • canonical
  • is_canonical
  • previousUrl
  • depth
  • status
  • request_time
  • redirects
  • redirected_from
  • title
  • h1
  • page_date
  • description
  • keywords
  • og_title
  • og_image
  • schema_types
  • h1_count
  • h2_count
  • h3_count
  • h4_count
  • canonical_count
  • google_amp
  • images
  • images_without_alt
  • images_alt_empty
  • images_outer
  • links
  • links_inner
  • links_outer
  • text_ratio_percent
  • dom_size
  • html_size
  • html_size_rendered
  • lighthouse_scores_performance
  • lighthouse_scores_pwa
  • lighthouse_scores_accessibility
  • lighthouse_scores_best-practices
  • lighthouse_scores_seo
  • lighthouse_first-contentful-paint
  • lighthouse_speed-index
  • lighthouse_largest-contentful-paint
  • lighthouse_interactive
  • lighthouse_total-blocking-time
  • lighthouse_cumulative-layout-shift
  • and 150 more lighthouse tests!

Install

Zero-knowledge install

Requires Docker.

Windows: download and run install-run.bat.

Script will clone repository to %LocalAppData%\Programs\site-audit-seo and run service on http://localhost:5302.

Linux/MacOS:

curl https://raw.githubusercontent.com/viasite/site-audit-seo/master/install-run.sh | bash

Script will clone repository to $HOME/.local/share/programs/site-audit-seo and run service on http://localhost:5302.

Service will available on http://localhost:5302

Default ports:
  • Backend: 5301
  • Frontend: 5302
  • Yake: 5303

You can change it in .env file or in docker-compose.yml.

Install with NPM:

npm install -g site-audit-seo

For linux users

npm install -g site-audit-seo --unsafe-perm=true

After installing on Ubuntu, you may need to change the owner of the Chrome directory from root to user.

Run this (replace $USER to your username or run from your user, not from root):

sudo chown -R $USER:$USER "$(npm prefix -g)/lib/node_modules/site-audit-seo/node_modules/puppeteer/.local-chromium/"

Install developer instanse with docker-compose

git clone https://github.com/viasite/site-audit-seo
cd site-audit-seo
git clone https://github.com/viasite/site-audit-seo-viewer data/front
docker-compose pull # for skip build step
docker-compose up -d

Error details Invalid file descriptor to ICU data received.

Command line usage:

$ site-audit-seo --help
Usage: site-audit-seo -u https://example.com

Options:
  -u --urls <urls>                  Comma separated url list for scan
  -p, --preset <preset>             Table preset (minimal, seo, seo-minimal, headers, parse, lighthouse,
                                    lighthouse-all) (default: "seo")
  -t, --timeout <timeout>           Timeout for page request, in ms (default: 10000)
  -e, --exclude <fields>            Comma separated fields to exclude from results
  -d, --max-depth <depth>           Max scan depth (default: 10)
  -c, --concurrency <threads>       Threads number (default: by cpu cores)
  --lighthouse                      Appends base Lighthouse fields to preset
  --delay <ms>                      Delay between requests (default: 0)
  -f, --fields <json>               Field in format --field 'title=$("title").text()' (default: [])
  --default-filter <defaultFilter>  Default filter when JSON viewed, example: depth>1
  --no-skip-static                  Scan static files
  --no-limit-domain                 Scan not only current domain
  --docs-extensions <ext>           Comma-separated extensions that will be add to table (default:
                                    doc,docx,xls,xlsx,ppt,pptx,pdf,rar,zip)
  --follow-xml-sitemap              Follow sitemap.xml (default: false)
  --ignore-robots-txt               Ignore disallowed in robots.txt (default: false)
  --url-list                        assume that --url contains url list, will set -d 1 --no-limit-domain
                                    --ignore-robots-txt (default: false)
  --remove-selectors <selectors>    CSS selectors for remove before screenshot, comma separated (default:
                                    ".matter-after,#matter-1,[data-slug]")
  -m, --max-requests <num>          Limit max pages scan (default: 0)
  --influxdb-max-send <num>         Limit send to InfluxDB (default: 5)
  --no-headless                     Show browser GUI while scan
  --remove-csv                      Delete csv after json generate (default: true)
  --remove-json                     Delete json after serve (default: true)
  --no-remove-csv                   No delete csv after generate
  --no-remove-json                  No delete json after serve
  --out-dir <dir>                   Output directory (default: "~/site-audit-seo/")
  --out-name <name>                 Output file name, default: domain
  --csv <path>                      Skip scan, only convert existing csv to json
  --json                            Save as JSON (default: true)
  --no-json                         No save as JSON
  --upload                          Upload JSON to public web (default: false)
  --no-color                        No console colors
  --partial-report <partialReport>
  --lang <lang>                     Language (en, ru, default: system language)
  --no-console-validate             Don't output validate messages in console
  --disable-plugins <plugins>       Comma-separated plugin list (default: [])
  --screenshot                      Save page screenshot (default: false)
  -V, --version                     output the version number
  -h, --help                        display help for command

Custom fields

Linux/Mac:

site-audit-seo -d 1 -u https://example -f 'title=$("title").text()' -f 'h1=$("h1").text()'
site-audit-seo -d 1 -u https://example -f noindex=$('meta[content="noindex,%20nofollow"]').length

Windows:

site-audit-seo -d 1 -u https://example -f title=$('title').text() -f h1=$('h1').text()

Remove fields from results

This will output fields from seo preset excluding canonical fields:

site-audit-seo -u https://example.com --exclude canonical,is_canonical

Lighthouse

Analyse each page with Lighthouse

site-audit-seo -u https://example.com --preset lighthouse

Analyse seo + Lighthouse

site-audit-seo -u https://example.com --lighthouse

Config file

You can copy .site-audit-seo.conf.js to your home directory and tune options.

Send to InfluxDB

It is beta feature. How to config:

  1. Add this to ~/.site-audit-seo.conf:
module.exports = {
  influxdb: {
    host: 'influxdb.host',
    port: 8086,
    database: 'telegraf',
    measurement: 'site_audit_seo', // optional
    username: 'user',
    password: 'password',
    maxSendCount: 5, // optional, default send part of pages
  }
};
  1. Use --influxdb-max-send in terminal.

  2. Create command for scan your urls:

site-audit-seo -u https://page-with-url-list.txt --url-list --lighthouse --upload --influxdb-max-send 100 >> ~/log/site-audit-seo.log
  1. Add command to cron.

Plugins

  • Readability - main page text length, reading time
  • Yake - keywords extraction from main page text

See CONTRIBUTING.md for details about plugin development.

Install plugins:

cd data
npm install site-audit-seo-readability
npm install site-audit-seo-yake

Disable plugins:

You can add argument such: --disable-plugins readability,yake. It more faster, but less data extracted.

Credentials

Based on headless-chrome-crawler (puppeteer). Used forked version @popstas/headless-chrome-crawler.

Bugs

  1. Sometimes it writes identical pages to csv. This happens in 2 cases: 1.1. Redirect from another page to this (solved by setting skipRequestedRedirect: true, hardcoded). 1.2. Simultaneous request of the same page in parallel threads.

Free audit tools alternatives

Free data scrapers

  • Web Scraper - free for local use extension
  • Portia - self-hosted visual scraper builder, scrapy based
  • Crawlab - distributed web crawler admin platform, self-hosted with Docker
  • OutWit Hub - free edition, pro edition for $99
  • Octoparse - 10 000 records free
  • Parsers.me - 1 000 pages per run free
  • website-scraper - opensource, CLI, download site to local directory
  • website-scraper-puppeteer - same but puppeteer based
  • Gerapy - distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Русский

Сканирование одного или несколько сайтов в json файл с веб-интерфейсом.

Особенности:

  • Обходит весь сайт, собирает ссылки на страницы и документы
  • Сводка результатов после сканирования
  • Документы с расширениями doc, docx, xls, xlsx, pdf, rar, zip добавляются в список с глубиной 0
  • Поиск страниц с SSL mixed content
  • Каждый сайт сохраняется в файл с именем домена
  • Не ходит по ссылкам вне сканируемого домена (настраивается)
  • Не загружает картинки, css, js (настраивается)
  • Некоторые URL игнорируются (preRequest в src/scrap-site.js)
  • Можно прогнать каждую страницу по Lighthouse (см. ниже)
  • Сканирование произвольного списка URL, --url-list

Установка:

npm install -g site-audit-seo

Если у вас Ubuntu

npm install -g site-audit-seo --unsafe-perm=true
npm run postinstall-puppeteer-fix

Или запустите это (замените $USER на вашего юзера, либо запускайте под юзером, не под root):

sudo chown -R $USER:$USER "$(npm prefix -g)/lib/node_modules/site-audit-seo/node_modules/puppeteer/.local-chromium/"

Подробности ошибки Invalid file descriptor to ICU data received.

Использование

site-audit-seo -u https://example.com

Кастомные поля

Можно передать дополнительные поля так:

site-audit-seo -d 1 -u https://example -f "title=$('title').text()" -f "h1=$('h1').text()"

Lighthouse

Прогнать каждую страницу по Lighthouse

site-audit-seo -u https://example.com --preset lighthouse

Обычный seo аудит + Lighthouse

site-audit-seo -u https://example.com --lighthouse

Как посчитать контент по csv

  1. Открыть в блокноте
  2. Документы посчитать поиском ,0
  3. Листалки исключить поиском ?
  4. Вычесть 1 (шапка)

Баги

  1. Иногда пишет в csv одинаковые страницы. Это бывает в 2 случаях: 1.1. Редирект с другой страницы на эту (решается установкой skipRequestedRedirect: true, сделано). 1.2. Одновременный запрос одной и той же страницы в параллельных потоках.

TODO:

site-audit-seo's People

Contributors

michalski-luc avatar popstas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

site-audit-seo's Issues

Error: listen EADDRINUSE: address already in use :::3001

JSON file: http://localhost:3001/data.json

Dev viewer: http://localhost:3000/?url=http://localhost:3001/data.json

Online viewer: https://viasite.github.io/site-audit-seo-viewer/?url=http://localhost:3001/data.json

Finish: 7.9 mins (0.14 sec per page)
events.js:287
throw er; // Unhandled 'error' event
^

Error: listen EADDRINUSE: address already in use :::3001
at Server.setupListenHandle [as _listen2] (net.js:1313:16)
at listenInCluster (net.js:1361:12)
at Server.listen (net.js:1449:7)
at Function.listen (/usr/local/lib/node_modules/site-audit-seo/node_modules/express/lib/application.js:618:24)
at module.exports (/usr/local/lib/node_modules/site-audit-seo/src/actions/startViewer.js:43:9)
at finishScan (/usr/local/lib/node_modules/site-audit-seo/src/scrap-site.js:476:13)
at async tryFinish (/usr/local/lib/node_modules/site-audit-seo/src/scrap-site.js:491:7)
at async module.exports (/usr/local/lib/node_modules/site-audit-seo/src/scrap-site.js:509:3)
at async start (/usr/local/lib/node_modules/site-audit-seo/src/index.js:217:5)
Emitted 'error' event on Server instance at:
at emitErrorNT (net.js:1340:8)
at processTicksAndRejections (internal/process/task_queues.js:84:21) {
code: 'EADDRINUSE',
errno: 'EADDRINUSE',
syscall: 'listen',
address: '::',
port: 3001
}

Error: connect ECONNREFUSED

Hi I keep getting this error after the command: site-audit-seo -u https://example.com --lighthouse
error: node:internal/process/promises:288
triggerUncaughtException(err, true /* fromPromise */);
^

Error: connect ECONNREFUSED ::1:57468
at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1300:16) {
errno: -4078,
code: 'ECONNREFUSED',
syscall: 'connect',
address: '::1',
port: 57468
}

Node.js v18.12.1

Upgrade packages

Is this package still being maintained?

I just did a fresh install and got the following warnings:

npm WARN deprecated [email protected]: Critical security vulnerability fixed in v0.21.1. For more information, see https://github.com/axios/axios/pull/3410
npm WARN deprecated [email protected]: request-promise has been deprecated because it extends the now deprecated request package, see https://github.com/request/request/issues/3142
npm WARN deprecated [email protected]: request has been deprecated, see https://github.com/request/request/issues/3142
npm WARN deprecated [email protected]: < 18.1.0 is no longer supported
npm WARN deprecated [email protected]: this library is no longer supported
npm WARN deprecated [email protected]: Please upgrade  to version 7 or higher.  Older versions may use Math.random() in certain circumstances, which is known to be problematic.  See https://v8.dev/blog/math-random for details.
npm WARN deprecated [email protected]: Please upgrade to @sentry/node. See the migration guide https://bit.ly/3ybOlo7
npm WARN deprecated [email protected]: We've written a new parser that's 6x faster and is backwards compatible. Please use @formatjs/icu-messageformat-parser
npm WARN deprecated [email protected]: request has been deprecated, see https://github.com/request/request/issues/3142
npm WARN deprecated [email protected]: Deprecated due to CVE-2021-21366 resolved in 0.5.0
npm WARN deprecated [email protected]: this library is no longer supported
npm WARN deprecated [email protected]: Please upgrade  to version 7 or higher.  Older versions may use Math.random() in certain circumstances, which is known to be problematic.  See https://v8.dev/blog/math-random for details.
npm WARN deprecated [email protected]: Debug versions >=3.2.0 <3.2.7 || >=4 <4.3.1 have a low-severity ReDos regression when used in a Node.js environment. It is recommended you upgrade to 3.2.7 or 4.3.1. (https://github.com/visionmedia/debug/issues/797)

sitemap index / robots.txt parsers

Hi,

Hope you are all well ! And merry Christmas first of all !

I was playing to today with site-audit-seo and I was missing some features like a robots.txt parser to find available sitemaps for a website and the related sitemap extractor (handling sitemap indexes also).

Do you think it is possible to add these 2 components easily ?

Please fin below some references that I found:

Thanks for any insights or inputs on that.

Ps. Do you have a telegram account as I have some questions for you and do not want to pollute this thread ?
My handle is "deepocrates"

Cheers,
Luc Michalski

Docker Bug - Missing Folders and Module

Hey Guys,

I really like your solution to get some statistics about homepages and SEO-Stuff.

I tried your provides solution and it works perfectly.

After it I tried to run the docker-compose section and got these errors:

$ docker-compose up -d

ERROR: build path /var/docker/config/pagespeed-insights-lighthouse/site-audit-seo/data/front either does not exist, is not accessible, or is not a valid URL.

So I created the folder-path
$ mkdir -p data/front

$ docker-compose up -d

sas-backend is always restarting
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e1510237a6a4 viasite/site-audit-seo:latest "docker-entrypoint.s…" 16 seconds ago Restarting (1) 1 second ago sas-backend

$ docker logs sas-backend

[email protected] server /app
node src/server.js

loaded plugins: export-influxdb
Create empty package.json in data
internal/fs/utils.js:307
throw err;
^

Error: EACCES: permission denied, copyfile './package-data.json' -> 'data/package.json'
at Object.copyFileSync (fs.js:1991:3)
at Object.exports.initDataDir (/app/src/utils.js:22:8)
at Object. (/app/src/server.js:17:7)
at Module._compile (internal/modules/cjs/loader.js:1063:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:1092:10)
at Module.load (internal/modules/cjs/loader.js:928:32)
at Function.Module._load (internal/modules/cjs/loader.js:769:14)
at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:72:12)
at internal/main/run_main_module.js:17:47 {
errno: -13,
syscall: 'copyfile',
code: 'EACCES',
path: './package-data.json',
dest: 'data/package.json'
}
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! [email protected] server: node src/server.js
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the [email protected] server script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR! /home/node/.npm/_logs/2021-06-09T15_23_14_901Z-debug.log

I added the the App Path in docker-compose File
volumes:
- .:/app <------------
- ./data:/app/data
- ./data/reports:/app/data/reports
- ./data/db-docker.json:/app/data/db.json

$ docker-compose down
$ docker-compose up -d
$ docker logs sas-backend

After this I get this error

[email protected] server /app
node src/server.js

internal/modules/cjs/loader.js:883
throw err;
^

Error: Cannot find module 'lowdb'
Require stack:

  • /app/src/server.js
    at Function.Module._resolveFilename (internal/modules/cjs/loader.js:880:15)
    at Function.Module._load (internal/modules/cjs/loader.js:725:27)
    at Module.require (internal/modules/cjs/loader.js:952:19)
    at require (internal/modules/cjs/helpers.js:88:18)
    at Object. (/app/src/server.js:2:15)
    at Module._compile (internal/modules/cjs/loader.js:1063:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1092:10)
    at Module.load (internal/modules/cjs/loader.js:928:32)
    at Function.Module._load (internal/modules/cjs/loader.js:769:14)
    at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:72:12) {
    code: 'MODULE_NOT_FOUND',
    requireStack: [ '/app/src/server.js' ]
    }
    npm ERR! code ELIFECYCLE
    npm ERR! errno 1
    npm ERR! [email protected] server: node src/server.js
    npm ERR! Exit status 1
    npm ERR!
    npm ERR! Failed at the [email protected] server script.
    npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
    npm WARN Local package.json exists, but node_modules missing, did you mean to install?

npm ERR! A complete log of this run can be found in:
npm ERR! /home/node/.npm/_logs/2021-06-09T15_26_01_533Z-debug.log

Now I'm without ideas, I hope you can help me or fix the image :)

Thanks for your great work :)

Licensing recommendations

I see the license is listed as ISC. I would recommend using a copyleft license instead as successful permissively licensed projects tend to be used by companies within proprietary projects and they typically never give back. If you go with copyleft, I would recommend AGPL-3.0-or-later or at least GPL-3.0-or-later. If you stick with a permissive license, I would recommend Apache-2.0.

I would further recommend making a license file in the root directory of the repo. If you create a LICENSE file, this information will be displayed within the first page view for the project. GitHub walks you through the steps after naming a new file LICENSE.

[Features] logs fetch errors / audit time estimation / slack / queue in kafka or redis

Hi @popstas ,

Hope you are all well ! And Happy New Year to all !

I had a play with site-audit-seo and figured out some things to add:

  • Slack notification (I already started to implement that one) but something more pluggable would be nice (Slack,Jira,Email,...)
  • Estimate the audit time if follow sitemap is enabled with an average time of 1s per page
  • Create a queue of urls ro crawl in redis or kafka per audits if the server restart to resume tasks
  • Log errors while a request failed, it can be helpful to identify failing url rewriting rules

Screenshot 2021-01-02 at 05 58 06

And again, you did an amazing job with this project. Thanks again and again :-)

Thanks for any insights or inputs on that

Cheers,
Luc Michalski

doesnt work

dont plan to use it unless its a .ru site

A magical grafana/influxdb support :-)

Hi guys,

Hope you are all well !

Would it be possible to send your metrics into influxdb so we can display them in a grafana dashboard ?

Just check this example, and you ll get what I mean: https://github.com/FeliceGeracitano/webperf-dashboard

That would be awesome, as it would be easy to integrate the site-audit-seo-viewer into a dashboard.

I can help :-)

Thanks for any insights or inputs on that.

Cheers,
Luc Michalski

Сделайте докер

Это очень полезный инструмент и будет очень удобно запускать его в докер контейнере.

Scan stops without error or info

Im running a scan with:

site-audit-seo -u https://domain.com -c 3 --preset=seo -m 1000

It never gets to 1000 urls because it just stops after some time and then nothing happens.
Sometimes its after 200 urls, sometimes its after just 10 urls.
Theres no error or other info, so I just have to kill the process and try again.

How can I debug this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.