Giter VIP home page Giter VIP logo

katana's Introduction

katana

A next-generation crawling and spidering framework

FeaturesInstallationUsageScopeConfigFiltersJoin Discord

Features

image

  • Fast And fully configurable web crawling
  • Standard and Headless mode
  • Active and Passive mode
  • JavaScript parsing / crawling
  • Customizable automatic form filling
  • Scope control - Preconfigured field / Regex
  • Customizable output - Preconfigured fields
  • INPUT - STDIN, URL and LIST
  • OUTPUT - STDOUT, FILE and JSON

Installation

katana requires Go 1.18 to install successfully. To install, just run the below command or download pre-compiled binary from release page.

go install github.com/projectdiscovery/katana/cmd/katana@latest

More options to install / run katana-

Docker

To install / update docker to latest tag -

docker pull projectdiscovery/katana:latest

To run katana in standard mode using docker -

docker run projectdiscovery/katana:latest -u https://tesla.com

To run katana in headless mode using docker -

docker run projectdiscovery/katana:latest -u https://tesla.com -system-chrome -headless
Ubuntu

It's recommended to install the following prerequisites -

sudo apt update
sudo snap refresh
sudo apt install zip curl wget git
sudo snap install golang --classic
wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add - 
sudo sh -c 'echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list'
sudo apt update 
sudo apt install google-chrome-stable

install katana -

go install github.com/projectdiscovery/katana/cmd/katana@latest

Usage

katana -h

This will display help for the tool. Here are all the switches it supports.

Katana is a fast crawler focused on execution in automation
pipelines offering both headless and non-headless crawling.

Usage:
  ./katana [flags]

Flags:
INPUT:
   -u, -list string[]  target url / list to crawl
   -resume string      resume scan using resume.cfg
   -e, -exclude string[]  exclude host matching specified filter ('cdn', 'private-ips', cidr, ip, regex)

CONFIGURATION:
   -r, -resolvers string[]       list of custom resolver (file or comma separated)
   -d, -depth int                maximum depth to crawl (default 3)
   -jc, -js-crawl                enable endpoint parsing / crawling in javascript file
   -jsl, -jsluice                enable jsluice parsing in javascript file (memory intensive)
   -ct, -crawl-duration value    maximum duration to crawl the target for (s, m, h, d) (default s)
   -kf, -known-files string      enable crawling of known files (all,robotstxt,sitemapxml), a minimum depth of 3 is required to ensure all known files are properly crawled.
   -mrs, -max-response-size int  maximum response size to read (default 9223372036854775807)
   -timeout int                  time to wait for request in seconds (default 10)
   -aff, -automatic-form-fill    enable automatic form filling (experimental)
   -fx, -form-extraction         extract form, input, textarea & select elements in jsonl output
   -retry int                    number of times to retry the request (default 1)
   -proxy string                 http/socks5 proxy to use
   -H, -headers string[]         custom header/cookie to include in all http request in header:value format (file)
   -config string                path to the katana configuration file
   -fc, -form-config string      path to custom form configuration file
   -flc, -field-config string    path to custom field configuration file
   -s, -strategy string          Visit strategy (depth-first, breadth-first) (default "depth-first")
   -iqp, -ignore-query-params    Ignore crawling same path with different query-param values
   -tlsi, -tls-impersonate       enable experimental client hello (ja3) tls randomization
   -dr, -disable-redirects       disable following redirects (default false)

DEBUG:
   -health-check, -hc        run diagnostic check up
   -elog, -error-log string  file to write sent requests error log

HEADLESS:
   -hl, -headless                    enable headless hybrid crawling (experimental)
   -sc, -system-chrome               use local installed chrome browser instead of katana installed
   -sb, -show-browser                show the browser on the screen with headless mode
   -ho, -headless-options string[]   start headless chrome with additional options
   -nos, -no-sandbox                 start headless chrome in --no-sandbox mode
   -cdd, -chrome-data-dir string     path to store chrome browser data
   -scp, -system-chrome-path string  use specified chrome browser for headless crawling
   -noi, -no-incognito               start headless chrome without incognito mode
   -cwu, -chrome-ws-url string       use chrome browser instance launched elsewhere with the debugger listening at this URL
   -xhr, -xhr-extraction             extract xhr request url,method in jsonl output

PASSIVE:
   -ps, -passive                   enable passive sources to discover target endpoints
   -pss, -passive-source string[]  passive source to use for url discovery (waybackarchive,commoncrawl,alienvault)

SCOPE:
   -cs, -crawl-scope string[]       in scope url regex to be followed by crawler
   -cos, -crawl-out-scope string[]  out of scope url regex to be excluded by crawler
   -fs, -field-scope string         pre-defined scope field (dn,rdn,fqdn) or custom regex (e.g., '(company-staging.io|company.com)') (default "rdn")
   -ns, -no-scope                   disables host based default scope
   -do, -display-out-scope          display external endpoint from scoped crawling

FILTER:
   -mr, -match-regex string[]       regex or list of regex to match on output url (cli, file)
   -fr, -filter-regex string[]      regex or list of regex to filter on output url (cli, file)
   -f, -field string                field to display in output (url,path,fqdn,rdn,rurl,qurl,qpath,file,ufile,key,value,kv,dir,udir)
   -sf, -store-field string         field to store in per-host output (url,path,fqdn,rdn,rurl,qurl,qpath,file,ufile,key,value,kv,dir,udir)
   -em, -extension-match string[]   match output for given extension (eg, -em php,html,js)
   -ef, -extension-filter string[]  filter output for given extension (eg, -ef png,css)
   -mdc, -match-condition string    match response with dsl based condition
   -fdc, -filter-condition string   filter response with dsl based condition

RATE-LIMIT:
   -c, -concurrency int          number of concurrent fetchers to use (default 10)
   -p, -parallelism int          number of concurrent inputs to process (default 10)
   -rd, -delay int               request delay between each request in seconds
   -rl, -rate-limit int          maximum requests to send per second (default 150)
   -rlm, -rate-limit-minute int  maximum number of requests to send per minute

UPDATE:
   -up, -update                 update katana to latest version
   -duc, -disable-update-check  disable automatic katana update check

OUTPUT:
   -o, -output string                file to write output to
   -sr, -store-response              store http requests/responses
   -srd, -store-response-dir string  store http requests/responses to custom directory
   -or, -omit-raw                    omit raw requests/responses from jsonl output
   -ob, -omit-body                   omit response body from jsonl output
   -j, -jsonl                        write output in jsonl format
   -nc, -no-color                    disable output content coloring (ANSI escape codes)
   -silent                           display output only
   -v, -verbose                      display verbose output
   -debug                            display debug output
   -version                          display project version

Running Katana

Input for katana

katana requires url or endpoint to crawl and accepts single or multiple inputs.

Input URL can be provided using -u option, and multiple values can be provided using comma-separated input, similarly file input is supported using -list option and additionally piped input (stdin) is also supported.

URL Input

katana -u https://tesla.com

Multiple URL Input (comma-separated)

katana -u https://tesla.com,https://google.com

List Input

$ cat url_list.txt

https://tesla.com
https://google.com
katana -list url_list.txt

STDIN (piped) Input

echo https://tesla.com | katana
cat domains | httpx | katana

Example running katana -

katana -u https://youtube.com

   __        __                
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/ v0.0.1                     

      projectdiscovery.io

[WRN] Use with caution. You are responsible for your actions.
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
https://www.youtube.com/
https://www.youtube.com/about/
https://www.youtube.com/about/press/
https://www.youtube.com/about/copyright/
https://www.youtube.com/t/contact_us/
https://www.youtube.com/creators/
https://www.youtube.com/ads/
https://www.youtube.com/t/terms
https://www.youtube.com/t/privacy
https://www.youtube.com/about/policies/
https://www.youtube.com/howyoutubeworks?utm_campaign=ytgen&utm_source=ythp&utm_medium=LeftNav&utm_content=txt&u=https%3A%2F%2Fwww.youtube.com%2Fhowyoutubeworks%3Futm_source%3Dythp%26utm_medium%3DLeftNav%26utm_campaign%3Dytgen
https://www.youtube.com/new
https://m.youtube.com/
https://www.youtube.com/s/desktop/4965577f/jsbin/desktop_polymer.vflset/desktop_polymer.js
https://www.youtube.com/s/desktop/4965577f/cssbin/www-main-desktop-home-page-skeleton.css
https://www.youtube.com/s/desktop/4965577f/cssbin/www-onepick.css
https://www.youtube.com/s/_/ytmainappweb/_/ss/k=ytmainappweb.kevlar_base.0Zo5FUcPkCg.L.B1.O/am=gAE/d=0/rs=AGKMywG5nh5Qp-BGPbOaI1evhF5BVGRZGA
https://www.youtube.com/opensearch?locale=en_GB
https://www.youtube.com/manifest.webmanifest
https://www.youtube.com/s/desktop/4965577f/cssbin/www-main-desktop-watch-page-skeleton.css
https://www.youtube.com/s/desktop/4965577f/jsbin/web-animations-next-lite.min.vflset/web-animations-next-lite.min.js
https://www.youtube.com/s/desktop/4965577f/jsbin/custom-elements-es5-adapter.vflset/custom-elements-es5-adapter.js
https://www.youtube.com/s/desktop/4965577f/jsbin/webcomponents-sd.vflset/webcomponents-sd.js
https://www.youtube.com/s/desktop/4965577f/jsbin/intersection-observer.min.vflset/intersection-observer.min.js
https://www.youtube.com/s/desktop/4965577f/jsbin/scheduler.vflset/scheduler.js
https://www.youtube.com/s/desktop/4965577f/jsbin/www-i18n-constants-en_GB.vflset/www-i18n-constants.js
https://www.youtube.com/s/desktop/4965577f/jsbin/www-tampering.vflset/www-tampering.js
https://www.youtube.com/s/desktop/4965577f/jsbin/spf.vflset/spf.js
https://www.youtube.com/s/desktop/4965577f/jsbin/network.vflset/network.js
https://www.youtube.com/howyoutubeworks/
https://www.youtube.com/trends/
https://www.youtube.com/jobs/
https://www.youtube.com/kids/

Crawling Mode

Standard Mode

Standard crawling modality uses the standard go http library under the hood to handle HTTP requests/responses. This modality is much faster as it doesn't have the browser overhead. Still, it analyzes HTTP responses body as is, without any javascript or DOM rendering, potentially missing post-dom-rendered endpoints or asynchronous endpoint calls that might happen in complex web applications depending, for example, on browser-specific events.

Headless Mode

Headless mode hooks internal headless calls to handle HTTP requests/responses directly within the browser context. This offers two advantages:

  • The HTTP fingerprint (TLS and user agent) fully identify the client as a legitimate browser
  • Better coverage since the endpoints are discovered analyzing the standard raw response, as in the previous modality, and also the browser-rendered one with javascript enabled.

Headless crawling is optional and can be enabled using -headless option.

Here are other headless CLI options -

katana -h headless

Flags:
HEADLESS:
   -hl, -headless                    enable headless hybrid crawling (experimental)
   -sc, -system-chrome               use local installed chrome browser instead of katana installed
   -sb, -show-browser                show the browser on the screen with headless mode
   -ho, -headless-options string[]   start headless chrome with additional options
   -nos, -no-sandbox                 start headless chrome in --no-sandbox mode
   -cdd, -chrome-data-dir string     path to store chrome browser data
   -scp, -system-chrome-path string  use specified chrome browser for headless crawling
   -noi, -no-incognito               start headless chrome without incognito mode
   -cwu, -chrome-ws-url string       use chrome browser instance launched elsewhere with the debugger listening at this URL
   -xhr, -xhr-extraction             extract xhr requests

-no-sandbox

Runs headless chrome browser with no-sandbox option, useful when running as root user.

katana -u https://tesla.com -headless -no-sandbox

-no-incognito

Runs headless chrome browser without incognito mode, useful when using the local browser.

katana -u https://tesla.com -headless -no-incognito

-headless-options

When crawling in headless mode, additional chrome options can be specified using -headless-options, for example -

katana -u https://tesla.com -headless -system-chrome -headless-options --disable-gpu,proxy-server=http://127.0.0.1:8080

Scope Control

Crawling can be endless if not scoped, as such katana comes with multiple support to define the crawl scope.

-field-scope

Most handy option to define scope with predefined field name, rdn being default option for field scope.

  • rdn - crawling scoped to root domain name and all subdomains (e.g. *example.com) (default)
  • fqdn - crawling scoped to given sub(domain) (e.g. www.example.com or api.example.com)
  • dn - crawling scoped to domain name keyword (e.g. example)
katana -u https://tesla.com -fs dn

-crawl-scope

For advanced scope control, -cs option can be used that comes with regex support.

katana -u https://tesla.com -cs login

For multiple in scope rules, file input with multiline string / regex can be passed.

$ cat in_scope.txt

login/
admin/
app/
wordpress/
katana -u https://tesla.com -cs in_scope.txt

-crawl-out-scope

For defining what not to crawl, -cos option can be used and also support regex input.

katana -u https://tesla.com -cos logout

For multiple out of scope rules, file input with multiline string / regex can be passed.

$ cat out_of_scope.txt

/logout
/log_out
katana -u https://tesla.com -cos out_of_scope.txt

-no-scope

Katana is default to scope *.domain, to disable this -ns option can be used and also to crawl the internet.

katana -u https://tesla.com -ns

-display-out-scope

As default, when scope option is used, it also applies for the links to display as output, as such external URLs are default to exclude and to overwrite this behavior, -do option can be used to display all the external URLs that exist in targets scoped URL / Endpoint.

katana -u https://tesla.com -do

Here is all the CLI options for the scope control -

katana -h scope

Flags:
SCOPE:
   -cs, -crawl-scope string[]       in scope url regex to be followed by crawler
   -cos, -crawl-out-scope string[]  out of scope url regex to be excluded by crawler
   -fs, -field-scope string         pre-defined scope field (dn,rdn,fqdn) (default "rdn")
   -ns, -no-scope                   disables host based default scope
   -do, -display-out-scope          display external endpoint from scoped crawling

Crawler Configuration

Katana comes with multiple options to configure and control the crawl as the way we want.

-depth

Option to define the depth to follow the urls for crawling, the more depth the more number of endpoint being crawled + time for crawl.

katana -u https://tesla.com -d 5

-js-crawl

Option to enable JavaScript file parsing + crawling the endpoints discovered in JavaScript files, disabled as default.

katana -u https://tesla.com -jc

-crawl-duration

Option to predefined crawl duration, disabled as default.

katana -u https://tesla.com -ct 2

-known-files

Option to enable crawling robots.txt and sitemap.xml file, disabled as default.

katana -u https://tesla.com -kf robotstxt,sitemapxml

-automatic-form-fill

Option to enable automatic form filling for known / unknown fields, known field values can be customized as needed by updating form config file at $HOME/.config/katana/form-config.yaml.

Automatic form filling is experimental feature.

katana -u https://tesla.com -aff

Authenticated Crawling

Authenticated crawling involves including custom headers or cookies in HTTP requests to access protected resources. These headers provide authentication or authorization information, allowing you to crawl authenticated content / endpoint. You can specify headers directly in the command line or provide them as a file with katana to perfrom authenticated crawling.

Note: User needs to be manually perform the authentication and export the session cookie / header to file to use with katana.

-headers

Option to add a custom header or cookie to the request.

Syntax of headers in the HTTP specification

Here is an example of adding a cookie to the request:

katana -u https://tesla.com -H 'Cookie: usrsess=AmljNrESo'

It is also possible to supply headers or cookies as a file. For example:

$ cat cookie.txt

Cookie: PHPSESSIONID=XXXXXXXXX
X-API-KEY: XXXXX
TOKEN=XX
katana -u https://tesla.com -H cookie.txt

There are more options to configure when needed, here is all the config related CLI options -

katana -h config

Flags:
CONFIGURATION:
   -r, -resolvers string[]       list of custom resolver (file or comma separated)
   -d, -depth int                maximum depth to crawl (default 3)
   -jc, -js-crawl                enable endpoint parsing / crawling in javascript file
   -ct, -crawl-duration int      maximum duration to crawl the target for
   -kf, -known-files string      enable crawling of known files (all,robotstxt,sitemapxml)
   -mrs, -max-response-size int  maximum response size to read (default 9223372036854775807)
   -timeout int                  time to wait for request in seconds (default 10)
   -aff, -automatic-form-fill    enable automatic form filling (experimental)
   -fx, -form-extraction         enable extraction of form, input, textarea & select elements
   -retry int                    number of times to retry the request (default 1)
   -proxy string                 http/socks5 proxy to use
   -H, -headers string[]         custom header/cookie to include in request
   -config string                path to the katana configuration file
   -fc, -form-config string      path to custom form configuration file
   -flc, -field-config string    path to custom field configuration file
   -s, -strategy string          Visit strategy (depth-first, breadth-first) (default "depth-first")

Connecting to Active Browser Session

Katana can also connect to active browser session where user is already logged in and authenticated. and use it for crawling. The only requirement for this is to start browser with remote debugging enabled.

Here is an example of starting chrome browser with remote debugging enabled and using it with katana -

step 1) First Locate path of chrome executable

Operating System Chromium Executable Location Google Chrome Executable Location
Windows (64-bit) C:\Program Files (x86)\Google\Chromium\Application\chrome.exe C:\Program Files (x86)\Google\Chrome\Application\chrome.exe
Windows (32-bit) C:\Program Files\Google\Chromium\Application\chrome.exe C:\Program Files\Google\Chrome\Application\chrome.exe
macOS /Applications/Chromium.app/Contents/MacOS/Chromium /Applications/Google Chrome.app/Contents/MacOS/Google Chrome
Linux /usr/bin/chromium /usr/bin/google-chrome

step 2) Start chrome with remote debugging enabled and it will return websocker url. For example, on MacOS, you can start chrome with remote debugging enabled using following command -

$ /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222


DevTools listening on ws://127.0.0.1:9222/devtools/browser/c5316c9c-19d6-42dc-847a-41d1aeebf7d6

Now login to the website you want to crawl and keep the browser open.

step 3) Now use the websocket url with katana to connect to the active browser session and crawl the website

katana -headless -u https://tesla.com -cwu ws://127.0.0.1:9222/devtools/browser/c5316c9c-19d6-42dc-847a-41d1aeebf7d6 -no-incognito

Note: you can use -cdd option to specify custom chrome data directory to store browser data and cookies but that does not save session data if cookie is set to Session only or expires after certain time.

Filters

-field

Katana comes with built in fields that can be used to filter the output for the desired information, -f option can be used to specify any of the available fields.

   -f, -field string  field to display in output (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir)

Here is a table with examples of each field and expected output when used -

FIELD DESCRIPTION EXAMPLE
url URL Endpoint https://admin.projectdiscovery.io/admin/login?user=admin&password=admin
qurl URL including query param https://admin.projectdiscovery.io/admin/login.php?user=admin&password=admin
qpath Path including query param /login?user=admin&password=admin
path URL Path https://admin.projectdiscovery.io/admin/login
fqdn Fully Qualified Domain name admin.projectdiscovery.io
rdn Root Domain name projectdiscovery.io
rurl Root URL https://admin.projectdiscovery.io
ufile URL with File https://admin.projectdiscovery.io/login.js
file Filename in URL login.php
key Parameter keys in URL user,password
value Parameter values in URL admin,admin
kv Keys=Values in URL user=admin&password=admin
dir URL Directory name /admin/
udir URL with Directory https://admin.projectdiscovery.io/admin/

Here is an example of using field option to only display all the urls with query parameter in it -

katana -u https://tesla.com -f qurl -silent

https://shop.tesla.com/en_au?redirect=no
https://shop.tesla.com/en_nz?redirect=no
https://shop.tesla.com/product/men_s-raven-lightweight-zip-up-bomber-jacket?sku=1740250-00-A
https://shop.tesla.com/product/tesla-shop-gift-card?sku=1767247-00-A
https://shop.tesla.com/product/men_s-chill-crew-neck-sweatshirt?sku=1740176-00-A
https://www.tesla.com/about?redirect=no
https://www.tesla.com/about/legal?redirect=no
https://www.tesla.com/findus/list?redirect=no

Custom Fields

You can create custom fields to extract and store specific information from page responses using regex rules. These custom fields are defined using a YAML config file and are loaded from the default location at $HOME/.config/katana/field-config.yaml. Alternatively, you can use the -flc option to load a custom field config file from a different location. Here is example custom field.

- name: email
  type: regex
  regex:
  - '([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)'
  - '([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)'

- name: phone
  type: regex
  regex:
  - '\d{3}-\d{8}|\d{4}-\d{7}'

When defining custom fields, following attributes are supported:

  • name (required)

The value of name attribute is used as the -field cli option value.

  • type (required)

The type of custom attribute, currenly supported option - regex

  • part (optional)

The part of the response to extract the information from. The default value is response, which includes both the header and body. Other possible values are header and body.

  • group (optional)

You can use this attribute to select a specific matched group in regex, for example: group: 1

Running katana using custom field:

katana -u https://tesla.com -f email,phone

-store-field

To compliment field option which is useful to filter output at run time, there is -sf, -store-fields option which works exactly like field option except instead of filtering, it stores all the information on the disk under katana_field directory sorted by target url.

katana -u https://tesla.com -sf key,fqdn,qurl -silent
$ ls katana_field/

https_www.tesla.com_fqdn.txt
https_www.tesla.com_key.txt
https_www.tesla.com_qurl.txt

The -store-field option can be useful for collecting information to build a targeted wordlist for various purposes, including but not limited to:

  • Identifying the most commonly used parameters
  • Discovering frequently used paths
  • Finding commonly used files
  • Identifying related or unknown subdomains

Katana Filters

-extension-match

Crawl output can be easily matched for specific extension using -em option to ensure to display only output containing given extension.

katana -u https://tesla.com -silent -em js,jsp,json

-extension-filter

Crawl output can be easily filtered for specific extension using -ef option which ensure to remove all the urls containing given extension.

katana -u https://tesla.com -silent -ef css,txt,md

-match-regex

The -match-regex or -mr flag allows you to filter output URLs using regular expressions. When using this flag, only URLs that match the specified regular expression will be printed in the output.

katana -u https://tesla.com -mr 'https://shop\.tesla\.com/*' -silent

-filter-regex

The -filter-regex or -fr flag allows you to filter output URLs using regular expressions. When using this flag, it will skip the URLs that are match the specified regular expression.

katana -u https://tesla.com -fr 'https://www\.tesla\.com/*' -silent

Advance Filtering

Katana supports DSL-based expressions for advanced matching and filtering capabilities:

  • To match endpoints with a 200 status code:
katana -u https://www.hackerone.com -mdc 'status_code == 200'
  • To match endpoints that contain "default" and have a status code other than 403:
katana -u https://www.hackerone.com -mdc 'contains(endpoint, "default") && status_code != 403'
  • To match endpoints with PHP technologies:
katana -u https://www.hackerone.com -mdc 'contains(to_lower(technologies), "php")'
  • To filter out endpoints running on Cloudflare:
katana -u https://www.hackerone.com -fdc 'contains(to_lower(technologies), "cloudflare")'

DSL functions can be applied to any keys in the jsonl output. For more information on available DSL functions, please visit the dsl project.

Here are additional filter options -

katana -h filter

Flags:
FILTER:
   -mr, -match-regex string[]       regex or list of regex to match on output url (cli, file)
   -fr, -filter-regex string[]      regex or list of regex to filter on output url (cli, file)
   -f, -field string                field to display in output (url,path,fqdn,rdn,rurl,qurl,qpath,file,ufile,key,value,kv,dir,udir)
   -sf, -store-field string         field to store in per-host output (url,path,fqdn,rdn,rurl,qurl,qpath,file,ufile,key,value,kv,dir,udir)
   -em, -extension-match string[]   match output for given extension (eg, -em php,html,js)
   -ef, -extension-filter string[]  filter output for given extension (eg, -ef png,css)
   -mdc, -match-condition string    match response with dsl based condition
   -fdc, -filter-condition string   filter response with dsl based condition

Rate Limit

It's easy to get blocked / banned while crawling if not following target websites limits, katana comes with multiple option to tune the crawl to go as fast / slow we want.

-delay

option to introduce a delay in seconds between each new request katana makes while crawling, disabled as default.

katana -u https://tesla.com -delay 20

-concurrency

option to control the number of urls per target to fetch at the same time.

katana -u https://tesla.com -c 20

-parallelism

option to define number of target to process at same time from list input.

katana -u https://tesla.com -p 20

-rate-limit

option to use to define max number of request can go out per second.

katana -u https://tesla.com -rl 100

-rate-limit-minute

option to use to define max number of request can go out per minute.

katana -u https://tesla.com -rlm 500

Here is all long / short CLI options for rate limit control -

katana -h rate-limit

Flags:
RATE-LIMIT:
   -c, -concurrency int          number of concurrent fetchers to use (default 10)
   -p, -parallelism int          number of concurrent inputs to process (default 10)
   -rd, -delay int               request delay between each request in seconds
   -rl, -rate-limit int          maximum requests to send per second (default 150)
   -rlm, -rate-limit-minute int  maximum number of requests to send per minute

Output

Katana support both file output in plain text format as well as JSON which includes additional information like, source, tag, and attribute name to co-related the discovered endpoint.

-output

By default, katana outputs the crawled endpoints in plain text format. The results can be written to a file by using the -output option.

katana -u https://example.com -no-scope -output example_endpoints.txt

-jsonl

katana -u https://example.com -jsonl | jq .
{
  "timestamp": "2023-03-20T16:23:58.027559+05:30",
  "request": {
    "method": "GET",
    "endpoint": "https://example.com",
    "raw": "GET / HTTP/1.1\r\nHost: example.com\r\nUser-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36\r\nAccept-Encoding: gzip\r\n\r\n"
  },
  "response": {
    "status_code": 200,
    "headers": {
      "accept_ranges": "bytes",
      "expires": "Mon, 27 Mar 2023 10:53:58 GMT",
      "last_modified": "Thu, 17 Oct 2019 07:18:26 GMT",
      "content_type": "text/html; charset=UTF-8",
      "server": "ECS (dcb/7EA3)",
      "vary": "Accept-Encoding",
      "etag": "\"3147526947\"",
      "cache_control": "max-age=604800",
      "x_cache": "HIT",
      "date": "Mon, 20 Mar 2023 10:53:58 GMT",
      "age": "331239"
    },
    "body": "<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset=\"utf-8\" />\n    <meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\" />\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\" />\n    <style type=\"text/css\">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, \"Segoe UI\", \"Open Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <h1>Example Domain</h1>\n    <p>This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.</p>\n    <p><a href=\"https://www.iana.org/domains/example\">More information...</a></p>\n</div>\n</body>\n</html>\n",
    "technologies": [
      "Azure",
      "Amazon ECS",
      "Amazon Web Services",
      "Docker",
      "Azure CDN"
    ],
    "raw": "HTTP/1.1 200 OK\r\nContent-Length: 1256\r\nAccept-Ranges: bytes\r\nAge: 331239\r\nCache-Control: max-age=604800\r\nContent-Type: text/html; charset=UTF-8\r\nDate: Mon, 20 Mar 2023 10:53:58 GMT\r\nEtag: \"3147526947\"\r\nExpires: Mon, 27 Mar 2023 10:53:58 GMT\r\nLast-Modified: Thu, 17 Oct 2019 07:18:26 GMT\r\nServer: ECS (dcb/7EA3)\r\nVary: Accept-Encoding\r\nX-Cache: HIT\r\n\r\n<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset=\"utf-8\" />\n    <meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\" />\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\" />\n    <style type=\"text/css\">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, \"Segoe UI\", \"Open Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <h1>Example Domain</h1>\n    <p>This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.</p>\n    <p><a href=\"https://www.iana.org/domains/example\">More information...</a></p>\n</div>\n</body>\n</html>\n"
  }
}

-store-response

The -store-response option allows for writing all crawled endpoint requests and responses to a text file. When this option is used, text files including the request and response will be written to the katana_response directory. If you would like to specify a custom directory, you can use the -store-response-dir option.

katana -u https://example.com -no-scope -store-response
$ cat katana_response/index.txt

katana_response/example.com/327c3fda87ce286848a574982ddd0b7c7487f816.txt https://example.com (200 OK)
katana_response/www.iana.org/bfc096e6dd93b993ca8918bf4c08fdc707a70723.txt http://www.iana.org/domains/reserved (200 OK)

Note:

-store-response option is not supported in -headless mode.

Here are additional CLI options related to output -

katana -h output

OUTPUT:
   -o, -output string                file to write output to
   -sr, -store-response              store http requests/responses
   -srd, -store-response-dir string  store http requests/responses to custom directory
   -j, -json                         write output in JSONL(ines) format
   -nc, -no-color                    disable output content coloring (ANSI escape codes)
   -silent                           display output only
   -v, -verbose                      display verbose output
   -version                          display project version

Katana as a library

katana can be used as a library by creating an instance of the Option struct and populating it with the same options that would be specified via CLI. Using the options you can create crawlerOptions and so standard or hybrid crawler. crawler.Crawl method should be called to crawl the input.

package main

import (
	"math"

	"github.com/projectdiscovery/gologger"
	"github.com/projectdiscovery/katana/pkg/engine/standard"
	"github.com/projectdiscovery/katana/pkg/output"
	"github.com/projectdiscovery/katana/pkg/types"
)

func main() {
	options := &types.Options{
		MaxDepth:     3,             // Maximum depth to crawl
		FieldScope:   "rdn",         // Crawling Scope Field
		BodyReadSize: math.MaxInt,   // Maximum response size to read
		Timeout:      10,            // Timeout is the time to wait for request in seconds
		Concurrency:  10,            // Concurrency is the number of concurrent crawling goroutines
		Parallelism:  10,            // Parallelism is the number of urls processing goroutines
		Delay:        0,             // Delay is the delay between each crawl requests in seconds
		RateLimit:    150,           // Maximum requests to send per second
		Strategy:     "depth-first", // Visit strategy (depth-first, breadth-first)
		OnResult: func(result output.Result) { // Callback function to execute for result
			gologger.Info().Msg(result.Request.URL)
		},
	}
	crawlerOptions, err := types.NewCrawlerOptions(options)
	if err != nil {
		gologger.Fatal().Msg(err.Error())
	}
	defer crawlerOptions.Close()
	crawler, err := standard.New(crawlerOptions)
	if err != nil {
		gologger.Fatal().Msg(err.Error())
	}
	defer crawler.Close()
	var input = "https://www.hackerone.com"
	err = crawler.Crawl(input)
	if err != nil {
		gologger.Warning().Msgf("Could not crawl %s: %s", input, err.Error())
	}
}

katana is made with ❤️ by the projectdiscovery team and distributed under MIT License.

Join Discord

katana's People

Contributors

0x123456789 avatar apriil15 avatar aristosmiliaressis avatar b34c0n5 avatar c3l3si4n avatar danielintruder avatar dependabot[bot] avatar dogancanbakir avatar edoardottt avatar ehsandeep avatar erikowen avatar geeknik avatar glaucocustodio avatar h4r5h1t avatar iamargus95 avatar ice3man543 avatar jen140 avatar maik-s avatar mschader avatar mzack9999 avatar niudaii avatar olearycrew avatar parthmalhotra avatar ramanareddy0m avatar rohantheprogrammer avatar shubhamrasal avatar tarunkoyalwar avatar toufik-airane avatar yuzhe-mortal avatar zy9ard3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

katana's Issues

invalid / blank url being requested for crawl

katana version:

dev

Current Behavior:

blank url / non http/s protocol being requested.

Expected Behavior:

only crawl / request valid, http/s URL.

[ERR] Could not request seed URL: Get "javascript:window.print();": unsupported protocol scheme "javascript"
[ERR] Could not request seed URL: context deadline exceeded (Client.Timeout or context cancellation while reading body)

(2) Could not request seed URL - context deadline exceeded (timeout)

katana -u https://44.199.9.133/ -d 10 -sjr -is

Error :

[ERR] Could not request seed URL: GET http://44.199.9.133/search/ giving up after 2 attempts: Get "http://44.199.9.133/search/": context deadline exceeded (Client.Timeout exceeded while awaiting heade                                                  

Request timed out because port 80 was not open on the server and katana found http url while crawling while the server had only port 443 open.

Possible Solutions can be auto upgrade to https or skip the url and continue testing

Could not request seed URL (stopped after 10 redirects)

./katana -u https://www.hackerone.com -csd hackerone.com -is -d 5
https://www.hackerone.com/vulnerability-management/vulnerability-assessment-i-complete-guide
https://www.hackerone.com/vulnerability-management/vulnerability-assessment-tools-top-tools-what-they-do
https://www.hackerone.com/vulnerability-management/bug-bounty-vs-vdp-which-program-right-you
[ERR] Could not request seed URL: Get "/vulnerability-management/critical-introducing-severity-cvss": stopped after 10 redirects

Command line suggestions

  1. Either set the default value of -cs flag to include only the current domain in crawl scope or add another flag -cscd ( current domain crawl scope) so that katana only crawls current domain.

  2. New parameter : -nqs ( No query string ) : When user doesn't want any query strings in output, Can be useful for further fuzzing purposes. Can be done easily otherwise but will be better if supported natively.

Output :
echo https://www.google.com | katana -d 1

https://policies.google.com/terms?hl=en-IN&fg=1
https://www.google.com/url?sa=t&rct=j&source=webhp&url=https://policies.google.com/terms%3Fhl%3Den-IN%26fg%3D1&ved=0ahUKEwjK7qb7mPz5AhVfUGwGHbDMC3gQ8qwCCB0
https://www.google.com/preferences?hl=en-IN&fg=1

echo https://www.google.com | katana -d 1 -nqs
Desired Output :

https://policies.google.com/terms
https://www.google.com/url
https://www.google.com/preferences

Implement CLI wrapper around non-headless Katana Engine

  • CLI Client using goflags
Usage:
  ./katana [flags]

Flags:

INPUT:
   -u, -list string[]  target url / list to crawl (single / comma separated / file input)

CONFIGURATIONS:
   -config string                   cli flag configuration file
   -d, -depth                       maximum depth to crawl (default 1)
   -ct, -crawl-duration int         maximum duration to crawl the target for
   -mrs, -max-response-size int     maximum response size to read (default 10 MB)
   -timeout int                     time to wait in seconds before timeout (default 5)
   -p, -proxy string[]              http/socks5 proxy list to use (single / comma separated / file input)
   -H, -header string[]             custom header/cookie to include in request (single / file input)

SCOPE:
   -cs, -crawl-scope string[]       in scope target to be followed by crawler (single / comma separated / file input) # regex input
   -cos, -crawl-out-scope string[]  out of scope target to exclude by crawler (single / comma separated / file input) # regex input
   -is, -include-sub                include subdomains in crawl scope (false)

RATE-LIMIT:
   -c, -concurrency int          number of concurrent fetchers to use  (default 300)
   -rd, -delay int               request delay between each request in seconds (default -1)
   -rl, -rate-limit int          maximum requests to send per second (default 150)
   -rlm, -rate-limit-minute int  maximum number of requests to send per minute

OUTPUT:
   -o, -output string        output file to write
   -json                     write output in JSONL(ines) format (false)
   -nc, -no-color            disable output content coloring (ANSI escape codes) (false)
   -silent                   display output only (false)
   -v, -verbose              display verbose output (false)
   -version                  display project version

Reference:

https://github.com/projectdiscovery/gocrawl
https://github.com/projectdiscovery/katana/tree/backup/pkg/engine/standard (improved)

JSON output improvement

Current output:

{
  "url": "https://www.hackerone.com/events/app-security-testing",
  "source": "a"
}

Updated output:

{
  "timestamp": "2022-08-22T04:46:23.405849+05:30"
  "endpoint": "https://www.hackerone.com/events/app-security-testing",  # endpoint is discovered url
  "source": "https://www.hackerone.com/events/"  # source is page url where the endpoint got discovered 
  "tag": "a", 
  "attribute": "href"
}

Errors in default run

Error information can be moved from default mode to verbose mode.

https://privacy.thewaltdisneycompany.com/app/themes/privacycenter/assets/dist/js/app-cfa6fbf0.min.js
https://privacy.thewaltdisneycompany.com/en/?s=katana&sentence=1
[ERR] Could not request seed URL: GET http://44.199.9.133/savings/ giving up after 2 attempts: Get "http://44.199.9.133/savings/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[ERR] Could not request seed URL: GET http://44.199.9.133/membership/costs/ giving up after 2 attempts: Get "http://44.199.9.133/membership/costs/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[ERR] Could not request seed URL: GET http://44.199.9.133/destinations/dvc-resorts/ giving up after 2 attempts: Get "http://44.199.9.133/destinations/dvc-resorts/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[ERR] Could not request seed URL: GET http://44.199.9.133/explore-membership/ giving up after 2 attempts: Get "http://44.199.9.133/explore-membership/": no address found for host (Client.Timeout exceeded while awaiting headers)
[ERR] Could not request seed URL: GET http://44.199.9.133/destinations/explore-disney-destinations-and-resort-hotels/ giving up after 2 attempts: Get "http://44.199.9.133/destinations/explore-disney-destinations-and-resort-hotels/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[ERR] Could not request seed URL: GET http://44.199.9.133/star-wars-galactic-starcruiser/ giving up after 2 attempts: Get "http://44.199.9.133/star-wars-galactic-starcruiser/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[ERR] Could not request seed URL: GET http://44.199.9.133/discounts-perks-offers/ giving up after 2 attempts: Get "http://44.199.9.133/discounts-perks-offers/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[ERR] Could not request seed URL: GET http://44.199.9.133/points-and-flexibility/ giving up after 2 attempts: Get "http://44.199.9.133/points-and-flexibility/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[ERR] Could not request seed URL: GET http://44.199.9.133/membership-magic/ giving up after 2 attempts: Get "http://44.199.9.133/membership-magic/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Configurable form config data

Please describe your feature request:

var DefaultFormFillData = FormFillData{
Email: "[email protected]",
Color: "#e66465",
Password: "katanaP@assw0rd1",
PhoneNumber: "2124567890",
Placeholder: "katana",

CLI Option:

   -fc, form-config string                path to the form configuration file

  • Custom form config input
  • Email input randomization by default

Execution context was destroyed

katana version:

dev | master

Current Behavior:

echo http://34.236.11.165 | ./katana -jc -headless -v

   __        __                
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/ v0.0.1							 

		projectdiscovery.io

[WRN] Use with caution. You are responsible for your actions.
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
[WRN] context canceled
[WRN] context canceled
[WRN] context canceled
[WRN] Could not request seed URL: {-32000 Execution context was destroyed. }

Investigate go-rod vs playwright-go vs others for crawler development

Description

Go has several headless libraries which are listed below -

We should investigate and decide on a library suitable for crawler design.

Metrics to consider:

  • Repository maintained
  • Community
  • Stability
  • Number of stars
  • Working correctly under heavy load

Create analyzer for scraping new navigation from headless page states

  • Anchor, Button, Embed, and Iframe for direct links.
  • Parse and fill HTML Forms as well optionally. (Login, Register, etc using these methods)
  • Scrape javascript / javascript files and collect links using regex.
  • Collect requests made by XHR/Javascript APIs as well.
  • Elements having event listeners can be navigated by querying the DOM or using JS hooks. (Decide on whether we want to use JS hooks or query the DOM)
  • Other relevant information can be decided in the future or depending upon demand.

meta links are not parsed correctly

katana version:

main/dev

Example response:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>GETPAID</title>
<meta http-equiv="REFRESH" content="0;url=https://unitedcargobilling.ual.com/ngetpaid"></HEAD>
<BODY>
Redirecting
</BODY>
</HTML>

Current Behavior:

No results.

Expected Behavior:

https://unitedcargobilling.ual.com/ngetpaid should be parsed and crawled.

Steps To Reproduce:

echo https://unitedcargobilling.ual.com | ./katana

Anything else:

Related code:

func bodyMetaContentTagParser(resp navigationResponse, callback func(navigationRequest)) {
resp.Reader.Find("meta[http-equiv='refresh']").Each(func(i int, item *goquery.Selection) {
header, ok := item.Attr("content")
if !ok {
return
}
values := utils.ParseRefreshTag(header)
if values == "" {
return
}
callback(newNavigationRequestURL(values, resp.Resp.Request.URL.String(), "meta", "refresh", resp))
})
}

Action item:

  • Fix
  • Test

Add Dockerfile

Please describe your feature request:

Dockerize katana, the container must pre-install headless broswer

Form Bug - GET Request Body

katana version:

v0.0.1

Current Behavior:

Just running katana over my website (pretty basic Wordpress site), https://wya.pl, there is a form to search posts via the /?s= parameter. When I proxy the crawler, I can see that the form is identified and the parameter is filled with the value of katana. However, I can see that the resultant request copies the parameter into the body of the request.

image

Expected Behavior:

The form submits a GET request normally, so I'd expect for a GET request with the filled out parameter to be only in the query string. Since this is a GET request, I'd expect for there to be an empty HTTP body.

Steps To Reproduce:

Here is what the HTML form looks like (I swapped my site to localhost here to limit spam):

<form role="search" method="get" class="search-form" action="http://localhost/">
	<label>
		<span class="screen-reader-text">Search for:</span>
		<input type="search" class="search-field" placeholder="Search …" value="" name="s" title="Search for:">
	</label>
	<button type="submit" class="search-submit"><span class="screen-reader-text">Search</span></button>
</form>

Anything else:

STDIN URL input support

Currently, input can be supplied with -u or -list option that can be extended to support stdin as well.

echo https://www.hackerone.com | ./katana 

   __        __                
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/ v0.0.1							 

		projectdiscovery.io

[WRN] Use with caution. You are responsible for your actions.
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
[FTL] Could not process: could not create runner: could not validate options: no inputs specified for crawler

Use Heuristic Inference to improve form-fill capability

Please describe your feature request:

Automatic form filling without context is a hard task. After implementing a series of robust standard rules, it would be interesting to investigate further strategies to infer the form category from the page:

  • What is the topic of the page - semantical analysis (form to book an airplane, form to subscribe to a newsletter, login form, etc)
  • Can we use prior knowledge to classify similar forms? (forms from web frameworks, from taken from snippets on the web)
  • Form filling can be a multistep operation: we need an autonomous approach (eg. with a fitness function that rewards the most promising filling paths)

Leakless binary flagged as malicious by Windows Defender

katana version:

dev

Current Behavior:

Leakless binary is flagged as malicious by Windows Deferender

Expected Behavior:

Headless instances cleanup

Steps To Reproduce:

> go run . -cs 127.0.0.1 -u http://127.0.0.1:8000 -headless > head.txt

   __        __
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/ v0.0.1

                projectdiscovery.io

[WRN] Use with caution. You are responsible for your actions.
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
[FTL] Could not process: could not execute crawling: could not create standard crawler: fork/exec C:\Users\user\AppData\Local\Temp\leakless-0c3354cd58f0813bb5b34ddf3a7c16ed\leakless.exe: Operation did not complete successfully because the file contains a virus or potentially unwanted software.
exit status 1

Notes: partially solved in https://github.com/projectdiscovery/nuclei/blob/1010cca84e62e04cd675debfce20ce96d2e9cd3c/v2/pkg/protocols/headless/engine/engine.go#L158

Create navigation module for headless crawler

  • Handle generic prompts (popups, alerts, etc)
  • Use go-rod to discover hooks and events and enqueue new discovered seed urls
  • Detect and handle navigation loops/edge cases with circuit breaker mechanisms: timeouts, errors

Add new option : -path-deny-list / -pdl to exclude paths from crawling

Add a new flag -path-deny-list / -pdl to exclude path(s) from crawling, Can be a list of paths / single path, comma separated path (via command line ), It will be useful for authenticated crawling, Where user doesn't want to make requests to logout paths to avoid cookie invalidation.

-H / -header not working as intended

-H is not working as intended :

root@bhramastra:/tmp/urldedupe# echo https://ylnhy1urfxmutnoat5qenl43hunkb9.oastify.com/ | katana -d 3 -o hk3 -c 100 -p 100 -rl 1500 -is -H "Cookie: ccc=ddd"

   __        __                
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/ v0.0.1							 

		projectdiscovery.io

[WRN] Use with caution. You are responsible for your actions.
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
[ERR] Could not request seed URL: GET https://ylnhy1urfxmutnoat5qenl43hunkb9.oastify.com/ giving up after 2 attempts: Get "https://ylnhy1urfxmutnoat5qenl43hunkb9.oastify.com/": net/http: invalid header field name "Cookie: ccc"

Scope syntax Improvements

katana version:

main
dev

Current Behavior:

scope doesn't support the include/exclude options:

  • host:port
  • ip
  • ip:port
  • :port
  • cidr

Expected Behavior:

Support the previous syntax

predefined fields to control output

Please describe your feature request:

CLI Option:

   -f, -field           field to display in output (fqdn,rdn,url,rurl,path,file,key,value,kv) (default url)

Example:

Field Example
url (default) https://policies.google.com/terms/file.php?hl=en-IN&fg=1
rurl (root url) https://policies.google.com
path /terms/file.php?hl=en-IN&fg=1
file file.php
key hl,fg
value en-IN,1
kv hl=en-IN&fg=1
fqdn policies.google.com
rdn google.com

Example run:

echo https://example.com | ./katana -f path -silent

/domains
/protocols
/numbers
/about
/go/rfc2606
/go/rfc6761
/http://www.icann.org/topics/idn/
/http://www.icann.org/
/domains/root/db/xn--kgbechtv.html
/domains/root/db/xn--hgbk6aj7f53bba.html
/domains/root/db/xn--0zwm56d.html
/domains/root/db/xn--g6w251d.html
/domains/root/db/xn--80akhbyknj4f.html
/domains/root/db/xn--11b5bs3a9aj6g.html
/domains/root/db/xn--jxalpdlp.html
/domains/root/db/xn--9t4b11yi5a.html
/domains/root/db/xn--deba0ad.html
/domains/root/db/xn--zckzah.html
/domains/root/db/xn--hlcj6aya9esc7a.html
/assignments/special-use-domain-names
/domains/root
/domains/root/db
/domains/root/files
/domains/root/manage
/domains/root/help
/domains/root/servers
/domains/int
/domains/int/manage
/domains/int/policy
/domains/arpa
/domains/idn-tables
/procedures/idn-repository.html
/dnssec
/dnssec/files
/dnssec/ceremonies
/dnssec/procedures
/dnssec/tcrs
/dnssec/archive
/domains/reserved
/abuse
/time-zones
/about/presentations
/reports
/performance
/reviews
/about/excellence
/contact
/_js/jquery.js
/_js/iana.js

This will be similar to uncover implementation of uncover - https://github.com/projectdiscovery/uncover#field-format

Describe the use case of this feature:

  • Control outptu as required for further processing / scanning / record

store field option

Please describe your feature request:

Similar to f, -sf is a new option that lets us write the values of single or multiple fields into txt file named after the scheme and host and a field key, i.e scheme_host_field_name.txt

CLI Option:

   -sf, -store-field string  field to store in output (fqdn,rdn,url,rurl,path,file,key,value,kv)
  • sf option is default to write in katana_output directory.

Example:

./katana -u https://example.com -f url -sf fqdn,key,dir

ls katana_output/

https_example.com_fqdn.txt
https_example.com_key.txt
https_example.com_dir.txt

Describe the use case of this feature:

This will allow us to write multiple type of url data into file that can be used further for various automation including

  • custom wordlist building
  • common query collection
  • common key values collection
  • common path collection
  • dns data collection and more.

Scope to mimic burp suite scope behavior

   -cs, -crawl-scope string[]       in scope target to be followed by crawler
   -cos, -crawl-out-scope string[]  out of scope target to be excluded by crawler

-cs flag accepts word/regex that will be applicable to the URL component.

image

Basic headless navigation module

Description

  • Refactor the code to extract common functionalities and define interface for multiple crawling engines
  • Add basic headless crawling functionality that in one pass analyses dom-rendered data and intercepts passively all js-generated http requests

Notes: extractors, analyzers and edge cases will be handled as part of #16

Create headless crawler design document

Description

A design document describing the functionality and the requirements of the headless variant of the crawler needs to be created. This will then be used to come up with the actual functionality of the crawler.

Investigate headless as a proxy

Please describe your feature request:

We need to investigate if it's possible to use chrome as a proxy for HTTP/HTTPS requests. At current time requests are performed with go client via go-rod hijacking.

Describe the use case of this feature:

HTTP requests would have native browser fingerprinting and full context

Endpoint parsing improvements

Project Version

dev

Please describe your feature request:

Regex improvements for endpoint extraction.

echo https://projectdiscovery.io | ./katana -jc -cs projectdiscovery.io

   __        __                
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/ v0.0.1							 

		projectdiscovery.io

[WRN] Use with caution. You are responsible for your actions.
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
https://projectdiscovery.io/app.js
https://projectdiscovery.io/
https://projectdiscovery.io/moment.js
https://projectdiscovery.io/Underscore.js
https://projectdiscovery.io/a/i
https://projectdiscovery.io/a/b
https://projectdiscovery.io/e.do
https://projectdiscovery.io/n.do
https://projectdiscovery.io/af
https://projectdiscovery.io/af.js
https://projectdiscovery.io/ar
https://projectdiscovery.io/ar-dz
https://projectdiscovery.io/ar-dz.js
https://projectdiscovery.io/ar-kw
https://projectdiscovery.io/ar-kw.js
https://projectdiscovery.io/ar-ly
https://projectdiscovery.io/ar-ly.js
https://projectdiscovery.io/ar-ma
https://projectdiscovery.io/ar-ma.js
https://projectdiscovery.io/ar-sa
https://projectdiscovery.io/ar-sa.js
https://projectdiscovery.io/ar-tn
https://projectdiscovery.io/ar-tn.js
https://projectdiscovery.io/ar.js
https://projectdiscovery.io/az
https://projectdiscovery.io/az.js
https://projectdiscovery.io/be
https://projectdiscovery.io/be.js
https://projectdiscovery.io/bg
https://projectdiscovery.io/bg.js
https://projectdiscovery.io/bm
https://projectdiscovery.io/bm.js
https://projectdiscovery.io/bn
https://projectdiscovery.io/bn-bd
https://projectdiscovery.io/bn-bd.js
https://projectdiscovery.io/bn.js
https://projectdiscovery.io/bo
https://projectdiscovery.io/bo.js
https://projectdiscovery.io/br
https://projectdiscovery.io/br.js
https://projectdiscovery.io/bs
https://projectdiscovery.io/bs.js
https://projectdiscovery.io/ca
https://projectdiscovery.io/ca.js
https://projectdiscovery.io/cs
https://projectdiscovery.io/cs.js
https://projectdiscovery.io/cv
https://projectdiscovery.io/cv.js
https://projectdiscovery.io/cy
https://projectdiscovery.io/cy.js
https://projectdiscovery.io/da
https://projectdiscovery.io/da.js
https://projectdiscovery.io/de
https://projectdiscovery.io/de-at
https://projectdiscovery.io/de-at.js
https://projectdiscovery.io/de-ch
https://projectdiscovery.io/de-ch.js
https://projectdiscovery.io/de.js
https://projectdiscovery.io/dv
https://projectdiscovery.io/dv.js
https://projectdiscovery.io/el
https://projectdiscovery.io/el.js
https://projectdiscovery.io/en-au
https://projectdiscovery.io/en-au.js
https://projectdiscovery.io/en-ca
https://projectdiscovery.io/en-ca.js
https://projectdiscovery.io/en-gb
https://projectdiscovery.io/en-gb.js
https://projectdiscovery.io/en-ie
https://projectdiscovery.io/en-ie.js
https://projectdiscovery.io/en-il
https://projectdiscovery.io/en-il.js
https://projectdiscovery.io/en-in
https://projectdiscovery.io/en-in.js
https://projectdiscovery.io/en-nz
https://projectdiscovery.io/en-nz.js
https://projectdiscovery.io/en-sg
https://projectdiscovery.io/en-sg.js
https://projectdiscovery.io/eo
https://projectdiscovery.io/eo.js
https://projectdiscovery.io/es
https://projectdiscovery.io/es-do
https://projectdiscovery.io/es-do.js
https://projectdiscovery.io/es-mx
https://projectdiscovery.io/es-mx.js
https://projectdiscovery.io/es-us
https://projectdiscovery.io/es-us.js
https://projectdiscovery.io/es.js
https://projectdiscovery.io/et
https://projectdiscovery.io/et.js
https://projectdiscovery.io/eu
https://projectdiscovery.io/eu.js
https://projectdiscovery.io/fa
https://projectdiscovery.io/fa.js
https://projectdiscovery.io/fi
https://projectdiscovery.io/fi.js
https://projectdiscovery.io/fil
https://projectdiscovery.io/fil.js
https://projectdiscovery.io/fo
https://projectdiscovery.io/fo.js
https://projectdiscovery.io/fr
https://projectdiscovery.io/fr-ca
https://projectdiscovery.io/fr-ca.js
https://projectdiscovery.io/fr-ch
https://projectdiscovery.io/fr-ch.js
https://projectdiscovery.io/fr.js
https://projectdiscovery.io/fy
https://projectdiscovery.io/fy.js
https://projectdiscovery.io/ga
https://projectdiscovery.io/ga.js
https://projectdiscovery.io/gd
https://projectdiscovery.io/gd.js
https://projectdiscovery.io/gl
https://projectdiscovery.io/gl.js
https://projectdiscovery.io/gom-deva
https://projectdiscovery.io/gom-deva.js
https://projectdiscovery.io/gom-latn
https://projectdiscovery.io/gom-latn.js
https://projectdiscovery.io/gu
https://projectdiscovery.io/gu.js
https://projectdiscovery.io/he
https://projectdiscovery.io/he.js
https://projectdiscovery.io/hi
https://projectdiscovery.io/hi.js
https://projectdiscovery.io/hr
https://projectdiscovery.io/hr.js
https://projectdiscovery.io/hu
https://projectdiscovery.io/hu.js
https://projectdiscovery.io/hy-am
https://projectdiscovery.io/hy-am.js
https://projectdiscovery.io/id
https://projectdiscovery.io/id.js
https://projectdiscovery.io/is
https://projectdiscovery.io/is.js
https://projectdiscovery.io/it
https://projectdiscovery.io/it-ch
https://projectdiscovery.io/it-ch.js
https://projectdiscovery.io/it.js
https://projectdiscovery.io/ja
https://projectdiscovery.io/ja.js
https://projectdiscovery.io/jv
https://projectdiscovery.io/jv.js
https://projectdiscovery.io/ka
https://projectdiscovery.io/ka.js
https://projectdiscovery.io/kk
https://projectdiscovery.io/kk.js
https://projectdiscovery.io/km
https://projectdiscovery.io/km.js
https://projectdiscovery.io/kn
https://projectdiscovery.io/kn.js
https://projectdiscovery.io/ko
https://projectdiscovery.io/ko.js
https://projectdiscovery.io/ku
https://projectdiscovery.io/ku.js
https://projectdiscovery.io/ky
https://projectdiscovery.io/ky.js
https://projectdiscovery.io/lb
https://projectdiscovery.io/lb.js
https://projectdiscovery.io/lo
https://projectdiscovery.io/lo.js
https://projectdiscovery.io/lt
https://projectdiscovery.io/lt.js
https://projectdiscovery.io/lv
https://projectdiscovery.io/lv.js
https://projectdiscovery.io/me
https://projectdiscovery.io/me.js
https://projectdiscovery.io/mi
https://projectdiscovery.io/mi.js
https://projectdiscovery.io/mk
https://projectdiscovery.io/mk.js
https://projectdiscovery.io/ml
https://projectdiscovery.io/ml.js
https://projectdiscovery.io/mn
https://projectdiscovery.io/mn.js
https://projectdiscovery.io/mr
https://projectdiscovery.io/mr.js
https://projectdiscovery.io/ms
https://projectdiscovery.io/ms-my
https://projectdiscovery.io/ms-my.js
https://projectdiscovery.io/ms.js
https://projectdiscovery.io/mt
https://projectdiscovery.io/mt.js
https://projectdiscovery.io/my
https://projectdiscovery.io/my.js
https://projectdiscovery.io/nb
https://projectdiscovery.io/nb.js
https://projectdiscovery.io/ne
https://projectdiscovery.io/ne.js
https://projectdiscovery.io/nl
https://projectdiscovery.io/nl-be
https://projectdiscovery.io/nl-be.js
https://projectdiscovery.io/nl.js
https://projectdiscovery.io/nn
https://projectdiscovery.io/nn.js
https://projectdiscovery.io/oc-lnc
https://projectdiscovery.io/oc-lnc.js
https://projectdiscovery.io/pa-in
https://projectdiscovery.io/pa-in.js
https://projectdiscovery.io/pl
https://projectdiscovery.io/pl.js
https://projectdiscovery.io/pt
https://projectdiscovery.io/pt-br
https://projectdiscovery.io/pt-br.js
https://projectdiscovery.io/pt.js
https://projectdiscovery.io/ro
https://projectdiscovery.io/ro.js
https://projectdiscovery.io/ru
https://projectdiscovery.io/ru.js
https://projectdiscovery.io/sd
https://projectdiscovery.io/sd.js
https://projectdiscovery.io/se
https://projectdiscovery.io/se.js
https://projectdiscovery.io/si
https://projectdiscovery.io/si.js
https://projectdiscovery.io/sk
https://projectdiscovery.io/sk.js
https://projectdiscovery.io/sl
https://projectdiscovery.io/sl.js
https://projectdiscovery.io/sq
https://projectdiscovery.io/sq.js
https://projectdiscovery.io/sr
https://projectdiscovery.io/sr-cyrl
https://projectdiscovery.io/sr-cyrl.js
https://projectdiscovery.io/sr.js
https://projectdiscovery.io/ss
https://projectdiscovery.io/ss.js
https://projectdiscovery.io/sv
https://projectdiscovery.io/sv.js
https://projectdiscovery.io/sw
https://projectdiscovery.io/sw.js
https://projectdiscovery.io/ta
https://projectdiscovery.io/ta.js
https://projectdiscovery.io/te
https://projectdiscovery.io/te.js
https://projectdiscovery.io/tet
https://projectdiscovery.io/tet.js
https://projectdiscovery.io/tg
https://projectdiscovery.io/tg.js
https://projectdiscovery.io/th
https://projectdiscovery.io/th.js
https://projectdiscovery.io/tk
https://projectdiscovery.io/tk.js
https://projectdiscovery.io/tl-ph
https://projectdiscovery.io/tl-ph.js
https://projectdiscovery.io/tlh
https://projectdiscovery.io/tlh.js
https://projectdiscovery.io/tr
https://projectdiscovery.io/tr.js

default scope option update and no scope option

Please describe your feature request:

  • Adding host based default scope
  • Removing extension related filters, e, extensions-allow-list, extensions-deny-list (not being done because wouldn't work separately)
  • Removing csd, cosd option
  • Adding -no-scope option to disable default scope.
   -ns, -no-scope             disable host based default scope.

Describe the use case of this feature:

  • optimizing default behavior to be target specific.
  • optionally user can disable behavior when required.
  • removing options that are already covered under cs/cos option.

Endpoint in JS is not crawled / follwoing depth option

katana version:

main

Current Behavior:

Endpoints in javascript files is not crawled after intial detection / does not follow depth options.

Expected Behavior:

-depth options to be respected for javascript files or any extension.

Steps To Reproduce:

echo https://projectdiscovery.io/app.js | ./katana -sjr | wc 
383

$echo https://projectdiscovery.io/app.js | ./katana -sjr -d 5 | wc 
383

$echo https://projectdiscovery.io/app.js | ./katana -sjr -d 10 | wc 
383

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.