Giter VIP home page Giter VIP logo

pydomainextractor's Introduction

Splunk> Phantom

Welcome to the open-source repository for Splunk> Phantom's intsights App.

Please have a look at our Contributing Guide if you are interested in contributing, raising issues, or learning more about open-source Phantom apps.

Legal and License

This Phantom App is licensed under the Apache 2.0 license. Please see our Contributing Guide for further details.

pydomainextractor's People

Contributors

nescobaraloplop avatar wavenator avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pydomainextractor's Issues

extract_from_url should handle url without protocol-scheme

How to reproduce:
call extract_from_url with //mail.google.com/mail as input

result will be Invalid Domain Error
expected behavior is to handle the case of missing protocol and return {subdomain: mail, domain: google, suffix: com}

query fragments are parsed incorrectly

            first=self.domain_extractor.extract_from_url(
                'http://google.com?q=cats',
            ),
            second={
                'subdomain': '',
                'domain': 'google',
>               'suffix': 'com',
            },
        )
E       AssertionError: {'subdomain': 'google', 'domain': 'com?q=cats', 'suffix': ''} != {'subdomain': '', 'domain': 'google', 'suffix': 'com'}
E       - {'domain': 'com?q=cats', 'subdomain': 'google', 'suffix': ''}
E       ?             ^ ^^^^^^^^                 ------
E       
E       + {'domain': 'google', 'subdomain': '', 'suffix': 'com'}
E       ?             ^ ^^^^                               +++

expected behavior: query fragments (starting with ? or # for example) should be removed and suffix should be recognized correctly.
{'subdomain': '', 'domain': 'google', 'suffix': 'com'}

Install error on macOS

MacOS BigSur 11.2.3

$ brew install gcc libidn2

is OK

but

$ pip install pydomainextractor
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Collecting pydomainextractor
Using cached https://mirrors.aliyun.com/pypi/packages/70/81/1d4faabb5c7c14351ccc1e69f36f822e8d2958517ff95f3fe293bc1ed501/PyDomainExtractor-0.9.1.tar.gz (126 kB)
Using legacy 'setup.py install' for pydomainextractor, since package 'wheel' is not installed.
Installing collected packages: pydomainextractor
Running setup.py install for pydomainextractor ... error
ERROR: Command errored out with exit status 1:
command: /Users/xx/PycharmProjects/pythonProject/venv/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/b6/3p4psc6x6fnbv84226snn34c0000gn/T/pip-install-c3fcazi1/pydomainextractor_3e140deed58c463c9b158f3ab06ff95c/setup.py'"'"'; file='"'"'/private/var/folders/b6/3p4psc6x6fnbv84226snn34c0000gn/T/pip-install-c3fcazi1/pydomainextractor_3e140deed58c463c9b158f3ab06ff95c/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /private/var/folders/b6/3p4psc6x6fnbv84226snn34c0000gn/T/pip-record-j3la2jy6/install-record.txt --single-version-externally-managed --compile --install-headers /Users/xx/PycharmProjects/pythonProject/venv/include/site/python3.9/pydomainextractor
cwd: /private/var/folders/b6/3p4psc6x6fnbv84226snn34c0000gn/T/pip-install-c3fcazi1/pydomainextractor_3e140deed58c463c9b158f3ab06ff95c/
Complete output (27 lines):
running install
running build
running build_py
creating build
creating build/lib.macosx-11.2-x86_64-3.9
creating build/lib.macosx-11.2-x86_64-3.9/tests
copying tests/init.py -> build/lib.macosx-11.2-x86_64-3.9/tests
copying tests/test_pydomainextractor.py -> build/lib.macosx-11.2-x86_64-3.9/tests
creating build/lib.macosx-11.2-x86_64-3.9/pydomainextractor
copying pydomainextractor/init.py -> build/lib.macosx-11.2-x86_64-3.9/pydomainextractor
running egg_info
writing PyDomainExtractor.egg-info/PKG-INFO
writing dependency_links to PyDomainExtractor.egg-info/dependency_links.txt
writing top-level names to PyDomainExtractor.egg-info/top_level.txt
reading manifest file 'PyDomainExtractor.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'PyDomainExtractor.egg-info/SOURCES.txt'
copying pydomainextractor/pydomainextractor.pyi -> build/lib.macosx-11.2-x86_64-3.9/pydomainextractor
running build_ext
building 'extractor' extension
creating build/temp.macosx-11.2-x86_64-3.9
creating build/temp.macosx-11.2-x86_64-3.9/src
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -Isrc -I/Users/xx/PycharmProjects/pythonProject/venv/include -I/Users/xx/.pyenv/versions/3.9.4/include/python3.9 -c src/extractor.cpp -o build/temp.macosx-11.2-x86_64-3.9/src/extractor.o -std=c++17
clang++ -bundle -undefined dynamic_lookup -L/usr/local/opt/readline/lib -L/usr/local/opt/readline/lib -L/Users/xx/.pyenv/versions/3.9.4/lib -L/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib -L/usr/local/opt/readline/lib -L/usr/local/opt/readline/lib -L/Users/xx/.pyenv/versions/3.9.4/lib -L/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib build/temp.macosx-11.2-x86_64-3.9/src/extractor.o -o build/lib.macosx-11.2-x86_64-3.9/pydomainextractor/extractor.cpython-39-darwin.so -lidn2 -Wl,--strip-all
ld: unknown option: --strip-all
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command '/usr/bin/clang++' failed with exit code 1
----------------------------------------
ERROR: Command errored out with exit status 1: /Users/xx/PycharmProjects/pythonProject/venv/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/b6/3p4psc6x6fnbv84226snn34c0000gn/T/pip-install-c3fcazi1/pydomainextractor_3e140deed58c463c9b158f3ab06ff95c/setup.py'"'"'; file='"'"'/private/var/folders/b6/3p4psc6x6fnbv84226snn34c0000gn/T/pip-install-c3fcazi1/pydomainextractor_3e140deed58c463c9b158f3ab06ff95c/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /private/var/folders/b6/3p4psc6x6fnbv84226snn34c0000gn/T/pip-record-j3la2jy6/install-record.txt --single-version-externally-managed --compile --install-headers /Users/xx/PycharmProjects/pythonProject/venv/include/site/python3.9/pydomainextractor Check the logs for full command output.

Enhance the structure returned from the dictionary

Good morning,

You could enhance the dictionary structure by providing 3 more elements:

  • fqdn: Which is a mapping of subdomain.domain.suffix
  • registered domain: which is a mapping of domain.suffix
  • one mapping for subdomain.domain (which would resolve the ip case issue)

Support Apple Silicon

It appears that even though version 0.11.0 is rust-based, a wheel for apple silicon isn't available - installing with pip throws an error.
Can you add M1 to the release?

Skipping link: none of the wheel’s tags (cp39-cp39-macosx_10_7_x86_64) are compatible (run pip debug --verbose to show compatible tags): https://files.pythonhosted.org/packages/37/c1/ba1152e8b7457d66b035debe0c996c5a597aa1a9e438997dbd31b9e6c08d/pydomainextractor-0.11.0-cp39-cp39-macosx_10_7_x86_64.whl#sha256=0d0fd67f80b1772e279b00c1374639c882f583b7aa58bf81f54fbaaa92aaedf4 (from https://pypi.org/simple/pydomainextractor/) (requires-python:>=3.7)

infinite loop

import pydomainextractor
pydomainextractor.DomainExtractor().extract('.am')

Wrong URL extraction

In readme you provide a domain extraction example:

import pydomainextractor

# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.extract('http://google.com/')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'com'
>>> }

where is currently the result of running the above input is:

>>> {
>>>     'suffix': '',
>>>     'domain': 'com/',
>>>     'subdomain': 'http://google'
>>> }

Also, if URL has a suffix, then the suffix is parsed as part of the domain.

domain_extractor.extract('https://google.com/some_path')
>>> {
>>>     'suffix': '',
>>>     'domain': 'com/some_path',
>>>     'subdomain': 'https://google'
>>> }

The module shares state across all imports

Problem

If 2 different places of the code load different suffix lists, then they will override each other if run in the same runtime.

Proposed Solution

Use classes to create independent instances.

Library crashes python on some invalid inputs

Hey,

I've noticed that for some inputs, the method is_valid_domain crashes python altogether.
To reproduce -

import pydomainextractor

domain_extractor = pydomainextractor.DomainExtractor()
domain_extractor.is_valid_domain('www.google.com.')```

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.