Giter VIP home page Giter VIP logo

privacy-scraper's Introduction

Privacy Scraper

Copyright

Creative Commons

Como instalar:

  1. Instale os requerimentos utilizando
pip install -r requirements.txt

  1. Crie um arquivo na raiz chamado .secrets.yaml com a seguinte arvore:
user: "<email do usuario>"
pwd: "<senha>"
  1. caso necessario, verifique a necessidade de rodar um
playwright install
  1. No arquivo settings.yaml, verifique o caminho da pasta downloads. Coloque o caminho completa da pasta, exemplo:
downloaddir: C:\downloads

Como usar:

  1. Após tudo configurado, apenas faça
python main.py
  1. Quando aparecer a lista de perfis, aperta o numero do perfil escolhido ou 0 para varrer todos os perfis.

privacy-scraper's People

Contributors

hammerheaddf avatar ishimarumakoto avatar keyaru38 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

privacy-scraper's Issues

Procurar todas as postagens de um perfil

Como:
Um utilizador do script

Eu gostaria de:
Não precisar colocar nenhum perfil e automaticamente pegar todos os perfis assinados de devido usuario

Para que:
A ferramenta fizesse scrap automático de tudo.

Critério de aceitação:

  1. Ao abrir o script, será logado e ao invés de ir direto para o perfil escolhido no parametro, aparecerá os perfis assinados (perfil/seguindo) e a opção e fazer o scrap individual ou apertando 0 de todos

Lista de perfis pega apenas os 30 primeiros

A lista de perfis tem carregamento dinâmico, puxando 30 de cada vez. Eu, por exemplo, sigo 37 perfis e só me aparecem os 30 primeiros. Identificar qual o XHR que monta essa lista.

Interesse

Olá, seu sistema de download funciona?

recebendo mensagem de errar

Abrindo página de login...
Traceback (most recent call last):
File "/Users//privacy-scraper/main.py", line 493, in
asyncio.run(main())
^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/asyncclick/core.py", line 1205, in call
return anyio.run(self._main, main, args, kwargs, **opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/anyio/_core/_eventloop.py", line 73, in run
return async_backend.run(func, args, {}, backend_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 2001, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 1989, in wrapper
return await func(*args)
^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/asyncclick/core.py", line 1208, in _main
return await main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/asyncclick/core.py", line 1120, in main
rv = await self.invoke(ctx)
^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/asyncclick/core.py", line 1485, in invoke
return await ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/asyncclick/core.py", line 824, in invoke
rv = await rv
^^^^^^^^
File "/Users//privacy-scraper/main.py", line 452, in main
await expect(user).to_be_editable()
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/playwright/async_api/_generated.py", line 19095, in to_be_editable
await self._impl_obj.to_be_editable(editable=editable, timeout=timeout)
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/playwright/_impl/_assertions.py", line 570, in to_be_editable
await self._expect_impl(
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/playwright/_impl/_assertions.py", line 70, in _expect_impl
raise AssertionError(
AssertionError: Locator expected to be editable
Actual value: None
Call log:
LocatorAssertions.to_be_editable with timeout 5000ms

  • waiting for get_by_placeholder("E-mail")

Time out

Aparece a tele de login, senha e email são colocados mas não passa disso, se clico pra entar, ele acessa mas não acontece nada e sai por timeout.

Barras de status (tqdm) quebradas

No código original só tinha um bug com as barras de status do tqdm. Agora tem várias barras quebradas. Quando tiver tempo vou investigar.

KeyError: 'Content-Length'

Está exibindo o seguinte erro:

Traceback (most recent call last):
File "D:\privacy-scraper-playwright\main.py", line 498, in
asyncio.run(main())
^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python312\Lib\site-packages\asyncclick\core.py", line 1205, in call
return anyio.run(self._main, main, args, kwargs, **opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python312\Lib\site-packages\anyio_core_eventloop.py", line 74, in run
return async_backend.run(func, args, {}, backend_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python312\Lib\site-packages\anyio_backends_asyncio.py", line 2034, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python312\Lib\site-packages\anyio_backends_asyncio.py", line 2022, in wrapper
return await func(*args)
^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python312\Lib\site-packages\asyncclick\core.py", line 1208, in _main
return await main(args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python312\Lib\site-packages\asyncclick\core.py", line 1120, in main
rv = await self.invoke(ctx)
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python312\Lib\site-packages\asyncclick\core.py", line 1485, in invoke
return await ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python312\Lib\site-packages\asyncclick\core.py", line 824, in invoke
rv = await rv
^^^^^^^^
File "D:\privacy-scraper-playwright\main.py", line 488, in main
await fetch_profiles(page, profile, backlog)
File "D:\privacy-scraper-playwright\main.py", line 110, in fetch_profiles
proc2 = multiprocessing.Process(target=await downloadLinks(page,jar,profile))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\privacy-scraper-playwright\main.py", line 318, in downloadLinks
await asyncio.gather(
([retrieveLinks(mediastoDownload,medias)] + [requestLink(medias, cookiejar) for _ in range(4)]))
File "D:\privacy-scraper-playwright\main.py", line 354, in requestLink
if int(req.headers['Content-Length']) < 10000000: #10MB
~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python312\Lib\site-packages\httpx_models.py", line 228, in getitem
raise KeyError(key)
KeyError: 'Content-Length'

Encontra as postagens e não as mídias.

C:\pvc>python main.py github
Abrindo página de login...
Aguardando autenticação...
Procurando página de postagens do perfil github...
Buscando postagens com mídia...
Post 6726d365-2b67-42d8-8a2d-45737aa424c1: 100%|██████████████████████████████████████████| 340/340 [00:23, 14.73it/s]
325 postagens com texto e mídia, 0 mídias encontradas. Baixando 0 mídias.
Sem mídia para baixar.
Encerrado.

Abre o chromium em modo anônimo, abre a pagina de login do privacy, adiciona o login e senha e abre o perfil escolhido, mas encontra as postagens e não baixa nenhuma midia, em tese esse perfil em específico tem 340 postagens e 3.771 mídias.

Vi que você está sem tempo para aplicar uma possível correção, então fico no aguardo de uma possível correção se algum dia puder!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.