Giter VIP home page Giter VIP logo

dmhy-spider's Introduction

关于DMHY-spider

ACGL | DMHY spider

DMHY-spider目的在于爬取动漫花园网站,并将字幕组上传的每一条动画所对应的网页和种子分文别类保存到本地文件系统和数据库中。目前数据库使用的是SQLite3,但是今天使用发现数据库大小增长比较快,后续可能更换为PostgresSQL。最终目的是要能够建立一系列规则,智能识别每一条记录是否为我们在追的番剧(根据剧名,但能识别简体繁体日语罗马音和剧名的缩写甚至错写)。功能概述:

  • 爬取动漫花园网站 :为了降低对服务器的压力,将访问间隔设为了10s;
  • 存储到文件系统和数据库 :将每一条记录的网页和种子保存到文件系统,并将标题文件大小类型等记录到数据库SQLite3,以便后续挖掘;
  • 智能识别番剧 :目前还没实现,也没想好要做成什么样,有想法的可以联系我。

用法

我只是想爬一下网页而已,请不要查我家水表。 —— Vijay

$ python DMHY_DataBase.py

我觉得我代码运行写的提示已经够多了。。。

请输入模式:1、更新昨天 2、更新指定日期(格式:2016-08-23)
           3、更新时间段(格式:[2016-08-22,2016-08-23])
           4、自动更新模式 5、更新固定页数
           6、更新至昨天(safe mode)(不要在第一次运行时使用)

对,就是这么用的。

1更新昨天

2更新指定日期

3更新一个时间段(注意输入格式,第一个和最后一个字符随意,我输入[]是为了提醒你们左右都是闭区间)

4自动更新模式是更新到数据库当前最后一条记录,如果数据库为空,那么会一直读取到1970年,也就是把网站全爬下来。不过因为到后面会因为读不到页面所以抛出异常,所以不要第一次运行就使用,否则。。。你会爬几个月。。。然后一个异常导致你数据库一条都没有commit进去,嗯。

5就更新前几页。

6从上一次运行的最后一条记录开始,一直到下载完昨天的全部记录。

除了**第一次运行不要使用46**以外没有别的提醒。建议是使用13。每天只更新昨天的比较好。嗯?问我为什么是昨天?因为今天还没过完啊。

第一次运行不要使用4和6

第一次运行不要使用4和6

第一次运行不要使用4和6

重要的事情说三遍。

数据库

create table if not exists DMHY_DataBase (
    id INTEGER PRIMARY KEY,
    date VARCHAR(20),
    type VARCHAR(10),
    title VARCHAR(255),
    link VARCHAR(255),
    magnet VARCHAR(255),
    size VARCHAR(10),
    uploader VARCHAR(30),
    HTML TEXT,
    attach VARCHAR(255),
    finish BOOLEAN
)

id是数据库自增的主键。

date是该条记录字幕组的上传日期。

type, title, link, magnet, size, uploader分别对应资源的类型, 标题, 网页链接, 资源磁链, 资源总大小和上传者。

其中link字段为2.0.0及以后版本新增, 会导致1.X版本的数据库不可用,请1.X版本使用者删除数据库重建。

HTML是资源对应网页的数据库备份, 虽然会占比较大的空间, 但出于后续挖掘现在没有的字段数据的考量, 还是决定先保存着。

attach是资源网页和种子在本地磁盘存储的位置。如果由于设置auto_download = False, 使得网页和种子没有下载, 那么该字段为空''。

finish是资源是否已经提醒我更新,对应第三项功能,目前尚未完成所以暂时全为False。

模块调用

import DMHY_DataBase.py
# url = r"https://share.dmhy.org/topics/list/page/"
# domain = r"https://share.dmhy.org"

# path = os.getcwd()

# # sqlite_db = r"D:\Data\Desktop\Workspace\test\DMHY\DMHY.db"
# sqlite_db = os.path.join(path, 'DMHY.db')
# time_delay = 0

# # warehouse = r'D:\Data\Desktop\Workspace\test\DMHY\Warehouse'
# warehouse = os.path.join(path, 'Warehouse')

# DataBase = DMHY_DataBase(mode, attr, url, domain, sqlite_db, time_delay, warehouse, auto_download)
DataBase = DMHY_DataBase(mode, attr)
DataBase.start_requests()

自己写一个模块来调用也可以。

DMHY_DataBase(mode, attr, url=None, domain=None, 
        sqlite_db=None, time_delay=None, warehouse=None,
        auto_download=None):

模式mode,和参数attr是必须要输入的。其余的参数都有默认值。

其中url和domain默认是DMHY现在的域名, time_delay默认是0。以上三项可在同目录底下的配置文件DMHY_Configuration.cfg中设置。

其余的数据库路径sqlite_db,和种子仓库路径warehouse,如果不设,默认在当前目录底下。设了就按设的来。

auto_download设置的是检查网页的时候除了插进数据库,是否需要将网页和种子保存到本地。Boolean类型。默认为真,即要保存到本地。

参数

mode就是前面的1,2,3,4,5,6.

attr就是2,3,5里面输入的时间段和页码。1,4,6的时候attr随便。

url和domain是DMHY的域名。

sqlite_db是数据库存放的路径。

time_delay是访问间隔。

warehouse是网页和种子存放的路径。

auto_download是要不要自动下载网页和种子到本地。

配置文件编写

空行随便空,井号#开头是注释。

格式是key = value。不读取空格,所以key和value(尤其是路径),不要包含空格。

路径不要包含空格

路径不要包含空格

路径不要包含空格

其中url和domain两项为必配项。调用的参数和配置文件两处中至少有一处有设置,其中调用时参数的优先级高于配置文件中的设置。

time_delay是访问间隔。

auto_download是要不要自动下载网页和种子到本地。

https_security_certificate_check为False表示不检查https的安全性。

dmhy-spider's People

Contributors

fno2010 avatar vijayqin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

dmhy-spider's Issues

Consistency between DMHY.db and Warehouse/

When an update task is interrupted accidentally, the torrrent files will be still downloaded and stored into Warehouse/ but the items cannot be committed into DMHY.db.

Handle long path or filename problem

The older Windows version may have limited length of path and filename for NTFS. Windows 7 cannot support the length of path and filename exceeds 260 chars. Trying to proceed long path will lead to issue #3.

To handle this problem, there are several potential solutions:

  1. Use UUIDs to store the path and link them to items in the DataBase. (Unreadable but maintainable)
  2. Force to cut long filename. (Maybe readable)
  3. Extract keywords and regenerate filename automatically. (Readable but hard to implement)

TypeError: not all arguments converted during string formatting

Throw following exception when execution:

Traceback (most recent call last):
  File "./DMHY_DataBase.py", line 298, in <module>
    DataBase.start_requests()
  File "./DMHY_DataBase.py", line 122, in start_requests
    self.parse_item(data[i], con)
  File "./DMHY_DataBase.py", line 189, in parse_item
    path = self.formulate_folder_path(item_date, item_type, item_title)
  File "./DMHY_DataBase.py", line 255, in formulate_folder_path
    item_title))
TypeError: not all arguments converted during string formatting

WindowsError [Error 3] is thrown because of folder path too long

When downloading page(20:39 Aug 02 2016)
https://share.dmhy.org/topics/view/438853_BDRIP_OVA_OVA_Sono_Hanabira_ni_Kuchizuke_wo_Anata_to_Koibito_Tsunagi_OVA_1920x1080_HEVC_10bit_FLAC_softSub_chi_jpn_eng_fre_ger_rus_ordered_chap.html
a WindowsError [Error 3] is thrown

complete error message is following:
Traceback (most recent call last):
File "DMHY_DataBase.py", line 314, in

File "DMHY_DataBase.py", line 122, in start_requests
self.parse_item(data[i], con)
File "DMHY_DataBase.py", line 191, in parse_item
os.makedirs(path)
File "C:\Python27\lib\os.py", line 157, in makedirs
mkdir(name, mode)
WindowsError: [Error 3] : u'D:\Data\Desktop\Workspace\test\DMHY\Warehouse\2016\08\02\u5b63\u5ea6\u5168\u96c6\2039_[BDRIP]\u82b1\u543b\u5728\u4e0a_\u4eb2\u543b\u90a3\u7247\u82b1\u74e3__\u604b\u4eba\u7684\u7f81\u7eca(OVA)\u305d\u306e\u82b1\u3073\u3089\u306b\u304f\u3061\u3065\u3051\u3092\u3042\u306a\u305f\u3068\u604b\u4eba\u3064\u306a\u304e(OVA)_Sono_Hanabira_ni_Kuchizuke_wo__Anata_to_Koibito_Tsunagi(OVA)(1920x1080_HEVC_10bit_FLAC_softSub(chi+jpn+eng+fre+ger+rus)_ordered_chap)[\u7b80\u7e41\u4e2d\u65e5\u82f1\u6cd5\u5fb7\u4fc4\u5b57\u5e55]'
image

sqlite3 ProgrammingError is thrown while downloading the animation from[2016-07-01,2016-08-27]

sqlite3 ProgrammingError is thrown while downloading the animation from[2016-07-01,2016-08-27]
The error message is
Traceback (most recent call last):
File "DMHY_DataBase.py", line 420, in
DataBase.start_requests()
File "DMHY_DataBase.py", line 223, in start_requests
cu.execute(insert_sql, d)
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
It seems like the exception is thrown while inserting into database.
image

跑出這個東西

File "DMHY_DataBase.py", line 45
print "type:", type
^
SyntaxError: Missing parentheses in call to 'print'

雖然我也不清楚前面的操作對不對

UnicodeEncodeError is thrown because of windows cmd 'gbk' codec problem

When downloading page(19:35 Aug 02 2016)
https://share.dmhy.org/topics/view/438777_160727_TV_Vol_2_320K.html
a UnicodeEncodeError is thrown
complete error message is following:
已完成:186/204
�[2FTraceback (most recent call last):
File "DMHY_DataBase.py", line 357, in
DataBase.start_requests()
File "DMHY_DataBase.py", line 172, in start_requests
self.parse_item(update_list[i], con)
File "DMHY_DataBase.py", line 218, in parse_item
print u"[姝e湪涓嬭浇] " + item_title
UnicodeEncodeError: 'gbk' codec can't encode character u'\u30fb' in position 28: illegal multibyte sequence
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.