Giter VIP home page Giter VIP logo

justinzm / gopup Goto Github PK

View Code? Open in Web Editor NEW
2.5K 2.5K 383.0 706 KB

数据接口:百度、谷歌、头条、微博指数,宏观数据,利率数据,货币汇率,千里马、独角兽公司,新闻联播文字稿,影视票房数据,高校名单,疫情数据…

Home Page: http://www.gopup.cn

Python 100.00%
covid19-data data data-analysis data-science datasets economic-data gopup index-data python

gopup's Issues

百度指数-request block

gp.baidu_search_index 接口,r = requests.get(url=url, headers=headers)返回的data为空,message:request block,是被反爬了吗?

大佬帮我看看这是出什么问题了?

Traceback (most recent call last):
File "E:/pythonProject2/01.17.py", line 3, in
df_index = g.weibo_user(keyword="雷军")
File "E:\python\lib\site-packages\gopup\pro\client.py", line 48, in query
raise Exception(result['msg'])
Exception: 'Response' object has no attribute 'msg'

建议添加12306车站、车次数据库

1. 从12306下载车站信息

通过分析12306的网站代码,发现全国车站信息的URL

https://kyfw.12306.cn/otn/resources/js/framework/station_name.js

2. 解析车站信息

解析1的数据,输出成以下格式

ID  电报码  站名    拼音        首字母  拼音码
0   BOP    北京北  beijingbei  bjb   bjb

3. 从12306下载车次信息

通过分析12306的网站代码,发现全国车次信息的URL。这个文件存储了当前60天的所有车次信息,大约有35MB。

https://kyfw.12306.cn/otn/resources/js/query/train_list.js

4. 解析车次信息

解析3的数据,按照日期分割成以下格式。

类型  列车编号       车次  起点  终点
D    24000000D10R  D1   北京  沈阳

12306将全国列车分成了7类,C-城际高速,D-动车,G-高铁,K-普快,T-特快,Z-直达,O-其他列车。这里我们仅抽取 C-城际高速,D-动车,G-高铁 的数据。

5. 根据车次和车站解析时刻表URL

首先Merge所有日期的车次,以车次和列车编号为KEY,去除重复后得到全部车次一览。
然后根据各车站的电报码,得出下载时刻表用的URL。如下:

https://kyfw.12306.cn/otn/czxx/queryByTrainNo?train_no=列车编号&from_station_telecode=出发车站电报码&to_station_telecode=到达车站电报码&depart_date=出发日期

#####注意点
a) 部分车次仅在特定日期运营(比如:工作日,周末,节假日等)
b) 同一车次,在不同日期,运营时刻和停靠车站可能不一样
c) 同一车次同一列车编号,在不同日期,运营时刻和停靠车站完全一致

6. 从12306下载时刻表信息

根据步骤5中得到的时刻表URL,下载所有时刻表信息。(JSON格式)

7. 解析时刻表信息

解析6的数据,分别输出完整的“车站”,“车次”,“时刻表”成以下CSV格式

ID  电报码  站名    拼音        首字母  拼音码
0   BOP    北京北  beijingbei  bjb   bjb
车次   起点   终点 出发时间  到达时间 类别  服务
C1002 延吉西 长春  5:47    8:04   动车  2
车次   站序  站名   到站时间  出发时间  停留时间  是否开通
C1002 1    延吉西  ----    6:20     ----     TRUE
      2    长春    8:25    8:25     ----     TRUE

参考: https://github.com/metromancn/Parse12306

调用marco_cmlrd 时报“XLRDError”

import gopup as gp
df_index = gp.marco_cmlrd()
print(df_index)

报错如下


XLRDError Traceback (most recent call last)
in
----> 1 df_index = gp.marco_cmlrd()
2 print(df_index)

~\AppData\Roaming\Python\Python39\site-packages\gopup\economic\marco_cn.py in marco_cmlrd()
20 """
21 url = "http://114.115.232.154:8080/handler/download.ashx"
---> 22 excel_data = pd.read_excel(url, sheet_name="Data", header=0, skiprows=1)
23 excel_data["Period"] = pd.to_datetime(excel_data["Period"]).dt.strftime("%Y-%m")
24 excel_data.columns = [

~\AppData\Roaming\Python\Python39\site-packages\pandas\util_decorators.py in wrapper(*args, **kwargs)
294 )
295 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296 return func(*args, **kwargs)
297
298 return wrapper

~\AppData\Roaming\Python\Python39\site-packages\pandas\io\excel_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols)
302
303 if not isinstance(io, ExcelFile):
--> 304 io = ExcelFile(io, engine=engine)
305 elif engine and engine != io.engine:
306 raise ValueError(

~\AppData\Roaming\Python\Python39\site-packages\pandas\io\excel_base.py in init(self, path_or_buffer, engine)
865 self._io = stringify_path(path_or_buffer)
866
--> 867 self._reader = self._enginesengine
868
869 def fspath(self):

~\AppData\Roaming\Python\Python39\site-packages\pandas\io\excel_xlrd.py in init(self, filepath_or_buffer)
20 err_msg = "Install xlrd >= 1.0.0 for Excel support"
21 import_optional_dependency("xlrd", extra=err_msg)
---> 22 super().init(filepath_or_buffer)
23
24 @Property

~\AppData\Roaming\Python\Python39\site-packages\pandas\io\excel_base.py in init(self, filepath_or_buffer)
349 # N.B. xlrd.Book has a read attribute too
350 filepath_or_buffer.seek(0)
--> 351 self.book = self.load_workbook(filepath_or_buffer)
352 elif isinstance(filepath_or_buffer, str):
353 self.book = self.load_workbook(filepath_or_buffer)

~\AppData\Roaming\Python\Python39\site-packages\pandas\io\excel_xlrd.py in load_workbook(self, filepath_or_buffer)
33 if hasattr(filepath_or_buffer, "read"):
34 data = filepath_or_buffer.read()
---> 35 return open_workbook(file_contents=data)
36 else:
37 return open_workbook(filepath_or_buffer)

~\AppData\Roaming\Python\Python39\site-packages\xlrd_init_.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows, ignore_workbook_corruption)
168 # files that xlrd can parse don't start with the expected signature.
169 if file_format and file_format != 'xls':
--> 170 raise XLRDError(FILE_FORMAT_DESCRIPTIONS[file_format]+'; not supported')
171
172 bk = open_workbook_xls(

XLRDError: Excel xlsx file; not supported

gp.lpr_data(startDate, endDate)日期格式是什么

Traceback (most recent call last):
File "E:/g/tz/mlltz/for_gp.py", line 30, in
get_lprdata()
File "E:/g/tz/mlltz/for_gp.py", line 21, in get_lprdata
lpr.to_csv('./lpr.csv', encoding='gb2312')
AttributeError: 'NoneType' object has no attribute 'to_csv'
lpr_data中文文档没写,不知道日期咋填

外商直接投资数据(FDI),api 出现语法错误

df_index = gp.get_fdi_data()
print(df_index)
368 data_df['当月(亿元)'] = data_df['当月(亿元)'].map(lambda x: int(x)/100000) --> 369 data_df['累计(亿元)'] = data_df['累计(亿元)'].map(lambda x: int(x)/100000)
ValueError: invalid literal for int() with base 10: ''

获取艺人商业价值报编码错误UnicodeDecodeError: 'gbk'....

`In [9]: df_index = gp.realtime_artist()
Exception in thread Thread-248:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\threading.py", line 932, in _bootstrap_inner
self.run()

File "C:\ProgramData\Anaconda3\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)

File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 1366, in _readerthread
buffer.append(fh.read())

UnicodeDecodeError: 'gbk' codec can't decode byte 0xaa in position 1444: illegal multibyte sequence`

求教大佬百度搜索指数报错

使用百度搜索指数的接口,使用的样例的代码:
index_df = gp.baidu_search_index(word="口罩", start_date='2020-01-01', end_date='2020-03-01', cookie=cookie)
提示如下错误:
Traceback (most recent call last):
File "<pyshell#10>", line 1, in
index_df = gp.baidu_search_index(word="口罩", start_date='2020-01-01', end_date='2020-03-01', cookie=cookie)
File "C:\Users\zeyu_\AppData\Local\Programs\Python\Python39\lib\site-packages\gopup\index\index_baidu.py", line 264, in baidu_search_index
all_data = data["userIndexes"][0][type]["data"]
TypeError: string indices must be integers
求大佬指点,谢谢!

大佬,运行完后百度指数的数据对不上,请问这个是什么原因呢

import gopup as gp
cookie = '我的cookie数据'
index_df = gp.baidu_info_index(word="共享经济", start_date='2018-01-01', end_date='2018-02-01', cookie=cookie)
print(index_df)

运行后得出
共享经济
date
2018-01-01 1570
2018-01-02 3114
2018-01-03 0
2018-01-04 672
2018-01-05 840
2018-01-06 2367
2018-01-07 2594
2018-01-08 1040
2018-01-09 847
2018-01-10 3162
2018-01-11 109
2018-01-12 584
2018-01-13 1172
2018-01-14 589
2018-01-15 1130
2018-01-16 269
2018-01-17 1067
2018-01-18 1434
2018-01-19 917
2018-01-20 929
2018-01-21 452
2018-01-22 372
2018-01-23 607
2018-01-24 415
2018-01-25 75549
2018-01-26 21709
2018-01-27 43497
2018-01-28 55024
2018-01-29 45434
2018-01-30 4504
2018-01-31 2330
2018-02-01 2169

但这和我在百度指数官网上的数据对不上

百度指数的cookies应该复制哪一个

我现在复制的是name为index.html的cookies之中的,name为BDUSS,我复制的是它的Value值,打印出来的是None

请问应该复制哪一个cookies,本人对cookies没什么使用经验,忘不吝赐教,谢谢!

微博指数个别关键词不可查

调用weiboindex 查询 model Y 和比亚迪唐正常, 在查询 小鹏P7 和 model 3 两个关键词提示无数据

已经尝试过 小鹏 P7 小鹏p7 小鹏 p7 , model3, Model3 model 3 Model 3 以上几个方式都有问题。确认官网页面可用。是否特定解析方式有问题?

微博指数获取时间

你好,我想问一下微博指数获取可以爬取自定义时间段的数据吗,如果可以的话应该怎么改呀?谢谢~

TypeError: string indices must be integers

import gopup as gp
cookie==。。。 #正确的赋值后
index_df = gp.baidu_search_index(word="罩", start_date='2020-12-01', end_date='2020-12-25', cookie=cookie)
print(index_df)

报错
TypeError Traceback (most recent call last)
in
2 # 怎样看cookie https://jingyan.baidu.com/article/76a7e409284a80fc3a6e1566.html
3 cookie =。。。
----> 4 index_df = gp.baidu_search_index(word="罩", start_date='2020-12-01', end_date='2020-12-25', cookie=cookie)
5 print(index_df)

C:\ProgramData\Anaconda3\lib\site-packages\gopup\index\index_baidu.py in baidu_search_index(word, start_date, end_date, cookie, type)
264 r = requests.get(url=url, params=params, headers=headers)
265 data = r.json()["data"]
--> 266 all_data = data["userIndexes"][0][type]["data"]
267 uniqid = data["uniqid"]
268 ptbk = get_ptbk(uniqid, cookie)

TypeError: string indices must be integers

算术指数数据返回None

算术指数数据返回None,搜索的关键词是股票。
不过在官网直接搜索是可以的,怀疑是反爬升级了。
感叹一下爬虫工程师真难啊。

微博话题热度信息获取时的bug

当查询微博某一话题一天的热度时,会返回未来时间节点的热度信息,比如我在9.18日18:00调用获取微博某话题当天的热度信息请求,会返回9.18日22:00该话题的热度信息,我认为这是很荒谬的,因为22:00还没有到,而且该接口也无法提供预测信息。

数据展示显示

在使用的这个gopup包抓取的数据之后,显示的是部分的数据,那我应该怎样显示全部的数据呢???

中文文档中 百度搜索数据 数据仓库cookie字段应该改为单引号赋值

在文档中,对于 百度搜索数据 这个数据仓库的爬取示例代码有误。

因为从百度指数网站得到的cookie中带有双引号,所以在示例代码的cookie字段处,应该改为单引号赋值~

接口示例应该为:

`
import gopup as gp

cookie = '此处输入您在网页端登录百度指数后的 cookie 数据'

index_df = gp.baidu_search_index(word="口罩", start_date='2020-01-01', end_date='2020-03-01', cookie=cookie)

print(index_df)
`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.