justinzm / gopup Goto Github PK

View Code? Open in Web Editor NEW

2.5K 2.5K 383.0 706 KB

数据接口：百度、谷歌、头条、微博指数,宏观数据，利率数据，货币汇率，千里马、独角兽公司，新闻联播文字稿，影视票房数据，高校名单，疫情数据…

Home Page: http://www.gopup.cn

Python 100.00%

covid19-data data data-analysis data-science datasets economic-data gopup index-data python

gopup's Issues

百度指数-request block

gp.baidu_search_index 接口，r = requests.get(url=url, headers=headers)返回的data为空，message:request block，是被反爬了吗？

Traceback (most recent call last):
File "E:/pythonProject2/01.17.py", line 3, in
df_index = g.weibo_user(keyword="雷军")
File "E:\python\lib\site-packages\gopup\pro\client.py", line 48, in query
raise Exception(result['msg'])
Exception: 'Response' object has no attribute 'msg'

建议添加12306车站、车次数据库

1. 从12306下载车站信息

通过分析12306的网站代码，发现全国车站信息的URL

https://kyfw.12306.cn/otn/resources/js/framework/station_name.js

2. 解析车站信息

解析1的数据，输出成以下格式

ID  电报码  站名    拼音        首字母  拼音码
0   BOP    北京北  beijingbei  bjb   bjb

3. 从12306下载车次信息

通过分析12306的网站代码，发现全国车次信息的URL。这个文件存储了当前60天的所有车次信息，大约有35MB。

https://kyfw.12306.cn/otn/resources/js/query/train_list.js

4. 解析车次信息

解析3的数据，按照日期分割成以下格式。

类型  列车编号       车次  起点  终点
D    24000000D10R  D1   北京  沈阳

12306将全国列车分成了7类，C-城际高速，D-动车，G-高铁，K-普快，T-特快，Z-直达，O-其他列车。这里我们仅抽取 C-城际高速，D-动车，G-高铁的数据。

5. 根据车次和车站解析时刻表URL

首先Merge所有日期的车次，以车次和列车编号为KEY，去除重复后得到全部车次一览。
然后根据各车站的电报码，得出下载时刻表用的URL。如下：

https://kyfw.12306.cn/otn/czxx/queryByTrainNo?train_no=列车编号&from_station_telecode=出发车站电报码&to_station_telecode=到达车站电报码&depart_date=出发日期

#####注意点
a) 部分车次仅在特定日期运营（比如:工作日，周末，节假日等）
b) 同一车次，在不同日期，运营时刻和停靠车站可能不一样
c) 同一车次同一列车编号，在不同日期，运营时刻和停靠车站完全一致

6. 从12306下载时刻表信息

根据步骤5中得到的时刻表URL，下载所有时刻表信息。（JSON格式）

7. 解析时刻表信息

解析6的数据，分别输出完整的“车站”，“车次”，“时刻表”成以下CSV格式

ID  电报码  站名    拼音        首字母  拼音码
0   BOP    北京北  beijingbei  bjb   bjb

车次   起点   终点 出发时间  到达时间 类别  服务
C1002 延吉西 长春  5：47    8：04   动车  2

车次   站序  站名   到站时间  出发时间  停留时间  是否开通
C1002 1    延吉西  ----    6:20     ----     TRUE
      2    长春    8:25    8:25     ----     TRUE

参考： https://github.com/metromancn/Parse12306

调用marco_cmlrd 时报“XLRDError”

import gopup as gp
df_index = gp.marco_cmlrd()
print(df_index)

报错如下

XLRDError Traceback (most recent call last)
in
----> 1 df_index = gp.marco_cmlrd()
2 print(df_index)

~\AppData\Roaming\Python\Python39\site-packages\gopup\economic\marco_cn.py in marco_cmlrd()
20 """
21 url = "http://114.115.232.154:8080/handler/download.ashx"
---> 22 excel_data = pd.read_excel(url, sheet_name="Data", header=0, skiprows=1)
23 excel_data["Period"] = pd.to_datetime(excel_data["Period"]).dt.strftime("%Y-%m")
24 excel_data.columns = [

~\AppData\Roaming\Python\Python39\site-packages\pandas\util_decorators.py in wrapper(*args, **kwargs)
294 )
295 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296 return func(*args, **kwargs)
297
298 return wrapper

~\AppData\Roaming\Python\Python39\site-packages\pandas\io\excel_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols)
302
303 if not isinstance(io, ExcelFile):
--> 304 io = ExcelFile(io, engine=engine)
305 elif engine and engine != io.engine:
306 raise ValueError(

~\AppData\Roaming\Python\Python39\site-packages\pandas\io\excel_base.py in init(self, path_or_buffer, engine)
865 self._io = stringify_path(path_or_buffer)
866
--> 867 self._reader = self._enginesengine
868
869 def fspath(self):

~\AppData\Roaming\Python\Python39\site-packages\pandas\io\excel_xlrd.py in init(self, filepath_or_buffer)
20 err_msg = "Install xlrd >= 1.0.0 for Excel support"
21 import_optional_dependency("xlrd", extra=err_msg)
---> 22 super().init(filepath_or_buffer)
23
24 @Property

~\AppData\Roaming\Python\Python39\site-packages\pandas\io\excel_base.py in init(self, filepath_or_buffer)
349 # N.B. xlrd.Book has a read attribute too
350 filepath_or_buffer.seek(0)
--> 351 self.book = self.load_workbook(filepath_or_buffer)
352 elif isinstance(filepath_or_buffer, str):
353 self.book = self.load_workbook(filepath_or_buffer)

~\AppData\Roaming\Python\Python39\site-packages\pandas\io\excel_xlrd.py in load_workbook(self, filepath_or_buffer)
33 if hasattr(filepath_or_buffer, "read"):
34 data = filepath_or_buffer.read()
---> 35 return open_workbook(file_contents=data)
36 else:
37 return open_workbook(filepath_or_buffer)

~\AppData\Roaming\Python\Python39\site-packages\xlrd_init_.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows, ignore_workbook_corruption)
168 # files that xlrd can parse don't start with the expected signature.
169 if file_format and file_format != 'xls':
--> 170 raise XLRDError(FILE_FORMAT_DESCRIPTIONS[file_format]+'; not supported')
171
172 bk = open_workbook_xls(

XLRDError: Excel xlsx file; not supported

gp.lpr_data(startDate， endDate)日期格式是什么

Traceback (most recent call last):
File "E:/g/tz/mlltz/for_gp.py", line 30, in
get_lprdata()
File "E:/g/tz/mlltz/for_gp.py", line 21, in get_lprdata
lpr.to_csv('./lpr.csv', encoding='gb2312')
AttributeError: 'NoneType' object has no attribute 'to_csv'
lpr_data中文文档没写，不知道日期咋填

头条指数获取失败，显示None

大佬,有更多的宏观数据提供吗

需要更多数据,如货币M2增速等

AttributeError: module 'gopup' has no attribute 'covid_163'

这是什么原因呢同时求个tanken [email protected]

外商直接投资数据(FDI),api 出现语法错误

df_index = gp.get_fdi_data()
print(df_index)
368 data_df['当月(亿元)'] = data_df['当月(亿元)'].map(lambda x: int(x)/100000) --> 369 data_df['累计(亿元)'] = data_df['累计(亿元)'].map(lambda x: int(x)/100000)
ValueError: invalid literal for int() with base 10: ''

请问谷歌指数能否选择地区？

目前是否是显示**地区的热度，请问能否选择国家呢？

以及如果是日为单位，请问该怎么设定呢？谢谢！

百度迁徙指数提取报错

大佬，这两天开始百度迁徙指数提取报错了，有空还请麻烦看一下啊，感谢！

请问微博指数能否有分地区版本的接口？

感谢您的工作，的确为我们提供了太多的便利，近来项目需求上，有微博指数分地区版本的数据的需求，这部分模块，能否添加？

百度媒体指数似乎不能用了

百度指数的网页里面也少了媒体指数这个部分，要咋办啊，写论文急用orz

文档错误

http://doc.gopup.cn/#/data/life_data?id=%e8%80%81%e9%bb%84%e5%8e%86

文档中 描述: 获取唐朝诗人姓名及诗词作品数量 不是对 老黄历 的描述，应该是笔误。

获取艺人商业价值报编码错误UnicodeDecodeError: 'gbk'....

`In [9]: df_index = gp.realtime_artist()
Exception in thread Thread-248:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\threading.py", line 932, in _bootstrap_inner
self.run()

File "C:\ProgramData\Anaconda3\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)

File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 1366, in _readerthread
buffer.append(fh.read())

UnicodeDecodeError: 'gbk' codec can't decode byte 0xaa in position 1444: illegal multibyte sequence`

百度指数的区域化选择是否可以实现？

在查看百度指数时，发现可选择全国，也可分地区。想问大佬和各位目前是否可以实现分地区的下载？具体方式是怎么样的呢？感谢！！

大佬，你的中文文档和官网打不开了

如题

求教大佬百度搜索指数报错

使用百度搜索指数的接口，使用的样例的代码：
index_df = gp.baidu_search_index(word="口罩", start_date='2020-01-01', end_date='2020-03-01', cookie=cookie)
提示如下错误：
Traceback (most recent call last):
File "<pyshell#10>", line 1, in
index_df = gp.baidu_search_index(word="口罩", start_date='2020-01-01', end_date='2020-03-01', cookie=cookie)
File "C:\Users\zeyu_\AppData\Local\Programs\Python\Python39\lib\site-packages\gopup\index\index_baidu.py", line 264, in baidu_search_index
all_data = data["userIndexes"][0][type]["data"]
TypeError: string indices must be integers
求大佬指点，谢谢！

大佬，运行完后百度指数的数据对不上，请问这个是什么原因呢

import gopup as gp
cookie = '我的cookie数据'
index_df = gp.baidu_info_index(word="共享经济", start_date='2018-01-01', end_date='2018-02-01', cookie=cookie)
print(index_df)

运行后得出
共享经济
date
2018-01-01 1570
2018-01-02 3114
2018-01-03 0
2018-01-04 672
2018-01-05 840
2018-01-06 2367
2018-01-07 2594
2018-01-08 1040
2018-01-09 847
2018-01-10 3162
2018-01-11 109
2018-01-12 584
2018-01-13 1172
2018-01-14 589
2018-01-15 1130
2018-01-16 269
2018-01-17 1067
2018-01-18 1434
2018-01-19 917
2018-01-20 929
2018-01-21 452
2018-01-22 372
2018-01-23 607
2018-01-24 415
2018-01-25 75549
2018-01-26 21709
2018-01-27 43497
2018-01-28 55024
2018-01-29 45434
2018-01-30 4504
2018-01-31 2330
2018-02-01 2169

但这和我在百度指数官网上的数据对不上

百度指数的相关接口似乎挂掉了

上周可以运行的代码，这种就不行了。更新了cookie后也不行。

partially initialized module 'gopup' has no attribute 'weibo_index'

这是什么问题

import gopup as gp

df_index = gp.weibo_index(word="疫情", time_type="3month")
print(df_index)

报错
AttributeError: partially initialized module 'gopup' has no attribute 'weibo_index' (most likely due to a circular import)

百度指数的cookies应该复制哪一个

我现在复制的是name为index.html的cookies之中的，name为BDUSS，我复制的是它的Value值，打印出来的是None

请问应该复制哪一个cookies，本人对cookies没什么使用经验，忘不吝赐教，谢谢！

微博指数个别关键词不可查

调用weiboindex 查询 model Y 和比亚迪唐正常，在查询小鹏P7 和 model 3 两个关键词提示无数据

已经尝试过小鹏 P7 小鹏p7 小鹏 p7 ， model3, Model3 model 3 Model 3 以上几个方式都有问题。确认官网页面可用。是否特定解析方式有问题？

VIP接口token怎么获取？是收费的吗？

收费方式是怎么样的，我愿意付费

微博指数获取时间

你好，我想问一下微博指数获取可以爬取自定义时间段的数据吗，如果可以的话应该怎么改呀？谢谢~

大佬，百度搜索指数好像出错了

百度搜索指数与实际数据不一致

算数指数返回的数据都是None？

直接复制运行cookbook中的代码，返回的dataframe都是None。

token如何获取？

TypeError: string indices must be integers

import gopup as gp
cookie==。。。 #正确的赋值后
index_df = gp.baidu_search_index(word="罩", start_date='2020-12-01', end_date='2020-12-25', cookie=cookie)
print(index_df)

报错
TypeError Traceback (most recent call last)
in
2 # 怎样看cookie https://jingyan.baidu.com/article/76a7e409284a80fc3a6e1566.html
3 cookie =。。。
----> 4 index_df = gp.baidu_search_index(word="罩", start_date='2020-12-01', end_date='2020-12-25', cookie=cookie)
5 print(index_df)

C:\ProgramData\Anaconda3\lib\site-packages\gopup\index\index_baidu.py in baidu_search_index(word, start_date, end_date, cookie, type)
264 r = requests.get(url=url, params=params, headers=headers)
265 data = r.json()["data"]
--> 266 all_data = data["userIndexes"][0][type]["data"]
267 uniqid = data["uniqid"]
268 ptbk = get_ptbk(uniqid, cookie)

TypeError: string indices must be integers

接口示例应该为：

`
import gopup as gp

cookie = '此处输入您在网页端登录百度指数后的 cookie 数据'

index_df = gp.baidu_search_index(word="口罩", start_date='2020-01-01', end_date='2020-03-01', cookie=cookie)

print(index_df)
`

获取实时电影票房报“SyntaxError: 缺少标识符、字符串或数字”错误

运行接口示例代码

import gopup as gp
df_index = gp.realtime_boxoffice()
print(df_index)

后报 SyntaxError: 缺少标识符、字符串或数字错误，各位大佬怎么解决呀，使用0.3.4和0.2.8版本都有此问题