Giter VIP home page Giter VIP logo

justinzm / gopup Goto Github PK

View Code? Open in Web Editor NEW
2.5K 43.0 384.0 706 KB

数据接口:百度、谷歌、头条、微博指数,宏观数据,利率数据,货币汇率,千里马、独角兽公司,新闻联播文字稿,影视票房数据,高校名单,疫情数据…

Home Page: http://www.gopup.cn

Python 100.00%
data-analysis covid19-data index-data economic-data datasets gopup python data-science data

gopup's Introduction

Hi there 👋

🧑‍💻 I’m good at Python and PHP.

🌱 I’m currently learning Blockchain Learning.

Anurag's GitHub stats

Top Langs

gopup's People

Contributors

justinzm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gopup's Issues

gp.lpr_data(startDate, endDate)日期格式是什么

Traceback (most recent call last):
File "E:/g/tz/mlltz/for_gp.py", line 30, in
get_lprdata()
File "E:/g/tz/mlltz/for_gp.py", line 21, in get_lprdata
lpr.to_csv('./lpr.csv', encoding='gb2312')
AttributeError: 'NoneType' object has no attribute 'to_csv'
lpr_data中文文档没写,不知道日期咋填

中文文档中 百度搜索数据 数据仓库cookie字段应该改为单引号赋值

在文档中,对于 百度搜索数据 这个数据仓库的爬取示例代码有误。

因为从百度指数网站得到的cookie中带有双引号,所以在示例代码的cookie字段处,应该改为单引号赋值~

接口示例应该为:

`
import gopup as gp

cookie = '此处输入您在网页端登录百度指数后的 cookie 数据'

index_df = gp.baidu_search_index(word="口罩", start_date='2020-01-01', end_date='2020-03-01', cookie=cookie)

print(index_df)
`

微博指数获取时间

你好,我想问一下微博指数获取可以爬取自定义时间段的数据吗,如果可以的话应该怎么改呀?谢谢~

大佬,运行完后百度指数的数据对不上,请问这个是什么原因呢

import gopup as gp
cookie = '我的cookie数据'
index_df = gp.baidu_info_index(word="共享经济", start_date='2018-01-01', end_date='2018-02-01', cookie=cookie)
print(index_df)

运行后得出
共享经济
date
2018-01-01 1570
2018-01-02 3114
2018-01-03 0
2018-01-04 672
2018-01-05 840
2018-01-06 2367
2018-01-07 2594
2018-01-08 1040
2018-01-09 847
2018-01-10 3162
2018-01-11 109
2018-01-12 584
2018-01-13 1172
2018-01-14 589
2018-01-15 1130
2018-01-16 269
2018-01-17 1067
2018-01-18 1434
2018-01-19 917
2018-01-20 929
2018-01-21 452
2018-01-22 372
2018-01-23 607
2018-01-24 415
2018-01-25 75549
2018-01-26 21709
2018-01-27 43497
2018-01-28 55024
2018-01-29 45434
2018-01-30 4504
2018-01-31 2330
2018-02-01 2169

但这和我在百度指数官网上的数据对不上

百度指数-request block

gp.baidu_search_index 接口,r = requests.get(url=url, headers=headers)返回的data为空,message:request block,是被反爬了吗?

TypeError: string indices must be integers

import gopup as gp
cookie==。。。 #正确的赋值后
index_df = gp.baidu_search_index(word="罩", start_date='2020-12-01', end_date='2020-12-25', cookie=cookie)
print(index_df)

报错
TypeError Traceback (most recent call last)
in
2 # 怎样看cookie https://jingyan.baidu.com/article/76a7e409284a80fc3a6e1566.html
3 cookie =。。。
----> 4 index_df = gp.baidu_search_index(word="罩", start_date='2020-12-01', end_date='2020-12-25', cookie=cookie)
5 print(index_df)

C:\ProgramData\Anaconda3\lib\site-packages\gopup\index\index_baidu.py in baidu_search_index(word, start_date, end_date, cookie, type)
264 r = requests.get(url=url, params=params, headers=headers)
265 data = r.json()["data"]
--> 266 all_data = data["userIndexes"][0][type]["data"]
267 uniqid = data["uniqid"]
268 ptbk = get_ptbk(uniqid, cookie)

TypeError: string indices must be integers

微博指数个别关键词不可查

调用weiboindex 查询 model Y 和比亚迪唐正常, 在查询 小鹏P7 和 model 3 两个关键词提示无数据

已经尝试过 小鹏 P7 小鹏p7 小鹏 p7 , model3, Model3 model 3 Model 3 以上几个方式都有问题。确认官网页面可用。是否特定解析方式有问题?

建议添加12306车站、车次数据库

1. 从12306下载车站信息

通过分析12306的网站代码,发现全国车站信息的URL

https://kyfw.12306.cn/otn/resources/js/framework/station_name.js

2. 解析车站信息

解析1的数据,输出成以下格式

ID  电报码  站名    拼音        首字母  拼音码
0   BOP    北京北  beijingbei  bjb   bjb

3. 从12306下载车次信息

通过分析12306的网站代码,发现全国车次信息的URL。这个文件存储了当前60天的所有车次信息,大约有35MB。

https://kyfw.12306.cn/otn/resources/js/query/train_list.js

4. 解析车次信息

解析3的数据,按照日期分割成以下格式。

类型  列车编号       车次  起点  终点
D    24000000D10R  D1   北京  沈阳

12306将全国列车分成了7类,C-城际高速,D-动车,G-高铁,K-普快,T-特快,Z-直达,O-其他列车。这里我们仅抽取 C-城际高速,D-动车,G-高铁 的数据。

5. 根据车次和车站解析时刻表URL

首先Merge所有日期的车次,以车次和列车编号为KEY,去除重复后得到全部车次一览。
然后根据各车站的电报码,得出下载时刻表用的URL。如下:

https://kyfw.12306.cn/otn/czxx/queryByTrainNo?train_no=列车编号&from_station_telecode=出发车站电报码&to_station_telecode=到达车站电报码&depart_date=出发日期

#####注意点
a) 部分车次仅在特定日期运营(比如:工作日,周末,节假日等)
b) 同一车次,在不同日期,运营时刻和停靠车站可能不一样
c) 同一车次同一列车编号,在不同日期,运营时刻和停靠车站完全一致

6. 从12306下载时刻表信息

根据步骤5中得到的时刻表URL,下载所有时刻表信息。(JSON格式)

7. 解析时刻表信息

解析6的数据,分别输出完整的“车站”,“车次”,“时刻表”成以下CSV格式

ID  电报码  站名    拼音        首字母  拼音码
0   BOP    北京北  beijingbei  bjb   bjb
车次   起点   终点 出发时间  到达时间 类别  服务
C1002 延吉西 长春  5:47    8:04   动车  2
车次   站序  站名   到站时间  出发时间  停留时间  是否开通
C1002 1    延吉西  ----    6:20     ----     TRUE
      2    长春    8:25    8:25     ----     TRUE

参考: https://github.com/metromancn/Parse12306

获取艺人商业价值报编码错误UnicodeDecodeError: 'gbk'....

`In [9]: df_index = gp.realtime_artist()
Exception in thread Thread-248:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\threading.py", line 932, in _bootstrap_inner
self.run()

File "C:\ProgramData\Anaconda3\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)

File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 1366, in _readerthread
buffer.append(fh.read())

UnicodeDecodeError: 'gbk' codec can't decode byte 0xaa in position 1444: illegal multibyte sequence`

外商直接投资数据(FDI),api 出现语法错误

df_index = gp.get_fdi_data()
print(df_index)
368 data_df['当月(亿元)'] = data_df['当月(亿元)'].map(lambda x: int(x)/100000) --> 369 data_df['累计(亿元)'] = data_df['累计(亿元)'].map(lambda x: int(x)/100000)
ValueError: invalid literal for int() with base 10: ''

数据展示显示

在使用的这个gopup包抓取的数据之后,显示的是部分的数据,那我应该怎样显示全部的数据呢???

算术指数数据返回None

算术指数数据返回None,搜索的关键词是股票。
不过在官网直接搜索是可以的,怀疑是反爬升级了。
感叹一下爬虫工程师真难啊。

百度指数的cookies应该复制哪一个

我现在复制的是name为index.html的cookies之中的,name为BDUSS,我复制的是它的Value值,打印出来的是None

请问应该复制哪一个cookies,本人对cookies没什么使用经验,忘不吝赐教,谢谢!

调用marco_cmlrd 时报“XLRDError”

import gopup as gp
df_index = gp.marco_cmlrd()
print(df_index)

报错如下


XLRDError Traceback (most recent call last)
in
----> 1 df_index = gp.marco_cmlrd()
2 print(df_index)

~\AppData\Roaming\Python\Python39\site-packages\gopup\economic\marco_cn.py in marco_cmlrd()
20 """
21 url = "http://114.115.232.154:8080/handler/download.ashx"
---> 22 excel_data = pd.read_excel(url, sheet_name="Data", header=0, skiprows=1)
23 excel_data["Period"] = pd.to_datetime(excel_data["Period"]).dt.strftime("%Y-%m")
24 excel_data.columns = [

~\AppData\Roaming\Python\Python39\site-packages\pandas\util_decorators.py in wrapper(*args, **kwargs)
294 )
295 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296 return func(*args, **kwargs)
297
298 return wrapper

~\AppData\Roaming\Python\Python39\site-packages\pandas\io\excel_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols)
302
303 if not isinstance(io, ExcelFile):
--> 304 io = ExcelFile(io, engine=engine)
305 elif engine and engine != io.engine:
306 raise ValueError(

~\AppData\Roaming\Python\Python39\site-packages\pandas\io\excel_base.py in init(self, path_or_buffer, engine)
865 self._io = stringify_path(path_or_buffer)
866
--> 867 self._reader = self._enginesengine
868
869 def fspath(self):

~\AppData\Roaming\Python\Python39\site-packages\pandas\io\excel_xlrd.py in init(self, filepath_or_buffer)
20 err_msg = "Install xlrd >= 1.0.0 for Excel support"
21 import_optional_dependency("xlrd", extra=err_msg)
---> 22 super().init(filepath_or_buffer)
23
24 @Property

~\AppData\Roaming\Python\Python39\site-packages\pandas\io\excel_base.py in init(self, filepath_or_buffer)
349 # N.B. xlrd.Book has a read attribute too
350 filepath_or_buffer.seek(0)
--> 351 self.book = self.load_workbook(filepath_or_buffer)
352 elif isinstance(filepath_or_buffer, str):
353 self.book = self.load_workbook(filepath_or_buffer)

~\AppData\Roaming\Python\Python39\site-packages\pandas\io\excel_xlrd.py in load_workbook(self, filepath_or_buffer)
33 if hasattr(filepath_or_buffer, "read"):
34 data = filepath_or_buffer.read()
---> 35 return open_workbook(file_contents=data)
36 else:
37 return open_workbook(filepath_or_buffer)

~\AppData\Roaming\Python\Python39\site-packages\xlrd_init_.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows, ignore_workbook_corruption)
168 # files that xlrd can parse don't start with the expected signature.
169 if file_format and file_format != 'xls':
--> 170 raise XLRDError(FILE_FORMAT_DESCRIPTIONS[file_format]+'; not supported')
171
172 bk = open_workbook_xls(

XLRDError: Excel xlsx file; not supported

求教大佬百度搜索指数报错

使用百度搜索指数的接口,使用的样例的代码:
index_df = gp.baidu_search_index(word="口罩", start_date='2020-01-01', end_date='2020-03-01', cookie=cookie)
提示如下错误:
Traceback (most recent call last):
File "<pyshell#10>", line 1, in
index_df = gp.baidu_search_index(word="口罩", start_date='2020-01-01', end_date='2020-03-01', cookie=cookie)
File "C:\Users\zeyu_\AppData\Local\Programs\Python\Python39\lib\site-packages\gopup\index\index_baidu.py", line 264, in baidu_search_index
all_data = data["userIndexes"][0][type]["data"]
TypeError: string indices must be integers
求大佬指点,谢谢!

微博话题热度信息获取时的bug

当查询微博某一话题一天的热度时,会返回未来时间节点的热度信息,比如我在9.18日18:00调用获取微博某话题当天的热度信息请求,会返回9.18日22:00该话题的热度信息,我认为这是很荒谬的,因为22:00还没有到,而且该接口也无法提供预测信息。

大佬帮我看看这是出什么问题了?

Traceback (most recent call last):
File "E:/pythonProject2/01.17.py", line 3, in
df_index = g.weibo_user(keyword="雷军")
File "E:\python\lib\site-packages\gopup\pro\client.py", line 48, in query
raise Exception(result['msg'])
Exception: 'Response' object has no attribute 'msg'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.