Giter VIP home page Giter VIP logo

stock-knowledge-graph's Introduction

stock-knowledge-graph

A small knowledge graph (knowledge base) construction using data published on the web.

利用网络上公开的数据构建一个小型的证券知识图谱(知识库)。

Welcome to watch, star or fork.

stock_graph_demo

工程目录结构

stock-knowledge-graph/
├── __init__.py
├── extract.py  # extract html pages for executives information
├── stock.py  # get stock industry and concept information
├── build_csv.py  # build csv files that can import neo4j
├── import.sh
├── data
│   ├── stockpage.zip
│   ├── executive_prep.csv
│   ├── stock_industry_prep.csv
│   ├── stock_concept_prep.csv
│   └── import  # import directory
│       ├── concept.csv
│       ├── executive.csv
│       ├── executive_stock.csv
│       ├── industry.csv
│       ├── stock.csv
│       ├── stock_concept.csv
│       └── stock_industry.csv
├── design.png
├── result.txt
├── img
│   ├── executive.png
│   └── executive_detail.png
├── import.report
├── README.md
└── requirements.txt

数据源

本项目需要用到两种数据源:一种是公司董事信息,另一种是股票的行业以及概念信息。

  • 公司董事信息

    这部分数据包含在data目录下的stockpage压缩文件中,⾥面的每一个文件是以XXXXXX.html命名,其中XXXXXX是股票代码。这部分数据是由同花顺个股的⽹页爬取而来的,执行解压缩命令unzip stockpage.zip即可获取。比如对于600007.html,这部分内容来自于http://stockpage.10jqka.com.cn/600007/company/#manager

  • 股票行业以及概念信息

    这部分信息也可以通过⽹上公开的信息得到。在这里,我们使用Tushare工具来获得,详细细节见之后具体的任务部分。

任务1:从⽹页中抽取董事会的信息

在我们给定的html文件中,需要对每一个股票/公司抽取董事会成员的信息,这部分信息包括董事会成员“姓名”、“职务”、“性别”、“年龄”共四个字段。首先,姓名和职务的字段来自于:

executive

在这里总共有12位董事成员的信息,都需要抽取出来。另外,性别和年龄字段也可以从下附图里抽取出来:

executive

最后,生成一个 executive_prep.csv文件,格式如下:

高管姓名 性别 年龄 股票代码 职位
朴明志 51 600007 董事⻓/董事
高燕 60 600007 执⾏董事
刘永政 50 600008 董事⻓/董事
··· ··· ··· ··· ···

注:建议表头最好用相应的英文表示。

任务2:获取股票行业和概念的信息

对于这部分信息,我们可以利⽤工具Tushare来获取,官网为http://tushare.org/ ,使用pip命令进行安装即可。下载完之后,在python里即可调用股票行业和概念信息。参考链接:http://tushare.org/classifying.html#id2

通过以下的代码即可获得股票行业信息,并把返回的信息直接存储在stock_industry_prep.csv文件里。

import tushare as ts
df = ts.get_industry_classified()
# TODO 保存到"stock_industry_prep.csv"

类似的,可以通过以下代码即可获得股票概念信息,并把它们存储在stock_concept_prep.csv文件里。

df = ts.get_concept_classified()
# TODO 保存到“stock_concept_prep.csv”

任务3:设计知识图谱

设计一个这样的图谱:

  • 创建“人”实体,这个人拥有姓名、性别、年龄

  • 创建“公司”实体,除了股票代码,还有股票名称

  • 创建“概念”实体,每个概念都有概念名

  • 创建“行业”实体,每个行业都有⾏业名

  • 给“公司”实体添加“ST”的标记,这个由LABEL来实现

  • 创建“人”和“公司”的关系,这个关系有董事长、执行董事等等

  • 创建“公司”和“概念”的关系

  • 创建“公司”和“行业”的关系

把设计图存储为design.png文件。

注:实体名字和关系名字需要易懂,对于上述的要求,并不一定存在唯一的设计,只要能够覆盖上面这些要求即可。“ST”标记是⽤用来刻画⼀个股票严重亏损的状态,这个可以从给定的股票名字前缀来判断,背景知识可参考百科ST股票,“ST”股票对应列表为['*ST', 'ST', 'S*ST', 'SST']。

任务4:创建可以导⼊Neo4j的csv文件

在前两个任务里,我们已经分别生成了 executive_prep.csv, stock_industry_prep.csv, stock_concept_prep.csv,但这些文件不能直接导入到Neo4j数据库。所以需要做⼀些处理,并生成能够直接导入Neo4j的csv格式。 我们需要生成这⼏个文件:executive.csv, stock.csv, concept.csv, industry.csv, executive_stock.csv, stock_industry.csv, stock_concept.csv。对于格式的要求,请参考:https://neo4j.com/docs/operations-manual/current/tutorial/import-tool/

任务5:利用上面的csv文件生成数据库

neo4j_home$ bin/neo4j-admin import --id-type=STRING --nodes executive.csv --nodes stock.csv --nodes concept.csv --nodes industry.csv --relationships executive_stock.csv --relationships stock_industry.csv --relationships stock_concept.csv

这个命令会把所有的数据导入到Neo4j中,数据默认存放在 graph.db 文件夹里。如果graph.db文件夹之前已经有数据存在,则可以选择先删除再执行命令。

把Neo4j服务重启之后,就可以通过localhost:7474观察到知识图谱了。

注意:这些csv要放到~/.config/Neo4j Desktop/Application/neo4jDatabases/database-xxxx/installation-4.0.4下,即与bin文件夹同级,否则需要绝对路径

简单查询命令

# 查询node
MATCH (n:Concept) RETURN n LIMIT 25
# 查询relationship
MATCH p=()-[r:industry_of]->() RETURN p LIMIT 100

任务6:基于构建好的知识图谱,通过编写Cypher语句回答如下问题

(1) 有多少个公司目前是属于“ST”类型的?

(2) “600519”公司的所有独立董事人员中,有多少人同时也担任别的公司的独立董事职位?

(3) 有多少公司既属于环保行业,又有外资背景?

(4) 对于有锂电池概念的所有公司,独⽴董事中女性⼈员⽐例是多少?

请提供对应的Cypher语句以及答案,并把结果写在result.txt

任务7:构建人的实体时,重名问题具体怎么解决?

把简单思路写在result.txt文件中。

stock-knowledge-graph's People

Contributors

jakkwj avatar lemonhu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stock-knowledge-graph's Issues

z

.

Stock.py报错

使用了tushare工具后会报以下两条错误:
1.socket.gaierror: [Errno 11001] getaddrinfo failed
2.urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>
求解答!!!感谢感谢
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 1346, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1255, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1301, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1250, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1010, in _send_output
self.send(msg)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 950, in send
self.connect()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 921, in connect
self.sock = self._create_connection(
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\socket.py", line 822, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\socket.py", line 953, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11001] getaddrinfo failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:\stock.py", line 37, in
df_industry = ts.get_industry_classified()
File "C:\Users\Administrator\AppData\Roaming\Python\Python39\site-packages\tushare\stock\classifying.py", line 49, in get_industry_classified
df = pd.read_csv(ct.TSDATA_CLASS%(ct.P_TYPE['http'], ct.DOMAINS['oss'], 'industry'),
File "C:\Users\Administrator\AppData\Roaming\Python\Python39\site-packages\pandas\io\parsers.py", line 610, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\Administrator\AppData\Roaming\Python\Python39\site-packages\pandas\io\parsers.py", line 462, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Users\Administrator\AppData\Roaming\Python\Python39\site-packages\pandas\io\parsers.py", line 819, in init
self._engine = self._make_engine(self.engine)
File "C:\Users\Administrator\AppData\Roaming\Python\Python39\site-packages\pandas\io\parsers.py", line 1050, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "C:\Users\Administrator\AppData\Roaming\Python\Python39\site-packages\pandas\io\parsers.py", line 1867, in init
self._open_handles(src, kwds)
File "C:\Users\Administrator\AppData\Roaming\Python\Python39\site-packages\pandas\io\parsers.py", line 1362, in _open_handles
self.handles = get_handle(
File "C:\Users\Administrator\AppData\Roaming\Python\Python39\site-packages\pandas\io\common.py", line 558, in get_handle
ioargs = _get_filepath_or_buffer(
File "C:\Users\Administrator\AppData\Roaming\Python\Python39\site-packages\pandas\io\common.py", line 289, in _get_filepath_or_buffer
req = urlopen(filepath_or_buffer)
File "C:\Users\Administrator\AppData\Roaming\Python\Python39\site-packages\pandas\io\common.py", line 195, in urlopen
return urllib.request.urlopen(*args, **kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 214, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 517, in open
response = self._open(req, data)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 534, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 494, in _call_chain
result = func(*args)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 1375, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 1349, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>

数据获取代码

您好,我冒昧的问一下,您能否将你数据爬取的代码也提供一下吗,比如如何爬取同花顺网页的代码?

卡在了任务五

我的csv文件放在了import目录下。命令行导入不进去啊,我从bin目录下执行noe4j-admin import 后面nodes 怎么改路径都不行。报invalid options和invalid file等错误。后来我把csv放入了bin目录下还是不行,报unmatched arguments错误。崩溃啦,大哥可以加个qq或者vx吗?417945175

无法查询多种关系图

使用neo4j-admin import进行csv数据导入后,只能查询单种关系,如employee_of, industry_of等。但无法查询如您demo图片中所示的多种混合关系在一起的结果,使用相关cpyher语句也不能查询。

希望您能帮助,谢谢

文件乱码问题

你好,我想问一下,为什么我运行完以后得到的.csv文件都存在乱码问题呢?

个人问题

您好,我想问您个个人问题,不知您方便不方便解答,就是我现在有两个实体,一个是人、一个是公司,他们之间分别有人投资公司和公司投资公司这两个关系,假设type都是invest,那么我可以把这两个都放到一个csv文件吗,之后把这个文件import到neo4j中,

运行stock.py出现如下错误

Traceback (most recent call last):
File "D:/PycharmProjects/untitled/stock-knowledge-graph-master/stock.py", line 1, in
import tushare as ts
File "D:\PycharmProjects\untitled\venv\lib\site-packages\tushare_init_.py", line 11, in
from tushare.stock.trading import (get_hist_data, get_tick_data,
File "D:\PycharmProjects\untitled\venv\lib\site-packages\tushare\stock\trading.py", line 15, in
import pandas as pd
ImportError: No module named 'pandas'

请问怎么解决?

有关问题检索

大佬,您好,刚接触知识图谱,请问知识图谱适合做下面的任务吗:
输入一段文字,或者关键字,想在大量的文档里面搜索跟这个有关的,越密切的越好,要是能定位到输入文字在那篇文档里就最好了,知识图谱可以这样做不啊

.csv文件导入neo4j

卡在了任务五,输入shell命令,一直在报错,求大神指导
unrecognized option: ''

usage: neo4j-admin import [--mode=csv] [--database=]
[--additional-config=]
[--report-file=]
[--nodes[:Label1:Label2]=<"file1,file2,...">]
[--relationships[:RELATIONSHIP_TYPE]=<"file1,file2,...">]
[--id-type=<STRING|INTEGER|ACTUAL>]
[--input-encoding=]
[--ignore-extra-columns[=<true|false>]]
[--ignore-duplicate-nodes[=<true|false>]]
[--ignore-missing-nodes[=<true|false>]]
[--multiline-fields[=<true|false>]]
[--delimiter=]
[--array-delimiter=]
[--quote=]
[--max-memory=]
[--f=]
[--high-io=<true/false>]
usage: neo4j-admin import --mode=database [--database=]
[--additional-config=]
[--from=]

environment variables:
NEO4J_CONF Path to directory which contains neo4j.conf.
NEO4J_DEBUG Set to anything to enable debug output.
NEO4J_HOME Neo4j home directory.
HEAP_SIZE Set JVM maximum heap size during command execution.
Takes a number and a unit, for example 512m.

Import a collection of CSV files with --mode=csv (default), or a database from a
pre-3.0 installation with --mode=database.

关于命令行的问题

我是一名新手,请问命令行究竟要怎么写,就一直报错。。。。。。一直报useage,我也把csv文件放在bin下面了,为什么会识别不出来,谢谢
:\neo4j-community-3.4.17\bin>neo4j-admin import --nodes executive.csv --nodes stock.csv -- nodes concept.csv --nodes industry.csv --relationships executive_stock.csv --relationships stock_industry.csv -- relationships stock_concept.csv
unrecognized option: ''

usage: neo4j-admin import [--mode=csv] [--database=]
[--additional-config=]
[--report-file=]
[--nodes[:Label1:Label2]=<"file1,file2,...">]
[--relationships[:RELATIONSHIP_TYPE]=<"file1,file2,...">]
[--id-type=<STRING|INTEGER|ACTUAL>]
[--input-encoding=]
[--ignore-extra-columns[=<true|false>]]
[--ignore-duplicate-nodes[=<true|false>]]
[--ignore-missing-nodes[=<true|false>]]
[--multiline-fields[=<true|false>]]
[--delimiter=]
[--array-delimiter=]
[--quote=]
[--max-memory=]
[--f=<File containing all arguments to this impo

Id 'xxx' is defined more than once in group 'global id space'

您好,我在将生成的文件导入到neo4j时,出现了以下问题:
image
我在网上搜的时候,说加入--ignore-duplicate-noedes就可以解决重名id,但是之后,又出现了其他问题,请问下怎么回事呢??
还有就是有个疑问,就是在build_executive的时候,会出现重复的personId,请问下这个会有影响吗??

ID(Executive)和ID(Concept)字段如何命名?

作者您好!
我尝试按照你的步骤完成该项目,目前我已经有了任务2的数据,但是我不知道你是如何命名import文件夹中ID(Executive)和ID(Concept)字段的(见图)。
期待您的回复,谢谢!
2022-04-18_222334

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.