macks22 / dblp Goto Github PK
View Code? Open in Web Editor NEWParse the dblp data into a structured format for experimentation.
License: MIT License
Parse the dblp data into a structured format for experimentation.
License: MIT License
I was able to run the whole project. But i am not sure to from where do i get author id to author name mapping ?
Hi,
I want to parse the dblp dataset, I went through the documentation but I am still struggling on starting. Can anyone please elaborate on the steps to get this thing running.
According to the documentation, there should be a task called BuillAllGraphData to build all the graphs, unfortunately, it does not exist in the code?
Thanks.
Link to Log file, running as per Readme getting this error.
hi,
after installing all dependecies , i am still unable to whats went wrong, as it did genrate csv files but not genrating gml file,
program is giving this error,
could you help in resolving this issue. would appreciate if you upload gml files to repo.
INFO: Task RemoveUniqueVenues__99914b932b died unexpectedly with exit code -9
/usr/local/lib/python2.7/dist-packages/luigi/parameter.py:259: UserWarning: Parameter None is not of type string.
warnings.warn("Parameter {0} is not of type string.".format(str(x)))
===== Luigi Execution Summary =====
Scheduled 25 tasks of which:
This progress looks ๐ because there were failed tasks
It would be good to find some sort of programmatic way to do this, such that the final output is a polished data dictionary which can be exported to Excel or converted to a PDF. Crawling the dependency tree sounds like the way to go in order to grab all possible output files. Then this list can be compared with a documentation file (perhaps doc.md
or doc.csv
) in order to determine coverage.
Of course, such an approach would need to ignore the optional year parameters. There's no sense in producing all possible files for all possible year ranges.
Write a script that downloads the data from the AMiner site and places it in the proper location. This could be a luigi Task or a bash script. The task option might be better, because then it can be a programmatic dependency which can be resolved (run) by luigi, rather than an external one.
Could you please add descriptions for each file in the repdocs module. I'm trying to use this parser for my projects and am unclear what all the files contain and how they relate to each other.
For example, the dictionary created using gensim.corpora has a different number of documents than the tfidf matrix created.
While the AMiner group already has a co-authorship network provided, it unfortunately does not allow for filtering by year ranges, which is a key feature of this library. Therefore it would be useful to implement tasks which construct such a network that inherit from YearFilterableTask
. These could then also be combined with the term attributes from the repdoc corpus files and the ground truth from the venues in order to produce a more useful network.
It appears that Task
s with no output are supposed to implement a custom complete()
method, since completion normally means all the output files exist. We should either make the outputs include all outputs or add a custom complete method that figures it out some other way.
More discussion here: https://groups.google.com/forum/#!topic/luigi-user/F8AAG91tZfk
Currently the filtering module always grabs the files from base-csv
to filter from. However, if a time range is given that is subsumed by another filtered file in filtered-csv
, that file could be used instead. For instance, if we need to filter to 2011-2011
, we can do that with the dataset for 2010-2012
, since 2011
is subsumed by it.
This should be implemented for all YearFilterableTask
subclasses (so probably something generic on the base class).
Hi Mack,
Thanks for closing the BuildAllGraph issue so promptly.
I also noticed that module 2 does not work as expected.
If I ran,
$ python filtering.py FilterAllCSVRecordsToYearRange --start 1990 --end 2000 --local-scheduler
I would receive the following errors.
ERROR: [pid 31790] Worker Worker(salt=969812214, workers=1, host=ubuntu, username=hello, pid=31790) failed FilterVenuesToYearRange(start=1990, end=1990)
Traceback (most recent call last):
File "/Users/hello/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 192, in run
new_deps = self._run_get_new_deps()
File "/Users/hello/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 130, in _run_get_new_deps
task_gen = self.task.run()
File "filtering.py", line 158, in run
with self.output().open() as afile:
File "/Users/hello/anaconda2/lib/python2.7/site-packages/luigi/local_target.py", line 152, in open
fileobj = FileWrapper(io.BufferedReader(io.FileIO(self.path, mode)))
IOError: [Errno 2] No such file or directory: '/Users/hello/dblp/data/processed/filtered-csv/venue-1990-1990.csv'
Many thanks in advance.
For a complete dataset, generate a summary of salient characteristics, such as:
I have followed steps. Not able to find config.py
Should rely on a small portion of the real dataset that is representative in order to test.
I am facing problem while running pipeline.py, as i am getting problem in config module
This will be useful for graph algorithms that read features from the nodes rather than reading them from another file. One example is EDCAR.
When I tried to run the pipeline, paper.csv was generated from Miner-Papertxt (about 2.2G). And the paper.csv file was too large (exceeded 1.7T) but my computer has only about 2T storage space. So it failed each time I run the project. Do you know how to fix this?
Hi,
I am trying to use your parser, and while running the command pipeline admin$ python pipeline.py BuildDataset --start 2000 --end 2001 --local-scheduler
I am getting the error, which is connected to " NameError: name 'basestring' is not defined" in the utils.py. I looked at the code, but tbh I struggle to get what the variable basestring suppose to be. Any indications what can I check or how can I solve this much appreciated!
Error:
DEBUG: Checking if BuildDataset(start=2000, end=2001) is complete
/Users/admin/anaconda/lib/python3.5/site-packages/luigi/worker.py:328: UserWarning: Task BuildDataset(start=2000, end=2001) without outputs has no custom complete() method
is_complete = task.complete()
DEBUG: Checking if BuildLCCAuthorRepdocCorpusTfidf(start=2000, end=2001) is complete
INFO: Informed scheduler that task BuildDataset_2001_2000_429339e3d6 has status PENDING
WARNING: Will not run BuildLCCAuthorRepdocCorpusTfidf(start=2000, end=2001) or any dependencies due to error in complete() method:
Traceback (most recent call last):
File "/Users/admin/anaconda/lib/python3.5/site-packages/luigi/worker.py", line 328, in check_complete
is_complete = task.complete()
File "/Users/admin/anaconda/lib/python3.5/site-packages/luigi/task.py", line 533, in complete
outputs = flatten(self.output())
File "/Users/admin/Desktop/DBLP_parser/dblp-master/pipeline/util.py", line 39, in output
if isinstance(self.base_paths, basestring):
NameError: name 'basestring' is not defined
INFO: Informed scheduler that task BuildLCCAuthorRepdocCorpusTfidf_2001_2000_429339e3d6 has status UNKNOWN
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
DEBUG: There are 1 pending tasks possibly being run by other workers
DEBUG: There are 1 pending tasks unique to this worker
DEBUG: There are 1 pending tasks last scheduled by this worker
INFO: Worker Worker(salt=038425231, workers=1, host=Monikas-MacBook-Pro.local, username=admin, pid=75675) was stopped. Shutting down Keep-Alive thread
INFO:
===== Luigi Execution Summary =====
Scheduled 2 tasks of which:
* 1 failed scheduling:
- 1 BuildLCCAuthorRepdocCorpusTfidf(start=2000, end=2001)
* 1 were left pending, among these:
* 1 had dependencies whose scheduling failed:
- 1 BuildDataset(start=2000, end=2001)
Did not run any tasks
This progress looks :( because there were tasks whose scheduling failed
===== Luigi Execution Summary =====
From the data, it appears the AMiner group did not perform any name disambiguation. This has led to a dataset with quite a few duplicate author records. This package currently does not address these issues.
The most obvious examples are those where the first or second name is abbreviated with a single letter in one place and spelled out fully in another. Use of dots and/or hyphens in some places also leads to different entity mappings. Another case that is quite common is when hyphenated names are spelled in some places with the hyphen and in some without.
There are also simple common misspellings, although these are harder to detect, since an edit distance of 1 or 2 could just as easily be a completely different name. One case which might be differentiated is when the edit is a deletion of a letter in a string of one or more of that same letter. For instance, "Acharya" vs. "Acharyya". Here it likely the second spelling simply has an extraneous y.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.