macks22 / dblp Goto Github PK

View Code? Open in Web Editor NEW

74.0 74.0 21.0 120 KB

Parse the dblp data into a structured format for experimentation.

License: MIT License

Python 98.93% Makefile 0.68% Shell 0.39%

dblp's People

Contributors

Stargazers

Watchers

dblp's Issues

Author <id> to <name> mapping

I was able to run the whole project. But i am not sure to from where do i get author id to author name mapping ?

How to run this on the Aminer dataset?

Hi,

I want to parse the dblp dataset, I went through the documentation but I am still struggling on starting. Can anyone please elaborate on the steps to get this thing running.

How to change AMiner-Author2Paper.txt to AMiner-Author2Paper.tsv ?

BuildAllGraphData task does not exist in build_graphs.py

According to the documentation, there should be a task called BuillAllGraphData to build all the graphs, unfortunately, it does not exist in the code?

Thanks.

RuntimeError: Unfulfilled dependency at run time:

Link to Log file, running as per Readme getting this error.

NFO: Task RemoveUniqueVenues__99914b932b died unexpectedly ERROR....

hi,
after installing all dependecies , i am still unable to whats went wrong, as it did genrate csv files but not genrating gml file,
program is giving this error,
could you help in resolving this issue. would appreciate if you upload gml files to repo.

INFO: Task RemoveUniqueVenues__99914b932b died unexpectedly with exit code -9
/usr/local/lib/python2.7/dist-packages/luigi/parameter.py:259: UserWarning: Parameter None is not of type string.
warnings.warn("Parameter {0} is not of type string.".format(str(x)))

===== Luigi Execution Summary =====

Scheduled 25 tasks of which:

2 present dependencies were encountered:
- 1 AminerNetworkAuthorships()
- 1 AminerNetworkPapers()
5 ran successfully:
- 1 CSVPaperRecords()
- 1 CSVRefsRecords()
- 1 ParseAuthorshipsToCSV()
- 1 ParsePapersToCSV()
- 1 RemovePapersNoVenueOrYear()
1 failed:
- 1 RemoveUniqueVenues()
17 were left pending, among these:
- 17 had failed dependencies:
  - 1 AuthorCitationGraphLCCIdmap(start=2003, end=2004)
  - 1 BuildAuthorCitationGraph(start=2003, end=2004)
  - 1 BuildAuthorRepdocVectors(start=2003, end=2004)
  - 1 BuildDataset(start=2003, end=2004)
  - 1 BuildLCCAuthorRepdocCorpusTf(start=2003, end=2004)
    ...

This progress looks 😞 because there were failed tasks

Thoroughly document each output file.

It would be good to find some sort of programmatic way to do this, such that the final output is a polished data dictionary which can be exported to Excel or converted to a PDF. Crawling the dependency tree sounds like the way to go in order to grab all possible output files. Then this list can be compared with a documentation file (perhaps doc.md or doc.csv) in order to determine coverage.

Of course, such an approach would need to ignore the optional year parameters. There's no sense in producing all possible files for all possible year ranges.

Add AMiner data retrieval script

Write a script that downloads the data from the AMiner site and places it in the proper location. This could be a luigi Task or a bash script. The task option might be better, because then it can be a programmatic dependency which can be resolved (run) by luigi, rather than an external one.

Repdocs Module Documentation

Could you please add descriptions for each file in the repdocs module. I'm trying to use this parser for my projects and am unclear what all the files contain and how they relate to each other.
For example, the dictionary created using gensim.corpora has a different number of documents than the tfidf matrix created.

Build co-authorship network

While the AMiner group already has a co-authorship network provided, it unfortunately does not allow for filtering by year ranges, which is a key feature of this library. Therefore it would be useful to implement tasks which construct such a network that inherit from YearFilterableTask. These could then also be combined with the term attributes from the repdoc corpus files and the ground truth from the venues in order to produce a more useful network.

Add complete() mechanism to BuildDataset

It appears that Tasks with no output are supposed to implement a custom complete() method, since completion normally means all the output files exist. We should either make the outputs include all outputs or add a custom complete method that figures it out some other way.

More discussion here: https://groups.google.com/forum/#!topic/luigi-user/F8AAG91tZfk

Improve filtering to use smaller dependencies if available

Currently the filtering module always grabs the files from base-csv to filter from. However, if a time range is given that is subsumed by another filtered file in filtered-csv, that file could be used instead. For instance, if we need to filter to 2011-2011, we can do that with the dataset for 2010-2012, since 2011 is subsumed by it.

This should be implemented for all YearFilterableTask subclasses (so probably something generic on the base class).

Thoroughly document each Task

python filtering.py FilterAllCSVRecordsToYearRange --start 1990 --end 2000 --local-scheduler does not work as guided

Hi Mack,

Thanks for closing the BuildAllGraph issue so promptly.

I also noticed that module 2 does not work as expected.

If I ran,

$ python filtering.py FilterAllCSVRecordsToYearRange --start 1990 --end 2000 --local-scheduler

I would receive the following errors.

ERROR: [pid 31790] Worker Worker(salt=969812214, workers=1, host=ubuntu, username=hello, pid=31790) failed FilterVenuesToYearRange(start=1990, end=1990)
Traceback (most recent call last):
File "/Users/hello/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 192, in run
new_deps = self._run_get_new_deps()
File "/Users/hello/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 130, in _run_get_new_deps
task_gen = self.task.run()
File "filtering.py", line 158, in run
with self.output().open() as afile:
File "/Users/hello/anaconda2/lib/python2.7/site-packages/luigi/local_target.py", line 152, in open
fileobj = FileWrapper(io.BufferedReader(io.FileIO(self.path, mode)))
IOError: [Errno 2] No such file or directory: '/Users/hello/dblp/data/processed/filtered-csv/venue-1990-1990.csv'

Many thanks in advance.

Tasks to summarize data

For a complete dataset, generate a summary of salient characteristics, such as:

number of nodes and edges for each graph, diameter, avg. degree
number of documents, terms, and nonzeros in each corpus, quantiles on term count
proportion of papers with abstracts
ground truth stats: # venues, quantiles on comm. size

File "pipeline.py", line 9, import config ImportError: No module named config

I have followed steps. Not able to find config.py

Unit tests for each v1.0 Task

Should rely on a small portion of the real dataset that is representative in order to test.

Refactor `convert` module to use luigi

config is not recognized while executing the code

I am facing problem while running pipeline.py, as i am getting problem in config module

Add graphml writer that includes term attributes on nodes

This will be useful for graph algorithms that read features from the nodes rather than reading them from another file. One example is EDCAR.

paper.csv is too large to save in my computer

When I tried to run the pipeline, paper.csv was generated from Miner-Papertxt (about 2.2G). And the paper.csv file was too large (exceeded 1.7T) but my computer has only about 2T storage space. So it failed each time I run the project. Do you know how to fix this?

Failed scheduling due to utils.py 'basestring' is not defined?

Hi,
I am trying to use your parser, and while running the command pipeline admin$ python pipeline.py BuildDataset --start 2000 --end 2001 --local-scheduler I am getting the error, which is connected to " NameError: name 'basestring' is not defined" in the utils.py. I looked at the code, but tbh I struggle to get what the variable basestring suppose to be. Any indications what can I check or how can I solve this much appreciated!

Error:

DEBUG: Checking if BuildDataset(start=2000, end=2001) is complete
/Users/admin/anaconda/lib/python3.5/site-packages/luigi/worker.py:328: UserWarning: Task BuildDataset(start=2000, end=2001) without outputs has no custom complete() method
  is_complete = task.complete()
DEBUG: Checking if BuildLCCAuthorRepdocCorpusTfidf(start=2000, end=2001) is complete
INFO: Informed scheduler that task   BuildDataset_2001_2000_429339e3d6   has status   PENDING
WARNING: Will not run BuildLCCAuthorRepdocCorpusTfidf(start=2000, end=2001) or any dependencies due to error in complete() method:
Traceback (most recent call last):
  File "/Users/admin/anaconda/lib/python3.5/site-packages/luigi/worker.py", line 328, in check_complete
    is_complete = task.complete()
  File "/Users/admin/anaconda/lib/python3.5/site-packages/luigi/task.py", line 533, in complete
    outputs = flatten(self.output())
  File "/Users/admin/Desktop/DBLP_parser/dblp-master/pipeline/util.py", line 39, in output
    if isinstance(self.base_paths, basestring):
NameError: name 'basestring' is not defined

INFO: Informed scheduler that task   BuildLCCAuthorRepdocCorpusTfidf_2001_2000_429339e3d6   has status   UNKNOWN
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
DEBUG: There are 1 pending tasks possibly being run by other workers
DEBUG: There are 1 pending tasks unique to this worker
DEBUG: There are 1 pending tasks last scheduled by this worker
INFO: Worker Worker(salt=038425231, workers=1, host=Monikas-MacBook-Pro.local, username=admin, pid=75675) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====

Scheduled 2 tasks of which:
* 1 failed scheduling:
    - 1 BuildLCCAuthorRepdocCorpusTfidf(start=2000, end=2001)
* 1 were left pending, among these:
    * 1 had dependencies whose scheduling failed:
        - 1 BuildDataset(start=2000, end=2001)

Did not run any tasks
This progress looks :( because there were tasks whose scheduling failed

===== Luigi Execution Summary =====

Perform author name disambiguation to produce new mapping

From the data, it appears the AMiner group did not perform any name disambiguation. This has led to a dataset with quite a few duplicate author records. This package currently does not address these issues.

The most obvious examples are those where the first or second name is abbreviated with a single letter in one place and spelled out fully in another. Use of dots and/or hyphens in some places also leads to different entity mappings. Another case that is quite common is when hyphenated names are spelled in some places with the hyphen and in some without.

There are also simple common misspellings, although these are harder to detect, since an edit distance of 1 or 2 could just as easily be a completely different name. One case which might be differentiated is when the edit is a deletion of a letter in a string of one or more of that same letter. For instance, "Acharya" vs. "Acharyya". Here it likely the second spelling simply has an extraneous y.

macks22 / dblp Goto Github PK

dblp's People

Contributors

Stargazers

Watchers

Forkers

dblp's Issues

Recommend Projects

Recommend Topics

Recommend Org