Giter VIP home page Giter VIP logo

dblp's People

Contributors

macks22 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

dblp's Issues

How to run this on the Aminer dataset?

Hi,

I want to parse the dblp dataset, I went through the documentation but I am still struggling on starting. Can anyone please elaborate on the steps to get this thing running.

NFO: Task RemoveUniqueVenues__99914b932b died unexpectedly ERROR....

hi,
after installing all dependecies , i am still unable to whats went wrong, as it did genrate csv files but not genrating gml file,
program is giving this error,
could you help in resolving this issue. would appreciate if you upload gml files to repo.

INFO: Task RemoveUniqueVenues__99914b932b died unexpectedly with exit code -9
/usr/local/lib/python2.7/dist-packages/luigi/parameter.py:259: UserWarning: Parameter None is not of type string.
warnings.warn("Parameter {0} is not of type string.".format(str(x)))

===== Luigi Execution Summary =====

Scheduled 25 tasks of which:

  • 2 present dependencies were encountered:
    • 1 AminerNetworkAuthorships()
    • 1 AminerNetworkPapers()
  • 5 ran successfully:
    • 1 CSVPaperRecords()
    • 1 CSVRefsRecords()
    • 1 ParseAuthorshipsToCSV()
    • 1 ParsePapersToCSV()
    • 1 RemovePapersNoVenueOrYear()
  • 1 failed:
    • 1 RemoveUniqueVenues()
  • 17 were left pending, among these:
    • 17 had failed dependencies:
      • 1 AuthorCitationGraphLCCIdmap(start=2003, end=2004)
      • 1 BuildAuthorCitationGraph(start=2003, end=2004)
      • 1 BuildAuthorRepdocVectors(start=2003, end=2004)
      • 1 BuildDataset(start=2003, end=2004)
      • 1 BuildLCCAuthorRepdocCorpusTf(start=2003, end=2004)
        ...

This progress looks ๐Ÿ˜ž because there were failed tasks

Thoroughly document each output file.

It would be good to find some sort of programmatic way to do this, such that the final output is a polished data dictionary which can be exported to Excel or converted to a PDF. Crawling the dependency tree sounds like the way to go in order to grab all possible output files. Then this list can be compared with a documentation file (perhaps doc.md or doc.csv) in order to determine coverage.

Of course, such an approach would need to ignore the optional year parameters. There's no sense in producing all possible files for all possible year ranges.

Add AMiner data retrieval script

Write a script that downloads the data from the AMiner site and places it in the proper location. This could be a luigi Task or a bash script. The task option might be better, because then it can be a programmatic dependency which can be resolved (run) by luigi, rather than an external one.

Repdocs Module Documentation

Could you please add descriptions for each file in the repdocs module. I'm trying to use this parser for my projects and am unclear what all the files contain and how they relate to each other.
For example, the dictionary created using gensim.corpora has a different number of documents than the tfidf matrix created.

Build co-authorship network

While the AMiner group already has a co-authorship network provided, it unfortunately does not allow for filtering by year ranges, which is a key feature of this library. Therefore it would be useful to implement tasks which construct such a network that inherit from YearFilterableTask. These could then also be combined with the term attributes from the repdoc corpus files and the ground truth from the venues in order to produce a more useful network.

Improve filtering to use smaller dependencies if available

Currently the filtering module always grabs the files from base-csv to filter from. However, if a time range is given that is subsumed by another filtered file in filtered-csv, that file could be used instead. For instance, if we need to filter to 2011-2011, we can do that with the dataset for 2010-2012, since 2011 is subsumed by it.

This should be implemented for all YearFilterableTask subclasses (so probably something generic on the base class).

python filtering.py FilterAllCSVRecordsToYearRange --start 1990 --end 2000 --local-scheduler does not work as guided

Hi Mack,

Thanks for closing the BuildAllGraph issue so promptly.

I also noticed that module 2 does not work as expected.

If I ran,

$ python filtering.py FilterAllCSVRecordsToYearRange --start 1990 --end 2000 --local-scheduler

I would receive the following errors.

ERROR: [pid 31790] Worker Worker(salt=969812214, workers=1, host=ubuntu, username=hello, pid=31790) failed FilterVenuesToYearRange(start=1990, end=1990)
Traceback (most recent call last):
File "/Users/hello/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 192, in run
new_deps = self._run_get_new_deps()
File "/Users/hello/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 130, in _run_get_new_deps
task_gen = self.task.run()
File "filtering.py", line 158, in run
with self.output().open() as afile:
File "/Users/hello/anaconda2/lib/python2.7/site-packages/luigi/local_target.py", line 152, in open
fileobj = FileWrapper(io.BufferedReader(io.FileIO(self.path, mode)))
IOError: [Errno 2] No such file or directory: '/Users/hello/dblp/data/processed/filtered-csv/venue-1990-1990.csv'

Many thanks in advance.

Tasks to summarize data

For a complete dataset, generate a summary of salient characteristics, such as:

  • number of nodes and edges for each graph, diameter, avg. degree
  • number of documents, terms, and nonzeros in each corpus, quantiles on term count
  • proportion of papers with abstracts
  • ground truth stats: # venues, quantiles on comm. size

paper.csv is too large to save in my computer

When I tried to run the pipeline, paper.csv was generated from Miner-Papertxt (about 2.2G). And the paper.csv file was too large (exceeded 1.7T) but my computer has only about 2T storage space. So it failed each time I run the project. Do you know how to fix this?

Failed scheduling due to utils.py 'basestring' is not defined?

Hi,
I am trying to use your parser, and while running the command pipeline admin$ python pipeline.py BuildDataset --start 2000 --end 2001 --local-scheduler I am getting the error, which is connected to " NameError: name 'basestring' is not defined" in the utils.py. I looked at the code, but tbh I struggle to get what the variable basestring suppose to be. Any indications what can I check or how can I solve this much appreciated!

Error:

DEBUG: Checking if BuildDataset(start=2000, end=2001) is complete
/Users/admin/anaconda/lib/python3.5/site-packages/luigi/worker.py:328: UserWarning: Task BuildDataset(start=2000, end=2001) without outputs has no custom complete() method
  is_complete = task.complete()
DEBUG: Checking if BuildLCCAuthorRepdocCorpusTfidf(start=2000, end=2001) is complete
INFO: Informed scheduler that task   BuildDataset_2001_2000_429339e3d6   has status   PENDING
WARNING: Will not run BuildLCCAuthorRepdocCorpusTfidf(start=2000, end=2001) or any dependencies due to error in complete() method:
Traceback (most recent call last):
  File "/Users/admin/anaconda/lib/python3.5/site-packages/luigi/worker.py", line 328, in check_complete
    is_complete = task.complete()
  File "/Users/admin/anaconda/lib/python3.5/site-packages/luigi/task.py", line 533, in complete
    outputs = flatten(self.output())
  File "/Users/admin/Desktop/DBLP_parser/dblp-master/pipeline/util.py", line 39, in output
    if isinstance(self.base_paths, basestring):
NameError: name 'basestring' is not defined

INFO: Informed scheduler that task   BuildLCCAuthorRepdocCorpusTfidf_2001_2000_429339e3d6   has status   UNKNOWN
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
DEBUG: There are 1 pending tasks possibly being run by other workers
DEBUG: There are 1 pending tasks unique to this worker
DEBUG: There are 1 pending tasks last scheduled by this worker
INFO: Worker Worker(salt=038425231, workers=1, host=Monikas-MacBook-Pro.local, username=admin, pid=75675) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====

Scheduled 2 tasks of which:
* 1 failed scheduling:
    - 1 BuildLCCAuthorRepdocCorpusTfidf(start=2000, end=2001)
* 1 were left pending, among these:
    * 1 had dependencies whose scheduling failed:
        - 1 BuildDataset(start=2000, end=2001)

Did not run any tasks
This progress looks :( because there were tasks whose scheduling failed

===== Luigi Execution Summary =====

Perform author name disambiguation to produce new mapping

From the data, it appears the AMiner group did not perform any name disambiguation. This has led to a dataset with quite a few duplicate author records. This package currently does not address these issues.

The most obvious examples are those where the first or second name is abbreviated with a single letter in one place and spelled out fully in another. Use of dots and/or hyphens in some places also leads to different entity mappings. Another case that is quite common is when hyphenated names are spelled in some places with the hyphen and in some without.

There are also simple common misspellings, although these are harder to detect, since an edit distance of 1 or 2 could just as easily be a completely different name. One case which might be differentiated is when the edit is a deletion of a letter in a string of one or more of that same letter. For instance, "Acharya" vs. "Acharyya". Here it likely the second spelling simply has an extraneous y.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.