brianified / jwpl Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 4.61 MB

Automatically exported from code.google.com/p/jwpl

Java 99.89% CSS 0.11%

jwpl's People

Watchers

jwpl's Issues

[API] Make template filter in WikipediaTemplateInfoGenerator configurable.

Templates are language specific. That's why we should load whitelists and 
blacklists in acceptTemplate() from the config file.

4 cases should be supported:
- Whitelist template begins with x
- Blacklist template begins with x
- Whitelist template equals x
- Blacklist template equals x

Original issue reported on code.google.com by oliver.ferschke on 5 Aug 2011 at 9:33

DataMachine processing might fail due to special characters in article text.

As reported by various users.

In some situations when an article contains characters (e.g. "\") that are not 
properly escaped during preprocessing, causing errors.

Original issue reported on code.google.com by [email protected] on 10 Jan 2011 at 9:28

de.tudarmstadt.ukp.wikipedia.api.Wikipedia.getPages(PageQuery) should allow to get number of pages

If de.tudarmstadt.ukp.wikipedia.api.Wikipedia.getPages(PageQuery) returned an
unmodifiable collection instead of an iterable, it would be possible to get the 
number of pages using size(). That would be helpful e.g. to display progress 
information (x of y pages processed). Inerhiting from AbstractCollection may be 
helpful.

Original issue reported on code.google.com by [email protected] on 21 Sep 2010 at 4:10

[RevisionMachine] Encoding problems with contributor names in revisions table

Database entries in revisions.ContributorName seem to have encoding problems. 
Umlaut characters are not shown correctly.

Original issue reported on code.google.com by oliver.ferschke on 27 Jul 2011 at 7:19

[RevisionMachine] IndexGenerator should create index after insert statements

The IndexGenerator defines a PRIMARY KEY in the db scheme - this makes 
inserting millions of rows very slow.
The index should be created after inserting all rows.

Original issue reported on code.google.com by oliver.ferschke on 13 Jun 2011 at 5:37

DataMachine : unexpected end of stream

What steps will reproduce the problem?
1.Download wiki dump dated 2011-05-26 or 2011-05-04
2. Run JWPL_DATAMACHINE_0.6.0.jar with options english Categories 
Disambiguation_pages

What is the expected output? What do you see instead?
Expected is parsing to be completed and output folder to be filled with parsed 
content. I tried using bunzip2 to unzip pages-articles.xml.bz2, it worked fine. 
But running JWPL_DATAMACHINE_0.6.0 fails. Same thing happens for both wiki dump 
dated 2011-05-26 and 2011-05-04

Here is the complete stack trace

Loading XML bean definitions from class path resource 
[context/applicationContext.xml]
parse input dumps...
Discussions are available
unexpected end of stream

org.apache.tools.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStre
am.java:706)
org.apache.tools.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:289)
org.apache.tools.bzip2.CBZip2InputStream.setupNoRandPartA(CBZip2InputStream.java
:846)
org.apache.tools.bzip2.CBZip2InputStream.setupNoRandPartB(CBZip2InputStream.java
:902)
org.apache.tools.bzip2.CBZip2InputStream.read0(CBZip2InputStream.java:212)
org.apache.tools.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:180)
org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown 
Source)
org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown 
Source)
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.
dispatch(Unknown Source)
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
de.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.readDump
(AbstractXmlDumpReader.java:207)
de.tudarmstadt.ukp.wikipedia.datamachine.dump.xml.XML2Binary.<init>(XML2Binary.j
ava:47)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInpu
tDump(DataMachineGenerator.java:65)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataM
achineGenerator.java:59)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMac
hine.java:57)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
a:43)
java.lang.reflect.Method.invoke(Method.java:616)
org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58
)


What version of the product are you using? On what operating system?
OS is Linux Ubantu 10 and Jwpl version is 0.6

Please reply soon any suggestion/fix. We are unable to proceed. Can I use jwpl 
for any wiki dump without any changes?

Thanks,
Shareeka

Original issue reported on code.google.com by [email protected] on 27 Jun 2011 at 7:47

Problems with CategoryDescendants

What steps will reproduce the problem?
1. I've imported an english wikipedia dump 20110115
2. And I'm running the code from CategoryList.java (see attached)

What is the expected output? What do you see instead?
The expected output should be a list of all articles descending from one input 
category as defined in the code (Finance)

What version of the product are you using? On what operating system?
I'm using the jwpl.jar downloaded from this side as I couldn't manage to build 
it with maven. (Maven install requires to run the tests, which fail because I 
can't access the DB of TUDarmstadt and I didn't figure out how to switch of the 
tests)

The IDE I am using is eclipse on an OpenSUSE 64bit machine
MySQL is the latest from OpenSuse provided version.

Please provide any additional information below.
I've attached the outputfile created with the attached code I was running.
There are some (from my point of view) weird things going on which I don't 
understand.

Here is the thrown exception:

17:19:11,660  INFO SchemaUpdate:160 - schema update complete
17:21:44,121 ERROR PageDAO:107 - get failed
org.hibernate.PropertyAccessException: Null value was assigned to a property of 
primitive type setter of 
de.tudarmstadt.ukp.wikipedia.api.hibernate.Page.isDisambiguation
    at org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:85)
    at org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337)
    at org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200)
    at org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3566)
    at org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:129)
    at org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854)
    at org.hibernate.loader.Loader.doQuery(Loader.java:729)
    at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236)
    at org.hibernate.loader.Loader.loadEntity(Loader.java:1860)
    at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:48)
    at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:42)
    at org.hibernate.persister.entity.AbstractEntityPersister.load(AbstractEntityPersister.java:3044)
    at org.hibernate.event.def.DefaultLoadEventListener.loadFromDatasource(DefaultLoadEventListener.java:395)
    at org.hibernate.event.def.DefaultLoadEventListener.doLoad(DefaultLoadEventListener.java:375)
    at org.hibernate.event.def.DefaultLoadEventListener.load(DefaultLoadEventListener.java:139)
    at org.hibernate.event.def.DefaultLoadEventListener.proxyOrLoad(DefaultLoadEventListener.java:195)
    at org.hibernate.event.def.DefaultLoadEventListener.onLoad(DefaultLoadEventListener.java:103)
    at org.hibernate.impl.SessionImpl.fireLoad(SessionImpl.java:878)
    at org.hibernate.impl.SessionImpl.get(SessionImpl.java:815)
    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.hibernate.context.ThreadLocalSessionContext$TransactionProtectionWrapper.invoke(ThreadLocalSessionContext.java:301)
    at $Proxy0.get(Unknown Source)
    at de.tudarmstadt.ukp.wikipedia.api.hibernate.PageDAO.findById(PageDAO.java:99)
    at de.tudarmstadt.ukp.wikipedia.api.Page.fetchPage(Page.java:89)
    at de.tudarmstadt.ukp.wikipedia.api.Page.<init>(Page.java:51)
    at de.tudarmstadt.ukp.wikipedia.api.Wikipedia.getPage(Wikipedia.java:119)
    at de.tudarmstadt.ukp.wikipedia.api.Category.getArticles(Category.java:287)
    at uk.ac.uuc.cidbio.wikipedia.CategoryList.main(CategoryList.java:109)
Caused by: java.lang.IllegalArgumentException
    at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42)
    ... 29 more
Exception in thread "main" org.hibernate.PropertyAccessException: Null value 
was assigned to a property of primitive type setter of 
de.tudarmstadt.ukp.wikipedia.api.hibernate.Page.isDisambiguation
    at org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:85)
    at org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337)
    at org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200)
    at org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3566)
    at org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:129)
    at org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854)
    at org.hibernate.loader.Loader.doQuery(Loader.java:729)
    at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236)
    at org.hibernate.loader.Loader.loadEntity(Loader.java:1860)
    at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:48)
    at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:42)
    at org.hibernate.persister.entity.AbstractEntityPersister.load(AbstractEntityPersister.java:3044)
    at org.hibernate.event.def.DefaultLoadEventListener.loadFromDatasource(DefaultLoadEventListener.java:395)
    at org.hibernate.event.def.DefaultLoadEventListener.doLoad(DefaultLoadEventListener.java:375)
    at org.hibernate.event.def.DefaultLoadEventListener.load(DefaultLoadEventListener.java:139)
    at org.hibernate.event.def.DefaultLoadEventListener.proxyOrLoad(DefaultLoadEventListener.java:195)
    at org.hibernate.event.def.DefaultLoadEventListener.onLoad(DefaultLoadEventListener.java:103)
    at org.hibernate.impl.SessionImpl.fireLoad(SessionImpl.java:878)
    at org.hibernate.impl.SessionImpl.get(SessionImpl.java:815)
    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.hibernate.context.ThreadLocalSessionContext$TransactionProtectionWrapper.invoke(ThreadLocalSessionContext.java:301)
    at $Proxy0.get(Unknown Source)
    at de.tudarmstadt.ukp.wikipedia.api.hibernate.PageDAO.findById(PageDAO.java:99)
    at de.tudarmstadt.ukp.wikipedia.api.Page.fetchPage(Page.java:89)
    at de.tudarmstadt.ukp.wikipedia.api.Page.<init>(Page.java:51)
    at de.tudarmstadt.ukp.wikipedia.api.Wikipedia.getPage(Wikipedia.java:119)
    at de.tudarmstadt.ukp.wikipedia.api.Category.getArticles(Category.java:287)
    at uk.ac.uuc.cidbio.wikipedia.CategoryList.main(CategoryList.java:109)
Caused by: java.lang.IllegalArgumentException
    at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42)
    ... 29 more


From what I understand until now the only two reasons why I am getting 
exceptions are either there are problems with the database entries or there is 
a bug in the code. But I can be wrong of course.

Thank you for any support

Original issue reported on code.google.com by [email protected] on 24 Feb 2011 at 5:34

Attachments:

[API] Add support for templates

Information about which page contains which templates is interesting for many 
applications.
We should provide a tool that creates (optional) database tables containing 
this information.

The access methods should be places in a dedicated class, not the main 
"Wikipedia" class.

Original issue reported on code.google.com by oliver.ferschke on 3 Aug 2011 at 12:42

[DataMachine, TimeMachine] applicationContext.xml is not included in the jars with dependencies

applicationContext.xml is not included in the jars with dependencies of the 
DataMachine and the TimeMachine

Original issue reported on code.google.com by oliver.ferschke on 17 Aug 2011 at 6:36

Design JWPL logo

The JWPL project page needs a logo.

Original issue reported on code.google.com by oliver.ferschke on 13 Aug 2011 at 9:21

Missing stand-alone parser library

I only need the mediawiki parser alone. There was one jar before, but now 
there's none. Could you please put it back up?

Original issue reported on code.google.com by [email protected] on 1 Jun 2011 at 10:47

[deleted issue]

[deleted issue]

[RevisionMachine] Language independent implementation of AbstractNameChecker

Issue regarding: Page-filter in DiffToolThread, ArticleConsumer

The only implementation of the AbstractNameChecker is currently 
EnglishArticleNameChecker();
A ArticleNameChecker is used to whitelist articles with prefixes in the article 
title. (e.g. Talk:PageTitle) - usually all pages with prefixes are filtered.
Instead of adding a ArticleNameChecker for every language, there should be ony 
language independent checker that reads the whitelist from the configuration. 

Additionally:
It is also unclear why filter-object is both created in DiffToolThread AND 
ArticleConsumer!!! This should only happen in one place.

Original issue reported on code.google.com by oliver.ferschke on 20 May 2011 at 1:56

Blocked on: #33

[DataMachine] Create PRIMARY KEY(ArticleID) in table index_articleID_rc_ts

The "index table" index_articleID_rc_ts needs a primary key. It has either to 
be automatically created upond schema creation or added later on upon first 
access (like the indexes)

ALTER TABLE index_articleID_rc_ts ADD PRIMARY KEY(ArticleID);

Original issue reported on code.google.com by oliver.ferschke on 12 Jul 2011 at 8:58

[deleted issue]

[deleted issue]

[RevisionMachine] IndexGenerator should produce data files instead of sql files

Index tables are currently created as sql-files (Insert Statements)
It should be changed to data files that can be read using LOAD_DATA_INFILE ( 
http://dev.mysql.com/doc/refman/5.1/en/load-data.html )

Original issue reported on code.google.com by oliver.ferschke on 12 Jun 2011 at 9:23

Category description from page

This is not a bug rather a faq. Posting it here for the answer

1. Is there any way to get the page content of the Category page ?
For example Category:Mathematics includes content as follows ....
************
The main article for this category is Mathematics.
    Wikimedia Commons has media related to: Mathematics

    See also: Category:Logic

Mathematics (colloquially, maths, or math), is the body of knowledge centered 
on concepts such as quantity, structure, space, and change, and also the 
academic discipline that studies them.

Mathematics can be further divided into smaller subcategories (such as geometry 
and algebra), and thus, it includes many ideas and theories.
The main article for this category is Mathematics.
*****************


2. Does JWPL support WordNet based near similarity matches (hyponym, hypernym, 
synonym based) ?

3. Is it possible to get an 'is-a' relation (which should be meaningful super 
class of real object, ex: 'Automobile' should be returned for 'car')  for given 
a category using JWPL api

Please answer.

Thanks

Original issue reported on code.google.com by [email protected] on 5 Jul 2011 at 2:34

Document that MySQL is needed and how to get it

JWPL needs a MySQL database, but does not come with the driver. It should be
documented that the driver needs to be installed and where to get it from
(website or via Maven).

Original issue reported on code.google.com by [email protected] on 21 Sep 2010 at 4:12

Parser includes some wiki artefacts in page.getPlainText(); method

Hi,
I stumbled across some strange results iterating over the dewiki-20070206.sql 
database using the page.getPlainText() method  and counting all words with JWPL:

the most common words in Wikipedia contained:
count:    Word

842554: TEMPLATE
629957: Kategorie
438822: nbsp
256422: thumb
253580: Weblinks

also later on some all uppercase words that where probably part of some 
template?
http://de.wikipedia.org/wiki/Vorlage:Personendaten

152323: NAME
131362: KURZBESCHREIBUNG
131333: GEBURTSDATUM
131300: GEBURTSORT
131237: ALTERNATIVNAMEN

and so on.

What is the expected output? What do you see instead?
I expected just normal German words to be the most common words like: 
6424690: der; 5262887: und; 4316465: die; 3693242: in; 2713375: von; 1888157: 
den; 1806153: des; 1578301: mit; 1509434: im; 1466509: ist; 1348733: Die; 
1254947: zu; 1219600: das; 1218469: dem; 1110328: als; 1083261: für; 1077734: 
auf; 1075940: eine; 1046970: ein; 1011403: wurde; 1009821: sich; 910366: er; 
881106: auch; 842554: TEMPLATE; 814815: an; 714727: aus; 701011: war; 675874: 
Der; 654112: nach; 629957: Kategorie; 616429: bei; 589324: wird; 581816: einer; 
573699: werden; 547424: bis; 530476: sind; 529210: nicht; 525816: durch; 
520091: oder; 518637: am; 503813: 1; 503254: zum; 481658: sie; 466585: es; 
446827: Das; 438822: nbsp;

and so on, as you can see the above mentioned Words got mixed with my results.

What version of the product are you using? On what operating system?
jwpl_v0.5, Ubuntu 10.04 LTS

Steps to reproduce:

Iterate over the whole database,
Tokenize every article with lucene standard tokenizer,
count all tokens with a TreeMap<String, Integer>
when finished put all values in a MultiValueMap and use the count as key, 
(basicly: <Integer, ArrayList<String>>)
get a copy of the keySet, sort it, use the top x values as keys to retrieve 
words from the map.

Original issue reported on code.google.com by [email protected] on 12 Jan 2011 at 1:25

[RevisionMachine] DatabaseWriter does not work

Writing the revision dump directly to the databse does not work at the moment 
(see report below)

Either fix the issue or remove the option to write directly to db.


EXCEPTION:

de.tudarmstadt.ukp.wikipedia.revisionmachine.common.exceptions.SQLConsumerExcept
ion: DIFFTOOL_SQLCONSUMER_DATABASEWRITER_EXCEPTION:
INSERT INTO revisions VALUES(null, 
233192,1,233192,10,980061141000,?,'*',0,'99',1),(null, 
233192,2,862220,10,1014669791000,?,'Automated conversion',1,'0',0),(null, 
233192,3,15898945,10,1051323518000,?,'Fixing redirect',1,'7543',1),(null, 
233192,4,56681914,10,1149368141000,?,'fix double redirect',1,'516514',1),(null, 
233192,5,74466685,10,1157703364000,?,'cat rd',0,'750223',1),(null, 
233192,6,133180268,10,1180032118000,?,'Robot: Automated text replacement  
(-\\[\\[(.*?[\\:|\\|])*?(.+?)\\]\\] +\\g&lt;2&gt;)',1,'4477979',1),(null, 
233192,7,133452289,10,1180127532000,?,'Revert edit(s) by 
[[Special:Contributions/Ngaiklin|Ngaiklin]] to last version by 
[[Special:Contributions/Rory096|Rory096]]',1,'241822',1),(null, 
233192,8,381200179,10,1282875831000,?,null,0,'0',0),(null, 
233192,9,381202555,10,1282876716000,?,'[[Help:Reverting|Reverted]] edits by 
[[Special:Contributions/76.28.186.133|76.28.186.133]] ([[User 
talk:76.28.186.133|talk]]) to last version by Gurch',1,'7181920',1);

0       106
1       40
2       17
3       28
4       26
5       7
6       11
7       479
8       63

        at de.tudarmstadt.ukp.wikipedia.revisionmachine.common.exceptions.ErrorFactory.createSQLConsumerException(ErrorFactory.java:353)
        at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.consumer.sql.writer.SQLDatabaseWriter.process(SQLDatabaseWriter.java:244)
        at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.consumer.sql.writer.TimedSQLDatabaseWriter.process(TimedSQLDatabaseWriter.java:131)
        at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.DiffToolThread$TaskTransmitter.writeOutput(DiffToolThread.java:223)
        at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.DiffToolThread$TaskTransmitter.transmitDiff(DiffToolThread.java:195)
        at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.consumer.diff.calculation.DiffCalculator.transmitAtEndOfTask(DiffCalculator.java:326)
        at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.consumer.diff.calculation.TimedDiffCalculator.transmitAtEndOfTask(TimedDiffCalculator.java:149)
        at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.consumer.diff.calculation.DiffCalculator.process(DiffCalculator.java:523)
        at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.consumer.diff.calculation.TimedDiffCalculator.process(TimedDiffCalculator.java:192)
        at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.DiffToolThread.run(DiffToolThread.java:319)
        at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.DiffTool.main(DiffTool.java:56)
Caused by: java.sql.SQLException: Column count doesn't match value count at row 
1
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2926)
        at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1571)
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1666)
        at com.mysql.jdbc.Connection.execSQL(Connection.java:2978)
        at com.mysql.jdbc.Connection.execSQL(Connection.java:2902)
        at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:933)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1162)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1079)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1064)
        at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.consumer.sql.writer.SQLDatabaseWriter.process(SQLDatabaseWriter.java:206)
        ... 9 more

Original issue reported on code.google.com by oliver.ferschke on 11 Aug 2011 at 5:18

[deleted issue]

[deleted issue]

[DataMachine, TimeMachine, RevisionMachine?] Correctly initialize log4j

Log4j is not correctly initialized when the executable jars are run.

Original issue reported on code.google.com by oliver.ferschke on 17 Aug 2011 at 6:43

Some redirects are not regarded in DataMachine transformation process

Working with JWPL 0.45b I encountered the following problem: 

I created an own Wikipedia dump using German Wikipedia backup dump files of
November 11, 2009 from download.wikimedia.org. I followed the steps explained
on the JWPL documentation page and ran the transformation process using the new
DataMachine (Version 2), that was kindly provided to me by Mr. Zesch. 

The creation of the SQl dump file was succesfull. I could retrieve pages and
process text, just as I could with the German Wikipedia SQL dump of 6 Feb 2007
provided on the JWPL homepage.

However, when comparing some results of the new Wikipedia dump with that of
2007, I could see that certain redirects, but not all, were missing. They were
however included in the online version of Wikipedia. I assumed that there was
some database mistake, but also in the output text files, namely
"page_redirects.txt" they did not appear. Some further investigation in the
online Wikipedia showed that the error was systematical:

Whenever a redirect page included a redirect link of the exact format "REDIRECT
[[...]]" (i.e.: the capitalized Redirect-keyword followed by a space), the
redirect did appear in the database. 
But, whenever the format was slightly different, the redirect was missing. 

Examples:
Missing space: REDIRECT[[...]]
not capitalized: Redirect [[...]]
German key word: WEITERLEITUNG [[...]]

I ran the DataMachine again, but the problem remained. Interestingly, in the
Wikipedia SQL dump of 6 Feb 2007, the problem does not appear.

Kind regards
Stephan Strohmaier

Original issue reported on code.google.com by [email protected] on 21 Sep 2010 at 4:05

CharacterSet - SEVERE: COLLATION 'utf8_bin' is not valid for CHARACTER SET 'utf8mb4'

Hi, I am using Mysql 5.5 community server 64bit version for Win 7 64bit OS.

I have successfully imported a fresh wikipedia 2011 db, then on my first query, 
as:

>>>
DatabaseConfiguration dbConfig = new DatabaseConfiguration();
dbConfig.setHost("localhost");
........//other config
Wikipedia wiki = new Wikipedia(dbConfig);
System.out.println(wiki.getPage("Cat").getTitle());
>>>

I get an exception:
>>>
SEVERE: COLLATION 'utf8_bin' is not valid for CHARACTER SET 'utf8mb4'
Exception in thread "main" org.hibernate.exception.SQLGrammarException: could 
not execute query
    at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:67)
    at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43)
    at org.hibernate.loader.Loader.doList(Loader.java:2223)
    at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104)
    at org.hibernate.loader.Loader.list(Loader.java:2099)
    at org.hibernate.loader.custom.CustomLoader.list(CustomLoader.java:289)
    at org.hibernate.impl.SessionImpl.listCustomQuery(SessionImpl.java:1695)
    at org.hibernate.impl.AbstractSessionImpl.list(AbstractSessionImpl.java:142)
    at org.hibernate.impl.SQLQueryImpl.list(SQLQueryImpl.java:152)
    at org.hibernate.impl.AbstractQueryImpl.uniqueResult(AbstractQueryImpl.java:811)
    at de.tudarmstadt.ukp.wikipedia.api.Page.fetchByTitle(Page.java:153)
    at de.tudarmstadt.ukp.wikipedia.api.Page.<init>(Page.java:109)
    at de.tudarmstadt.ukp.wikipedia.api.Wikipedia.getPage(Wikipedia.java:112)
    at uk.ac.shef.oak.jwpltest.Test.main(Test.java:23)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: COLLATION 
'utf8_bin' is not valid for CHARACTER SET 'utf8mb4'
>>>


I have double checked my database's character set, which shows:

>>>
character-set-database = utf8
collation-database = utf8_general_ci
>>>



It seems that the DB character set config is right, but the error seems to 
complain about that. Any suggestions please?

Thanks

Original issue reported on code.google.com by [email protected] on 4 Jun 2011 at 4:28

TimeMachine throws exception when facing UTF surrogate character

What steps will reproduce the problem:
Create a Wikipedia snapshot for 20090101 or 20080101 from the 
20100130-Wikipedia Dump (http://dumps.wikimedia.org/enwiki/20100130/) 
After Revision 7270000, the TimeMachine aborts with the following exception:

Exception in thread "xml2sql" java.lang.RuntimeException: java.io.IOException: 
Invalid byte 2 of 4-byte UTF-8 sequence.
    at de.tudarmstadt.ukp.wikipedia.timemachine.dump.xml.original.XMLDumpTableInputStreamThread.run(XMLDumpTableInputStreamThread.java:128)
Caused by: java.io.IOException: Invalid byte 2 of 4-byte UTF-8 sequence.
    at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:92)
    at de.tudarmstadt.ukp.wikipedia.timemachine.dump.xml.original.XMLDumpTableInputStreamThread.run(XMLDumpTableInputStreamThread.java:123)
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid 
byte 2 of 4-byte UTF-8 sequence.
    at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
    at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
    at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
    ... 1 more
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid 
byte 2 of 4-byte UTF-8 sequence.
    at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
    at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
    at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
    at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
    ... 11 more
Write end dead


This is apparently caused by readDump() in org.mediawiki.importer.XmlDumpReader

Original issue reported on code.google.com by oliver.ferschke on 11 Feb 2011 at 9:08

Hibernate hbm2ddl policy should be "verify" instead of "update"

JWPL doesn't use hibernate to generate the database schema. It's set up during 
the import of the SQL file that JWPL generates. Thus, hibernate should be able 
to operate in a pure read-only mode. This includes verifying the database 
schema against the Hibernate mappings, but not trying to update the DB.

hibernate.hbm2ddl.auto=validate

Original issue reported on code.google.com by richard.eckart on 17 Jul 2011 at 10:54

Use embedded DB for test cases

Currently the JWPL test cases require access to our DB host. This precludes
external users from running the tests (when JWPL is open sourced) and it also
prevents tests when the DB host is offline (as currently...).

Instead an embedded DB like HSQLDB or Derby should be used for test cases.

Original issue reported on code.google.com by [email protected] on 21 Sep 2010 at 4:07

[RevisionMachine] Add feature to DiffTool-ConfigGUI to allow creation of multiple config files for distributed processing

When working with multiple source archives, the DiffTools normally processes 
them all sequentially and builds a single output files.

In order to speed up processing, multiple configuration files can be written to 
allow for calculating revisions using several instances of the DiffTool. e.g. 
if we have 15 source archives, we might create 5 configuration files - each 
definining 3 archives. We can then process the data with 5 instances of the 
tool in parallel.

Output folders have to be adapted in case of multiple instances.

This is, of course, no replacement for REAL distributed processing. The 
limiting factor here probably is the RAM.

Original issue reported on code.google.com by oliver.ferschke on 6 Jun 2011 at 4:24

[RevisionMachine] DiffTool produces "INSERT INTO revisisions VALUES;" without values

When processing the English Wikipedia dump 20110405, the DiffTool produced the 
following insert statement 
    INSERT INTO revisions VALUES;
without giving any values. 

It was impossible to import this sql file to the database. The line had to be 
removed.

It should be ensured that only legal SQL statements are produced.

Original issue reported on code.google.com by oliver.ferschke on 11 Jun 2011 at 9:55

Section parts in links are not properly handled.

Hi,
I'm not sure whether this is intended, or a bug but when you access
all links in a page with for Example:
Map<String, Set<String>> anchorStrings = page.getOutlinkAnchors();
and iterate over them with:
for (String targetArticle : anchorStrings.keySet())

the target-article is sometimes not found by
Wikipedia.existPage(String title) or Wikipedia.getPage(String title)
because it contains a section link.
for example "Geschichte_Norwegens#Der Bürgerkrieg" (german wikipidea)
is returned as targetArticle and the article "Geschichte_Norwegens"
exists but Wikipedia.existPage returns false and Wikipedia.getPage
throws an exception.

I think in general it is valid that the method returns the full target
string of a link, because maybe some applications need that extra
section information, but the javadoc of the method
Page.getOutlinkAnchors() should warn you about this behavior, because
if you overlook it you might discard a lot of valid links.

Also it would be a nice feature if the Wikipedia.existPage() and
Wikipedia.getPage() methods would resolve section links automatically.
(return true or the Page, if the string before the "#" is a valid
title).

Original issue reported on code.google.com by [email protected] on 4 Jul 2011 at 12:08

Attach source code to downloadable 0.6.0 jars

All 0.6.0 jars only contain compiled classes.
Source code can only be obtained via SVN and maven.

We should add the source code the the jars as well.

Original issue reported on code.google.com by oliver.ferschke on 1 Jun 2011 at 6:37

[RevisionMachine] Add (support for) user group assignments to revision database

In order to retrieve the user group a user belongs to, it is necessary to have 
the user id of each user. At the moment, only the username is stored in the 
revision db.

Actions to be taken:
1. Currently, the username is called "ContributorID" in the db. It should be 
renamed to "ContributorName"
2. Add userid to revisions table as "ContributorId"

With this additional information, the user group assignments should be usable. 
Access methods for retrieving user group info should be added to the 
RevisionApi.

Original issue reported on code.google.com by oliver.ferschke on 21 Jul 2011 at 1:14

[RevisionMachine] Handling of paths in config files

The system seems to just concatenate the path string with the filename string. 
If the paths do not have a trailing File.separator, the resulting concatenated 
path will be wrong.
The ConfigGUI has been fixed to ensure that those paths have a trailing 
File.separator.
However, if someone manually creates/alters a config file, this could be still 
the a bug-source.
It would be best to find all occurences where paths are built (path+filename) 
and ensure at those points that a file.separator is put in between.

Original issue reported on code.google.com by oliver.ferschke on 20 May 2011 at 1:50

TimeMachine crashes with Classcast exception

As reported by user:

I'm trying to use the config file below to import a small subset of
the latest (English) Wikipedia dump into TimeMachine.  However, for
some reason I can't determine (I've tried debugging a bit),
TimeMachine tries to cast its generator object to a DataMachine and is
unable to do so.  I'm running
de.tudarmstadt.ukp.wikipedia.timemachine.domain.JWPLTimeMachine
through Eclipse and I get the following output:

09:57:13,204  INFO XmlBeanDefinitionReader:315 - Loading XML bean
definitions from class path resource [context/applicationContext.xml]
09:57:13,592  INFO Log4jLogger:21 - parsing configuration file....
09:57:13,611  INFO Log4jLogger:21 - processing data ...
09:57:13,640  INFO Log4jLogger:21 -
de.tudarmstadt.ukp.wikipedia.timemachine.domain.TimeMachineFiles
cannot be cast to
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineFiles

de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.setFiles(Da
taMachineGenerator.java:
49)
de.tudarmstadt.ukp.wikipedia.timemachine.domain.JWPLTimeMachine.main(JWPLTimeMac
hine.java:
77)

Have I configured something wrong?

Original issue reported on code.google.com by [email protected] on 10 Aug 2011 at 2:30

[RevisionMachine] Add support for checkpoints in DiffTool

The generation of the revision data with the DiffTool can take quite long for 
large Wikipedias. It would be very helpful to be able to resume the generation 
process if the process is interrupted or cancelled for any reason.

Original issue reported on code.google.com by oliver.ferschke on 24 May 2011 at 4:57

Add support for Wikipedia XML dumps which are split into several files.

Recent XML dumps are often split into several (compressed) XML files. At the 
moment, all JWPL products can only work with single source files.

Original issue reported on code.google.com by oliver.ferschke on 11 Apr 2011 at 3:59

[RevisionMachin] Make ArticleFilter configurable with the ConfigGUI

The new ArticleFilter can now filter the pages that are included in the 
revision dump according to their namespaces. The prefixes are read from the 
siteinfo section in the xml dump. 
Currently, the namespaces are hard-coded in the class DiffToolThread

ArticleFilter nameFilter = new ArticleFilter(Arrays.asList(new Integer[]{0,1}));

The filter is set to include articles (namespace 0) and talk pages(namespace 1) 
and reject everything else. It already works language independently.


The namespaces that are to be included in the revision db should be passed to 
the filter via the configuration file (and thus be made configurable via the 
ConfigGUI)

Original issue reported on code.google.com by oliver.ferschke on 22 Jul 2011 at 3:41

[DataMachine, API] Add support for all namespaces

Currently, the DataMachine only includes pages from the namespaces 0 and 1 into 
the JWPL dumps. This should be made configurable in the same fashion as in the 
RevisionMachine.

Additionatlly, the language specific namespace prefixes should be mapped to the 
English prefixes in the JWPL database, so we don't have to perform the 
distinction in the API.

Original issue reported on code.google.com by oliver.ferschke on 3 Aug 2011 at 10:59

[RevisionMachine] Minor-flag is always set to true

The flag which indicates whether a revision is a minor revision or not seems 
always to be set to true. 
This probably is a bug in the WikipediaXMLReader.

Original issue reported on code.google.com by oliver.ferschke on 1 Jun 2011 at 6:00

[RevisionMachine] Create tutorial

Create a tutorial for the RevisionMachine similiar to the JWPL Core tutorial

Original issue reported on code.google.com by oliver.ferschke on 26 May 2011 at 8:27

[DataMachine] Automatically delete *.bin files after dump creation process

The files page.bin, revision.bin and text.bin, which are generated during the 
dump creaton process, should be automatically deleted when processing has 
finished. They are no longer needed.

Original issue reported on code.google.com by oliver.ferschke on 8 Jun 2011 at 12:32

[deleted issue]

[deleted issue]

Dependency to trove 2.0.4 not fulfilled by Maven central

The dependency to trove 2.0.4 specified in the main POM is not contained in the 
Maven central repository. Thus, it is not possible to complete the build 
process without either manually installing the required artifact to the local 
repository or to an own Maven repository.

Version 1.1-beta-5 of trove is available from the central repository and at 
least there seem to be no errors when building the projects with this version 
(after changing the POM accordingly). However, I have not tested the 
datamachine and timemachine tools when used with this version of trove.

Original issue reported on code.google.com by [email protected] on 28 Sep 2010 at 8:53

Add support for discussion archives

In addition to the current discussion page for each article there can be 
multiple discussion page archives, if the article is heavily discussed. We have 
to add support for these archives, so that the default access methods for 
discussions returns the current and all archived discussions.

Original issue reported on code.google.com by oliver.ferschke on 11 Apr 2011 at 3:57

DataMachine processing fails for Turkish due to different format of categorylinks file.

The recent format of the Turkish categorylinks file differs from other language 
editions, as it contains more fields (that are irrelevant for our purposes).

Processing fails, as false values are read.

Original issue reported on code.google.com by [email protected] on 1 Apr 2011 at 12:27

[deleted issue]

[deleted issue]

Iterable getDescendants() within CategoryGraph constructor causes endless loop

What steps will reproduce the problem?

Required is access to a running mysql database with a converted version of 
wikipedia.
Connecting to the database is done as shown in tutorials.

Following excerpt of code causes a endless loop.

String firstDomain = "Photosynthesis";
Category firstCat = wiki.getCategory(firstDomain);
CategoryGraph catGraph =
       new CategoryGraph(wiki, firstCat.getDescendants());
System.out.println(catGraph.getNumberOfNodes());

What is the expected output? What do you see instead?

As Photosynthesis has currently about 7 subcategories. I expect to get the 
number 7 in a reasonable time.

Instead I get no results and the program runs endlessly.

What version of the product are you using? On what operating system?
Current version on Open Suse 11.2

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 14 Mar 2011 at 4:26

[RevisionMachine] Make IndexGenerator runnable from command line

Currently, revision indexes can only be generated by altering the settings in 
the main method of the IndexGenerator class (or by invoking index generation 
from a custom class)
It should be possible to run the IndexGenerator from the commandline.
DB-settings and output path have to be set via command line parameters or some 
config file.

Original issue reported on code.google.com by oliver.ferschke on 2 Jun 2011 at 2:09

[RevisionMachine] Missing revision histories for articles with colon in title

Currently, all articles with prefixes in the title (like User:) are filtered by 
the RevisionMachine unless the prefix appears in a whitelist.
This way, only "normal" articles appear in the db PLUS everything you 
specifically define in the whitelist.
At the moment, a page is identified as having a prefix by looking for a colon 
in the title. There are, however, a few pages which have a colon in the title 
whitout using it for prefix demarkation. These pages will currently be lost. 
(<0.20%)
We therefore should adjust the filter and maybe go back to a (language 
dependent) blacklist filter.

Original issue reported on code.google.com by oliver.ferschke on 7 Jul 2011 at 10:28

Blocking: #15

Include endings of blend links in anchor text

From the user-mailinglist:

> I also have a question concerning blend links, how are they treated by
> the JWPL parser? Are endings outside of the link brackets included in
> the anchor texts of the links or not? I think it would be nice if they
> where included because for me they are the intended anchor texts of
> the links by the editor.

Blend links are not handled by the parser yet.
You are right that they are the true intended form.
However, handling these cases is non-trivial in the parser.
Please open an issue.
However, I cannot promise that it will be tackled quickly.

-Torsten

Original issue reported on code.google.com by [email protected] on 9 Jul 2011 at 2:08

brianified / jwpl Goto Github PK

jwpl's People

Watchers

jwpl's Issues

Recommend Projects

Recommend Topics

Recommend Org