brianified / jwpl Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/jwpl
Automatically exported from code.google.com/p/jwpl
Templates are language specific. That's why we should load whitelists and
blacklists in acceptTemplate() from the config file.
4 cases should be supported:
- Whitelist template begins with x
- Blacklist template begins with x
- Whitelist template equals x
- Blacklist template equals x
Original issue reported on code.google.com by oliver.ferschke
on 5 Aug 2011 at 9:33
As reported by various users.
In some situations when an article contains characters (e.g. "\") that are not
properly escaped during preprocessing, causing errors.
Original issue reported on code.google.com by [email protected]
on 10 Jan 2011 at 9:28
If de.tudarmstadt.ukp.wikipedia.api.Wikipedia.getPages(PageQuery) returned an
unmodifiable collection instead of an iterable, it would be possible to get the
number of pages using size(). That would be helpful e.g. to display progress
information (x of y pages processed). Inerhiting from AbstractCollection may be
helpful.
Original issue reported on code.google.com by [email protected]
on 21 Sep 2010 at 4:10
Database entries in revisions.ContributorName seem to have encoding problems.
Umlaut characters are not shown correctly.
Original issue reported on code.google.com by oliver.ferschke
on 27 Jul 2011 at 7:19
The IndexGenerator defines a PRIMARY KEY in the db scheme - this makes
inserting millions of rows very slow.
The index should be created after inserting all rows.
Original issue reported on code.google.com by oliver.ferschke
on 13 Jun 2011 at 5:37
What steps will reproduce the problem?
1.Download wiki dump dated 2011-05-26 or 2011-05-04
2. Run JWPL_DATAMACHINE_0.6.0.jar with options english Categories
Disambiguation_pages
What is the expected output? What do you see instead?
Expected is parsing to be completed and output folder to be filled with parsed
content. I tried using bunzip2 to unzip pages-articles.xml.bz2, it worked fine.
But running JWPL_DATAMACHINE_0.6.0 fails. Same thing happens for both wiki dump
dated 2011-05-26 and 2011-05-04
Here is the complete stack trace
Loading XML bean definitions from class path resource
[context/applicationContext.xml]
parse input dumps...
Discussions are available
unexpected end of stream
org.apache.tools.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStre
am.java:706)
org.apache.tools.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:289)
org.apache.tools.bzip2.CBZip2InputStream.setupNoRandPartA(CBZip2InputStream.java
:846)
org.apache.tools.bzip2.CBZip2InputStream.setupNoRandPartB(CBZip2InputStream.java
:902)
org.apache.tools.bzip2.CBZip2InputStream.read0(CBZip2InputStream.java:212)
org.apache.tools.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:180)
org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown
Source)
org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
Source)
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.
dispatch(Unknown Source)
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
de.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.readDump
(AbstractXmlDumpReader.java:207)
de.tudarmstadt.ukp.wikipedia.datamachine.dump.xml.XML2Binary.<init>(XML2Binary.j
ava:47)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInpu
tDump(DataMachineGenerator.java:65)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataM
achineGenerator.java:59)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMac
hine.java:57)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
a:43)
java.lang.reflect.Method.invoke(Method.java:616)
org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58
)
What version of the product are you using? On what operating system?
OS is Linux Ubantu 10 and Jwpl version is 0.6
Please reply soon any suggestion/fix. We are unable to proceed. Can I use jwpl
for any wiki dump without any changes?
Thanks,
Shareeka
Original issue reported on code.google.com by [email protected]
on 27 Jun 2011 at 7:47
What steps will reproduce the problem?
1. I've imported an english wikipedia dump 20110115
2. And I'm running the code from CategoryList.java (see attached)
What is the expected output? What do you see instead?
The expected output should be a list of all articles descending from one input
category as defined in the code (Finance)
What version of the product are you using? On what operating system?
I'm using the jwpl.jar downloaded from this side as I couldn't manage to build
it with maven. (Maven install requires to run the tests, which fail because I
can't access the DB of TUDarmstadt and I didn't figure out how to switch of the
tests)
The IDE I am using is eclipse on an OpenSUSE 64bit machine
MySQL is the latest from OpenSuse provided version.
Please provide any additional information below.
I've attached the outputfile created with the attached code I was running.
There are some (from my point of view) weird things going on which I don't
understand.
Here is the thrown exception:
17:19:11,660 INFO SchemaUpdate:160 - schema update complete
17:21:44,121 ERROR PageDAO:107 - get failed
org.hibernate.PropertyAccessException: Null value was assigned to a property of
primitive type setter of
de.tudarmstadt.ukp.wikipedia.api.hibernate.Page.isDisambiguation
at org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:85)
at org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337)
at org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200)
at org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3566)
at org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:129)
at org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854)
at org.hibernate.loader.Loader.doQuery(Loader.java:729)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236)
at org.hibernate.loader.Loader.loadEntity(Loader.java:1860)
at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:48)
at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:42)
at org.hibernate.persister.entity.AbstractEntityPersister.load(AbstractEntityPersister.java:3044)
at org.hibernate.event.def.DefaultLoadEventListener.loadFromDatasource(DefaultLoadEventListener.java:395)
at org.hibernate.event.def.DefaultLoadEventListener.doLoad(DefaultLoadEventListener.java:375)
at org.hibernate.event.def.DefaultLoadEventListener.load(DefaultLoadEventListener.java:139)
at org.hibernate.event.def.DefaultLoadEventListener.proxyOrLoad(DefaultLoadEventListener.java:195)
at org.hibernate.event.def.DefaultLoadEventListener.onLoad(DefaultLoadEventListener.java:103)
at org.hibernate.impl.SessionImpl.fireLoad(SessionImpl.java:878)
at org.hibernate.impl.SessionImpl.get(SessionImpl.java:815)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.hibernate.context.ThreadLocalSessionContext$TransactionProtectionWrapper.invoke(ThreadLocalSessionContext.java:301)
at $Proxy0.get(Unknown Source)
at de.tudarmstadt.ukp.wikipedia.api.hibernate.PageDAO.findById(PageDAO.java:99)
at de.tudarmstadt.ukp.wikipedia.api.Page.fetchPage(Page.java:89)
at de.tudarmstadt.ukp.wikipedia.api.Page.<init>(Page.java:51)
at de.tudarmstadt.ukp.wikipedia.api.Wikipedia.getPage(Wikipedia.java:119)
at de.tudarmstadt.ukp.wikipedia.api.Category.getArticles(Category.java:287)
at uk.ac.uuc.cidbio.wikipedia.CategoryList.main(CategoryList.java:109)
Caused by: java.lang.IllegalArgumentException
at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42)
... 29 more
Exception in thread "main" org.hibernate.PropertyAccessException: Null value
was assigned to a property of primitive type setter of
de.tudarmstadt.ukp.wikipedia.api.hibernate.Page.isDisambiguation
at org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:85)
at org.hibernate.tuple.entity.AbstractEntityTuplizer.setPropertyValues(AbstractEntityTuplizer.java:337)
at org.hibernate.tuple.entity.PojoEntityTuplizer.setPropertyValues(PojoEntityTuplizer.java:200)
at org.hibernate.persister.entity.AbstractEntityPersister.setPropertyValues(AbstractEntityPersister.java:3566)
at org.hibernate.engine.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:129)
at org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:854)
at org.hibernate.loader.Loader.doQuery(Loader.java:729)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236)
at org.hibernate.loader.Loader.loadEntity(Loader.java:1860)
at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:48)
at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:42)
at org.hibernate.persister.entity.AbstractEntityPersister.load(AbstractEntityPersister.java:3044)
at org.hibernate.event.def.DefaultLoadEventListener.loadFromDatasource(DefaultLoadEventListener.java:395)
at org.hibernate.event.def.DefaultLoadEventListener.doLoad(DefaultLoadEventListener.java:375)
at org.hibernate.event.def.DefaultLoadEventListener.load(DefaultLoadEventListener.java:139)
at org.hibernate.event.def.DefaultLoadEventListener.proxyOrLoad(DefaultLoadEventListener.java:195)
at org.hibernate.event.def.DefaultLoadEventListener.onLoad(DefaultLoadEventListener.java:103)
at org.hibernate.impl.SessionImpl.fireLoad(SessionImpl.java:878)
at org.hibernate.impl.SessionImpl.get(SessionImpl.java:815)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.hibernate.context.ThreadLocalSessionContext$TransactionProtectionWrapper.invoke(ThreadLocalSessionContext.java:301)
at $Proxy0.get(Unknown Source)
at de.tudarmstadt.ukp.wikipedia.api.hibernate.PageDAO.findById(PageDAO.java:99)
at de.tudarmstadt.ukp.wikipedia.api.Page.fetchPage(Page.java:89)
at de.tudarmstadt.ukp.wikipedia.api.Page.<init>(Page.java:51)
at de.tudarmstadt.ukp.wikipedia.api.Wikipedia.getPage(Wikipedia.java:119)
at de.tudarmstadt.ukp.wikipedia.api.Category.getArticles(Category.java:287)
at uk.ac.uuc.cidbio.wikipedia.CategoryList.main(CategoryList.java:109)
Caused by: java.lang.IllegalArgumentException
at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.hibernate.property.BasicPropertyAccessor$BasicSetter.set(BasicPropertyAccessor.java:42)
... 29 more
From what I understand until now the only two reasons why I am getting
exceptions are either there are problems with the database entries or there is
a bug in the code. But I can be wrong of course.
Thank you for any support
Original issue reported on code.google.com by [email protected]
on 24 Feb 2011 at 5:34
Attachments:
Information about which page contains which templates is interesting for many
applications.
We should provide a tool that creates (optional) database tables containing
this information.
The access methods should be places in a dedicated class, not the main
"Wikipedia" class.
Original issue reported on code.google.com by oliver.ferschke
on 3 Aug 2011 at 12:42
applicationContext.xml is not included in the jars with dependencies of the
DataMachine and the TimeMachine
Original issue reported on code.google.com by oliver.ferschke
on 17 Aug 2011 at 6:36
The JWPL project page needs a logo.
Original issue reported on code.google.com by oliver.ferschke
on 13 Aug 2011 at 9:21
I only need the mediawiki parser alone. There was one jar before, but now
there's none. Could you please put it back up?
Original issue reported on code.google.com by [email protected]
on 1 Jun 2011 at 10:47
[deleted issue]
Issue regarding: Page-filter in DiffToolThread, ArticleConsumer
The only implementation of the AbstractNameChecker is currently
EnglishArticleNameChecker();
A ArticleNameChecker is used to whitelist articles with prefixes in the article
title. (e.g. Talk:PageTitle) - usually all pages with prefixes are filtered.
Instead of adding a ArticleNameChecker for every language, there should be ony
language independent checker that reads the whitelist from the configuration.
Additionally:
It is also unclear why filter-object is both created in DiffToolThread AND
ArticleConsumer!!! This should only happen in one place.
Original issue reported on code.google.com by oliver.ferschke
on 20 May 2011 at 1:56
The "index table" index_articleID_rc_ts needs a primary key. It has either to
be automatically created upond schema creation or added later on upon first
access (like the indexes)
ALTER TABLE index_articleID_rc_ts ADD PRIMARY KEY(ArticleID);
Original issue reported on code.google.com by oliver.ferschke
on 12 Jul 2011 at 8:58
[deleted issue]
Index tables are currently created as sql-files (Insert Statements)
It should be changed to data files that can be read using LOAD_DATA_INFILE (
http://dev.mysql.com/doc/refman/5.1/en/load-data.html )
Original issue reported on code.google.com by oliver.ferschke
on 12 Jun 2011 at 9:23
This is not a bug rather a faq. Posting it here for the answer
1. Is there any way to get the page content of the Category page ?
For example Category:Mathematics includes content as follows ....
************
The main article for this category is Mathematics.
Wikimedia Commons has media related to: Mathematics
See also: Category:Logic
Mathematics (colloquially, maths, or math), is the body of knowledge centered
on concepts such as quantity, structure, space, and change, and also the
academic discipline that studies them.
Mathematics can be further divided into smaller subcategories (such as geometry
and algebra), and thus, it includes many ideas and theories.
The main article for this category is Mathematics.
*****************
2. Does JWPL support WordNet based near similarity matches (hyponym, hypernym,
synonym based) ?
3. Is it possible to get an 'is-a' relation (which should be meaningful super
class of real object, ex: 'Automobile' should be returned for 'car') for given
a category using JWPL api
Please answer.
Thanks
Original issue reported on code.google.com by [email protected]
on 5 Jul 2011 at 2:34
JWPL needs a MySQL database, but does not come with the driver. It should be
documented that the driver needs to be installed and where to get it from
(website or via Maven).
Original issue reported on code.google.com by [email protected]
on 21 Sep 2010 at 4:12
Hi,
I stumbled across some strange results iterating over the dewiki-20070206.sql
database using the page.getPlainText() method and counting all words with JWPL:
the most common words in Wikipedia contained:
count: Word
842554: TEMPLATE
629957: Kategorie
438822: nbsp
256422: thumb
253580: Weblinks
also later on some all uppercase words that where probably part of some
template?
http://de.wikipedia.org/wiki/Vorlage:Personendaten
152323: NAME
131362: KURZBESCHREIBUNG
131333: GEBURTSDATUM
131300: GEBURTSORT
131237: ALTERNATIVNAMEN
and so on.
What is the expected output? What do you see instead?
I expected just normal German words to be the most common words like:
6424690: der; 5262887: und; 4316465: die; 3693242: in; 2713375: von; 1888157:
den; 1806153: des; 1578301: mit; 1509434: im; 1466509: ist; 1348733: Die;
1254947: zu; 1219600: das; 1218469: dem; 1110328: als; 1083261: für; 1077734:
auf; 1075940: eine; 1046970: ein; 1011403: wurde; 1009821: sich; 910366: er;
881106: auch; 842554: TEMPLATE; 814815: an; 714727: aus; 701011: war; 675874:
Der; 654112: nach; 629957: Kategorie; 616429: bei; 589324: wird; 581816: einer;
573699: werden; 547424: bis; 530476: sind; 529210: nicht; 525816: durch;
520091: oder; 518637: am; 503813: 1; 503254: zum; 481658: sie; 466585: es;
446827: Das; 438822: nbsp;
and so on, as you can see the above mentioned Words got mixed with my results.
What version of the product are you using? On what operating system?
jwpl_v0.5, Ubuntu 10.04 LTS
Steps to reproduce:
Iterate over the whole database,
Tokenize every article with lucene standard tokenizer,
count all tokens with a TreeMap<String, Integer>
when finished put all values in a MultiValueMap and use the count as key,
(basicly: <Integer, ArrayList<String>>)
get a copy of the keySet, sort it, use the top x values as keys to retrieve
words from the map.
Original issue reported on code.google.com by [email protected]
on 12 Jan 2011 at 1:25
Writing the revision dump directly to the databse does not work at the moment
(see report below)
Either fix the issue or remove the option to write directly to db.
EXCEPTION:
de.tudarmstadt.ukp.wikipedia.revisionmachine.common.exceptions.SQLConsumerExcept
ion: DIFFTOOL_SQLCONSUMER_DATABASEWRITER_EXCEPTION:
INSERT INTO revisions VALUES(null,
233192,1,233192,10,980061141000,?,'*',0,'99',1),(null,
233192,2,862220,10,1014669791000,?,'Automated conversion',1,'0',0),(null,
233192,3,15898945,10,1051323518000,?,'Fixing redirect',1,'7543',1),(null,
233192,4,56681914,10,1149368141000,?,'fix double redirect',1,'516514',1),(null,
233192,5,74466685,10,1157703364000,?,'cat rd',0,'750223',1),(null,
233192,6,133180268,10,1180032118000,?,'Robot: Automated text replacement
(-\\[\\[(.*?[\\:|\\|])*?(.+?)\\]\\] +\\g<2>)',1,'4477979',1),(null,
233192,7,133452289,10,1180127532000,?,'Revert edit(s) by
[[Special:Contributions/Ngaiklin|Ngaiklin]] to last version by
[[Special:Contributions/Rory096|Rory096]]',1,'241822',1),(null,
233192,8,381200179,10,1282875831000,?,null,0,'0',0),(null,
233192,9,381202555,10,1282876716000,?,'[[Help:Reverting|Reverted]] edits by
[[Special:Contributions/76.28.186.133|76.28.186.133]] ([[User
talk:76.28.186.133|talk]]) to last version by Gurch',1,'7181920',1);
0 106
1 40
2 17
3 28
4 26
5 7
6 11
7 479
8 63
at de.tudarmstadt.ukp.wikipedia.revisionmachine.common.exceptions.ErrorFactory.createSQLConsumerException(ErrorFactory.java:353)
at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.consumer.sql.writer.SQLDatabaseWriter.process(SQLDatabaseWriter.java:244)
at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.consumer.sql.writer.TimedSQLDatabaseWriter.process(TimedSQLDatabaseWriter.java:131)
at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.DiffToolThread$TaskTransmitter.writeOutput(DiffToolThread.java:223)
at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.DiffToolThread$TaskTransmitter.transmitDiff(DiffToolThread.java:195)
at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.consumer.diff.calculation.DiffCalculator.transmitAtEndOfTask(DiffCalculator.java:326)
at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.consumer.diff.calculation.TimedDiffCalculator.transmitAtEndOfTask(TimedDiffCalculator.java:149)
at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.consumer.diff.calculation.DiffCalculator.process(DiffCalculator.java:523)
at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.consumer.diff.calculation.TimedDiffCalculator.process(TimedDiffCalculator.java:192)
at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.DiffToolThread.run(DiffToolThread.java:319)
at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.DiffTool.main(DiffTool.java:56)
Caused by: java.sql.SQLException: Column count doesn't match value count at row
1
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2926)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1571)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1666)
at com.mysql.jdbc.Connection.execSQL(Connection.java:2978)
at com.mysql.jdbc.Connection.execSQL(Connection.java:2902)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:933)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1162)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1079)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1064)
at de.tudarmstadt.ukp.wikipedia.revisionmachine.difftool.consumer.sql.writer.SQLDatabaseWriter.process(SQLDatabaseWriter.java:206)
... 9 more
Original issue reported on code.google.com by oliver.ferschke
on 11 Aug 2011 at 5:18
[deleted issue]
Log4j is not correctly initialized when the executable jars are run.
Original issue reported on code.google.com by oliver.ferschke
on 17 Aug 2011 at 6:43
Working with JWPL 0.45b I encountered the following problem:
I created an own Wikipedia dump using German Wikipedia backup dump files of
November 11, 2009 from download.wikimedia.org. I followed the steps explained
on the JWPL documentation page and ran the transformation process using the new
DataMachine (Version 2), that was kindly provided to me by Mr. Zesch.
The creation of the SQl dump file was succesfull. I could retrieve pages and
process text, just as I could with the German Wikipedia SQL dump of 6 Feb 2007
provided on the JWPL homepage.
However, when comparing some results of the new Wikipedia dump with that of
2007, I could see that certain redirects, but not all, were missing. They were
however included in the online version of Wikipedia. I assumed that there was
some database mistake, but also in the output text files, namely
"page_redirects.txt" they did not appear. Some further investigation in the
online Wikipedia showed that the error was systematical:
Whenever a redirect page included a redirect link of the exact format "REDIRECT
[[...]]" (i.e.: the capitalized Redirect-keyword followed by a space), the
redirect did appear in the database.
But, whenever the format was slightly different, the redirect was missing.
Examples:
Missing space: REDIRECT[[...]]
not capitalized: Redirect [[...]]
German key word: WEITERLEITUNG [[...]]
I ran the DataMachine again, but the problem remained. Interestingly, in the
Wikipedia SQL dump of 6 Feb 2007, the problem does not appear.
Kind regards
Stephan Strohmaier
Original issue reported on code.google.com by [email protected]
on 21 Sep 2010 at 4:05
Hi, I am using Mysql 5.5 community server 64bit version for Win 7 64bit OS.
I have successfully imported a fresh wikipedia 2011 db, then on my first query,
as:
>>>
DatabaseConfiguration dbConfig = new DatabaseConfiguration();
dbConfig.setHost("localhost");
........//other config
Wikipedia wiki = new Wikipedia(dbConfig);
System.out.println(wiki.getPage("Cat").getTitle());
>>>
I get an exception:
>>>
SEVERE: COLLATION 'utf8_bin' is not valid for CHARACTER SET 'utf8mb4'
Exception in thread "main" org.hibernate.exception.SQLGrammarException: could
not execute query
at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:67)
at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43)
at org.hibernate.loader.Loader.doList(Loader.java:2223)
at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104)
at org.hibernate.loader.Loader.list(Loader.java:2099)
at org.hibernate.loader.custom.CustomLoader.list(CustomLoader.java:289)
at org.hibernate.impl.SessionImpl.listCustomQuery(SessionImpl.java:1695)
at org.hibernate.impl.AbstractSessionImpl.list(AbstractSessionImpl.java:142)
at org.hibernate.impl.SQLQueryImpl.list(SQLQueryImpl.java:152)
at org.hibernate.impl.AbstractQueryImpl.uniqueResult(AbstractQueryImpl.java:811)
at de.tudarmstadt.ukp.wikipedia.api.Page.fetchByTitle(Page.java:153)
at de.tudarmstadt.ukp.wikipedia.api.Page.<init>(Page.java:109)
at de.tudarmstadt.ukp.wikipedia.api.Wikipedia.getPage(Wikipedia.java:112)
at uk.ac.shef.oak.jwpltest.Test.main(Test.java:23)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: COLLATION
'utf8_bin' is not valid for CHARACTER SET 'utf8mb4'
>>>
I have double checked my database's character set, which shows:
>>>
character-set-database = utf8
collation-database = utf8_general_ci
>>>
It seems that the DB character set config is right, but the error seems to
complain about that. Any suggestions please?
Thanks
Original issue reported on code.google.com by [email protected]
on 4 Jun 2011 at 4:28
What steps will reproduce the problem:
Create a Wikipedia snapshot for 20090101 or 20080101 from the
20100130-Wikipedia Dump (http://dumps.wikimedia.org/enwiki/20100130/)
After Revision 7270000, the TimeMachine aborts with the following exception:
Exception in thread "xml2sql" java.lang.RuntimeException: java.io.IOException:
Invalid byte 2 of 4-byte UTF-8 sequence.
at de.tudarmstadt.ukp.wikipedia.timemachine.dump.xml.original.XMLDumpTableInputStreamThread.run(XMLDumpTableInputStreamThread.java:128)
Caused by: java.io.IOException: Invalid byte 2 of 4-byte UTF-8 sequence.
at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:92)
at de.tudarmstadt.ukp.wikipedia.timemachine.dump.xml.original.XMLDumpTableInputStreamThread.run(XMLDumpTableInputStreamThread.java:123)
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid
byte 2 of 4-byte UTF-8 sequence.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
... 1 more
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid
byte 2 of 4-byte UTF-8 sequence.
at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
... 11 more
Write end dead
This is apparently caused by readDump() in org.mediawiki.importer.XmlDumpReader
Original issue reported on code.google.com by oliver.ferschke
on 11 Feb 2011 at 9:08
JWPL doesn't use hibernate to generate the database schema. It's set up during
the import of the SQL file that JWPL generates. Thus, hibernate should be able
to operate in a pure read-only mode. This includes verifying the database
schema against the Hibernate mappings, but not trying to update the DB.
hibernate.hbm2ddl.auto=validate
Original issue reported on code.google.com by richard.eckart
on 17 Jul 2011 at 10:54
Currently the JWPL test cases require access to our DB host. This precludes
external users from running the tests (when JWPL is open sourced) and it also
prevents tests when the DB host is offline (as currently...).
Instead an embedded DB like HSQLDB or Derby should be used for test cases.
Original issue reported on code.google.com by [email protected]
on 21 Sep 2010 at 4:07
When working with multiple source archives, the DiffTools normally processes
them all sequentially and builds a single output files.
In order to speed up processing, multiple configuration files can be written to
allow for calculating revisions using several instances of the DiffTool. e.g.
if we have 15 source archives, we might create 5 configuration files - each
definining 3 archives. We can then process the data with 5 instances of the
tool in parallel.
Output folders have to be adapted in case of multiple instances.
This is, of course, no replacement for REAL distributed processing. The
limiting factor here probably is the RAM.
Original issue reported on code.google.com by oliver.ferschke
on 6 Jun 2011 at 4:24
When processing the English Wikipedia dump 20110405, the DiffTool produced the
following insert statement
INSERT INTO revisions VALUES;
without giving any values.
It was impossible to import this sql file to the database. The line had to be
removed.
It should be ensured that only legal SQL statements are produced.
Original issue reported on code.google.com by oliver.ferschke
on 11 Jun 2011 at 9:55
Hi,
I'm not sure whether this is intended, or a bug but when you access
all links in a page with for Example:
Map<String, Set<String>> anchorStrings = page.getOutlinkAnchors();
and iterate over them with:
for (String targetArticle : anchorStrings.keySet())
the target-article is sometimes not found by
Wikipedia.existPage(String title) or Wikipedia.getPage(String title)
because it contains a section link.
for example "Geschichte_Norwegens#Der Bürgerkrieg" (german wikipidea)
is returned as targetArticle and the article "Geschichte_Norwegens"
exists but Wikipedia.existPage returns false and Wikipedia.getPage
throws an exception.
I think in general it is valid that the method returns the full target
string of a link, because maybe some applications need that extra
section information, but the javadoc of the method
Page.getOutlinkAnchors() should warn you about this behavior, because
if you overlook it you might discard a lot of valid links.
Also it would be a nice feature if the Wikipedia.existPage() and
Wikipedia.getPage() methods would resolve section links automatically.
(return true or the Page, if the string before the "#" is a valid
title).
Original issue reported on code.google.com by [email protected]
on 4 Jul 2011 at 12:08
All 0.6.0 jars only contain compiled classes.
Source code can only be obtained via SVN and maven.
We should add the source code the the jars as well.
Original issue reported on code.google.com by oliver.ferschke
on 1 Jun 2011 at 6:37
In order to retrieve the user group a user belongs to, it is necessary to have
the user id of each user. At the moment, only the username is stored in the
revision db.
Actions to be taken:
1. Currently, the username is called "ContributorID" in the db. It should be
renamed to "ContributorName"
2. Add userid to revisions table as "ContributorId"
With this additional information, the user group assignments should be usable.
Access methods for retrieving user group info should be added to the
RevisionApi.
Original issue reported on code.google.com by oliver.ferschke
on 21 Jul 2011 at 1:14
The system seems to just concatenate the path string with the filename string.
If the paths do not have a trailing File.separator, the resulting concatenated
path will be wrong.
The ConfigGUI has been fixed to ensure that those paths have a trailing
File.separator.
However, if someone manually creates/alters a config file, this could be still
the a bug-source.
It would be best to find all occurences where paths are built (path+filename)
and ensure at those points that a file.separator is put in between.
Original issue reported on code.google.com by oliver.ferschke
on 20 May 2011 at 1:50
As reported by user:
I'm trying to use the config file below to import a small subset of
the latest (English) Wikipedia dump into TimeMachine. However, for
some reason I can't determine (I've tried debugging a bit),
TimeMachine tries to cast its generator object to a DataMachine and is
unable to do so. I'm running
de.tudarmstadt.ukp.wikipedia.timemachine.domain.JWPLTimeMachine
through Eclipse and I get the following output:
09:57:13,204 INFO XmlBeanDefinitionReader:315 - Loading XML bean
definitions from class path resource [context/applicationContext.xml]
09:57:13,592 INFO Log4jLogger:21 - parsing configuration file....
09:57:13,611 INFO Log4jLogger:21 - processing data ...
09:57:13,640 INFO Log4jLogger:21 -
de.tudarmstadt.ukp.wikipedia.timemachine.domain.TimeMachineFiles
cannot be cast to
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineFiles
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.setFiles(Da
taMachineGenerator.java:
49)
de.tudarmstadt.ukp.wikipedia.timemachine.domain.JWPLTimeMachine.main(JWPLTimeMac
hine.java:
77)
Have I configured something wrong?
Original issue reported on code.google.com by [email protected]
on 10 Aug 2011 at 2:30
The generation of the revision data with the DiffTool can take quite long for
large Wikipedias. It would be very helpful to be able to resume the generation
process if the process is interrupted or cancelled for any reason.
Original issue reported on code.google.com by oliver.ferschke
on 24 May 2011 at 4:57
Recent XML dumps are often split into several (compressed) XML files. At the
moment, all JWPL products can only work with single source files.
Original issue reported on code.google.com by oliver.ferschke
on 11 Apr 2011 at 3:59
The new ArticleFilter can now filter the pages that are included in the
revision dump according to their namespaces. The prefixes are read from the
siteinfo section in the xml dump.
Currently, the namespaces are hard-coded in the class DiffToolThread
ArticleFilter nameFilter = new ArticleFilter(Arrays.asList(new Integer[]{0,1}));
The filter is set to include articles (namespace 0) and talk pages(namespace 1)
and reject everything else. It already works language independently.
The namespaces that are to be included in the revision db should be passed to
the filter via the configuration file (and thus be made configurable via the
ConfigGUI)
Original issue reported on code.google.com by oliver.ferschke
on 22 Jul 2011 at 3:41
Currently, the DataMachine only includes pages from the namespaces 0 and 1 into
the JWPL dumps. This should be made configurable in the same fashion as in the
RevisionMachine.
Additionatlly, the language specific namespace prefixes should be mapped to the
English prefixes in the JWPL database, so we don't have to perform the
distinction in the API.
Original issue reported on code.google.com by oliver.ferschke
on 3 Aug 2011 at 10:59
The flag which indicates whether a revision is a minor revision or not seems
always to be set to true.
This probably is a bug in the WikipediaXMLReader.
Original issue reported on code.google.com by oliver.ferschke
on 1 Jun 2011 at 6:00
Create a tutorial for the RevisionMachine similiar to the JWPL Core tutorial
Original issue reported on code.google.com by oliver.ferschke
on 26 May 2011 at 8:27
The files page.bin, revision.bin and text.bin, which are generated during the
dump creaton process, should be automatically deleted when processing has
finished. They are no longer needed.
Original issue reported on code.google.com by oliver.ferschke
on 8 Jun 2011 at 12:32
[deleted issue]
The dependency to trove 2.0.4 specified in the main POM is not contained in the
Maven central repository. Thus, it is not possible to complete the build
process without either manually installing the required artifact to the local
repository or to an own Maven repository.
Version 1.1-beta-5 of trove is available from the central repository and at
least there seem to be no errors when building the projects with this version
(after changing the POM accordingly). However, I have not tested the
datamachine and timemachine tools when used with this version of trove.
Original issue reported on code.google.com by [email protected]
on 28 Sep 2010 at 8:53
In addition to the current discussion page for each article there can be
multiple discussion page archives, if the article is heavily discussed. We have
to add support for these archives, so that the default access methods for
discussions returns the current and all archived discussions.
Original issue reported on code.google.com by oliver.ferschke
on 11 Apr 2011 at 3:57
The recent format of the Turkish categorylinks file differs from other language
editions, as it contains more fields (that are irrelevant for our purposes).
Processing fails, as false values are read.
Original issue reported on code.google.com by [email protected]
on 1 Apr 2011 at 12:27
[deleted issue]
What steps will reproduce the problem?
Required is access to a running mysql database with a converted version of
wikipedia.
Connecting to the database is done as shown in tutorials.
Following excerpt of code causes a endless loop.
String firstDomain = "Photosynthesis";
Category firstCat = wiki.getCategory(firstDomain);
CategoryGraph catGraph =
new CategoryGraph(wiki, firstCat.getDescendants());
System.out.println(catGraph.getNumberOfNodes());
What is the expected output? What do you see instead?
As Photosynthesis has currently about 7 subcategories. I expect to get the
number 7 in a reasonable time.
Instead I get no results and the program runs endlessly.
What version of the product are you using? On what operating system?
Current version on Open Suse 11.2
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 14 Mar 2011 at 4:26
Currently, revision indexes can only be generated by altering the settings in
the main method of the IndexGenerator class (or by invoking index generation
from a custom class)
It should be possible to run the IndexGenerator from the commandline.
DB-settings and output path have to be set via command line parameters or some
config file.
Original issue reported on code.google.com by oliver.ferschke
on 2 Jun 2011 at 2:09
Currently, all articles with prefixes in the title (like User:) are filtered by
the RevisionMachine unless the prefix appears in a whitelist.
This way, only "normal" articles appear in the db PLUS everything you
specifically define in the whitelist.
At the moment, a page is identified as having a prefix by looking for a colon
in the title. There are, however, a few pages which have a colon in the title
whitout using it for prefix demarkation. These pages will currently be lost.
(<0.20%)
We therefore should adjust the filter and maybe go back to a (language
dependent) blacklist filter.
Original issue reported on code.google.com by oliver.ferschke
on 7 Jul 2011 at 10:28
From the user-mailinglist:
> I also have a question concerning blend links, how are they treated by
> the JWPL parser? Are endings outside of the link brackets included in
> the anchor texts of the links or not? I think it would be nice if they
> where included because for me they are the intended anchor texts of
> the links by the editor.
Blend links are not handled by the parser yet.
You are right that they are the true intended form.
However, handling these cases is non-trivial in the parser.
Please open an issue.
However, I cannot promise that it will be tackled quickly.
-Torsten
Original issue reported on code.google.com by [email protected]
on 9 Jul 2011 at 2:08
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.