heideltime / heideltime Goto Github PK

A multilingual, cross-domain temporal tagger developed at the Database Systems Research Group at Heidelberg University.

License: GNU General Public License v3.0

Shell 0.33% Batchfile 0.06% Java 99.61%

heideltime's Introduction

HeidelTime can now also be used for English temponym tagging. For details, see our TempWeb'16 paper.

HeidelTime contains automatically created resources for 200+ languages in addition to manually created ones for 13 languages. For further details, take a look at our EMNLP 2015 paper.

About HeidelTime

HeidelTime is a multilingual, domain-sensitive temporal tagger developed at the Database Systems Research Group at Heidelberg University. It extracts temporal expressions from documents and normalizes them according to the TIMEX3 annotation standard. HeidelTime is available as UIMA annotator and as standalone version.

HeidelTime currently contains hand-crafted resources for 13 languages: English, German, Dutch, Vietnamese, Arabic, Spanish, Italian, French, Chinese, Russian, Croatian, Estonian and Portuguese. In addition, starting with version 2.0, HeidelTime contains automatically created resources for more than 200 languages. Although these resources are of lower quality than the manually created ones, temporal tagging of many of these languages has never been addressed before. Thus, HeidelTime can be used as a baseline for temporal tagging of all these languages or as a starting point for developing temporal tagging capabilities for them.

HeidelTime distinguishes between news-style documents and narrative-style documents (e.g., Wikipedia articles) in all languages. In addition, English colloquial (e.g., Tweets and SMS) and scientific articles (e.g., clinical trails) are supported.

Want to see what it can do before you delve in? Take a look at our online demo.

Latest downloads

Our latest as well as past releases are always available on the Releases page.
Bleeding edge version is available via our Git repository.
Our temporal annotated corpora and supplementary evaluation scripts can be found here.
If you want to receive notifications on updates of HeidelTime, please fill out this form.
You can also follow us on Twitter @HeidelTime.

Maven

A minimal set of dependencies is satisfied by these entries for your pom.xml:

		<dependency>
			<groupId>org.apache.uima</groupId>
			<artifactId>uimaj-core</artifactId>
			<version>2.8.1</version>
		</dependency>
		<dependency>
			<groupId>com.github.heideltime</groupId>
			<artifactId>heideltime</artifactId>
			<version>2.2</version>
		</dependency>

For some additional features, you will need to provide additional dependencies. See our Maven wiki page.

Publications

If you use HeidelTime, please cite the appropriate paper (in general, this would be the journal paper [4]; if you use HeidelTime with automatically created resources, please cite paper [10]; if you use HeidelTime for temponym tagging, please cite paper [11]):

Strötgen, Gertz: HeidelTime: High Qualitiy Rule-based Extraction and Normalization of Temporal Expressions. SemEval'10. pdf bibtex
Strötgen, Gertz: Temporal Tagging on Different Domains: Challenges, Strategies, and Gold Standards. LREC'12. pdf bibtex
Strötgen et al.: HeidelTime: Tuning English and Developing Spanish Resources for TempEval-3. SemEval'13. pdf bibtex
Strötgen, Gertz: Multilingual and Cross-domain Temporal Tagging. Language Resources and Evaluation, 2013. pdf bibtex
Strötgen et al.: Time for More Languages: Temporal Tagging of Arabic, Italian, Spanish, and Vietnamese. TALIP, 2014. pdf bibtex
Li et al.: Chinese Temporal Tagging with HeidelTime. EACL'14. pdf bibtex
Strötgen et al.: Extending HeidelTime for Temporal Expressions Referring to Historic Dates. LREC'14. pdf bibtex
Manfredi et al.: HeidelTime at EVENTI: Tuning Italian Resources and Addressing TimeML's Empty Tags. EVALITA'14. pdf bibtex
Strötgen: Domain-sensitive Temporal Tagging for Event-centric Information Retrieval. PhD Thesis. pdf bibtex
Strötgen, Gertz: A Baseline Temporal Tagger for All Languages. EMNLP'15. pdf bibtex
Kuzey, Strötgen, Setty, Weikum: Temponym Tagging: Temporal Scopes for Textual Phrases. TempWeb'16. pdf bibtex

Language Resources

We want to thank the following researchers for their efforts to develop HeidelTime resources:

Dutch resources: Matje van de Camp, Tilburg University
French resources: Véronique Moriceau, LIMSI - CNRS
Russian resources: Elena Klyachko
Croatian resources: Luka Skukan, University of Zagreb
Portuguese resources: Zunsik Lim

Please feel free to use our automatically created resources as starting point, if you plan to manually address a language.

Tell me more!

HeidelTime was developed in Java with extensibility in mind -- especially in terms of language-specific resources, as well as in terms of programmatic functionality.

Get your hands dirty!

You'd like to reproduce HeidelTime's evaluation results described in our papers on several corpora? Download the heideltime-kit or clone our repository and check out our tutorial on reproducing evaluation results. This will also explain how to integrate the HeidelTime annotator into a UIMA pipeline.
You'd like to participate in the development of HeidelTime; maybe create an addon or improve functionality? Clone our repository and see how to set up Eclipse to develop HeidelTime. Then have a look at HeidelTime's architectural concepts and have a go at it!
You'd like to share some changes you've made, resources for a new language, or you think that HeidelTime could be improved in a specific way? Open up a pull request or an issue and let us know, we're eager to read your thoughts!

heideltime's People

Contributors

Stargazers

Watchers

heideltime's Issues

Incorrect resolved "2nd or 3rd century BC" -> 01 (AD)

This is the example for date_historic_5c-BCADhint, but it is incorrectly resolved.

The problem is the overlap handling. We have four matches here:

"2nd" as 2nd century BC
"2nd" as 2nd century (AD!)
"3rd century BC" as 3rd century BC
"3rd century" as 3rd century (AD!)

the tail is correctly resolved as the BC match is longer. But in the top part, the matching range is set to the beginning only. By the current logic, this is an exact duplicate. I suggest to prefer the longer timex value (if different), i.e. BC01 over 01 assuming that it is a more complex match:

else if (t1.getTimexValue().length() > t2.getTimexValue().length()) {
  hsTimexesToRemove.add(t2);
}

(for the diff in my branch, see: b637df3)

improper handling of newline when reading files

The main() in class HeidelTimeStnadalone reads input with this loop:

    while ((line = fileReader.readLine()) != null)
       sb.append(System.getProperty("line.separator")+line);
                        }
This has the effect of adding a newline at the beginning and leaving the last line
unterminated.

This affects the tokenizer and POS tagger I am using, which gets an extra empty token
at the beginning and causing a disalignement in tokens.

It should be changed to:

    while ((line = fileReader.readLine()) != null)
       sb.append(line + System.getProperty("line.separator"));

Original issue reported on code.google.com by attardi on 2014-10-18 20:20:20

HeidelTime (Chinese) throws Exception / Error

When attempting to run HeidelTime with Chinese, we get a FileNotFoundException. We are testing on the following Chinese text:

“西门商场（2013年9月31日武装分子在肯尼亚首都内罗毕袭击的购物中心）的时候我身在艺术咖啡厅，这次我又住在半岛酒店……凌晨4点前一切被炸成地狱的时候，我正准备离开我的房间。”

I'm running the following command:

java -jar de.unihd.dbs.heideltime.standalone.jar ~/chinese.txt -l chinese -t narrative -vv -pos treetagger

The config file points to the proper directory, however we get the below exception:

java.io.FileNotFoundException: /home/bisoldi/bin/heideltime/treetagger/chinese-tokenizer/zh-tokenise/segment-zh.pl (No such file or directory)

   at java.io.FileInputStream.open0(Native Method)
   at java.io.FileInputStream.open(FileInputStream.java:195)
   at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at de.unihd.dbs.uima.annotator.treetagger.TreeTaggerProperties.getChineseTokenizationProcess(TreeTaggerProperties.java:81)
at de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper.tokenizeChinese(TreeTaggerWrapper.java:302)
at de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper.process(TreeTaggerWrapper.java:222)
at de.unihd.dbs.heideltime.standalone.components.impl.TreeTaggerWrapper.process(TreeTaggerWrapper.java:43)
at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.establishPartOfSpeechInformation(HeidelTimeStandalone.java:406)
at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.establishHeidelTimePreconditions(HeidelTimeStandalone.java:339)
at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.process(HeidelTimeStandalone.java:499)
at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.process(HeidelTimeStandalone.java:448)
at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.main(HeidelTimeStandalone.java:765)

So, we renamed segment-zh.perl to segment-zh.pl and then re-executed HeidelTime and got the following:

[HeidelTime] HeidelTime has not found any sentence tokens in this document. HeidelTime needs sentence tokens tagged by a preprocessing UIMA analysis engine to do its work. Please check your UIMA workflow and add an analysis engine that creates these sentence tokens.

I then tried running TreeTagger with Chinese as standalone and found that while it looks for the correctly titled segment-zh.perl, it looks for it in Tree Taggers cmd directory, however there are no instructions to put the Chinese Tokenizer in there and with the subdirectories created by the tokenizer's compressed file, it would not work anyways unless we manually moved it.

So, I created symlinks to the actual locations and then tried running TreeTagger again with:

echo "西门商场（2013年9月31日武装分子在肯尼亚首都内罗毕袭击的购物中心）的时候我身在艺术咖啡厅，这次我又住在半岛酒店……凌晨4点前一切 被炸成地狱的时候，我正准备离开我的房间" | cmd/tree-tagger-chinese

And get the following:

reading parameters ...
Can't locate segmenter.pm in @inc (you may need to install the segmenter module) (@inc contains: ./cmd /etc/perl /usr/local/lib/perl/5.18.2 /usr/local/share/perl/5.18.2 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.18 /usr/share/perl/5.18 /usr/local/lib/site_perl .) at ./cmd/segment-zh.perl line 5.
BEGIN failed--compilation aborted at ./cmd/segment-zh.perl line 5.
tagging ...
finished.

That's where I completed my troubleshooting. Hopefully there is a simple answer to this, but let me know if I need to do anything further.

Thanks!

availability of resources for Portuguese

Hi,
I'm a software engineer working on a project that uses HeidelTime for English, Spanish,
and Arabic.
Now I started developing resources for Portuguese and I wonder if there's any sharable
preliminary work done in resources of Portuguese even if it's not officially released.
Thanks much.

Original issue reported on code.google.com by ZunsikLim on 2015-01-07 14:20:45

Pass config.properties as parameter

Force to have file config.properties in the same folder can be a problem. My software
has already a config.properties file. Should be better to have the possibility to submit
config file we want.

So I added a "run" function in HeidelTimeStandalone so I can provide Config file, file
of Input and file of Ouput. I attach it if it can be useful for others ;)

Original issue reported on code.google.com by damien.palacio on 2012-05-25 08:55:46

- _Attachment: [HeidelTimeStandalone.java](https://storage.googleapis.com/google-code-attachments/heideltime/issue-3/comment-0/HeidelTimeStandalone.java)_

Mod "debug"

Another improvement that can be nice I think: an optionnal "debug" mod. 

Like all the trace printed (except errors) are not necessary useful and if you use
it on a big collection and want to see if there are problems it produces a lot of text.

So it could be a field in the config.properties to activate or not this option.

But need to change everywhere in the code where there is a print to show what is done
and add someting like that:
if(Config.get(Config.DEBUG).equals("true")) {
System.out.println("trace");
}

Original issue reported on code.google.com by damien.palacio on 2012-05-25 20:02:09

Singular noun vs. cardinal number > 1 (Arabic)

I ran into a situation where a BBC article in Arabic had the phrase:

قبل 6 ساعة

Which literally means "before 6 hour".

Stanford POS tags it as NN, CD, NN. This is incorrect, the first word (on the right) is a preposition.

HeidelTime interprets this as PT1H, presumably because it read the word "hour" and because of the singularity of it ignored any cardinal numbers and defaulted to 1.

With the phrase:

قبل 6 ساعات

HeidelTime correctly picks up the cardinal number and the plural "hours" and interprets it as PT6H.

There is precedent for not having to specify the singular / plural form in Arabic, though I'm told only if it comes before the cardinal number. It's possible BBC is taking some liberty with that rule or my understanding is incorrect.

Either way, do you have any comment on this? Is this something that can be taken into account?

An example of the article is:
http://www.bbc.com/arabic/middleeast/2015/07/150730_yemen_fighting (see at the top, underneath the headline).

I posted the issue with Stanford mentioned above on SO:
http://stackoverflow.com/questions/31731575/corenlp-arabic-time-duration-misses-alpha-numeric-numbers

Thanks.

HeidelTime has not found any sentence tokens in this document.

I tried to reproduce the evaluation result using WikiWars. Follow the wiki, I can reproduce same results using v2.1. However, I followed same steps using other versions (tried 1.3, 1.6, 1.7, and 1.8), but received
..[de.unihd.dbs.uima.annotator.heideltime.HeidelTime] HeidelTime has not found any sentence tokens in this document. HeidelTime needs sentence tokens tagged by a preprocessing UIMA analysis engine to do its work. Please check your UIMA workflow and add an analysis engine that creates these sentence tokens.
everytime.
I have changed the .bash_profile accordingly.
Is there any other particular adjustments I should have done when setting up the experiment? Thanks a lot.

Changing the license to Apache

Due to some requests, we are currently discussing to change HeidelTime's license from
GPL to Apache. Please participate in the discussion and tell us your thoughts about
that issue:
1. GNU: http://www.gnu.org/licenses/gpl.html
2. Apache: http://www.apache.org/licenses/LICENSE-2.0.html

Original issue reported on code.google.com by jannik.stroetgen on 2012-05-24 19:31:25

Erroneous Date Recognized

What steps will reproduce the problem?
1. HeidelTimeStandalone hts_sci = new HeidelTimeStandalone(Language.ENGLISH, DocumentType.NARRATIVES,
OutputType.TIMEML);
    String f = hts_sci.process("19-Nov-12", new Date(2012,01,05), new TimeMLResultFormatter());

It should pick out 19, November, 2012. Instead this is the result produced:

<?xml version="1.0"?>
<!DOCTYPE TimeML SYSTEM "TimeML.dtd">
<TimeML>
19-<TIMEX3 tid="t0" type="DATE" value="3912-11">Nov</TIMEX3>-12
</TimeML>

What is the work-around for this?

Original issue reported on code.google.com by shriphanip on 2013-03-04 08:06:18

German compounds consisting of weekday + time of day not extracted

Hi,

running HeidelTime on news texts, I encountered a type of temporal expression that
is currently not recognized: according to the German spelling reform, combinations
of weekday (e.g., 'Montag') and time of day (e.g., 'abend') are connected to one word.
This holds for substantives and adverbs, for instance: http://www.duden.de/rechtschreibung/Montagmorgen

HeidelTime (tested version: 1.8) currently doesn't extract these temporal expressions.
In the following sentence, only 'Mittwoch' is extracted and correctly normalized, all
other temporal expressions are neglected:

"Am Montagabend hat Peter telefoniert. Am Dienstagabend auch. Am Mittwoch auch. Montagmorgens
wird er ebenfalls telefonieren."

Attached you can find the entered command and HeidelTime's output.

Maybe you can find time to add this feature at some point :)

Original issue reported on code.google.com by boegel.thomas on 2015-01-13 09:17:33

- _Attachment: [heideltime_feature_request_german_compounds.txt](https://storage.googleapis.com/google-code-attachments/heideltime/issue-25/comment-0/heideltime_feature_request_german_compounds.txt)_

Run Heideltime without config.props

Hey folks. First i'd like to thank for this amazing too.

I'd like to know if i can run my java code using heideltime without the config.props or the tree-tagger files, i mean, as a standalone application.

The problem with it is that the application gets tied to a local config file and worse, to a local executable file. Is there a chance of you to mavenize all this in only one deliverable?

wrong sentence boundary detection avoids matching

JAN. 27, 2017 is a date.
two sentences extracted avoid the matching of JAN. 27, 2017 as temporal expression:

"JAN."
"27, 2017 is a date."

problems to run heideltime on ubuntu

Ive tried several hours to get Heideltime Standalone to run on my ubuntu system, but
it still didnt work. 
i followed exactly the how to use instructions in the readme file and i also installed
the treetagger from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ with the
whole package, the tagging scripts, the installation script and the parameter files
for the languages which i use and i also  indicate the path to the folder containing
the tree-tagger in config.props, in "treeTaggerHome" (treeTaggerHome = /home/chuulio/Dokumente/TreeTagger/)
Ive tried heideltime on a text document about moskow in german, and this is what i
got:

chuulio@chuulio-UX32VD:~/Dokumente/Temporal_Annotation/Standalone$ java -jar de.unihd.dbs.heideltime.standalone.jar
/home/chuulio/Dokumente/Moskau.txt -l german -vv
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Verbosity: '-vv'; Logging level set to ALL.
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Encoding '-e': NOT FOUND OR RECOGNIZED; set to 'UTF-8'
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Output '-o': NOT FOUND OR RECOGNIZED; set to TIMEML
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Language '-l': GERMAN
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Type '-t': NOT FOUND OR RECOGNIZED; set to NARRATIVES
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Document Creation Time '-dct': NOT FOUND; skipping.
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Locale '-locale': NOT FOUND, set to environment locale: de_CH
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Configuration path '-c': config.props
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone readConfigFile
INFO: trying to read in file config.props
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: POS Tagger '-pos': NOT FOUND OR RECOGNIZED; set to TREETAGGER
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Interval Tagger '-it': NOT FOUND OR RECOGNIZED; set to false
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Reading document using charset: UTF-8
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone initialize
INFO: HeidelTimeStandalone initialized with language german
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone initialize
INFO: HeidelTime initialized
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone initialize
INFO: JCas factory initialized
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone process
INFO: Processing started
[de.unihd.dbs.uima.annotator.heideltime.HeidelTime] HeidelTime has not found any sentence
tokens in this document. HeidelTime needs sentence tokens tagged by a preprocessing
UIMA analysis engine to do its work. Please check your UIMA workflow and add an analysis
engine that creates these sentence tokens.
Aug 26, 2014 10:47:19 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone process
INFO: Processing finished
Aug 26, 2014 10:47:19 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone process
INFO: Result formatted
<?xml version="1.0"?>
<!DOCTYPE TimeML SYSTEM "TimeML.dtd">
<TimeML>

Moskau (russisch Москва́ Zum Anhören bitte klicken! [mɐˈskva], Moskwa) ist die Hauptstadt
der Russischen Föderation und mit rund 11,55 Millionen Einwohnern (Stand 14. Oktober
2010)[1] die größte Stadt bzw. mit 15,1 Millionen (2012)[2] die größte Agglomeration
Europas. Am 1. Juli 2012 wurde Moskau durch Eingemeindung der beiden Verwaltungsbezirke
Nowomoskowski und Troizk im Südwesten der Stadt auf Kosten der Moskauer Oblast um 1480
km², d. h. um das 1,39-Fache, auf 2550 km² vergrößert. Durch die Eingliederung wuchs
die Moskauer Bevölkerung um etwa 235.000 Menschen.

Moskau ist das politische, wirtschaftliche und kulturelle Zentrum des Landes mit Hochschulen
und Fachschulen sowie zahlreichen Kirchen, Theatern, Museen, Galerien und dem 540 Meter
hohen Ostankino-Turm. Moskau ist Sitz der Russisch-Orthodoxen Kirche: Der Patriarch
residiert im Danilow-Kloster, das größte russisch-orthodoxe Kirchengebäude ist die
Moskauer Christ-Erlöser-Kathedrale. Es gibt im Stadtgebiet von Moskau über 300 Kirchen.[3]
Seit dem 16. Jahrhundert wird Moskau auch als Drittes Rom bezeichnet. Nach Ende des
Zweiten Weltkriegs erhielt Moskau die Auszeichnung einer „Heldenstadt“.

Der Kreml und der Rote Platz im Zentrum Moskaus stehen seit 1990 auf der UNESCO-Liste
des Weltkulturerbes. Mit acht Fernbahnhöfen, drei internationalen Flughäfen und drei
Binnenhäfen ist die Stadt wichtigster Verkehrsknoten und größte Industriestadt Russlands.

Geschichte
Ursprung
Denkmal für den Stadtgründer Juri Dolgoruki

Eine der Sagen kündet davon, dass der Fürst Juri Dolgoruki (1090–1157) im Land der
Wjatitschen eine hölzerne Stadt zu errichten befahl, und dass diese Stadt nach dem
Fluss benannt wurde, an dessen Ufern sie emporwuchs. Die erste schriftliche Erwähnung
Moskaus stammt aus dem Jahre 1147, das darum als das Gründungsjahr Moskaus gilt. Doch
schon lange davor gab es an der Stelle, wo heute Moskau steht, menschliche Niederlassungen.
Archäologische Ausgrabungen bezeugen, dass die ältesten von ihnen vor etwa 5000 Jahren
entstanden waren.

Um 1156 entstand eine erste, noch hölzerne Wehranlage des Kremls, in deren Schutz sich
der Marktflecken allmählich zu einer beachtlichen Ansiedlung entwickelte. Im Jahre
1238 ist die Stadt von den Mongolen erobert und niedergebrannt worden. 1263 wurde das
Umland zu einem Teilfürstentum im Großfürstentum Wladimir-Susdal, wenig später unter
Fürst Daniel ein eigenständiges Fürstentum. In der ersten Hälfte des 14. Jahrhunderts
– die Stadt zählte mittlerweile 30.000 Einwohner – erkannte der tatarische Großkhan
den Moskauer Großfürsten als (ihm allerdings tributpflichtiges) Oberhaupt von Russland
an.

Der Sieg über die Tataren in der Schlacht von Kulikowo am 8. September 1380, angeführt
durch den Moskauer Großfürsten Dmitri Donskoi, befreite zwar nicht von der Hegemonie
der Goldenen Horde (1382 wurde Moskau sogar abermals niedergebrannt und geplündert),
doch die Stadt festigte dadurch ihr politisches und militärisches Ansehen erheblich
und gewann mithin beständig an wirtschaftlicher Macht. 1480 konnte sie die Tatarenherrschaft
endgültig abschütteln und wurde zur Hauptstadt des russischen Reiches.

Der seit 1462 regierende Großfürst von Moskau Iwan III., der Große (1440–1505), heiratete
1472 die byzantinische Prinzessin Sofia (Zoe) Palaiologos, eine Nichte des letzten
oströmischen Kaisers Konstantin XI. Palaiologos, und übernahm von dort die autokratische
Staatsidee und ihre Symbole: den Doppeladler und das Hofzeremoniell. Seither gilt Moskau
als „Drittes Rom“ und Hort der Orthodoxie.
Moskau wird Großstadt
Moskau am Ende des 17. Jahrhunderts

In den beiden letzten Jahrzehnten des 15. Jahrhunderts begann der Ausbau des Kreml,
in dessen Umkreis sich nun in großer Zahl Handwerker und Kaufleute niederließen. Die
Einwohnerzahl stieg bald darauf auf mehr als 100.000, so dass um 1600 eine Ringmauer
um Moskau und eine Erdverschanzung hinzukamen, die die blühende Stadt fortan nach außen
abschirmten. 1571 war sie ein letztes Mal von den Tataren heimgesucht worden, als die
überwiegend aus Holz gebaute Stadt abbrannte. Bereits ein Jahr später war die Tatarengefahr
in der Schlacht von Molodi südlich von Moskau aber endgültig gebannt. In der Zeit der
Wirren, die durch unklare Thronfolgeverhältnisse ausgelöst wurde, rückten polnische
Truppen in die Stadt und versuchten, eigene Marionetten zu installieren. Eine Volksarmee
aus Nischni Nowgorod belagerte die Polen jedoch im Moskauer Kreml und zwang sie zur
Kapitulation. Diese Ereignisse ebneten den Weg für die Romanow-Dynastie auf den russischen
Thron.

Während die ersten Tuch-, Papier- und Ziegelmanufakturen, Glasfabriken und Pulvermühlen
entstanden, kulminierten die sozialen Gegensätze des Großreiches: 1667 erhoben sich
die Bauern im Wolga- und Dongebiet gegen die wachsende Unterdrückung, ihr Führer, Stepan
Rasin, wurde 1671 auf dem Roten Platz in Moskau hingerichtet. Im Jahre 1687 ist die
erste Hochschule Russlands, die „Slawisch-Griechische Akademie“ eröffnet worden, 1703
erschien die erste gedruckte russische Zeitung „Wedomosti“. Im Jahre 1712 ging unter
Zar Peter dem Großen (1672–1725) das Privileg der Hauptstadt auf das neu gegründete
Sankt Petersburg über, aber Moskau blieb das wirtschaftliche und geistig-kulturelle
Zentrum des Landes. 1755 wurde in Moskau mit der heutigen Lomonossow-Universität die
erste russische Universität eröffnet.
Der Brand von Moskau vor der Einnahme der Stadt durch Napoleon 1812
Twerskaja-Straße im 19. Jahrhundert

Mit dem Moskau des 18. Jahrhunderts ist das Schaffen hervorragender russischer Schriftsteller
und Dichter verknüpft wie Alexander Sumarokow, Denis Fonwisin, Nikolai Karamsin und
vieler anderer. In Moskau trat der große russische Gelehrte Michail Lomonossow seinen
Weg in die Wissenschaft an. Auch in späteren Zeiten lebten und wirkten in Moskau viele
berühmte russische Schriftsteller und Dichter, Wissenschaftler und Künstler, die durch
ihr Schaffen nicht nur zur russischen, sondern auch zur Weltkultur einen immensen Beitrag
geleistet haben.

Im Vaterländischen Krieg von 1812, als Napoleon Bonaparte (1769–1821) mit seiner „Großen
Armee“ auf Moskau zumarschierte, verlor die Stadt in einem Flächenbrand – die Bewohner
zündeten ihre Häuser an und flohen aus der Stadt – zwei Drittel ihrer Bausubstanz.
Aber in Moskau kam die französische Armee zum Stehen, hier wurde sie wegen Hunger und
Kälte zur Umkehr gezwungen, die mit ihrem Untergang endete.

Der im Frühjahr 1813 einsetzende großstilige Wieder- und Neuaufbau sprengte rasch den
alten städtischen Verteidigungsring und verschaffte der Stadt von der Mitte des 19.
Jahrhunderts an durch zügigen Straßen- und Bahnstreckenbau Anschluss an die wichtigsten
Städte des Landes. 1890 fuhren die ersten elektrischen Straßenbahnen; die erste Volkszählung
des Landes fand am 28. Januar 1897 statt, die Bevölkerung der Stadt war auf etwa eine
Milli


</TimeML>

java and ubuntu version:

chuulio@chuulio-UX32VD:~/Dokumente/Temporal_Annotation/Standalone$ lsb_release -a &&
java -version
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 12.10
Release:    12.10
Codename:   quantal
java version "1.7.0_25"
OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.10.2)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)


any idea?

Original issue reported on code.google.com by julien.peter.thoma on 2014-08-26 20:54:12

"second half" matches too broad.

In particular, it occurs frequently in sports (and thus, in Wikipedia, news articles, books, ...).

E.g. https://en.wikipedia.org/wiki/1958_FIFA_World_Cup_Final

The 1958 FIFA World Cup Final took place in Råsunda Stadium, Solna (near Stockholm), Sweden on 29 June 1958. [...]
Sweden took the lead after only 4 minutes after an excellent finish by captain Nils Liedholm.
The lead didn't last long however as Vavá equalised just 5 minutes later.
On 32 minutes, Vavá scored a similar goal to his first to give Brazil a lead 2–1 at the break.
10 minutes into the second half, Brazil went further in front thanks to a brilliant goal scored by Pelé.

Here, second half will be interpreted as 1958-H2. I suggest to disable rule date_r10b because of these false positives. Also, "after only 4 minutes", "on 32 minutes", and "10 minutes into" should probably be relative time references rather than durations.

Also, e.g. in Wikipedia "Karl May":

May's first translated work was the first half of the Orient Cycle into a French daily in 1881.

Suggest rule updates for the time references:

RULENAME="date_r20a",EXTRACTION="\b(?:%reApproximate )?(?:several|a couple of|some|a few|many) %reUnitFine (?:later|into)",NORM_VALUE="FUTURE_REF"
RULENAME="date_r20b",EXTRACTION="\b(?:%reApproximate )?%reNumWord12D %reUnitFine (?:later|into)",NORM_VALUE="UNDEF-REF-%normUnit(group(3))-PLUS-%normDurationNumber(group(2))",NORM_MOD="%normApprox4Dates(group(1))"
RULENAME="date_r20c",EXTRACTION="\b(?:%reApproximate )?(\d+) %reUnitFine (?:later|into)",NORM_VALUE="UNDEF-REF-%normUnit(group(3))-PLUS-group(2)",NORM_MOD="%normApprox4Dates(group(1))"
RULENAME="date_r20d",EXTRACTION="\b(?:%reApproximate )?an? %reUnitFine (?:later|into)",NORM_VALUE="UNDEF-REF-%normUnit(group(2))-PLUS-1",NORM_MOD="%normApprox4Dates(group(1))"
RULENAME="date_r20e",EXTRACTION="\brecent %reUnit",NORM_VALUE="PAST_REF"
RULENAME="date_r20f",EXTRACTION="\b[Oo]n (?:%reApproximate )?(\d+) %reUnitFine",NORM_VALUE="UNDEF-REF-%normUnit(group(3))-PLUS-group(2)",NORM_MOD="%normApprox4Dates(group(1))"
RULENAME="date_r20g",EXTRACTION="\b[Oo]n (?:%reApproximate )?%reNumWord12D %reUnitFine",NORM_VALUE="UNDEF-REF-%normUnit(group(3))-PLUS-%normDurationNumber(group(2))",NORM_MOD="%normApprox4Dates(group(1))"

POS matching

It should be possible to use a regex to specify a POS_CONSTRAINT.

The POS tagger I am using provides morphological information, hence to detect a plural
(either Smp or Sfp), I need to use S.p

It is enough to change checkPosConstraint to use:

     if (pos_as_is.matches(pos))

Thanks.

Original issue reported on code.google.com by attardi on 2014-10-20 11:26:48

Sharing resources for Russian

Hi!
I am a student of computational linguistics, and I am writing my master's thesis on
temporal expression for Russian. In my work, I use Heideltime, and I would like to
make my Russian resources publicly available. Do you think I could make a commit to
the main Heideltime development trunk?

My changes mainly consist of creating new resources for Russian, and I have also added
a function to calculate some Russian holidays.

Original issue reported on code.google.com by elenaklyachko on 2014-04-04 21:24:41

Inflected variants of "ein"(einer, einem) not recognized

What steps will reproduce the problem?
1. goto http://heideltime.ifi.uni-heidelberg.de/heideltime/
2. put strings like "mit einer Frist von einem Monat zum Monatsende." "in einer Woche."
into the input field
3. Press "Compute" 

What is the expected output? What do you see instead?
It should detect "einem Monat", "einer Woche"as time period.
But it didn't.

What version of the product are you using? On what operating system?
online demo

Original issue reported on code.google.com by [email protected] on 2013-07-04 13:59:17

HeidelTime on Maven Central

It would be nice to be able to get the library from Maven Central.
Other users will also appreciate it (see e.g. this questions)

Difference between UNDEF-REF- and UNDEF-REFUNIT-?

Can't we just remove REFUNIT and replace it with REF?
REFUNIT currently may only be used with year, and I don't fully understand the difference. Also this vs. REF? The documentation I found was not very clear on this difference.

As far as I can tell, REFUNIT-year and REF-year/this-year will differ in their "fuzziness". So 2017-06-07 will be mapped to 2016-06-07 or to 2016. But overall, this fuzziness is not consistently used. Usually, there exists UNDEF-this-<unit>-PLUS-1 and UNDEF-next-<unit>, which may or may not do the same thing, and additionally UNDEF-REFUNIT-year-PLUS-1 for year only...

Of course there is no golden rule whether "in a week" refers to exactly 7 days, or the week; and we may need to use heuristics (e.g. "next week" usually being the entire week, "in a week" usually being 7 days, but "in two weeks" being an entire week again), but they should be made explicit.

next friday afternoon Problem

Hi.
heideltime is awesome.

I have a little problem.

version : 2.1 standalone
document type: News
base time : 2016-10-19

i works good.

input : next friday
output : next friday
input : friday afternoon
output : friday afternoon

but,

input : next friday afternoon
output : next friday afternoon

next is not tagging.

How can i tagging all together?
like, next friday afternoon

wrong normalization

English rule date_r0f: RULENAME="date_r0f",EXTRACTION="%reYear4Digit%reMonthNumber%reDayNumber",NORM_VALUE="group(1)-group(2)-group(3)"
results in wrong normalizations, e.g., 2882015 --> 2882-01-5

In general, the rule should be more context-sensitive as it produces many false positives!

NullPointerException in TreeTaggerWrapper

What steps will reproduce the problem?
1. Simple test using the TreeTaggerWrapper (the environment is well configured with
the resources from here: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/):

import java.util.Calendar;
import java.util.Date;

import de.unihd.dbs.heideltime.standalone.DocumentType;
import de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone;
import de.unihd.dbs.heideltime.standalone.OutputType;
import de.unihd.dbs.heideltime.standalone.POSTagger;
import de.unihd.dbs.heideltime.standalone.components.impl.TimeMLResultFormatter;
import de.unihd.dbs.heideltime.standalone.exceptions.DocumentCreationTimeMissingException;
import de.unihd.dbs.uima.annotator.heideltime.resources.Language;

public class HeidelTest {

    public static void main(String[] args)
        throws DocumentCreationTimeMissingException {

    final String configProps = "/config.props";

    String configPath = HeidelTest.class.getResource(configProps).getFile();

    HeidelTimeStandalone heidel = new HeidelTimeStandalone(Language.FRENCH,
        DocumentType.NEWS, OutputType.TIMEML, configPath,
        POSTagger.TREETAGGER);

    Date documentCreationTime = Calendar.getInstance().getTime();

    String result = heidel.process(
"samedi 13 décembre 2014 à 20h00.",
        documentCreationTime, new TimeMLResultFormatter());

    System.out.println(result);
    }

}

2.
3.

What is the expected output? What do you see instead?

java.lang.NullPointerException
    at de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper.doTreeTag(TreeTaggerWrapper.java:494)
    at de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper.process(TreeTaggerWrapper.java:227)
    at de.unihd.dbs.heideltime.standalone.components.impl.TreeTaggerWrapper.process(TreeTaggerWrapper.java:43)
    at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.establishPartOfSpeechInformation(HeidelTimeStandalone.java:388)
    at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.establishHeidelTimePreconditions(HeidelTimeStandalone.java:336)
    at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.process(HeidelTimeStandalone.java:481)
    at com.glue.feed.html.demo.dates.HeidelTest.main(HeidelTest.java:37)




What version of the product are you using? On what operating system?
HeidelTime 1.8 on Ubuntu 64

Please provide any additional information below.

Looking at the code in de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper class,
line 494:

 if ((!(token.getPos().equals(null))) &&
         (token.getPos().equals("EMPTYLINE"))){
         token.removeFromIndexes();
         }

token.getPos() cannot be compared to null by calling the equals method on it if it
is actually null...


So, the previous portion of code should be replaced by:

if (token.getPos() != null
            && token.getPos().equals("EMPTYLINE")) {
            token.removeFromIndexes();
        }

Original issue reported on code.google.com by pascalgill3t on 2014-12-14 10:24:35

Installation issue

I'm trying to run heideltime on the commandline as follows: 

C:\Home\heideltime-standal
one-1.4>java -jar de.unihd.dbs.heideltime.standalone.jar test.txt -vv 

jul 26, 2013 11:31:23 AM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone
 main
INFO: Verbosity: '-vv'; Logging level set to ALL.
jul 26, 2013 11:31:23 AM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone
 main
INFO: Encoding '-e': NOT FOUND OR RECOGNIZED; set to 'UTF-8'
jul 26, 2013 11:31:23 AM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone
 main
INFO: Output '-o': NOT FOUND OR RECOGNIZED; set to TIMEML
jul 26, 2013 11:31:23 AM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone
 main
INFO: Language '-l': NOT FOUND; set to ENGLISH
jul 26, 2013 11:31:23 AM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone
 main
INFO: Type '-t': NOT FOUND OR RECOGNIZED; set to NARRATIVES
jul 26, 2013 11:31:23 AM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone
 main
INFO: Document Creation Time '-dct': NOT FOUND; skipping.
jul 26, 2013 11:31:23 AM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone
 main
INFO: Locale '-locale': NOT FOUND, set to environment locale: en_US
jul 26, 2013 11:31:23 AM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone
 main
INFO: Configuration path '-c': config.props
jul 26, 2013 11:31:23 AM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone
 readConfigFile
INFO: trying to read in file config.props
jul 26, 2013 11:31:23 AM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone
 main
INFO: Reading document using charset: UTF-8
jul 26, 2013 11:31:23 AM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone
 initialize
INFO: HeidelTimeStandalone initialized with language english
Jul 26, 2013 11:31:23 AM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone
 initialize
INFO: HeidelTime initialized
Jul 26, 2013 11:31:23 AM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone
 initialize
INFO: JCas factory initialized
Jul 26, 2013 11:31:23 AM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone
 process
INFO: Processing started
Jul 26, 2013 11:31:24 AM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone
 process
INFO: Processing finished
Jul 26, 2013 11:31:24 AM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone
 process
INFO: Result formatted

INFO: Result formatted
<?xml version="1.0"?>
<!DOCTYPE TimeML SYSTEM "TimeML.dtd">
<TimeML>

∩╗┐Upcoming UIMA Seminars

April 7, 2004 Distillery Lunch Seminar
UIMA and its Metadata
12:00PM-1:00PM in HAW GN-K35.

Dave Ferrucci will give a UIMA overview and discuss the types of component metad
ata that UIMA components provide.  Jon Lenchner will give a demo of the Text Ana
lysis Engine configurator tool.


April 16, 2004 KM & I Department Tea
Title: An Eclipse-based TAE Configurator Tool
3:00PM-4:30PM in HAW GN-K35 .

Jon Lenchner will demo an Eclipse plugin for configuring TAE descriptors, which
will be available soon for you to use.  No more editing XML descriptors by hand!



May 11, 2004 UIMA Tutorial
9:00AM-5:00PM in HAW GN-K35.

This is a full-day, hands-on tutorial on UIMA, covering the development of Text
Analysis Engines and Collection Processing Engines, as well as how to include th
ese components in your own applications.
</TimeML>



What version of the product are you using? On what operating system?
I am using the heideltime-standalone-1.4 on Windows. 

Please provide any additional information below.
No dates are tagged in the TimeML file, and I don't get any error message. It's the
first time I'm trying to use heideltime, so probably something went wrong in the installation,
but without error message it's hard to fix it. Any ideas what it could be?
Thanks!

Original issue reported on code.google.com by riannekaptein on 2013-07-26 09:37:48

StanfordPOSTaggerWrapper model path

Hi,

The initialize method of StanfordPOSTaggerWrapper class tests whether the file denoted
by model_path exists, and then attempts to instantiate a MaxentTagger object with it.
The Javadoc for the MaxentTagger constructor says that the modelFile parameter can
be interpreted as a URL if it starts with "https?://" or can be loaded directly from
the classpath as in "com/example/models/model.tagger".

I put my model file in my project's classpath (and I configured my config.props according
to this resource path). Heidel Time fails because of the check of the pathname's existence.
If I remove this check, it works like a charm.

It would be nice to reflect the MaxentTagger specification in StanfordPOSTaggerWrapper.
I think StanfordPOSTaggerWrapper should only check that model_path is not null, and
should leave the responsibility of the other checks to MaxentTagger. What do you think?

Thank you for this great library and the work you have done so far!

Original issue reported on code.google.com by pascalgill3t on 2015-01-17 13:56:49

Dependency to TreeTagger / other POS taggers

We would like to include HeidelTime to the DBpedia information extraction framework (see dbpedia/extraction-framework#447 & dbpedia/extraction-framework#446).

A contributor tried to work on this integration but also added a lot of TreeTagger files int the repo
as requested in your documentation pdf. We would like to avoid that is possible so, we 'd like to ask:

Can we use another POS tagger instead? (with a dependency mostly from maven)?
if yes, DBpedia parses text from Wikipedia only but in many languages, which tagger would you recommend in this case?

Thanks in advance for your help & guidance

"Charset mismatch" when running the standalone version under Ubuntu

I have previously used HeidelTime's standalone version on Mac OSX and on a Lubuntu machine
without any problem. 
Today I tried it on Ubuntu (running as VirtualBox) and can't get rid of this mistake:


usr@usr-VirtualBox:~/Downloads/Temporal_Annotation_Initial/Standalone$ java -jar de.unihd.dbs.heideltime.standalone.initial.jar
/home/usr/Temporal_Annotation/lill_sample.txt -l german
Error: Unable to access jarfile de.unihd.dbs.heideltime.standalone.initial.jar
usr@usr-VirtualBox:~/Downloads/Temporal_Annotation_Initial/Standalone$ java -jar de.unihd.dbs.heideltime.standalone.jar
/home/usr/Temporal_Annotation/lill_sample.txt -l german
java.lang.RuntimeException: Opps! Could not find token fï¿½rbringen in JCas after tokenizing
with TreeTagger. Hmm, there may exist a charset missmatch! Default encoding is UTF-8
and should always be UTF-8 (use -Dfile.encoding=UTF-8). If input document is not UTF-8
use -e option to set it according to the input, additionally.
    at de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper.tokenize(TreeTaggerWrapper.java:262)
    at de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper.process(TreeTaggerWrapper.java:221)
    at de.unihd.dbs.heideltime.standalone.components.impl.TreeTaggerWrapper.process(TreeTaggerWrapper.java:43)
    at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.establishPartOfSpeechInformation(HeidelTimeStandalone.java:388)
    at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.establishHeidelTimePreconditions(HeidelTimeStandalone.java:336)
    at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.process(HeidelTimeStandalone.java:481)
    at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.process(HeidelTimeStandalone.java:430)
    at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.main(HeidelTimeStandalone.java:728)
<?xml version="1.0"?>
<!DOCTYPE TimeML SYSTEM "TimeML.dtd">
<TimeML>

178. Frondienst der Wurmsbacher Lehenleute 
1605 <TIMEX3 tid="t4" type="DATE" value="XXXX-06-06">Juni 6</TIMEX3>. Rapperswil 
Uf fürbringen unnd clagen der frawen äbbtißin unnd convent des würdigen gozhuß Wurmbspach
gegen und wider jre lehenlüthen zu Wagen, jm Buech und in der Auw: Das dieselbigen
vermeinen, die wyl die fraw andere jres gozhuß güetter verlichen, also dz sy keiner
acherlüthen nit mer mangelbar und aber so sy die behallten, jnnen die ächer zebuwen
12 tag schuldig und sonnsten nit.
</TimeML>

It annotates the first occurance of a temporal expression, and then stops... 
My input files are in UTF8:

usr@usr-VirtualBox:~/Temporal_Annotation$ file lill_sample.txt 
lill_sample.txt: UTF-8 Unicode text, with very long lines

I also tried to indicate -Dfile.encoding=UTF-8, as the error message says, but it doesn't
help a bit...

Original issue reported on code.google.com by natako88 on 2014-07-28 14:27:00

Problem with the dot

At the time of Heideltime 1.8, I added some rules in french resources to manage numeric dates such as "27.03.15", "27.03.2015" or "21.03" with a dot "." separator. This type of date is quite common in french, at least on Websites.

I also were able to annotate other dates like "sam 3 jan" with abbreviated days and months.

My changes were working, but they do not anymore with the latest version 2.0.1.

Can you help me ? Is that related with the new autogenerated language resources ?

I attached my modified rules in a zip file that I renamed with the JPG extension, otherwise I do not have the permission to "write" in this repo...

Thank you,

Pascal

HeidelTime demo detects age references as date

I copied and pasted the text of the article found here:

http://www.cnn.com/2015/07/15/europe/germany-nazi-death-camp-verdict/?iid=ob_article_footer_expansion&iref=obnetwork

Into the demo and tried it with all 4 "Document Types". It detected a reference to age (written in an informal manner) as references to a Date.

For example, the sentence "Groening, who's in his 90s, ..." detected "90's" as TYPE: DATE and VALUE: 199.

When written more formally (as one would expect from a professional, mainstream news source) as "Groening, who is over 90 years old" (or something similar), it doesn't detect anything (as I assume it shouldn't), however that's too common of a way of expressing age to be left alone I would think.

NumberFormatException while parsing string as integer

int diff = Integer.parseInt(mr.group(5)); (line 899) results in NumberFormatException for long numbers in expressions such as "10000000000000 years ago"

"Standalone" is not standalone

What steps will reproduce the problem?
1. Download Heideltime
2. Try to run example on the front page

What is the expected output? 

A date.

What do you see instead?

stuff about perl

ryan@3G08:~/Downloads/heideltime-standalone-1.3$ pwd
/home/ryan/Downloads/heideltime-standalone-1.3
ryan@3G08:~/Downloads/heideltime-standalone-1.3$ cat cat.txt 
Jannik Strötgen, Julian Zell, and Michael Gertz: HeidelTime: Tuning English and Developing
Spanish Resources for TempEval-3. In SemEval13, 15-19, 2013
ryan@3G08:~/Downloads/heideltime-standalone-1.3$ java -jar de.unihd.dbs.heideltime.standalone.jar
-t news cat.txt 
[de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper] File missing to use TreeTagger
tokenizer: english-abbreviations
[de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper] File missing to use TreeTagger
tokenizer: english.par
[de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper] File missing to use TreeTagger
tokenizer: utf8-tokenize.perl
[de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper] 
Cannot find tree tagger (SET ME IN CONFIG.PROPS!/cmd/utf8-tokenize.perl). Make sure
that path to tree tagger is set correctly in config.props!
[de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper] 
If path is set correctly:

[de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper] Maybe you need to download
the TreeTagger tagger-scripts.tar.gz
[de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper] from ftp://ftp.ims.uni-stuttgart.de/pub/corpora/tagger-scripts.tar.gz
[de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper] Extract this file and copy
the missing file into the corresponding TreeTagger directories.
[de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper] If missing, copy english-abbreviations
into SET ME IN CONFIG.PROPS!/lib
[de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper] If missing, copy english.par
into SET ME IN CONFIG.PROPS!/lib
[de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper] If missing, copy utf8-tokenize.perl
into SET ME IN CONFIG.PROPS!/cmd
ryan@3G08:~/Downloads/heideltime-standalone-1.3$ 



What version of the product are you using? On what operating system?

standalone.jar

Please provide any additional information below.

Original issue reported on code.google.com by compton.ryan on 2013-06-21 02:02:31

Strange online result

What steps will reproduce the problem?
1. Browse to http://heideltime.ifi.uni-heidelberg.de/heideltime/
2. Enter the String "02/08/2012 4:49 PM" (w/o ") into the form field "Input"
3. Press "Compute"

What is the expected output?
complete sting underlined w/ normalized date 2012-02-08 16:49:00

What do you see instead?
underlined "02/08/2012 4:49pan> PM" w/ normalized date 2012-02-08 4:49:00

What version of the product are you using? On what operating system?
online version

Original issue reported on code.google.com by [email protected] on 2013-06-25 13:11:48

strange rule matching error

Can someone explain why this rule

RULENAME="date_r15k",EXTRACTION="il %reDayNumber %reMonthLong %reThisNextLast",NORM_VALUE="UNDEF-%normThisNextLast(group(3))-%normMonthToEnglish(group(2))-%normDay(group(1))"

works fine and matches

  il 3 aprile prossimo

but if I add this other rule

RULENAME="date_r15k",EXTRACTION="%reThisNextLast %reUnit",NORM_VALUE="UNDEF-%normThisNextLast(group(1))-%normUnit(group(2))"

I get this error:

 DEBUGGING: tonormalize:UNDEF-%normThisNextLast(group(1))-%normUnit(group(2))
 DEBUGGING: mr.group():%normThisNextLast(group(1))
 DEBUGGING: mr.group(1):normThisNextLast
 DEBUGGING: mr.group(2):1
 DEBUGGING: m.group():il 3 aprile prossimo
 DEBUGGING: m.group(1):3
 DEBUGGING: hmR...:null
 -----------------------------------
 Maybe problem with normalization of the resource: normThisNextLast
 Maybe problem with part to replace? 3

The second rule by itself works fine and matches "prossima settimana".

Thank you.

Original issue reported on code.google.com by attardi on 2014-10-19 22:15:37

Incorrect value for decades/centuries?

I decided to update to 1.5 and I notice something that has changed.
If I submit this sentence:
"Near the southern end, signs saying 'Hatfield and the North' inspired the eponymous
1970s rock band Hatfield and the North."
The date "1970s" is tagged: <TIMEX3 tid="t86" type="DATE" value="197">1970s</TIMEX3>
Or before it was <TIMEX3 tid="t85" type="DATE" value="197X">1970s</TIMEX3>. Which make
more sense (no ambiguity with year 197). 

The same with centuries:
now: <TIMEX3 tid="t4" type="DATE" value="11">the 12th century</TIMEX3>
before: <TIMEX3 tid="t3" type="DATE" value="11XX">the 12th century</TIMEX3>

Original issue reported on code.google.com by damien.palacio on 2014-02-04 16:19:45

Regular expression

Hi,

I would like to process a series of french dates such as:

"lundi 20, mardi 21 mercredi 22 jeudi 23 vendredi 24 samedi 25 et dimanche 26 avril"

An equivalent in english would be:

"Monday 20, Tuesday 21 Wednesday 22 Thursday 23 Friday 24 Saturday 25 and Sunday, April
26"

where all the dates should apply ("be relative") to April.

I ended up writing the following rule:

RULENAME="date_r4d2",EXTRACTION="(%reWeekday %reDayNumber%reAndOrTo)+%reWeekday %reDayNumber
(%reMonthLong|%reMonthShort)",NORM_VALUE="UNDEF-year-%normMonth(group(7))-%normDay(group(3))",OFFSET="group(2)-group(3)"

Note: %reAndOrTo is ( et | ou | au |,\s|\s) in my case

And I get this result:

Monday, March 9, 2015 0:00
Tuesday, March 10, 2015 0:00
Wednesday, March 11, 2015 0:00
Thursday, March 12, 2015 0:00
Friday, March 13, 2015 0:00
Saturday, April 25, 2015 0:00
Sunday, April 26, 2015 0:00

The XML version:
<!DOCTYPE TimeML SYSTEM "TimeML.dtd">
<TimeML>
<TIMEX3 tid="t5" type="DATE" value="2015-03-09">lundi</TIMEX3> 20, <TIMEX3 tid="t6"
type="DATE" value="2015-03-10">mardi</TIMEX3> 21 <TIMEX3 tid="t7" type="DATE" value="2015-03-11">mercredi</TIMEX3>
22 <TIMEX3 tid="t8" type="DATE" value="2015-03-12">jeudi</TIMEX3> 23 <TIMEX3 tid="t9"
type="DATE" value="2015-03-13">vendredi</TIMEX3> 24 <TIMEX3 tid="t4" type="DATE" value="2015-04-25">samedi
25</TIMEX3> et <TIMEX3 tid="t3" type="DATE" value="2015-04-26">dimanche 26 avril</TIMEX3>
</TimeML>


As you can see, only the two last dates are correct.

The key here is that I have a repeatable group (%reWeekday %reDayNumber%reAndOrTo)+

I tried a lot of alternatives in my regular expression, like using a non-capturing
group as in \(hello\), etc.

Actually, I do not know if it is an OFFSET issue or if Heideltime is not able to handle
such regular expressions.
Do you have any clue that could help me ?

Thank you very much,

Pascal

Original issue reported on code.google.com by pascalgill3t on 2015-03-14 21:39:56

normalization issue in Arabic

وأشار مكاوي إلى أن «الجبرا كابيتال» تتوقع نمو سوق إدارة الأصول في دول مجلس التعاون الخليجي إلى نحو 200 مليار دولار في غضون السنوات الخمس القادمة، مؤكداً عزم شركته على احتلال مكانة رائدة في هذه السوق
[HeidelTime] Language: arabic
[HeidelTime] Re-running this sentence with DEBUGGING enabled...

DEBUGGING: tonormalize:P%normDurationNumber(group(6))%normUnit4Duration(group(4))
DEBUGGING: mr.group():%normDurationNumber(group(6))
DEBUGGING: mr.group(1):normDurationNumber
DEBUGGING: mr.group(2):6
DEBUGGING: m.group():في غضون السنوات الخمس القادمة
DEBUGGING: m.group(6):خمس
DEBUGGING: hmR...:5

DEBUGGING: tonormalize:P5%normUnit4Duration(group(4))
DEBUGGING: mr.group():%normUnit4Duration(group(4))
DEBUGGING: mr.group(1):normUnit4Duration
DEBUGGING: mr.group(2):4
DEBUGGING: m.group():في غضون السنوات الخمس القادمة
DEBUGGING: m.group(4):سنوات
DEBUGGING: hmR...:Y

DEBUGGING: tonormalize:%normApprox4Durations(group(1))
DEBUGGING: mr.group():%normApprox4Durations(group(1))
DEBUGGING: mr.group(1):normApprox4Durations
DEBUGGING: mr.group(2):1
DEBUGGING: m.group():في غضون السنوات الخمس القادمة
DEBUGGING: m.group(1):في غضون
DEBUGGING: hmR...:null

Maybe problem with normalization of the resource: normApprox4Durations
Maybe problem with part to replace? في غضون
[HeidelTime] Execution will now resume.

Problem with the intervals

Hi, it's me again (you'll think I'm starting to annoy you),

I have a problem with the latest version 2.0.1 while tagging interval expressions in french.

For two "dumb" cases, I get the following results:

First example

Source = "Le jeudi 24 décembre"

<!DOCTYPE TimeML SYSTEM "TimeML.dtd">
<TimeML>
<TIMEX3INTERVAL earliestBegin="2015-12-24T00:00:00" latestBegin="2015-12-24T23:59:59" earliestEnd="2015-12-24T00:00:00" latestEnd="2015-12-24T23:59:59"><TIMEX3 tid="t2" type="DATE" value="2015-12-24">le jeudi 24 décembre</TIMEX3></TIMEX3INTERVAL>.
</TimeML>

2nd example

Source = "Du lundi 21 décembre au vendredi 25 décembre"

<!DOCTYPE TimeML SYSTEM "TimeML.dtd">
<TimeML>
du <TIMEX3 tid="t5" type="DATE" value="2015-12-21">lundi 21 décembre</TIMEX3> au <TIMEX3 tid="t6" type="DATE" value="2015-12-25">vendredi 25 décembre</TIMEX3>.
</TimeML>

Regarding the first example, Heideltime found an interval for the first single date which spans from midnight to 12 PM.
This is correct but I'm expecting a single TIMEX3 with the DATE type, and this was the behavior before the project mavenization.

As for the 2nd example (from ... to ...), I obtain two dates while I'm expecting an interval (and I also used to get it before the mavenization).

My pom.xml:

...
        <!-- HeidelTime -->
        <dependency>
            <groupId>org.apache.uima</groupId>
            <artifactId>uimaj-core</artifactId>
            <version>2.8.1</version>
        </dependency>
        <dependency>
            <groupId>com.github.heideltime</groupId>
            <artifactId>heideltime</artifactId>
            <version>2.0.1</version>
        </dependency>
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.3.1</version>
        </dependency>
...

I use Heideltime this way:

heidel = new HeidelTimeStandalone(Language.FRENCH, DocumentType.NEWS,
        OutputType.TIMEML, configPath, POSTagger.STANFORDPOSTAGGER,
        true);

...

Thanks again for your feedback,

Pascal

problem with the execution

i am working on persian language of heideltime and wanna improve that.i am using standalone kit.i've improved resources,in first run in CMD,it has detected many temporal expressions,Of course i manually calculate recall and precision measure ,recall was 0.50 and precision was 0.98.
but in second time,it has found less than first time.and it shows this error:

java -Dfile.encoding=UTF-8 -jar de.unihd.dbs.heideltime.standalone.jar biographies.txt -l auto-persian -pos no
[RuleManager] Cannot read the following line of rule resource daterules
[RuleManager] Line: // author: Jannik Strötgen
[HeidelTime] HeidelTime's execution has been interrupted by an exception that is likely rooted in faulty normalization resource files. Please consider opening an issue report containing the following information at our GitHub project issue tracker: https://github.com/HeidelTime/heideltime/issues - Thanks!
java.lang.NullPointerException
at java.lang.String.replace(Unknown Source)
at de.unihd.dbs.uima.annotator.heideltime.HeidelTime.applyRuleFunctions(HeidelTime.java:2268)
at de.unihd.dbs.uima.annotator.heideltime.HeidelTime.getAttributesForTimexFromFile(HeidelTime.java:2371)
at de.unihd.dbs.uima.annotator.heideltime.HeidelTime.findTimexes(HeidelTime.java:2195)
at de.unihd.dbs.uima.annotator.heideltime.HeidelTime.process(HeidelTime.java:223)
at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.process(HeidelTimeStandalone.java:497)
at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.process(HeidelTimeStandalone.java:443)
at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.main(HeidelTimeStandalone.java:760)
[HeidelTime] Sentence [0-2384]:

[HeidelTime] Language: auto-persian
[HeidelTime] Re-running this sentence with DEBUGGING enabled...

DEBUGGING: tonormalize:%normYearBC(group(2))
DEBUGGING: mr.group():%normYearBC(group(2))
DEBUGGING: mr.group(1):normYearBC
DEBUGGING: mr.group(2):2
DEBUGGING: m.group():سال 1991
DEBUGGING: m.group(2):1991
DEBUGGING: hmR...:1991

t11EXTRACTION PHASE: found by:date_lrec1b-explicit text:1991
t11NORMALIZATION PHASE: found by:date_lrec1b-explicit text:1991 value:1991

DEBUGGING: tonormalize:%normYearBC(group(2))
DEBUGGING: mr.group():%normYearBC(group(2))
DEBUGGING: mr.group(1):normYearBC
DEBUGGING: mr.group(2):2
DEBUGGING: m.group():سال 2011
DEBUGGING: m.group(2):2011
DEBUGGING: hmR...:2011

t12EXTRACTION PHASE: found by:date_lrec1b-explicit text:2011
t12NORMALIZATION PHASE: found by:date_lrec1b-explicit text:2011 value:2011

DEBUGGING: tonormalize:%normYearBC(group(2))
DEBUGGING: mr.group():%normYearBC(group(2))
DEBUGGING: mr.group(1):normYearBC
DEBUGGING: mr.group(2):2
DEBUGGING: m.group():سال

2009
DEBUGGING: m.group(2):2009
DEBUGGING: hmR...:2009

t13EXTRACTION PHASE: found by:date_lrec1b-explicit text:2009
t13NORMALIZATION PHASE: found by:date_lrec1b-explicit text:2009 value:2009

DEBUGGING: tonormalize:%normYearBC(group(2))
DEBUGGING: mr.group():%normYearBC(group(2))
DEBUGGING: mr.group(1):normYearBC
DEBUGGING: mr.group(2):2
DEBUGGING: m.group():سال 2009
DEBUGGING: m.group(2):2009
DEBUGGING: hmR...:2009

t14EXTRACTION PHASE: found by:date_lrec1b-explicit text:2009
t14NORMALIZATION PHASE: found by:date_lrec1b-explicit text:2009 value:2009

t15EXTRACTION PHASE: found by:date_lrec1b-explicit text:91
t15NORMALIZATION PHASE: found by:date_lrec1b-explicit text:91 value:0091

DEBUGGING: tonormalize:%normYearBC(group(3))
DEBUGGING: mr.group():%normYearBC(group(3))
DEBUGGING: mr.group(1):normYearBC
DEBUGGING: mr.group(2):3
DEBUGGING: m.group():2009-12
DEBUGGING: m.group(3):2009
DEBUGGING: hmR...:2009

DEBUGGING: tonormalize:%normApprox4Dates(group(2))
DEBUGGING: mr.group():%normApprox4Dates(group(2))
DEBUGGING: mr.group(1):normApprox4Dates
DEBUGGING: mr.group(2):2
DEBUGGING: m.group():2009-12
DEBUGGING: m.group(2):null
DEBUGGING: hmR...:null

Empty part to normalize in normApprox4Dates
t16EXTRACTION PHASE: found by:date_lrec1e-explicit text:2009
t16NORMALIZATION PHASE: found by:date_lrec1e-explicit text:2009 value:2009

Empty part to normalize in normApprox4Dates
t17EXTRACTION PHASE: found by:date_lrec1e-explicit text:2009
t17NORMALIZATION PHASE: found by:date_lrec1e-explicit text:2009 value:2009

DEBUGGING: tonormalize:%normYearBC(group(3))
DEBUGGING: mr.group():%normYearBC(group(3))
DEBUGGING: mr.group(1):normYearBC
DEBUGGING: mr.group(2):3
DEBUGGING: m.group():2010-01
DEBUGGING: m.group(3):2010
DEBUGGING: hmR...:2010

DEBUGGING: tonormalize:%normApprox4Dates(group(2))
DEBUGGING: mr.group():%normApprox4Dates(group(2))
DEBUGGING: mr.group(1):normApprox4Dates
DEBUGGING: mr.group(2):2
DEBUGGING: m.group():2010-01
DEBUGGING: m.group(2):null
DEBUGGING: hmR...:null

Empty part to normalize in normApprox4Dates
t18EXTRACTION PHASE: found by:date_lrec1e-explicit text:2010
t18NORMALIZATION PHASE: found by:date_lrec1e-explicit text:2010 value:2010

DEBUGGING: tonormalize:%normYearBC(group(3))
DEBUGGING: mr.group():%normYearBC(group(3))
DEBUGGING: mr.group(1):normYearBC
DEBUGGING: mr.group(2):3
DEBUGGING: m.group():19-20
DEBUGGING: m.group(3):19
DEBUGGING: hmR...:0019

DEBUGGING: tonormalize:%normApprox4Dates(group(2))
DEBUGGING: mr.group():%normApprox4Dates(group(2))
DEBUGGING: mr.group(1):normApprox4Dates
DEBUGGING: mr.group(2):2
DEBUGGING: m.group():19-20
DEBUGGING: m.group(2):null
DEBUGGING: hmR...:null

Empty part to normalize in normApprox4Dates
t19EXTRACTION PHASE: found by:date_lrec1e-explicit text:19
t19NORMALIZATION PHASE: found by:date_lrec1e-explicit text:19 value:0019

DEBUGGING: tonormalize:%normYearBC(group(3))
DEBUGGING: mr.group():%normYearBC(group(3))
DEBUGGING: mr.group(1):normYearBC
DEBUGGING: mr.group(2):3
DEBUGGING: m.group():1850-58
DEBUGGING: m.group(3):1850
DEBUGGING: hmR...:1850

DEBUGGING: tonormalize:%normApprox4Dates(group(2))
DEBUGGING: mr.group():%normApprox4Dates(group(2))
DEBUGGING: mr.group(1):normApprox4Dates
DEBUGGING: mr.group(2):2
DEBUGGING: m.group():1850-58
DEBUGGING: m.group(2):null
DEBUGGING: hmR...:null

Empty part to normalize in normApprox4Dates
t20EXTRACTION PHASE: found by:date_lrec1e-explicit text:1850
t20NORMALIZATION PHASE: found by:date_lrec1e-explicit text:1850 value:1850

DEBUGGING: tonormalize:%normYearBC(group(7))-%normMonth(group(1))
DEBUGGING: mr.group():%normYearBC(group(7))
DEBUGGING: mr.group(1):normYearBC
DEBUGGING: mr.group(2):7
DEBUGGING: m.group():نوامبر

19
DEBUGGING: m.group(7):19
DEBUGGING: hmR...:0019

DEBUGGING: tonormalize:0019-%normMonth(group(1))
DEBUGGING: mr.group():%normMonth(group(1))
DEBUGGING: mr.group(1):normMonth
DEBUGGING: mr.group(2):1
DEBUGGING: m.group():نوامبر

19
DEBUGGING: m.group(1):نوامبر
DEBUGGING: hmR...:null

Maybe problem with normalization of the resource: normMonth
Maybe problem with part to replace? نوامبر
[HeidelTime] Execution will now resume.

2009-12-19,17:00:00

2009-12-19 17:00:00

12/29/2000 20:29

2010-01-29

1999/26/09

19-20 از ماه نوامبر

فوریه سال 1991

می و ژوئن،2011

می-ژوئن 2011

تا ماه ژوئن سال 2011

2009

از سال 2009

سال 91

1850-58

thank you so much.
fatemeh

Parse "the last week of April in 1867"

Example from https://en.wikipedia.org/wiki/Mary_Ann_Cotton:

All three children were buried in the last week of April and first week of May in 1867.

Ideally, this could be resolved to 1867-04-W4 and 1867-05-W1 (or 1867-W17, 1867-W18?)

For now, I have added a negative look-ahead in my codebase: (?! of) to avoid matching these, but they will not be resolved.

HeidelTimeStandalone default constructor is missing

Good afternoon,

I notice, that the default constructor of HeidelTimeStandalone doesn't exist. So if
you want to invoke the class dynamically it produces an error: java.lang.InstantiationException.

You just need to add
public HeidelTimeStandalone() {     
    }

in HeidelTimeStandalone to avoid that

ps: is a new release planned soon? last one was in may.

Original issue reported on code.google.com by damien.palacio on 2012-12-11 13:07:29

24h time not matched

From https://en.wikipedia.org/wiki/Angela_Merkel

The first Cabinet of Angela Merkel was sworn in at 16:00 CET on 22 November 2005.

The date is matched, but the time isn't.

I added a :? to time_r7 and then it worked, c.f. 16a3c5d

Sentence splitting bug in de.unihd.dbs.uima.annotator.stanfordtagger.StanfordPOSTaggerWrapper

Hi heideltime team,

I'm Master Student at the University of Mannheim and currently building an Temporal
Information Extraction system using heideltime as a temporal tagger.
I encountered a bug in the StanfordPOSTaggerWrapper UIMA component

What steps will reproduce the problem?
1. Check the attached file "Breaking_Sample.txt"; it's a plain text version of Apple's
Wikipedia article.
2. Apply de.unihd.dbs.uima.annotator.stanfordtagger.StanfordPOSTaggerWrapper on it
3. Check the JCas sentence annotations, respectively the sentences text you get when
building substrings on the annotations "begin" and "end" indexes.

What is the expected output? What do you see instead?
Expected: Sentences as shown in "Output_MyStanfordPOSTaggerWrapper.txt"
Actual: Sentences as shown in "Output_StanfordPOSTaggerWrapper.txt"
Issue starts with Sentence 117

What version of the product are you using? On what operating system?
1.7
OS X

Please provide any additional information below.
Results of my analysis are as following:
The weakness of the current implementation is the own calculation  of an offset value
in conjunction with
relying on searching the document text with ".indexOf(thisWord, offset)".

To fit my needs I copied and reimplemented your component the code can be found in
"MyStanfordPOSTaggerWrapper.java".
From my perspective this implementation is more robust as it reuses the offsets calculated
by the Stanford Tokenizer.

If you have further questions please do not hesitate to contact me.

Bests
Norman

Original issue reported on code.google.com by norman.weisenburger on 2014-07-23 14:31:54

- _Attachment: [potential_StanfordPOSTaggerWrapper_bug.zip](https://storage.googleapis.com/google-code-attachments/heideltime/issue-16/comment-0/potential_StanfordPOSTaggerWrapper_bug.zip)_

overlapping timexes produce broken XML in the standalone

When the standalone version of heideltime is used to tag a document that contains multiple
overlapping temporal expressions, the TimeML writer module will produce invalid XML
code.

Due to the nature of inline tags in TimeML documents, this condition cannot be resolved
entirely satisfactorily; two overlapping timexes would produce overlapping XML tags
which would be semantically invalid.

The condition of two overlapping timexes should *ideally* never occur, since if a temporal
expression produces two overlapping timexes, this temporal expression should also be
representable by a single timex that spans both of the smaller timexes. The recognition
of temporal expressions however is subject to the utilized resources/rules and whether
they include such a "larger" rule.

Different domains such as poetry however can produce unexpected sentence syntax which
may elude any of the existing rulesets otherwise thought of as comprehensive.

To resolve the bug that produces broken XML/TimeML tags, we will, for overlapping timexes,
only create an XML tag for the first recognized timex, omitting all of the subsequent
timexes that overlap with the first one.

Our thanks go to Armin Hoenen for bringing this bug to our attention.

Original issue reported on code.google.com by juliii on 2012-07-31 00:19:44

I am new to java . I have configured as it is written in Manual. I am using windows. I am facing problem in executing standalone version

please can any one give me steps to use heidel time
I have already used Treetagger.
Error is : Exception in thread main "java.lang.Noclassdeffounderror": de/unihd/dbs/heideltime/standalone/jar

Original issue reported on code.google.com by parul.patel.ns on 2013-10-09 06:16:54

Overflow in quarter logic

HI,
As I have been refactoring (and simplifying) the disambiguation code, I have found the following problematic strings:

November 2015, 1 quarter later.
January 2015, 1 quarter earlier.

Which will yield 2015-Q5 and 2015-Q0.

The reason is that the quarter logic computes the mod-4 on the delta, not on the resulting value.

In my refactoring branch, I have a bug fix for this.

erroneous week calculation based on dctWeek/getxnextweek

there's something going wrong with the week values generated in HeidelTime.specifyAmbiguousValues()
likely originating from system(/locale)-specific behavior within DateCalculator.getXNextWeek().

this results in week numbers that are decremented by one compared to their gold standard
values. likely in relation to the relative value calculation based on dctWeek.

Original issue reported on code.google.com by juliii on 2012-05-23 15:57:27

OutputType parameter ignored except in main

While a HeidelTimeStandalone instance knows through a constructor argument which type
of output it should produce, it still requires the user to pass a ResultFormatter to
process. I added two overloads of process to fix that.

Original issue reported on code.google.com by larsmans on 2013-01-14 14:50:41

- _Attachment: [process-without-resultformatter.patch](https://storage.googleapis.com/google-code-attachments/heideltime/issue-7/comment-0/process-without-resultformatter.patch)_

heideltime standalone not working under my ubuntu system

Ive tried several hours to get Heideltime Standalone to run on my ubuntu system, but
it still didnt work. 
i followed exactly the how to use instructions in the readme file and i also installed
the treetagger from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ with the
whole package, the tagging scripts, the installation script and the parameter files
for the languages which i use and i also  indicate the path to the folder containing
the tree-tagger in config.props, in "treeTaggerHome" (treeTaggerHome = /home/chuulio/Dokumente/TreeTagger/)
Ive tried heideltime on a text document about moskow in german, and this is what i
got:

chuulio@chuulio-UX32VD:~/Dokumente/Temporal_Annotation/Standalone$ java -jar de.unihd.dbs.heideltime.standalone.jar
/home/chuulio/Dokumente/Moskau.txt -l german -vv
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Verbosity: '-vv'; Logging level set to ALL.
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Encoding '-e': NOT FOUND OR RECOGNIZED; set to 'UTF-8'
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Output '-o': NOT FOUND OR RECOGNIZED; set to TIMEML
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Language '-l': GERMAN
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Type '-t': NOT FOUND OR RECOGNIZED; set to NARRATIVES
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Document Creation Time '-dct': NOT FOUND; skipping.
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Locale '-locale': NOT FOUND, set to environment locale: de_CH
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Configuration path '-c': config.props
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone readConfigFile
INFO: trying to read in file config.props
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: POS Tagger '-pos': NOT FOUND OR RECOGNIZED; set to TREETAGGER
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Interval Tagger '-it': NOT FOUND OR RECOGNIZED; set to false
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone main
INFO: Reading document using charset: UTF-8
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone initialize
INFO: HeidelTimeStandalone initialized with language german
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone initialize
INFO: HeidelTime initialized
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone initialize
INFO: JCas factory initialized
Aug 26, 2014 10:47:17 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone process
INFO: Processing started
[de.unihd.dbs.uima.annotator.heideltime.HeidelTime] HeidelTime has not found any sentence
tokens in this document. HeidelTime needs sentence tokens tagged by a preprocessing
UIMA analysis engine to do its work. Please check your UIMA workflow and add an analysis
engine that creates these sentence tokens.
Aug 26, 2014 10:47:19 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone process
INFO: Processing finished
Aug 26, 2014 10:47:19 PM de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone process
INFO: Result formatted
<?xml version="1.0"?>
<!DOCTYPE TimeML SYSTEM "TimeML.dtd">
<TimeML>

Moskau (russisch Москва́ Zum Anhören bitte klicken! [mɐˈskva], Moskwa) ist die Hauptstadt
der Russischen Föderation und mit rund 11,55 Millionen Einwohnern (Stand 14. Oktober
2010)[1] die größte Stadt bzw. mit 15,1 Millionen (2012)[2] die größte Agglomeration
Europas. Am 1. Juli 2012 wurde Moskau durch Eingemeindung der beiden Verwaltungsbezirke
Nowomoskowski und Troizk im Südwesten der Stadt auf Kosten der Moskauer Oblast um 1480
km², d. h. um das 1,39-Fache, auf 2550 km² vergrößert. Durch die Eingliederung wuchs
die Moskauer Bevölkerung um etwa 235.000 Menschen.

Moskau ist das politische, wirtschaftliche und kulturelle Zentrum des Landes mit Hochschulen
und Fachschulen sowie zahlreichen Kirchen, Theatern, Museen, Galerien und dem 540 Meter
hohen Ostankino-Turm. Moskau ist Sitz der Russisch-Orthodoxen Kirche: Der Patriarch
residiert im Danilow-Kloster, das größte russisch-orthodoxe Kirchengebäude ist die
Moskauer Christ-Erlöser-Kathedrale. Es gibt im Stadtgebiet von Moskau über 300 Kirchen.[3]
Seit dem 16. Jahrhundert wird Moskau auch als Drittes Rom bezeichnet. Nach Ende des
Zweiten Weltkriegs erhielt Moskau die Auszeichnung einer „Heldenstadt“.

Der Kreml und der Rote Platz im Zentrum Moskaus stehen seit 1990 auf der UNESCO-Liste
des Weltkulturerbes. Mit acht Fernbahnhöfen, drei internationalen Flughäfen und drei
Binnenhäfen ist die Stadt wichtigster Verkehrsknoten und größte Industriestadt Russlands.

Geschichte
Ursprung
Denkmal für den Stadtgründer Juri Dolgoruki

Eine der Sagen kündet davon, dass der Fürst Juri Dolgoruki (1090–1157) im Land der
Wjatitschen eine hölzerne Stadt zu errichten befahl, und dass diese Stadt nach dem
Fluss benannt wurde, an dessen Ufern sie emporwuchs. Die erste schriftliche Erwähnung
Moskaus stammt aus dem Jahre 1147, das darum als das Gründungsjahr Moskaus gilt. Doch
schon lange davor gab es an der Stelle, wo heute Moskau steht, menschliche Niederlassungen.
Archäologische Ausgrabungen bezeugen, dass die ältesten von ihnen vor etwa 5000 Jahren
entstanden waren.

Um 1156 entstand eine erste, noch hölzerne Wehranlage des Kremls, in deren Schutz sich
der Marktflecken allmählich zu einer beachtlichen Ansiedlung entwickelte. Im Jahre
1238 ist die Stadt von den Mongolen erobert und niedergebrannt worden. 1263 wurde das
Umland zu einem Teilfürstentum im Großfürstentum Wladimir-Susdal, wenig später unter
Fürst Daniel ein eigenständiges Fürstentum. In der ersten Hälfte des 14. Jahrhunderts
– die Stadt zählte mittlerweile 30.000 Einwohner – erkannte der tatarische Großkhan
den Moskauer Großfürsten als (ihm allerdings tributpflichtiges) Oberhaupt von Russland
an.

Der Sieg über die Tataren in der Schlacht von Kulikowo am 8. September 1380, angeführt
durch den Moskauer Großfürsten Dmitri Donskoi, befreite zwar nicht von der Hegemonie
der Goldenen Horde (1382 wurde Moskau sogar abermals niedergebrannt und geplündert),
doch die Stadt festigte dadurch ihr politisches und militärisches Ansehen erheblich
und gewann mithin beständig an wirtschaftlicher Macht. 1480 konnte sie die Tatarenherrschaft
endgültig abschütteln und wurde zur Hauptstadt des russischen Reiches.

Der seit 1462 regierende Großfürst von Moskau Iwan III., der Große (1440–1505), heiratete
1472 die byzantinische Prinzessin Sofia (Zoe) Palaiologos, eine Nichte des letzten
oströmischen Kaisers Konstantin XI. Palaiologos, und übernahm von dort die autokratische
Staatsidee und ihre Symbole: den Doppeladler und das Hofzeremoniell. Seither gilt Moskau
als „Drittes Rom“ und Hort der Orthodoxie.
Moskau wird Großstadt
Moskau am Ende des 17. Jahrhunderts

In den beiden letzten Jahrzehnten des 15. Jahrhunderts begann der Ausbau des Kreml,
in dessen Umkreis sich nun in großer Zahl Handwerker und Kaufleute niederließen. Die
Einwohnerzahl stieg bald darauf auf mehr als 100.000, so dass um 1600 eine Ringmauer
um Moskau und eine Erdverschanzung hinzukamen, die die blühende Stadt fortan nach außen
abschirmten. 1571 war sie ein letztes Mal von den Tataren heimgesucht worden, als die
überwiegend aus Holz gebaute Stadt abbrannte. Bereits ein Jahr später war die Tatarengefahr
in der Schlacht von Molodi südlich von Moskau aber endgültig gebannt. In der Zeit der
Wirren, die durch unklare Thronfolgeverhältnisse ausgelöst wurde, rückten polnische
Truppen in die Stadt und versuchten, eigene Marionetten zu installieren. Eine Volksarmee
aus Nischni Nowgorod belagerte die Polen jedoch im Moskauer Kreml und zwang sie zur
Kapitulation. Diese Ereignisse ebneten den Weg für die Romanow-Dynastie auf den russischen
Thron.

Während die ersten Tuch-, Papier- und Ziegelmanufakturen, Glasfabriken und Pulvermühlen
entstanden, kulminierten die sozialen Gegensätze des Großreiches: 1667 erhoben sich
die Bauern im Wolga- und Dongebiet gegen die wachsende Unterdrückung, ihr Führer, Stepan
Rasin, wurde 1671 auf dem Roten Platz in Moskau hingerichtet. Im Jahre 1687 ist die
erste Hochschule Russlands, die „Slawisch-Griechische Akademie“ eröffnet worden, 1703
erschien die erste gedruckte russische Zeitung „Wedomosti“. Im Jahre 1712 ging unter
Zar Peter dem Großen (1672–1725) das Privileg der Hauptstadt auf das neu gegründete
Sankt Petersburg über, aber Moskau blieb das wirtschaftliche und geistig-kulturelle
Zentrum des Landes. 1755 wurde in Moskau mit der heutigen Lomonossow-Universität die
erste russische Universität eröffnet.
Der Brand von Moskau vor der Einnahme der Stadt durch Napoleon 1812
Twerskaja-Straße im 19. Jahrhundert

Mit dem Moskau des 18. Jahrhunderts ist das Schaffen hervorragender russischer Schriftsteller
und Dichter verknüpft wie Alexander Sumarokow, Denis Fonwisin, Nikolai Karamsin und
vieler anderer. In Moskau trat der große russische Gelehrte Michail Lomonossow seinen
Weg in die Wissenschaft an. Auch in späteren Zeiten lebten und wirkten in Moskau viele
berühmte russische Schriftsteller und Dichter, Wissenschaftler und Künstler, die durch
ihr Schaffen nicht nur zur russischen, sondern auch zur Weltkultur einen immensen Beitrag
geleistet haben.

Im Vaterländischen Krieg von 1812, als Napoleon Bonaparte (1769–1821) mit seiner „Großen
Armee“ auf Moskau zumarschierte, verlor die Stadt in einem Flächenbrand – die Bewohner
zündeten ihre Häuser an und flohen aus der Stadt – zwei Drittel ihrer Bausubstanz.
Aber in Moskau kam die französische Armee zum Stehen, hier wurde sie wegen Hunger und
Kälte zur Umkehr gezwungen, die mit ihrem Untergang endete.

Der im Frühjahr 1813 einsetzende großstilige Wieder- und Neuaufbau sprengte rasch den
alten städtischen Verteidigungsring und verschaffte der Stadt von der Mitte des 19.
Jahrhunderts an durch zügigen Straßen- und Bahnstreckenbau Anschluss an die wichtigsten
Städte des Landes. 1890 fuhren die ersten elektrischen Straßenbahnen; die erste Volkszählung
des Landes fand am 28. Januar 1897 statt, die Bevölkerung der Stadt war auf etwa eine
Milli


</TimeML>

java and ubuntu version:

chuulio@chuulio-UX32VD:~/Dokumente/Temporal_Annotation/Standalone$ lsb_release -a &&
java -version
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 12.10
Release:    12.10
Codename:   quantal
java version "1.7.0_25"
OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.10.2)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)


any idea?

Original issue reported on code.google.com by julien.peter.thoma on 2014-08-26 20:52:56

Descriptors of Chinese text

Hi,

I would like to process Chinese documents, they are .txt files without any  tokenization
and segmentation. which reader and annotation descriptors should I choose? 

I have tested the FileSystemCollectionReader in the example project of UIMA, and ACETernReader,
Eventi2014Reader, Tempeval2Reader and Tempeval3Reader in heideltime, with the StanfordPosTagger
and Heideltime annotation descriptors, but I cannot get the right result under all
these choices! Actually, the output has no any annotation.

while I copy the input text from the .txt file into the heideltime online demo input
dialogue, it works, how does the demo generate the right result? Do I need to write
a new reader for my input by myself?

my heideltime version is: heideltime-kit 1.8.
and my system is: ubuntu 14.04.1

By the way, my input is a piece of Chinese news, the creation time is January 29, 2014.
the input is as follows:
        中新网1月29日电 综合马来西亚、新加坡等媒体消息，马来西亚民航局总监阿兹哈鲁丁于29日下午6时，通过国营电视台TV1针对MH370事件最新进展作出汇报。
　　阿扎鲁丁说，在经过长时间“分析和推测”后，大马民航局今天正式确认马航MH370飞机“失事”，并推定机上所有239名乘客和机组人员已遇难。飞机目前位于印度洋南部偏远海底。
　　此前，大马方面原定于1月29日下午3时30分就马航MH370失联一事做出说明，但发布会先是因“技术问题”被延迟，随后又因“意外情况”取消。

The expected output is as the output of heideltime online demo, which annotates the
"1月29日", "今天", "目前", "下午3时30分".

But the actual output with heideltime-kit 1.8 is the same with the input without any
annotation.

Thank you very much!

Best, 

Lin

Original issue reported on code.google.com by eriney.gl on 2015-04-24 16:02:19

No clear Documentation on intial Set up and Usage

Please provide read me on initial setup , install and usage .
The jar file has many dependencies which are not straight forward