Giter VIP home page Giter VIP logo

parlamint's Introduction

ParlaMint: Comparable Parliamentary Corpora

The CLARIN ParlaMint project is compiling comparable parliamentary corpora for a number of countries and languages.

ParlaMint corpora are interoperable, i.e. encoded to a very constrained common ParlaMint schema, a specialisation of the Parla-CLARIN recommendations, which are a customisation of the TEI Guidelines. Common scripts should process the common data in any ParlaMint corpus, despite the differing parliamentary systems of the countries, the kind of information included in the corpora, and, of course, language.

The latest version of ParlaMint is 4.1 which contains corpora for 29 countries and autonomous regions in original languages as well as machine translated to English, and is available from the CLARIN.SI repository:

Publications connected to ParlaMint are available at the ParlaMint project page.

The two most comprehensive publication on ParlaMint corpora are the LREV preprint describing version 4.1 and the LREV publication describing version 2.1:

  • Tomaž Erjavec, Matyáš Kopp, Nikola Ljubešić, Taja Kuzman, Paul Rayson, Petya Osenova, Maciej Ogrodniczuk, Çağrı Çöltekin, Danijel Koržinek, Katja Meden, Jure Skubic, Peter Rupnik, Tommaso Agnoloni, José Aires, Starkaður Barkarson, Roberto Bartolini, Núria Bel, Calzada María Pérez, Roberts Darģis, Sascha Diwersy, Maria Gavriilidou, van Ruben Heusden, Mikel Iruskieta, Neeme Kahusk, Anna Kryvenko, Noémi Ligeti-Nagy, Carmen Magariños, Martin Mölder, Costanza Navarretta, Kiril Simov, Lars Magne Tungland, Jouni Tuominen, John Vidler, Adina Ioana Vladu, Tanja Wissik, Väinö Yrjänäinen, Darja Fišer. ParlaMint II: Advancing Comparable Parliamentary Corpora Across Europe. (2024). DOI: 10.21203/rs.3.rs-4176128/v1.

  • Tomaž Erjavec, Maciej Ogrodniczuk, Petya Osenova, Nikola Ljubešić, Kiril Simov, Andrej Pančur, Michał Rudolf, Matyáš Kopp, Starkaður Barkarson, Steinþór Steingrímsson, Çağrı Çöltekin, Jesse de Does, Katrien Depuydt, Tommaso Agnoloni, Giulia Venturi, María Calzada Pérez, Luciana D. de Macedo, Costanza Navarretta, Giancarlo Luxardo, Matthew Coole, Paul Rayson, Vaidas Morkevičius, Tomas Krilavičius, Roberts Darǵis, Orsolya Ring, Ruben van Heusden, Maarten Marx & Darja Fišer. The ParlaMint corpora of parliamentary proceedings. Language Resources & Evaluation 57:415–448 (2023). DOI: 10.1007/s10579-021-09574-0.


This Git repository contains the ParlaMint XML schemas, the scripts used to validate and convert the ParlaMint TEI XML corpora to some useful derived formats, and samples of the ParlaMint corpora. Note that there are several branches for different parts of the development.

  • Contributing to ParlaMint repository is described in CONTRIBUTING.md file
    • git and GitHub setup
    • installing prerequisites
  • Running make help in repository root folder provides make targets list with description.
  • The TEI folder contains the TEI ODD, i.e. the Guidelines for encoding ParlaMint corpora, with their HTML available on [ParlaMint project pages] and the formal TEI schema specification. TEI README provides more information.
  • The Schema folder contains the RelaxNG schemas for separately validating the four types of files present in the corpora. Schema README provides more information.
  • The Scripts folder contains the XSLT scripts and Perl wrappers used to:
    • validate the corpora (RNG + XSLT validation for consistency);
    • convert the TEI encoded corpora to derived formats;
    • add/change common information, currently for V3.0
    • compute some statistics
  • The Samples folder contains directories for a particular country or autonomous region that should include samples for all variants and formats of the ParlaMint corpora
  • The Build folder contains the build environemt for a release, and all associated data. This consists of the input (source) data, scripts, and Makefile with targets to make a relese. Note the the complete corpora are too large to store on GitHub, so most data files are gitignored. However, the directory or its subdirectories contain various associated resources, e.g. the automatically produced ParlaMint root files, common taxonomies, various metadata on the corpora etc.

parlamint's People

Contributors

5roop avatar adina-v avatar atomm avatar bartjongejan avatar cluljoseaires avatar coltekin avatar dimitrisgk-iel avatar gclux avatar hpreki avatar ivo-clark avatar jessededoes avatar katjameden avatar lnnoemi avatar maciej-ogrodniczuk avatar matthewcoole avatar matyaskopp avatar mindpetk avatar mrudolf avatar nemeek avatar ninpnin avatar nljubesi avatar osenova avatar repierre avatar rjzevallos avatar rubenvanheusden avatar starkadur avatar tomazerjavec avatar tungland avatar wissikt avatar yoge1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parlamint's Issues

Schema extension suggestion: new affiliation role values //affiliation/@role

I have unified person/affiliation/@role values:

chairman     ~ chairperson
minister     ~ minister of ...
viceChairman ~ viceChairperson 

so our final list of all affiliation roles is (I believe):

member
viceChairman
MP
candidate
chairman
verifier
minister
replacement
vicePresident
headOfDelegation
president
presidiumMember
observer
vicePublicDefenderOfRights
publicDefenderOfRights
alternateOfDelegation

list of new (or camelCased existing ones - with star) roles with number of occurrences in corpus:

*   662 viceChairman
    604 candidate
    332 verifier
     62 replacement
*    22 vicePresident
     18 headOfDelegation
      2 presidiumMember
      2 observer
      1 vicePublicDefenderOfRights
      1 publicDefenderOfRights
      1 alternateOfDelegation

I don't insist on these exact words - I can replace them (ie replacement ~ substitute), or just remove.

Originally posted by @matyaskopp in #8 (comment)

schema: Multi-token words

As wrongly reported here: #36
It would be nice to allow @xml:id in all w elements.

Furthermore, it is a must to allow @join attribute in all w elements. Consider this sentence from our corpus:

Ještě bych pochopil, kdyby, jak to včera říkal pan ministr průmyslu, obchodu, dopravy a nevím ještě čeho, ale určitě ne podnikatelů a podnikání, že v tom doplňkovém prodeji ty velké obchody mají nárůst tržeb jenom o 5 %.

The word kdyby consists of two tokens and can be represented (I have temporarily remove @xml:id to "pass" validation):

                       <w join="right">kdyby<w xml:id="ParlaMint-CZ_2020-11-19_ps2017-070-01-001-001.u65.p7.s3.w6"
                             lemma="když"
                             msd="UposTag=SCONJ"
                             ana="pdt:J,-------------"
                             norm="když"/>
                          <w xml:id="ParlaMint-CZ_2020-11-19_ps2017-070-01-001-001.u65.p7.s3.w7"
                             lemma="být"
                             msd="UposTag=AUX|Mood=Cnd|Person=3|VerbForm=Fin"
                             ana="pdt:Vc-------------"
                             norm="by"/>
                       </w>

Schema: @lemma and @msd has to be optional

@TomazErjavec , please make @lemma and @msd optional

token.atts =
id.att,
attribute msd { xsd:string },
attribute ana { anyURIs }?,
attribute norm { xsd:string }?,
attribute join { "right" }?
word.atts =
attribute lemma { xsd:string },
token.atts

it is not possible to represent some multi-token words (out email discussion from the beginning of december, subj ParlaMint - TEI encoding - multi-word tokens).

ParlaClarin example: https://clarin-eric.github.io/parla-clarin/#sec-ananorm

example:
image

CoNLL-U to ParlaMint

A free contanuation of email discussion [subj. UD & UPOS tags].

I have this script udpipe2.pl (related perl package is here: ParCzech::PipeLine::FileManager). It anotates tei file with UDPipe2 LINDAT service.

For Czech, you can run it in this way:

perl -I ../lib udpipe2.pl --model=czech-pdt-ud-2.6-200830 --output-file test_data_big.ann.xml --input-file test_data_big.xml

It annotates all text child within <seg> elements and allows annotation of one level subelement <ref>(it should contain the whole tokens) in default settings.

You can try it. Use it wisely - please don't overload our service. It is sending 100k requests and have a sleep 1 second after each request.

Development is still in progress and it is probably not fully compatible with ParlaMint. Bug reports are welcomed.

Bugs in RO samples

Great first sample, the only error I see is w/@who, this must be a reference to an ID, and IDs cannot contain spaces, colons etc. so e.g. "#Florin-Claudiu Roman:" won't work. But I guess you will fix this once you have the root file with proper person descriptions.

Also, if you could (redundantly) specify the type of corpus on the text element as well like <text ana="#covid">.

Schema: additional named entities annotations

We use for annotating named entities Czech CNEC 2.0 taxonomy, which contains 46 nested classes: http://ufal.mff.cuni.cz/~strakova/cnec2.0/ne-type-hierarchy.pdf

For ParlaMint purposes, we flatten hierarchal and merge categories into ConLL2003 categories PER, LOC, ORG and MISC.
But we want to also keep our taxonomy.

Example of annotation (additional info is in XML comments):

<name xml:id="ParlaMint-CZ_2014-09-10_ps2013-016-01-000-000.ne17" type="PER"><!--ana="ne:p"--><!--name ana="ne:pf"-->
<w xml:id="ParlaMint-CZ_2014-09-10_ps2013-016-01-000-000.u1.p4.s1.w11"
lemma="František"
msd="UposTag=PROPN|Animacy=Anim|Case=Acc|Gender=Masc|NameType=Giv|Number=Sing|Polarity=Pos"
ana="pdt:NNMS4-----A----">Františka</w>
<!--name-->
<!--name ana="ne:ps"-->
<w xml:id="ParlaMint-CZ_2014-09-10_ps2013-016-01-000-000.u1.p4.s1.w12"
lemma="Laudát"
msd="UposTag=PROPN|Animacy=Anim|Case=Acc|Gender=Masc|NameType=Sur|Number=Sing|Polarity=Pos"
ana="pdt:NNMS4-----A----">Laudáta</w>
<!--name-->
</name>

Taxonomy starts here:

https://github.com/matyaskopp/ParlaMint/blob/9b4948532863562bd5da52421de7f1b2b613ac61/ParlaMint-CZ/ParlaMint-CZ.ana.xml#L566

An idea of schema modification

  • allow attribute ana in name element
  • allow nested names : name/name
  • make attribute type optional (at least for nested element name)

In the above-mentioned example, it distinguishes a forename and a surname which can be very useful for following annotations (linked-name entities). It would be a pity to lose this kind of information.

ParlaMint "root" files

There are now two overall ParlaMint corpus "root" files, one for the unannotated and the other for the .ana corpus. Both are automatically generated from the root files in the sample directories with the parlamint2root.xsl script.

The reason "root" is in quotes is that the files are not actually meant to be used as a proper root (we would get ID clashes, also, a huge XML document), but rather as another aid to validation, or, rather, harmonisation: here you can see one after the other the same elements from the teiHeaders of the corpora for each country (for those that are currently available), so if anybody is different (in a bad way) from the others, you could fix your corpus.

Of coruse, if anybody has suggestion for improving these two files, pls. post them here.

UD syntax: underscores in extended relations

I have moved to the annotation part of our project and looked at UD extended syntactic relations again (#5).
@TomazErjavec, I don't want to push you too obstinately but I think that a solution with underscores _ in extended relations is not good and can be confusing for users who are habituated to colons :.

As I see, you are willing to do changes in ParlaMint schema. So I am trying it one more time.

If I understand you, you don't want to use a : in taxonomy because of a possible collision with prefixes. So I'm suggesting using @type and @subtype in relational links.
Current:

 <link ana="ud-syn:obl_arg" target="#seg3.1.6 #seg3.1.8"/>

Suggestion:

 <link type="obl" subtype="arg" target="#seg3.1.6 #seg3.1.8"/>

My solution doesn't go against UD standard it just split relation and its extension. I don't think that we have to introduce a new extended syntactic relations standard...

@TomazErjavec .
Please, can you look at it one more time?
Is it really necessary to create a new "standard" for UD syntax?
Can you figure out another better solution? (ideally where relation and extension would be in a single string... rel:ext)

Validation details

Hi,

in the validation xsl (cf code below):

  • The first rule base-uri() = concat($id, '.xml') should probably read not (base-uri() = concat($id, '.xml')) ?
  • The rule matches(base-uri(), '_.+_') fires for me because of underscores in the path where my files live: file:/mnt/Projecten/Corpora/0-Ruwmateriaal/FederaalParlement/annotatie_lopend/both_parlamint_v7_trankit/ParlaMint-BE_2019-11-05-55-commissie-ic045x.xml
  • I get warnings like SnoyetdOppuersThérèse should be Snoyetd'OppuersThérèse , but the suggested improvement is not a valid xml id.
  <xsl:template match="tei:TEI">
    <xsl:if test="base-uri() = concat($id, '.xml')">
      <xsl:call-template name="error">
        <xsl:with-param name="msg">TEI/@xml:id does not match filename</xsl:with-param>
      </xsl:call-template>
    </xsl:if>
    <xsl:if test="$level != 'component'">
      <xsl:call-template name="error">
        <xsl:with-param name="msg">Wrong filename for TEI component</xsl:with-param>
      </xsl:call-template>
    </xsl:if>
    <xsl:choose>
      <xsl:when test="not(matches(base-uri(), 'ParlaMint-.._'))">
        <xsl:call-template name="error">
          <xsl:with-param name="msg">
            <xsl:text>Component filenames should be ParlaMint-XX_...</xsl:text>
          </xsl:with-param>
        </xsl:call-template>
      </xsl:when>
      <xsl:when test="matches(base-uri(), '_.+_')">
        <xsl:call-template name="error">
          <xsl:with-param name="msg">
                  <xsl:text>Component filenames should have only one underscore</xsl:text>
          </xsl:with-param>
        </xsl:call-template>
      </xsl:when>
    </xsl:choose>

Organisation of directory with many (> 1000) files

Some countries have very many component files in their corpus. It is in general better not to have too many files in one directory, so in case a corpus has more than 1000 files, pls. separate them into directories per year, so you have e.g.

$ ls -l ParlaMint-NL.TEI/
drwxrwxr-x 2 tomaz tomaz  12288 Feb 22 03:07 2014
drwxrwxr-x 2 tomaz tomaz  73728 Feb 22 03:08 2015
drwxrwxr-x 2 tomaz tomaz  77824 Feb 22 03:08 2016
drwxrwxr-x 2 tomaz tomaz  65536 Feb 22 03:08 2017
drwxrwxr-x 2 tomaz tomaz  69632 Feb 22 03:09 2018
drwxrwxr-x 2 tomaz tomaz  69632 Feb 22 03:09 2019
drwxrwxr-x 2 tomaz tomaz  49152 Feb 22 03:09 2020
-rw-rw-r-- 1 tomaz tomaz 778265 Feb 22 12:09 ParlaMint-NL.xml

and each directory contains the component files for the given year.

Note that the XInclude statements in the root file must then also point to the files in directories, e.g.:

<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="2014/ParlaMint-NL_2014-04-16-tweedekamer-2.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="2014/ParlaMint-NL_2014-04-16-tweedekamer-3.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="2014/ParlaMint-NL_2014-04-16-tweedekamer-5.xml"/>

Capitalisation of UPosTag in annotated corpora

In the annotated corpora, the UD PoS tag is written in as the first element of the space separated value of @ana, as in

msd="UPosTag=ADJ|Case=Nom|Degree=Pos|Gender=Fem|Number=Plur|VerbForm=Part"

However, CZ capitalises it differently, as "UposTag", cf.

<w xml:id="ParlaMint-CZ_2014-09-10-ps2013-016-01-000-000.u1.p1.s1.w1" lemma="vážený" msd="UposTag=ADJ|Case=Nom|Degree=Pos|Gender=Fem|Number=Plur|Polarity=Pos">Vážené</w>

So @matyaskopp (and any others who have wrong capitalisation) could you pls. change it to "UPosTag".
And, I guess, this is something else that the validation should check.

tagUsage expected scope

I am uncertain if I am counting correctly numbers in tagUsage. What is the expected scope of tagUsage element?

Now I am counting value in @occurs over the whole TEI file: count(/TEI//element)
But in TEI guidelines (https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-tagUsage.html) is written:

specifies the number of occurrences of this element within the text.

This probably means: within the <text> element.

So should I change the values to count(/TEI/text//element)?

How to add links to photos?

GB would like to add links to photos of persons, as this can then be used by ParlaMeter. The questions is how to encode photos?
The obvious solution would be to use <idno> as we do alredy for links to homepages of people, e.g.

<person xml:id="AdamVojtech.1986">
   <persName>
      <surname>Vojtěch</surname>
      <forename>Adam</forename>
   </persName>
   <idno type="URI">https://www.psp.cz/sqw/detail.sqw?id=6491</idno>

But this is problematic, as the link to the photo is also an URI, so how do you distinguish the two?

The best I can come up with is to introduce the subtype attribute, and then have e.g.

   <idno type="URI" subtype="home">https://www.psp.cz/sqw/detail.sqw?id=6491</idno>
   <idno type="URI" subtype="photo">https://www.psp.cz/eknih/cdrom/2017ps/eknih/2017ps/poslanci/i6491.jpg</idno>

Any opinions on this?

Schema extension suggestion: new org role values //org/@role

(CZ) list of new (or camelCased existing ones - with star) org roles with number of occurrences in corpus:

    299 interparliamentaryFriendshipGroup
    280 subcommittee
    114 committee
     77 commission
     44 politicalGroup
     31 delegation
     13 senate
*    11 politicalParty
      9 institution
      7 supervisoryBoard
      5 workingGroup
      3 czechNationalCouncil
      3 chamberOfThePeople
      3 chamberOfTheNations
      2 europeanParliament
      1 president
      1 internationalOrganizations
      1 boardOfDirectors

Originally posted by @matyaskopp in #9 (comment)

Multiple speakers for one utterance?

Sometimes there are multiple chairs in a meeting in our (BE) data, and the utterance is marked only as spoken by chair.
e.g.

 <u who="#Lalieux_Karine #Bracke_Siegfried" ana="#chair">

This validates with parla-clarin, but not with the parlamint schema.

TEI (https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.ascribed.html) has for @who:
Datatype | 1–∞ occurrences of teidata.pointer separated by whitespace

Could it become:

who.att = attribute who { xsd:anyURIs }

in the schema?

Allow @pos <w>

We have a local tagset which does not have a decomposition into feature-structures. Could we use the pos-attribute in ?

Bugs in CZ + validaton with converion to vertical

I have now modified my script that converts ParlaMint XML files to vertical files so that it can be used as further (semantic) validation of the corpora. It now produces some error messsages, and inspecting the structural attributes (there are as yet no tokens in the vertical files) can also reveal errors.

I ran it on the latest CZ data, problems:

ERROR: cannot find speaker for ParlaMint-CZ_2014-09-10_ps2013-016-01-001-001:AndrejBabis6150
ERROR: cannot find speaker for ParlaMint-CZ_2020-11-19_ps2017-070-01-001-001:JanBlatny6683
ERROR: cannot find speaker for ParlaMint-CZ_2020-11-19_ps2017-070-01-001-001:JanHamacek5462
...

15 such in sample; it's because the "#" is missing in the u/@who.
(link checker does not catch them, as AndrejBabis6150 is a legal reference to a local file named AndrejBabis6150).

The <text> element in the vertical looks like this:

<text id="ParlaMint-CZ_2014-09-10_ps2013-016-01-000-000" subcorpus="Reference" term="ps2013" session="?" sitti\
ng="ps2013/016" from="2014-09-10" to="2014-09-10" title="Parliament of the Czech Republic, Chamber of Deputies">

Does this makes sense to you? Note that these attributes are the ones we had in V1, it might make sense to change them now, esp. session and sitting, depending also on what the other corpora will have.

And this is how a speech looked like before I modified the script:

šek, Jan" speaker_role="chair" speaker_type="notMP" speaker_party="parliament.PSP7;ORGV;RRPNM;PNPUSV;SK;XNS;US\
;DE;HR;GB" speaker_party_name="Organizační výbor;Stálá komise pro rodinu, rovné příležitosti a národnostní men\
šiny;Podvýbor pro přípravu návrhů na propůjčení nebo udělení státních vyznamenání;Slovenská republika;Severské\
 státy (Švédsko, Finsko, Norsko, Dánsko, Island);Spojené státy americké;Spolková republika Německo;Chorvatská \
republika;Spojené království Velké Británie a Severního Irska" speaker_gender="M" speaker_birth="?">

Most attributes seem ok, except for speaker_party(_name), which is a complete mess:

In V1 we had a simple situation:

<affiliation role="member" ref="#party.SDS.2" from="2014-08-01" to="2018-06-21" ana="#DZ.7"/>

and
<org xml:id="party.SDS.2" role="politicalParty">

So, a person could be a member of an organisation, and all the organisations were political parties. It was enough to check that they were a member on the date when they made their speech, and that is how the vertical file text/@speaker_party was computed.

In CZ you have many more organisations than just political parties. So I now check that a person has an affiliation with an org[@role="politicalParty"], but in the first run this resulted in all speaker_party being empty.

The reason was that persons in CZ are also not "member"s of their political parties, like in V1 corpora, but "candidateMP"s, which I'm not sure what it means really (despite discussion in #16). Maybe you change back to member unless there are good reasons for keeping it so?

I then changed the script again, to just ignore the person's affiaition role. So, if somebody is affiliated to an organisation that is a political party, they are ipso facto their member. So, with this change, it now looks ok I think:

<speech id="ParlaMint-CZ_2020-11-19_ps2017-070-01-001-001.u23" speaker_id="PetrFiala6074" speaker_name="Fiala"\
 speaker_role="chair" speaker_type="MP" speaker_party="ODS" speaker_party_name="Občanská demokratická strana" \
speaker_gender="M" speaker_birth="?">
</speech>

And, just for fun, here are the stats on your speaker parties per speech:

     24 _       _
     36 ANO2011 ANO2011
     95 CSSD    Česká strana sociálně demokratická
     12 KSCM    Komunistická strana Čech a Moravy
     24 ODS     Občanská demokratická strana
     18 Piráti  Česká pirátská strana
     21 SPD     Svoboda a prima demokracie - Tomio Okamura
     28 TOP09   TOP 09
      1 Usvit   Úsvit přímé demokracie Tomia Okamury
`˙˙

The new script is coming along in the next commit.

Bugs in CZ corpus samples

Thanks @matyaskopp for the samples, these are the current problems:

  • there is no need to send so many files, the root and 2 component files are sufficient
  • the corpus root main title should be "Czech parliamentary corpus ParlaMint-CZ [ParlaMint]", while what you have now as the main title could be the title[@type="sub"], cf. the existing examples
  • the meeting elements in the root should cover all the mandates of the corpus, currently you have 2013(!?) only; they should also be linked, via @ana to the appropriate mandate from the org[role="parliament"], cf. SI on how it is done.
  • pls. have the text content of your elements normalised, i.e. without leading or trailing spaces and line breaks, and witout multiple spacing chars (e.g. in name[@type="org"])
  • component files: I noticed <u xml:id="ParlaMint-CZ_2013-11-25_ps2013-001-01-000-000.u1"> that does not have @who, which is a mistake (firstly in my schema, which should have u/@who as obligatory!). The text in this utterance looks more like a comment (i.e. note) anyway.
  • you now have links to audio, but are missing the recordingStmt giving the media files
  • you have Non-breaking spaces (U+0020) in your text, e.g. in ParlaMint-CZ_2013-11-25_ps2013-001-01-000-000.u2.p4 where it says "Z řad", pls. substitute them with ordinary spaces, cf. https://clarin-eric.github.io/parla-clarin/#sec-chars

So, if you could correct this and re-submit pls.

Allow "usage" in langUsage?

For our bilingual BE corpus, it might be nice to annotate the relative amounts for the two languages, e.g.

            <langUsage>
                <language xml:lang="en" ident="fr" usage="49.5">French</language>
                <language xml:lang="nl" ident="fr" usage="49.5">Frans</language>
                <language xml:lang="en" ident="nl" usage="50.5">Dutch</language>
                <language xml:lang="nl" ident="nl" usage="50.5">Nederlands</language>
            </langUsage>

The current scheme does not allow @usage. Cf https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-language.html

Missing file for BE

Current BE files have XInclude to file ParlaMint-BE_2015-03-31_54-commissie-ic132x.xml (and .ana.xml) which does not in fact exist in the directory.

Named entities and element <name>

This issue is closely related to #41

Thinking about our elaborate taxonomy again.

TEI element <name> is primarily intended for proper nouns: https://tei-c.org/release/doc/tei-p5-doc/en/html//CO.html#CONARS
But Named entities' tasks can take care of a wider area of words:

  • time expressions
  • numbers
  • or some artifact names

So I am suggesting using the different element and element name use only for proper nouns.

Possibilities:

Bug in V1.0 HR and SI corpora: <orgName> without @full

The ParlaMint schema insists on having the @full attribute on org/orgName but currently some in Version 1 do not have them
and so do not validate. This is fixed in d0c1b57 via the corpus2sample.xsl script that makes the samples, but should be corrected in the Version 2 of the HR and SI corpora.

Hierarchical structure of organizations

There are various types of organizations in our (CZ) parliament data, eg: committee, subcommittee, workingGroup, politicalGroup, government.
And they have a hierarchical structure of org. Would be possible to allow //org/listOrg/org?

Change of handle

So far we have the handle where the corpus will be deposited as:

<idno type="handle">http://hdl.handle.net/11356/1388</idno>

On reflection, it would be better to have the tag as <idno type="URI" subtype="handle">. More importantly, if all the corpora in all the formats would be in one repository entry, it will be very cluttered, so I think it would be better if the "plain text" files were stored separately from the annotated, .ana files.

So, could everybody pls. change the idno handle tag to <idno type="URI" subtype="handle">, and, only in the .ana files, change the handle from http://hdl.handle.net/11356/1388 to http://hdl.handle.net/11356/1405.

Nested date validation problem

In our nested named entities can this happen (formated and removed elements(w, pc) for better orientation):

<s> Návrh zákona, kterým se mění  
  <name ana="ne:or" type="MISC">
     zákon č. 
    <num ana="ne:n_">353</num>
    /
    <date ana="ne:ty">2003</date>
     ..... 
  </name> 
  .... 
</s>

so s is not a parent of date and this validation fails:

<xsl:if test="not(parent::tei:s) and not(@when or @from or @to)">
<xsl:call-template name="error">
<xsl:with-param name="msg">Missing any temporal attribute on date</xsl:with-param>
</xsl:call-template>
</xsl:if>

@TomazErjavec, can you change parent axis to ancestor please?

Or you can change the condition to: not(@when or @from or @to or @ana) as can occur only in named entities and there should be some annotation (in CZ).

Tokenization for romance languages

Cf also https://universaldependencies.org/fr/tokenization.html

We should agree on a common TEI convention for the encoding of clitic combinations like

  • l'auto <w join="right">l'</w> <w>auto</w>?
  • dammelo
  • au which should be à + le according to the guidelines
# text = Créée au cours du troisième trimestre 1915 comme escadrille MF 93, elle a disparu en même temps que la 30e Escadre de Chasse à laquelle elle était intégrée.
1       Créée   créer   VERB    _       Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part 17      advcl   _       wordform=créée
2-3     au      _       _       _       _       _       _       _       _
2       à       à       ADP     _       _       4       case    _       _
3       le      le      DET     _       Definite=Def|Gender=Masc|Number=Sing|PronType=Art       4       det     _       _
4       cours   cours   NOUN    _       Gender=Masc|Number=Sing 1       obl:mod _       _
5-6     du      _       _       _       _       _       _       _       _

Bugs in CZ.ana

Some details:

  • for the .ana documents the top level title stamp is "[ParlaMint.ana]" (in all files)
  • as mentioned, you also need the standard NER taxonomy

I see you are committed to your ud-syn prefixDef:

replacementPattern="#xpath(//*[@xml:id = replace('$1',':','_')])">

Are you sure? Your corpus will be different from all the others, and conversions to other formats (like for the concordancer) won't work.

The pdt prefixDef is also of the same ilk:

replacementPattern="../pdt-fslib.xml#xpath(//fvLib/fs[./f/symbol/@value = '$1'])">

First, the ../pdt-fslib.xml certaily wont work for others. If you do want to include it with your corpus, then, sure, do, but have the file then just pdt-fslib.xml as it will be in the same dir as the rest (you could even XInclude it in the root). In both cases, I don't think you need xpath, just have pdt-fslib.xml#1 (or if you XInclude, just #1).

And the component files look really nice.

Bugs in CZ

No-break space in teiCorpus file:

U+00A0   2 0.00 2 16.67 NO-BREAK SPACE

a chovatelství:

               <org xml:id="subcommittee.PMRVZ.1243" role="subcommittee">
                  <orgName full="yes" xml:lang="cs">Podvýbor pro myslivost, rybářství, včelařství, zahrádkářství a cho
vatelství</orgName>
                  <orgName full="yes" xml:lang="en">Subcommittee on Gamekeeping, Fisheries, Apiculture, Gardening and 
Animal Husbandry</orgName>
                  <orgName full="init">PMRVZ</orgName>
                  <event from="2014-02-26" to="2017-10-26">
                     <label xml:lang="en">existence</label>
                  </event>
               </org>

Bugs in CZ

In latest CZ data:

  • the main title of the component files should be uniq in the corpus, so you should add a suffix that distinguishes it from the rest, e.g. "Czech parliamentary corpus ParlaMint-CZ, 2020-11-19 ps2017/070/01 [ParlaMint]"
  • on the assumptiom that you are pushing actual files from the corpus, and not samples, the publicationStmt/idno should not be to GitHub but <idno type="handle">http://hdl.handle.net/11356/1388</idno>

Getting ready for distribution

We should now get ready for distributing the corpora, i.e. preparing the meta-data and data for upload to the CLARIN.SI repository; the "plain text" version of the corpora will be available on http://hdl.handle.net/11356/1388 and the ".ana" one on http://hdl.handle.net/11356/1405. Below is the plan, comments welcome.

For the data:

  • Each vesion will have a .tgz file (bitstream) for each country corpus + 1 archive file of the ParlaMint git.
  • The country corpus for plain text will consist of 2 directories: the TEI version + the plain text version with TSV metadata.
  • The county corpus for .ana will consist of 3 directories: the TEI.ana version + CoNLL version with TSV metadata + vertical file version with registry file.

In the process of making the .tgz I now:

  1. Do a finalization step over all your corpora, as I noticed still some slight inconsistencies (like the version number), also wrong numbers of words in the extents etc.
  2. This step also converts the polished TEI files to derived encodings, makes samples, and validates the files. It generates a summary log files for: reporting on the finalization (what has been changed in your corpora); the CoNLL-U conversion and validation report and the standard validation report.

I will put the .tgz files and the logs on the tmp location, and let you know about it, so you can have a look.

The meta-data for the repo entries I will send around - once we have it - for your check. For authorship of the datasets, I would take the people mentioned in the plain text and .ana root files, but - apart from the leads - in the order in which the final corpora have been submitted (this will make it easier to make subsequent versions).

As for the timing, I would publish the corpora that are ready soon, so that they are available for the DH Hachathon. In case some are still missing then, or if anybody finds horrible mistakes in their data, we could do another release at the end of the project. Specifically:

  1. 2020-05-05: data freeze for V2.0
  2. 2020-05-10: release of V2.0
  3. 2020-05-25: data freeze for V2.1
  4. 2020-05-31: release of V2.1

Anonymous speaker?

It might happen that for a certain speech (i.e. <u>) it is not known who the speaker was. Of the V1 corpora, this is only the case in BG, where this speaker was specified as:

<person xml:id="Anonymous">

However, it is not clear that this is the best way to do it, it would be more XMLish if such speeches simply did not have the @who attribute. This would mean fixing BG, undoing #21 and esp. being aware in any user-facing applications that the attribute can be missing, and that such cases must be processed appropriatelly (probably by showing "Anonymous").

Comments?

Utterance has obligatory @ana, for the role of the speaker.

We have cases, for which the speaker metadata is incomplete or inconsistent. Mostly for invited speakers, who should be 'guest' or 'government'. We may fix these situations progressively. Meanwhile, how can we avoid validation errors? Is a default value OK, like "#NIL"? Is there a generic rule also for other attributes?

Missing action for schema validation

To make sure that the sample files are compliant with the schema we need to create and configure a GitHub action that will validate the files against the schema automatically.

Bug in V1.0 BG corpus (and in schema): bad affiliation/@role values

In the ParlaMint-teiCorpus schema there are a lot of values of affiliation/@ROLE which are used only for BG. This is ok, but:

  • in addition to "deputyPrimeMinister" there is also "deputyPrimeMinster" which is obviously a typo
  • a lot of roles are like "candidate-chairman", but there are also roles like "deputyChief"; in order to have them consistent and in line with TEI naming convetions, the hyphenated ones should be changed to camel-case

Problems with party affiliation

political party vs. political group

I have an example of Václav Klaus (the son of our former president), only relevant affiliations included:

          <person xml:id="VaclavKlaus6498">
            <persName>
              <surname>Klaus</surname>
              <forename>Václav</forename>
            </persName>
            <idno type="URI">https://www.psp.cz/sqw/detail.sqw?id=6498</idno>
            <sex value="M">mužské</sex>
            <affiliation ref="#parliament.PSP8" role="MP" from="2017-10-21"/>
            <affiliation ref="#politicalParty.ODS.155" role="candidateMP" from="2017-10-21"/>
            <affiliation ref="#politicalGroup.ODS.1295" role="member" from="2017-10-24T00:00:00" to="2019-03-17T00:00:00"/>
            <affiliation ref="#politicalGroup.Nezaraz.1500" role="member" from="2019-03-17T00:00:00"/>
          </person>

Related parties and political groups:
He was elected as a candidate of political party ODS

          <org xml:id="politicalParty.ODS.155" role="politicalParty">
            <orgName full="yes" xml:lang="cs">Občanská demokratická strana</orgName>
            <orgName full="yes" xml:lang="en">Civic Democratic Party</orgName>
            <orgName full="init">ODS</orgName>
            <event from="1900-01-01"><label xml:lang="en">existence</label></event>
          </org>

He joined to political group ODS in parliament

          <org xml:id="politicalGroup.ODS.1295" role="politicalGroup">
            <orgName full="yes" xml:lang="cs">Poslanecký klub Občanské demokratické strany</orgName>
            <orgName full="yes" xml:lang="en">Political group Civic Democratic Party</orgName>
            <orgName full="init">ODS</orgName>
            <event from="2017-10-24"><label xml:lang="en">existence</label></event>
          </org>

then he joined an independent MP group

          <org xml:id="politicalGroup.Nezaraz.1500" role="politicalGroup">
            <orgName full="yes" xml:lang="cs">Nezařazení</orgName>
            <orgName full="init">Nezařaz</orgName>
            <event from="2019-03-15"><label xml:lang="en">existence</label></event>
          </org>

Then he created a new political party "Trikolóra hnutí občanů" (2019-06-25), but there is no data of this event in the source data...

So the question is: What organization is relevant?
a political party or political group ( https://public.psp.cz/en/sqw/hp.sqw?k=193 )...

Originally posted by @matyaskopp in #31 (comment)

Schema extension suggestion: allow child nodes in //note

Currently, only text nodes are allowed in <note>

audio

We are storing audio links in notes in this way :

               <note type="media">
                  <media mimeType="audio/mp3"
                         source="https://www.psp.cz/eknih/2013ps/audio/2017/07/13/2017071312581312.mp3"
                         url="2013ps/audio/2017/07/13/2017071312581312.mp3"/>
               </note>

It is related to page breaks because one audio file corresponds to one page (#13)

time

We are parsing different formats of time, ie:

            <note type="time">
               <time from="2017-07-13T14:30:00">(Jednání pokračovalo ve 14.30 hodin.)</time>
            </note>
           <note type="time">
              <time to="2017-03-02T16:38:00">(Jednání skončilo v 16.38 hodin.)</time>
          </note>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.