Giter VIP home page Giter VIP logo

ace2005chinese_preprocess's Introduction

ace2005chinese_preprocess

ACE 2005 corpus preprocessing for Event Extraction task

Prerequisites

  1. Prepare ACE 2005 dataset.

    (Download: https://catalog.ldc.upenn.edu/LDC2006T06. Note that ACE 2005 dataset is not free.)

  2. Install the packages.

    pip install beautifulsoup4 nltk tqdm
    
  3. Choose your data_list,not given above. (train/dev/test)

Usage

Run:

sudo python main.py --data=./data/ace_2005/Chinese
  • Then you can get the parsed data in output directory.

Output

Format

I follow the json format described in nlpcl-lab/ace2005-preprocessing [github] repository like the bellow sample. But currently only sentence, event-mentions, entity-mentions, others information such as dependency tree, pos_tags, etc. will be added later. The data division method (data_list.csv) is selected randomly during the experiment, it does not belong to the authoritative division method of ED task.

If you want to know event types and arguments in detail, read this document (ACE 2005 event guidelines).

sample.json

[
   {
    "sentence": "两个星期来,藤森曾亲自带队搜捕 前情报顾问蒙特西诺斯,迄今蒙特西诺斯仍未落网",
    "golden-event-mentions": [
      {
        "arguments": [
          {
            "start": 29,
            "end": 34,
            "entity-type": "PER:Individual",
            "text": "蒙特西诺斯",
            "role": "Person"
          },
          {
            "start": 0,
            "end": 4,
            "entity-type": "TIM:time",
            "text": "两个星期",
            "role": "Time"
          }
        ],
        "trigger": {
          "start": 36,
          "end": 38,
          "text": "落网"
        },
        "event_type": "Justice:Arrest-Jail"
      }
    ],
    "golden-entity-mentions": [
      {
        "start": 16,
        "entity-type": "PER:Individual",
        "text": "前情报顾问",
        "end": 21,
        "phrase-type": "NOM"
      },
      {
        "start": 21,
        "entity-type": "PER:Individual",
        "text": "蒙特西诺斯",
        "end": 26,
        "phrase-type": "NAM"
      },
      {
        "start": 29,
        "entity-type": "PER:Individual",
        "text": "蒙特西诺斯",
        "end": 34,
        "phrase-type": "NAM"
      },
      {
        "start": 6,
        "entity-type": "PER:Individual",
        "text": "藤森",
        "end": 8,
        "phrase-type": "NAM"
      },
      {
        "start": 0,
        "entity-type": "TIM:time",
        "text": "两个星期",
        "end": 4,
        "phrase-type": "TIM"
      },
      {
        "start": 27,
        "entity-type": "TIM:time",
        "text": "迄今",
        "end": 29,
        "phrase-type": "TIM"
      }
    ]
  },
]

Reference

  • nlpcl-lab's ace2005-preprocessing repository, [github]

ace2005chinese_preprocess's People

Contributors

fantasyoo666 avatar ll0ruc avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.