Giter VIP home page Giter VIP logo

Comments (2)

tsproisl avatar tsproisl commented on June 7, 2024

You are right. Currently, as a developer, you would need to implement parallel processing yourself. This is largely because parallel tagging was quickly added as an afterthought. However, cli.parallel_tagging does almost what you want. You only need to change some parts to make it return its output, instead of printing it straight away.

Here I've turned it into a generator that yields tagged sentences, i.e. lists of (word, tag)-tuples. If you use it on XML input, note that tuples corresponding to XML tags have length 1 (only the word and no pos tag).

#!/usr/bin/env python3

import multiprocessing
import threading

from someweta import ASPTagger
import someweta.cli


def parallel_tagging(corpus, asptagger, parallel, xml=False):
    """Tag file object `corpus` using `parallel` worker processes."""
    def output_result(data, xml):
        if xml:
            i, result, lines, word_indexes = data
            tags = {idx: (t[1],) for idx, t in zip(word_indexes, result)}
            return [(t,) + tags.get(idx, ()) for idx, t in enumerate(lines)]
        else:
            i, result = data
            return result

    sentinel = someweta.cli.Sentinel()
    processes = min(parallel, multiprocessing.cpu_count())
    input_queue = multiprocessing.Queue(maxsize=processes * 100)
    output_queue = multiprocessing.Queue(maxsize=processes * 100)
    producer = threading.Thread(target=someweta.cli.fill_input_queue, args=(input_queue, corpus, processes, sentinel, xml))
    with multiprocessing.Pool(processes=processes, initializer=someweta.cli.process_input_queue, initargs=(asptagger.tag_sentence, input_queue, output_queue, sentinel, xml)):
        producer.start()
        observed_sentinels = 0
        current = 0
        cached_results = {}
        while True:
            data = output_queue.get()
            if isinstance(data, someweta.cli.Sentinel):
                observed_sentinels += 1
                if observed_sentinels == processes:
                    break
                else:
                    continue
            i = data[0]
            cached_results[i] = data
            while current in cached_results:
                yield output_result(cached_results[current], xml)
                del cached_results[current]
                current += 1
        corpus_size = input_queue.get()
        producer.join()


tagger = ASPTagger()
tagger.load("german_web_social_media_2020-05-28.model")

with open("input.txt", encoding="utf-8") as f:
    for sentence in parallel_tagging(f, tagger, parallel=4, xml=False):
        print("\n".join(["\t".join(t) for t in sentence]))
        print()

from someweta.

g3rfx avatar g3rfx commented on June 7, 2024

Thank you so much for your prompt response. Really, appreciate it!

from someweta.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.