Giter VIP home page Giter VIP logo

Comments (3)

david-waterworth avatar david-waterworth commented on August 20, 2024 2

@helpmefindaname - will do, I've update the code to a fork here. Haven't had a chance to add tests yet.

from flair.

david-waterworth avatar david-waterworth commented on August 20, 2024

This is my "patched" version (I removed some code from _add_label_to_sentence)

class JsonlDatasetEx(JsonlDataset):
    def __init__(
        self,
        path_to_jsonl_file: Union[str, Path],
        encoding: str = "utf-8",
        text_column_name: str = "text",
        label_column_name: str = "spans",
        metadata_column_name: str = "metadata",
        label_type: str = "ner",
        use_tokenizer: bool | Tokenizer = True,
    ) -> None:
        path_to_json_file = Path(path_to_jsonl_file)

        self.text_column_name = text_column_name
        self.label_column_name = label_column_name
        self.metadata_column_name = metadata_column_name
        self.label_type = label_type
        self.path_to_json_file = path_to_json_file

        self.sentences: list[Sentence] = []
        with path_to_json_file.open(encoding=encoding) as jsonl_fp:
            for line in jsonl_fp:
                current_line = json.loads(line)
                raw_text = current_line[text_column_name]
                current_labels = current_line[label_column_name]
                current_metadatas = current_line.get(self.metadata_column_name, [])
                current_sentence = Sentence(raw_text, use_tokenizer=use_tokenizer)

                self._add_labels_to_sentence(raw_text, current_sentence, current_labels)
                self._add_metadatas_to_sentence(current_sentence, current_metadatas)

                self.sentences.append(current_sentence)

    def _add_label_to_sentence(self, text: str, sentence: Sentence, start: int, end: int, label: str):
        # Search start and end token index for current span
        start_idx = -1
        end_idx = -1
        for token in sentence:
            if token.start_position <= start < token.end_position and start_idx == -1:
                start_idx = token.idx - 1

            if token.start_position < end <= token.end_position and end_idx == -1:
                end_idx = token.idx - 1

        # Throw error if indices are not valid
        if start_idx == -1 or start_idx > end_idx:
            raise ValueError(
                f"Could not create token span from char span.\n\
                    Sen: {sentence}\nStart: {start}, End: {end}, Label: {label}\n\
                        Ann: {text[start:end]}\nRaw: {text}\nCo: {start_idx}, {end_idx}"
            )

        sentence[start_idx : end_idx + 1].add_label(self.label_type, label)

tested with

class CharTokenizer(Tokenizer):
    def tokenize(self, text: str) -> list[str]:
        return list(text)

from flair.

helpmefindaname avatar helpmefindaname commented on August 20, 2024

Hi @david-waterworth
I agree that this would be a good feature extension to flair,
since you already invested some time, do you want to create a Pull request?

from flair.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.