Skip to content

Update SpaCy support to cover new features #176

@frreiss

Description

@frreiss

SpaCy 3.0's language models now produce some additional features that we don't currently translate to DataFrames. The parse tree information now includes information on children and ancestors. There is an is_sent_start flag to indicate whether a token is at the beginning of a sentence. There is support for embeddings in the vector field of Token. There are probably a few more. See https://spacy.io/api/token for the full list.

We should extend the existing SpaCy support in https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/io/spacy.py to support these additional features if present.

With these additional features, the DataFrame representation of the full output of a SpaCy language model is getting a bit large, so it would be a good idea to also add a facility to produce only the DataFrame columns that your application needs -- say, an additional argument to make_tokens_and_features that replaces and generalizes the existing add_left_and_right argument to control whether multiple columns appear in the output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions