Skip to content

Cleaning data before BPE #35

@marianelamin

Description

@marianelamin
  • Create a set of data cleaning methods

    • Set to lowercase
    • Change á é í ó ú -> aeiou and ñ -> gn
    • Remove Emojis
    • Remove mentions
    • Remove hashtags
    • Remove links
    • Remove punctuation: . - : , ?
    • Remove extra spaces
    • Remove spaces before and after string content.
    • Stemming ?
  • Create the Cleaning class. The idea is that each method above belongs to the cleaning class. This can be part of the c4v nlp cleaning library.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions