Skip to content

Conversation

@marianelamin
Copy link
Collaborator

@marianelamin marianelamin commented Apr 22, 2021

Catching up.
Data cleaning class created 5 months ago to deal with tweets. This class offers several methods that can be applied directly on a str or a pd.Series. to remove punctuation, hashtags, links, mentions etc...
More details on issue #35

Cambios en este PR:

  • src/c4v/data/data_sampler.py
    Make use of the data cleaner utility when sampling the data.
    Use Black formatter
  • src/c4v/data/data_cleaner.py
    Create methods to "clean" texts in varios ways (remove links, hashtags, emojies, punctuation, extra white spaces, trimming, tagging or mentioning, removing Spanish accents).
  • tests/data/test_data_cleaner.py

This utility can grow depending on the necessities of the cleaning phase.
Feedback is encouraged!

@marianelamin marianelamin linked an issue Apr 22, 2021 that may be closed by this pull request
12 tasks
@marianelamin marianelamin requested a review from dieko95 April 23, 2021 02:15
@marianelamin marianelamin self-assigned this Apr 23, 2021
@marianelamin marianelamin added the enhancement New feature or request label Apr 23, 2021
@marianelamin marianelamin marked this pull request as ready for review April 23, 2021 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cleaning data before BPE

2 participants