Skip to content

Conversation

@dhruvil410
Copy link

Fix #60
We can also fix the issue by replacing \n by space at starting, when we get sentences, means we can add sentences=replace(sentences, r"\n" => Base.SubstitutionString(" ")) this line at starting of function rulebased_split_sentences(sentences). We can also add different characters other than alphanumeric in committed code.
Which is better way to fix this issue? or any suggestions other than this.

@triztian
Copy link

triztian commented Apr 5, 2021

I think perhaps adding tests would help in making this fix more robust, also since it'd be changing the output of the function, maybe make it an optional keyword arg so that those that need it to behave that way enable the behavior explicitly rather than it changing all of the sudden.

For example updating rulebased_split_sentences:

function rulebased_split_sentences(sentences)

So that it can be called like this:

rulebased_split_sentences(sentence, collapse_newlines=true)

So that multiple newlines are reduced to 1 newline and single newlines removed.

@dhruvil410
Copy link
Author

I have no idea about checks. Why didn't code pass checks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sentence tokenization must ignore newline as whitespace in the default mode.

2 participants