Skip to content

Conversation

@ShaharNaveh
Copy link
Contributor

While trying to lex an incomplete source code that comes from a lazy iterator or a BufRead, I had trouble with:

  • knowing what is the offset of the last good token
  • Configure the Lexer to start from the offset of the last good token

Unfortunately

/// Create a new [`Lexer`] for the given source code and [`Mode`].
pub fn lex(source: &str, mode: Mode) -> Lexer<'_> {

doesn't let you do that.


Feel free to close this PR if this is an unwanted change

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell me more about your use case?

The lexer is very much tied to our parser and not really intended to be public API.

/// This means that the input source should be the complete source code and not the
/// sliced version.
pub(crate) fn new(source: &'src str, mode: Mode, start_offset: TextSize) -> Self {
pub fn new(source: &'src str, mode: Mode, start_offset: TextSize) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be okay to have a function next to lex_tokens that also takes an offset but I rather keep the constructor pub(crate)

Copy link
Contributor Author

@ShaharNaveh ShaharNaveh Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.But it does feel a bit redundant as it would have the same signature and would call Lexer::new under the hood. so we end up with two identical methods only that one of them is pub and the other is pub(crate)

I'm not even sure how to call it 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After your explanation here: #21074 (comment)

There's no need to have this one public. will revert

@github-actions
Copy link
Contributor

github-actions bot commented Oct 25, 2025

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@ShaharNaveh
Copy link
Contributor Author

ShaharNaveh commented Oct 25, 2025

Can you tell me more about your use case?

ofc:)

I'm trying to implement a Rust iterator that behaves like CPython’s internal, undocumented _tokenize.TokenizerIter class.
This class operates on any object that provides a .readline() method.

On each __next__ call, it:

  • May call .readline() on the given object, consuming it lazily as needed.
  • Yields a tuple that includes (among other elements) the line and column numbers where the token starts and ends, which is why I need access to the TextRange.

Because the input is consumed lazily, I need to keep track of my current position (offset) in the source. I don’t want to re-tokenize everything from the beginning every time a new line is read.

For example, imagine the first line is:

def foo():

After processing the : token, a syntax error occurs. To continue, I call .readline() on the buffer, which gives me:

    pass

Now, my source looks like this:

def foo():
    pass

I should be able to get the next token starting from where the : token ended (in this case the next token would be Indent, then Pass).

Here’s a short Python snippet that illustrates the behavior:

import io
import _tokenize

buf = io.StringIO(
"""
def func():
  pass

-)( ERROR $&-

for i in range(1):
  pass
""")

try:
  for tup in _tokenize.TokenizerIter(buf.readline, extra_tokens=False):
    # (token numeric value, token value, (char_offset_start, line_start), (char_offset_end, line_end), current_line)
    # Token numeric val from: https://github.com/python/cpython/blob/ebf955df7a89ed0c7968f79faec1de49f61ed7cb/Lib/token.py#L7-L79
    print(tup)
except BaseException as err:
  print(f"{err=}")

print(f"{buf.read()=}") # Remaining buffer that wasn't touched.

@MichaReiser
Copy link
Member

Thanks for the explanation.

I don't think Lexer::new starting from a given offset is what you want in that case. The issue with constructing a new Lexer is that the Lexer tracks a lot of internal state (the number of open parentheses, the f-string nesting, ...) that you lose when you throw away the old lexer and create a new instance.

So what you really want is a way to update the underlying String (which will append new content) and then call next_token again. But I'm not even sure if that will work because the Lexer e.g. returns a String token even if it is unterminated. In that case, you'd have to check that the unterminated flag is set, then rewind the lexer to before the string.

@ShaharNaveh
Copy link
Contributor Author

Thanks for the explanation.

I don't think Lexer::new starting from a given offset is what you want in that case. The issue with constructing a new Lexer is that the Lexer tracks a lot of internal state (the number of open parentheses, the f-string nesting, ...) that you lose when you throw away the old lexer and create a new instance.

So what you really want is a way to update the underlying String (which will append new content) and then call next_token again. But I'm not even sure if that will work because the Lexer e.g. returns a String token even if it is unterminated. In that case, you'd have to check that the unterminated flag is set, then rewind the lexer to before the string.

Oh, good to know.

So, if I understand it correctly:
There's no benefit for me to start the Lexer from a different offset, I could re-lex the entire source and grab the first token that has a larger offset from what I previously had.
Unless there's an API to adjust the cursor location of the Lexer that I don't see (even if there was, it will feel criminally wrong).

And for this PR, I'd need to make the following adjustments:

Add TokenFlags to

pub use crate::token::{Token, TokenKind};

And make this pub:

pub(crate) fn current_flags(&self) -> TokenFlags {

@MichaReiser
Copy link
Member

MichaReiser commented Oct 25, 2025

It's not clear to me why you need the current_* methos over just calling next token?

Adjusting the cursor Location has the same problem as creating a new lexer: it doesn't account for the internal state.

@ShaharNaveh
Copy link
Contributor Author

@MichaReiser After your explanation of:

I don't think Lexer::new starting from a given offset is what you want in that case. The issue with constructing a new Lexer is that the Lexer tracks a lot of internal state (the number of open parentheses, the f-string nesting, ...) that you lose when you throw away the old lexer and create a new instance.

It seems like Lexer isn't exactly what I need. And if I intend to reparse the whole source after each new addition to source then ruff_python_parser::Parser already gives me what I need.

tysm for the replies and explanations!

@ShaharNaveh ShaharNaveh deleted the expose-lexer-state branch October 26, 2025 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants