Expose some methods of `ruff_python_parser::Lexer` #21074

ShaharNaveh · 2025-10-25T14:39:28Z

While trying to lex an incomplete source code that comes from a lazy iterator or a BufRead, I had trouble with:

knowing what is the offset of the last good token
Configure the Lexer to start from the offset of the last good token

Unfortunately

ruff/crates/ruff_python_parser/src/lexer.rs

Lines 1823 to 1824 in 64ab79e

    
           /// Create a new [`Lexer`] for the given source code and [`Mode`]. 
        
           pub fn lex(source: &str, mode: Mode) -> Lexer<'_> {

doesn't let you do that.

Feel free to close this PR if this is an unwanted change

MichaReiser

Can you tell me more about your use case?

The lexer is very much tied to our parser and not really intended to be public API.

MichaReiser · 2025-10-25T14:55:54Z

crates/ruff_python_parser/src/lexer.rs

    /// This means that the input source should be the complete source code and not the
    /// sliced version.
-    pub(crate) fn new(source: &'src str, mode: Mode, start_offset: TextSize) -> Self {
+    pub fn new(source: &'src str, mode: Mode, start_offset: TextSize) -> Self {


I'd be okay to have a function next to lex_tokens that also takes an offset but I rather keep the constructor pub(crate)

np.But it does feel a bit redundant as it would have the same signature and would call Lexer::new under the hood. so we end up with two identical methods only that one of them is pub and the other is pub(crate)

I'm not even sure how to call it 😅

After your explanation here: #21074 (comment)

There's no need to have this one public. will revert

github-actions · 2025-10-25T14:57:12Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

ShaharNaveh · 2025-10-25T17:01:24Z

Can you tell me more about your use case?

ofc:)

I'm trying to implement a Rust iterator that behaves like CPython’s internal, undocumented _tokenize.TokenizerIter class.
This class operates on any object that provides a .readline() method.

On each __next__ call, it:

May call .readline() on the given object, consuming it lazily as needed.
Yields a tuple that includes (among other elements) the line and column numbers where the token starts and ends, which is why I need access to the TextRange.

Because the input is consumed lazily, I need to keep track of my current position (offset) in the source. I don’t want to re-tokenize everything from the beginning every time a new line is read.

For example, imagine the first line is:

def foo():

After processing the : token, a syntax error occurs. To continue, I call .readline() on the buffer, which gives me:

    pass

Now, my source looks like this:

def foo():
    pass

I should be able to get the next token starting from where the : token ended (in this case the next token would be Indent, then Pass).

Here’s a short Python snippet that illustrates the behavior:

import io
import _tokenize

buf = io.StringIO(
"""
def func():
  pass

-)( ERROR $&-

for i in range(1):
  pass
""")

try:
  for tup in _tokenize.TokenizerIter(buf.readline, extra_tokens=False):
    # (token numeric value, token value, (char_offset_start, line_start), (char_offset_end, line_end), current_line)
    # Token numeric val from: https://github.com/python/cpython/blob/ebf955df7a89ed0c7968f79faec1de49f61ed7cb/Lib/token.py#L7-L79
    print(tup)
except BaseException as err:
  print(f"{err=}")

print(f"{buf.read()=}") # Remaining buffer that wasn't touched.

MichaReiser · 2025-10-25T18:02:12Z

Thanks for the explanation.

I don't think Lexer::new starting from a given offset is what you want in that case. The issue with constructing a new Lexer is that the Lexer tracks a lot of internal state (the number of open parentheses, the f-string nesting, ...) that you lose when you throw away the old lexer and create a new instance.

So what you really want is a way to update the underlying String (which will append new content) and then call next_token again. But I'm not even sure if that will work because the Lexer e.g. returns a String token even if it is unterminated. In that case, you'd have to check that the unterminated flag is set, then rewind the lexer to before the string.

ShaharNaveh · 2025-10-25T18:27:52Z

Thanks for the explanation.

I don't think Lexer::new starting from a given offset is what you want in that case. The issue with constructing a new Lexer is that the Lexer tracks a lot of internal state (the number of open parentheses, the f-string nesting, ...) that you lose when you throw away the old lexer and create a new instance.

So what you really want is a way to update the underlying String (which will append new content) and then call next_token again. But I'm not even sure if that will work because the Lexer e.g. returns a String token even if it is unterminated. In that case, you'd have to check that the unterminated flag is set, then rewind the lexer to before the string.

Oh, good to know.

So, if I understand it correctly:
There's no benefit for me to start the Lexer from a different offset, I could re-lex the entire source and grab the first token that has a larger offset from what I previously had.
Unless there's an API to adjust the cursor location of the Lexer that I don't see (even if there was, it will feel criminally wrong).

And for this PR, I'd need to make the following adjustments:

Add TokenFlags to

ruff/crates/ruff_python_parser/src/lib.rs

Line 74 in 64ab79e

pub use crate::token::{Token, TokenKind};

And make this pub:

ruff/crates/ruff_python_parser/src/lexer.rs

Line 134 in 64ab79e

pub(crate) fn current_flags(&self) -> TokenFlags {

MichaReiser · 2025-10-25T21:32:42Z

It's not clear to me why you need the current_* methos over just calling next token?

Adjusting the cursor Location has the same problem as creating a new lexer: it doesn't account for the internal state.

ShaharNaveh · 2025-10-26T10:06:25Z

@MichaReiser After your explanation of:

I don't think Lexer::new starting from a given offset is what you want in that case. The issue with constructing a new Lexer is that the Lexer tracks a lot of internal state (the number of open parentheses, the f-string nesting, ...) that you lose when you throw away the old lexer and create a new instance.

It seems like Lexer isn't exactly what I need. And if I intend to reparse the whole source after each new addition to source then ruff_python_parser::Parser already gives me what I need.

tysm for the replies and explanations!

Make sevral methods of ruff_python_parser::Lexer public

19f17b7

ShaharNaveh requested review from MichaReiser and dhruvmanila as code owners October 25, 2025 14:39

MichaReiser reviewed Oct 25, 2025

View reviewed changes

Export TokenFlags related struct&method

aadfe16

ShaharNaveh requested a review from MichaReiser October 25, 2025 18:53

ShaharNaveh closed this Oct 26, 2025

ShaharNaveh deleted the expose-lexer-state branch October 26, 2025 10:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Expose some methods of `ruff_python_parser::Lexer` #21074

Expose some methods of `ruff_python_parser::Lexer` #21074

Uh oh!

ShaharNaveh commented Oct 25, 2025

Uh oh!

MichaReiser left a comment

Uh oh!

MichaReiser Oct 25, 2025

Uh oh!

ShaharNaveh Oct 25, 2025 •

edited

Loading

Uh oh!

ShaharNaveh Oct 25, 2025

Uh oh!

github-actions bot commented Oct 25, 2025 •

edited

Loading

Uh oh!

ShaharNaveh commented Oct 25, 2025 •

edited

Loading

Uh oh!

MichaReiser commented Oct 25, 2025

Uh oh!

ShaharNaveh commented Oct 25, 2025

Uh oh!

MichaReiser commented Oct 25, 2025 •

edited

Loading

Uh oh!

ShaharNaveh commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	/// Create a new [`Lexer`] for the given source code and [`Mode`].
	pub fn lex(source: &str, mode: Mode) -> Lexer<'_> {

Uh oh!

Expose some methods of ruff_python_parser::Lexer #21074

Expose some methods of ruff_python_parser::Lexer #21074

Uh oh!

Conversation

ShaharNaveh commented Oct 25, 2025

Uh oh!

MichaReiser left a comment

Choose a reason for hiding this comment

Uh oh!

MichaReiser Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

Uh oh!

ShaharNaveh commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaReiser commented Oct 25, 2025

Uh oh!

ShaharNaveh commented Oct 25, 2025

Uh oh!

MichaReiser commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShaharNaveh commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Expose some methods of `ruff_python_parser::Lexer` #21074

Expose some methods of `ruff_python_parser::Lexer` #21074

ShaharNaveh Oct 25, 2025 •

edited

Loading

github-actions bot commented Oct 25, 2025 •

edited

Loading

`ruff-ecosystem` results

ShaharNaveh commented Oct 25, 2025 •

edited

Loading

MichaReiser commented Oct 25, 2025 •

edited

Loading