- 
                Notifications
    You must be signed in to change notification settings 
- Fork 4.9k
whisper-cli : align token timestamps with VAD ts #3218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
b23c671    to
    75db936      
    Compare
  
    | Has this issue been resolved? It seems it hasn't been merged into the main branch, or has it already been fixed in the branch (vad-token-timestamp-alignment) that I can use it ? | 
| 
 No, it has not been resolved yet. I changed it to a draft (which might have sent a notification) as I noticed the token level timestamps are still not correct and I need to revisit this. | 
75db936    to
    12e44a1      
    Compare
  
    | @chriswang- It would be great if you could try this out with the audio sample in your original issue report. | 
| @danbev Sorry The issue is not commited by me, But I can try to verify it . | 
| @chriswang- Ah my bad, I should have checked to be sure and not just assumed. | 
| subtitle-master-with-vad.json @danbev | 
This commit aligns the token timestamps with the VAD timestamps when VAD is enabled. The motivation of this is that currently the token timestamps that are reported in the full json output are the timestamps that whisper sees after the VAD has processed the audio. This means that whisper only sees possibly filtered audio and the token timestamps are related to the filtered audio, not the original audio. For the segment timestamps we map/align them with original timestamps but this is not currenly done for the token timestamps which is what this commit aims to address. Resolves: ggml-org#3174
12e44a1    to
    c5e33f4      
    Compare
  
    
This commit aligns the token timestamps with the VAD timestamps when VAD is enabled.
The motivation of this is that currently the token timestamps that are reported in the full json output are the timestamps that whisper sees after the VAD has processed the audio. This means that whisper only sees possibly filtered audio and the token timestamps are related to the filtered audio, not the original audio. For the segment timestamps we map/align them with original timestamps but this is not currenly done for the token timestamps which is what this commit aims to address.
Resolves: #3174
Example of token level timestamps prior to this PR:
And with this PR: