argmaxinc / whisperkit Goto Github PK
View Code? Open in Web Editor NEWSwift native on-device speech recognition with Whisper for Apple Silicon
Home Page: https://takeargmax.com/blog/whisperkit
License: MIT License
Swift native on-device speech recognition with Whisper for Apple Silicon
Home Page: https://takeargmax.com/blog/whisperkit
License: MIT License
error message
System Information
macOS Version 14.3.1 (Build 23D60)
Xcode 15.2 (22503) (Build 15C500b)
Timestamp: 2024-03-09T21:07:42+08:00
It would be great if the first text token's logprob can be used to discard a transcription draft as failed and start over. Start over could mean either falling back to a higher temperature sampling or updating the audio buffer for streaming use cases.
the example is unable to run on iphone 11 pro.
(The example is running good on Mac m1 max )
The following is the screen shot on iphone 11 pro. Base Model
debug log:
[WhisperKit] --------------- DECODER INPUTS DEBUG ---------------
[WhisperKit] Cache Length: 2 Input Token: 50359
[WhisperKit] Key Cache | Val Cache | Update Mask | Decoder Mask | Position
[WhisperKit] -0.125732 | 0.048828 | 0 | 0 | 0
[WhisperKit] 0.308350 | -0.556641 | 0 | 0 | 1
[WhisperKit] 0.000000 | 0.000000 | 1 | 0 | 2
[WhisperKit] 0.000000 | 0.000000 | 0 | -10000 | 3
[WhisperKit] [0.00 --> 14.90]
[WhisperKit] ---- Transcription Timings ----
[WhisperKit] Audio Load: 0.00 ms / 1 runs ( 0.00 ms/run) 0.00%
[WhisperKit] Audio Processing: 0.41 ms / 1 runs ( 0.41 ms/run) 0.03%
[WhisperKit] Mels: 57.57 ms / 1 runs ( 57.57 ms/run) 3.96%
[WhisperKit] Encoding: 1171.59 ms / 1 runs ( 1171.59 ms/run) 80.56%
[WhisperKit] Matrices Init: 5.36 ms / 1 runs ( 5.36 ms/run) 0.37%
[WhisperKit] Prefill: 0.49 ms / 1 runs ( 0.49 ms/run) 0.03%
[WhisperKit] Decoding: 208.06 ms / 4 runs ( 52.01 ms/run) 14.31%
[WhisperKit] Non-inference: 7.49 ms / 4 runs ( 1.87 ms/run) 0.52%
[WhisperKit] - Sampling: 4.13 ms / 4 runs ( 1.03 ms/run) 0.28%
[WhisperKit] - Kv Caching: 3.91 ms / 4 runs ( 0.98 ms/run) 0.27%
[WhisperKit] - Windowing: 0.08 ms / 1 runs ( 0.08 ms/run) 0.01%
[WhisperKit] Fallbacks: 122.98 ms / 0 runs ( 0.00 ms/run) 8.46%
[WhisperKit] Decoding Full Loop: 1448.16 ms / 4 runs ( 362.04 ms/run) 99.57%
[WhisperKit] -------------------------------
[WhisperKit] Model Load Time: 6.60 seconds
[WhisperKit] Inference Duration: 1.45 seconds
[WhisperKit] - Decoding Loop: 1.45 seconds
[WhisperKit] Time to first token: 1.30 seconds
[WhisperKit] Total Tokens: 5
[WhisperKit] Tokens per Second: 2.76 tok/s
[WhisperKit] Real Time Factor: 0.10
[WhisperKit] Fallbacks: 0.0
[WhisperKit] [0.00 --> 14.90] <|endoftext|>
Segment level timestamps look good, great work guys.
Are token level timestamps currently supported somehow, or on the roadmap?
Hi folks, just wanted to check in and ask what would be entailed in adding support for older mac versions, such as 13.0?
Hey! Thanks for making WhisperKit!
I hope I did not miss it in the documentation. But is it possible to provide a local URL to the model for the WhisperKit instead of relying on its internal mechanism to load the model? Inside my app I already have a nice UI that allows to download, suspend, and cancel download progress so would be nice if I could then feed WhisperKit with the local URL.
If there is no such functionality but you are considering adding it - I might try to help by making a PR.
Thanks.
Language detection here should be fairly simple with logits filters now, it will entail a single decoder pass and sample just the language tokens. However, this cannot be used when we are using a prefill prompt (i.e. forced decoder tokens) so that will need special handling.
Openai implementation: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L19
WhisperKit inline todo:
It would be worth adding support for ReactNative apps using Native Modules and expose Swift APIs to JS.
Hey,
Apple just dropped MLX-Swift, a cross-platform (currently only iOS/macOS) MLX framework. Are there any plans to support it?
Thanks!
trash
, used in the following make rule, is not part of a default macOS setup.
clean-package-caches:
@trash ~/Library/Caches/org.swift.swiftpm/repositories
@trash ~/Library/Developer/Xcode/DerivedData
I see three options to address this:
rm
instead, and delete immediatelymv
, and move the files to the user's trash in ~/.Trash
(only works properly if the files are in the local disk; for external hard drives trashes are at /Volumes/NAME_OF_EXTERNAL/.Trashes/USER_ID/
, and to handle these cases probably better go with option 3)trash
using Homebrew in the setup
rule.I'm trying the demo app on a MacBook Pro with Apple M1 Pro and 16 GB memory. The large-v3_turbo_1049MB
model has been specializing for more than 30 minutes, but aned
is still running and using a whole performance core. Have you guys tested the loading time on different devices?
I have problem being able to develop and run with this sadly.
I am running a AMD cpu windows 11 pc. I am using vmware to get MacOS, however I am not able to run any MacOS version after 12, due to amd cpu not supporting this. This in turn means that I cannot run the later versions of xcode that support swift 5.9.
Would you ever considering backporting some of this functionality for previous versions of swift?
Currently when using the default WhisperKit flow of auto downloading models on transcribe, an internet connection is required even if models have already been downloaded in the past due to swift-transformers fetching the filenames here.
This is a bit limiting, as e.g. @pveugen was on a train with poor internet and couldn't transcribe audio even after downloading the model in the past (after #80 it would throw an error instead of crashing). I think we could get around this by manually downloading and specifying the path in setupModels modelFolder:
, but it would be nice if there was a way to avoid this HTTP get by default.
This can be done with logit filters on the first loop, similar to detecting language. However, this cannot be used when we are using a prefill prompt (i.e. forced decoder tokens) so that will need special handling. Ideally, there'd be an option to ignore the prefill prompt for the first decoder loop to detect no speech, which costs 1 extra loop but may allow skipping the entire window if developers are expecting some long stretches of silence in their input audio.
Openai implementation: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L692-L693
WhisperKit inline todo:
WhisperKit/Sources/WhisperKit/Core/WhisperKit.swift
Lines 612 to 616 in 228630c
From what I understood there are some limitations and degradations to the model quality but it would still be nice to be able to support users on Ventura (and iOS 16)
It would be great if certain patterns in the newly added word timestamps (#38 ) can be leveraged to reduce the incidence rate of hallucinations. This change will require comprehensive re-evaluation of the models since accurate words could also have zero-length based on inaccurate word timestamps.
Timestamp rules are helpful to more consistently find reliable timestamps during decoding.
Important note: We have already brought over some of this logic into the SegmentSeeker
which runs at the end of a full decode loop to generate the segments. This feature will need to detangle any repeated logic between them.
Openai implementation: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L441-L505
Hey folks! I'm trying to use the CLI, but it fails to load models:
Building for debugging...
Build complete! (0.07s)
Error: Unable to load model: file:///Users/usmanm/whisperkit/Models/whisperkit-coreml/openai_whisper-tiny/MelSpectrogram.mlmodelc/. Compile the model with Xcode or `MLModel.compileModel(at:)`.
The setup instructions seemed to have worked correctly:
➜ whisperkit git:(main) make setup
Setting up environment...
/opt/homebrew/bin/pip3
/opt/homebrew/bin/python3
Requirement already satisfied: huggingface_hub in /opt/homebrew/lib/python3.11/site-packages (0.20.3)
Requirement already satisfied: filelock in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (3.13.1)
Requirement already satisfied: fsspec>=2023.5.0 in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (2023.10.0)
Requirement already satisfied: requests in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (2.31.0)
Requirement already satisfied: tqdm>=4.42.1 in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (4.66.1)
Requirement already satisfied: pyyaml>=5.1 in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (6.0.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (4.8.0)
Requirement already satisfied: packaging>=20.9 in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (23.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/homebrew/lib/python3.11/site-packages (from requests->huggingface_hub) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/homebrew/lib/python3.11/site-packages (from requests->huggingface_hub) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/homebrew/lib/python3.11/site-packages (from requests->huggingface_hub) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /opt/homebrew/lib/python3.11/site-packages (from requests->huggingface_hub) (2023.7.22)
usmanm
Already logged in to Hugging Face.
➜ whisperkit git:(main) make download-models
Downloading compressed models...
Repository exists, pulling latest changes...
HEAD is now at 07ea546 Create config.json
Like OpenAI's Whisper, is it possible to pass a text prompt which could be used to improve the quality of the transcript?
Laurent
Beam search on CoreML will require some model changes to work according to the reference implementation. This is mainly due to CoreML static shapes requiring a new model for each possible beam_size
. We have some plans to deal with this so will keep this issue here for tracking purposes.
Openai implementation: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L301-L404
import SwiftUI
import WhisperKit
@main
struct WhisperKitApp: App {
init() {
Task {
do {
let pipe = try? await WhisperKit()
let transcription = try? await pipe!.transcribe(audioPath: "Audio/output-lang.wav")?.text
print(transcription)
} catch {
print("Error: \(error)")
}
}
}
var body: some Scene {
WindowGroup {
ContentView()
}
}
}
Then I get an error : Cannot call value of non-function type 'module<WhisperKit>'
What should I do to solve this problem? tks.
When filtering out special tokens in addWordTimestamps, word timings that contain a timing token followed by a hyphen aren't filtered out correctly. WordTiming.tokens
correctly contains just [532]
, but WordTiming.word
is "<|0.00|> -"
. This seems to occur most when multiple people are talking over each other in a recording, I guess it's Whisper's way of trying to label speakers.
the advantage of this project is that it uses CoreML for a performance gain, so showing benchmarks would solidify how much this advantage is
How to fix this issue?
Task <60D6EF47-1009-4EFE-9E1B-5988A7FD6E4F>.<1> HTTP load failed, 0/0 bytes (error code: -1200 [3:-9816])
Task <60D6EF47-1009-4EFE-9E1B-5988A7FD6E4F>.<1> finished with error [-1200] Error Domain=NSURLErrorDomain Code=-1200 "An SSL error has occurred and a secure connection to the server cannot be made." UserInfo={NSErrorFailingURLStringKey=https://cdn-lfs-us-1.huggingface.co/repos/8f/fc/8ffc19694b8dfd29ebaafed41040596f15c2a6ee94d3e9f8a0bf0f1523bade3c/6ac1227740ecc2fd7a03df50ac6e2a7f7946acfa77069cf2c486ae0255356b95?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27coremldata.bin%3B+filename%3D%22coremldata.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1710121327&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMDEyMTMyN319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzhmL2ZjLzhmZmMxOTY5NGI4ZGZkMjllYmFhZmVkNDEwNDA1OTZmMTVjMmE2ZWU5NGQzZTlmOGEwYmYwZjE1MjNiYWRlM2MvNmFjMTIyNzc0MGVjYzJmZDdhMDNkZjUwYWM2ZTJhN2Y3OTQ2YWNmYTc3MDY5Y2YyYzQ4NmFlMDI1NTM1NmI5NT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=TR-WYiW9gDlLkJIYv-2TaU4UYLNidoOb9oE-OXvBkpsmBHYZ7%7ElhzAoGKa7aqBYGcUnDmmJG0HTJXVyz-6dYbX%7E6vlU8j3x83mJfi2DEPRKzW1RB0tjRx4HMOpuP1G5FMr9CWBvS8M-icXoz-Beyu%7EmyDcLzKISUPV-RFlw1Jm72PiLb5MvCpdw2cdlDfFYUbmzYYIyWsUZsK5YuB6R187AXqM00lIy05xzIOhmuwJzL1XSMzu5-D2WxnNfkBDP4NUiX6OtYhZgJVA9I2ELqmHhOs4qX6HNAXOkxz6KtnuWEpO3N8%7E-yZ%7EPPeNcOudyuAMKw1m2qp0L8JuUxhqCP8Q__&Key-Pair-Id=KCD77M1F0VK2B, NSLocalizedRecoverySuggestion=Would you like to connect to the server anyway?, _kCFStreamErrorDomainKey=3, _NSURLErrorFailingURLSessionTaskErrorKey=LocalDownloadTask <60D6EF47-1009-4EFE-9E1B-5988A7FD6E4F>.<1>, _NSURLErrorRelatedURLSessionTaskErrorKey=(
"LocalDownloadTask <60D6EF47-1009-4EFE-9E1B-5988A7FD6E4F>.<1>"
), NSLocalizedDescription=An SSL error has occurred and a secure connection to the server cannot be made., NSErrorFailingURLKey=https://cdn-lfs-us-1.huggingface.co/repos/8f/fc/8ffc19694b8dfd29ebaafed41040596f15c2a6ee94d3e9f8a0bf0f1523bade3c/6ac1227740ecc2fd7a03df50ac6e2a7f7946acfa77069cf2c486ae0255356b95?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27coremldata.bin%3B+filename%3D%22coremldata.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1710121327&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMDEyMTMyN319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzhmL2ZjLzhmZmMxOTY5NGI4ZGZkMjllYmFhZmVkNDEwNDA1OTZmMTVjMmE2ZWU5NGQzZTlmOGEwYmYwZjE1MjNiYWRlM2MvNmFjMTIyNzc0MGVjYzJmZDdhMDNkZjUwYWM2ZTJhN2Y3OTQ2YWNmYTc3MDY5Y2YyYzQ4NmFlMDI1NTM1NmI5NT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=TR-WYiW9gDlLkJIYv-2TaU4UYLNidoOb9oE-OXvBkpsmBHYZ7%7ElhzAoGKa7aqBYGcUnDmmJG0HTJXVyz-6dYbX%7E6vlU8j3x83mJfi2DEPRKzW1RB0tjRx4HMOpuP1G5FMr9CWBvS8M-icXoz-Beyu%7EmyDcLzKISUPV-RFlw1Jm72PiLb5MvCpdw2cdlDfFYUbmzYYIyWsUZsK5YuB6R187AXqM00lIy05xzIOhmuwJzL1XSMzu5-D2WxnNfkBDP4NUiX6OtYhZgJVA9I2ELqmHhOs4qX6HNAXOkxz6KtnuWEpO3N8%7E-yZ%7EPPeNcOudyuAMKw1m2qp0L8JuUxhqCP8Q__&Key-Pair-Id=KCD77M1F0VK2B, NSUnderlyingError=0x600000c7edf0 {Error Domain=kCFErrorDomainCFNetwork Code=-1200 "(null)" UserInfo={_kCFStreamPropertySSLClientCertificateState=0, _kCFNetworkCFStreamSSLErrorOriginalValue=-9816, _kCFStreamErrorDomainKey=3, _kCFStreamErrorCodeKey=-9816, _NSURLErrorNWPathKey=satisfied (Path is satisfied), interface: en0}}, _kCFStreamErrorCodeKey=-9816}
WhisperKit/WhisperKit.swift:194: Fatal error: Unexpectedly found nil while unwrapping an Optional value
It looks like the model is deleted when I use FIleManager removeAt method but when I re-run the project the deleted model appears again.
FileManager.default.removeItem(at: URL.init(string: "file://" + "(path)")!)
Thank you for your WORK!!!
I'm not a MacOS developer, but a user. I want to know if it's possible to use the computer's audio output in Stream, not just the microphone. The scenario is similar to simultaneous interpretation in meetings.
I look forward to your reply, thank you again!!!
When using Swift CLI example, followed the exact commands mentioned, getting above error:
Chip: Apple M3 Max
OS: 14.3.1
Xcode: Version 15.2
Apple Swift version: 5.9.2
The app crashes after recording a few seconds of sound. It's being used on an iPhone 12 mini device that has been cold restarted, with Large-v2_1050MB.
The app “WhisperAX” has been killed by the operating system because it is using too much memory.
Domain: IDEDebugSessionErrorDomain
Code: 11
Recovery Suggestion: Use a memory profiling tool to track the process memory usage.
User Info: {
DVTErrorCreationDateKey = "2024-03-12 18:15:07 +0000";
IDERunOperationFailingWorker = DBGLLDBLauncher;
}
--
The app “WhisperAX” has been killed by the operating system because it is using too much memory.
Domain: IDEDebugSessionErrorDomain
Code: 11
Recovery Suggestion: Use a memory profiling tool to track the process memory usage.
User Info: {
IDERunOperationFailingWorker = DBGLLDBLauncher;
}
--
Event Metadata: com.apple.dt.IDERunOperationWorkerFinished : {
"device_isCoreDevice" = 1;
"device_model" = "iPhone13,1";
"device_osBuild" = "17.3.1 (21D61)";
"device_platform" = "com.apple.platform.iphoneos";
"dvt_coredevice_version" = "355.24";
"dvt_mobiledevice_version" = "1643.100.58";
"launchSession_schemeCommand" = Run;
"launchSession_state" = 2;
"launchSession_targetArch" = arm64;
"operation_duration_ms" = 968315;
"operation_errorCode" = 11;
"operation_errorDomain" = IDEDebugSessionErrorDomain;
"operation_errorWorker" = DBGLLDBLauncher;
"operation_name" = IDERunOperationWorkerGroup;
"param_debugger_attachToExtensions" = 0;
"param_debugger_attachToXPC" = 1;
"param_debugger_type" = 3;
"param_destination_isProxy" = 0;
"param_destination_platform" = "com.apple.platform.iphoneos";
"param_diag_MainThreadChecker_stopOnIssue" = 0;
"param_diag_MallocStackLogging_enableDuringAttach" = 0;
"param_diag_MallocStackLogging_enableForXPC" = 1;
"param_diag_allowLocationSimulation" = 1;
"param_diag_checker_tpc_enable" = 1;
"param_diag_gpu_frameCapture_enable" = 0;
"param_diag_gpu_shaderValidation_enable" = 0;
"param_diag_gpu_validation_enable" = 0;
"param_diag_memoryGraphOnResourceException" = 0;
"param_diag_queueDebugging_enable" = 1;
"param_diag_runtimeProfile_generate" = 0;
"param_diag_sanitizer_asan_enable" = 0;
"param_diag_sanitizer_tsan_enable" = 0;
"param_diag_sanitizer_tsan_stopOnIssue" = 0;
"param_diag_sanitizer_ubsan_stopOnIssue" = 0;
"param_diag_showNonLocalizedStrings" = 0;
"param_diag_viewDebugging_enabled" = 1;
"param_diag_viewDebugging_insertDylibOnLaunch" = 1;
"param_install_style" = 2;
"param_launcher_UID" = 2;
"param_launcher_allowDeviceSensorReplayData" = 0;
"param_launcher_kind" = 0;
"param_launcher_style" = 99;
"param_launcher_substyle" = 8192;
"param_runnable_appExtensionHostRunMode" = 0;
"param_runnable_productType" = "com.apple.product-type.application";
"param_structuredConsoleMode" = 1;
"param_testing_launchedForTesting" = 0;
"param_testing_suppressSimulatorApp" = 0;
"param_testing_usingCLI" = 0;
"sdk_canonicalName" = "iphoneos17.4";
"sdk_osVersion" = "17.4";
"sdk_variant" = iphoneos;
}
--
System Information
macOS Version 14.2.1 (Build 23C71)
Xcode 15.3 (22618) (Build 15E204a)
Timestamp: 2024-03-12T11:15:07-07:00
The goal is to leverage the high-quality word-level timestamps added in #38 as anchors to reliably seek the audio buffer forward at a higher frequency compared to current behavior:
<|endoftext|>
is generated or max_tokens
tokens are generated.The CLI executable should be able to stream directly from the microphone, similar to the WhisperAX example app. This enables use cases outside of an Xcode project.
WhisperAX streaming code:
WhisperKit/Examples/WhisperAX/WhisperAX/Views/ContentView.swift
Lines 997 to 1127 in 228630c
Needed for benchmarking the streaming functionality, as well as generally testing it's accuracy and performance. A simple loop can be made to read a file in incremental n
second chunks, where the audio length increases by n
seconds each loop, and the transcription is appended as the audio size increases.
Hey there!
First off, thanks so much for building this awesome library! Its a total pleasure to use and works great. Looking forward to the Metal
update. In the meantime, I was curious if you all would accept a PR to allow for AVCaptureSession
to be used in the AudioProcessor
class instead of AVAudioEngine
.
I was thinking of creating a way to pass in a new setupEngine function that allowed for the captureOutput
delegate to be used in place of the installTap
function. The reason I want to do this is it makes it easier to change the microphone in app instead of relying on the system default.
AudioProcessor
? If so, Im happy to come up with a clean interface proposal.AudioProcessor
class and provide an alternate setupEngine
function?Seems like I cannot resolve the packages correctly with 0.20:
swift package update 🌱 main 📝 ×3 via 🐦 v5.9.2
Updating https://github.com/apple/swift-argument-parser
Updating https://github.com/argmaxinc/whisperkit
Updated https://github.com/argmaxinc/whisperkit (0.43s)
Updated https://github.com/apple/swift-argument-parser (0.43s)
Computing version for https://github.com/argmaxinc/whisperkit
error: Dependencies could not be resolved because root depends on 'whisperkit' 0.2.0..<1.0.0.
'whisperkit' >= 0.2.0 cannot be used because no versions of 'whisperkit' match the requirement 0.2.1..<1.0.0 and package 'whisperkit' is required using a stable-version but 'whisperkit' depends on an unstable-version package 'swift-transformers'.
The doc on SPM dependencies says:
packages which use commit-based dependency requirements can't be added as dependencies to packages that use version-based dependency requirements
Does it have a duration limit? I remember that Whisper limits the input file to 30 seconds, but when I tested it on macOS, the app could handle much longer duration audio files. Do you have to chunk the audio files before transcription?
Hi,
Is speaker diarization planned (espec. in realtime)?
Thx!
Models are taking really good time to download. If the WiFI is OFF for a second it gets struck - it would be great if the example has two more options - Delete and Retry.
Occasionally Im seeing an index out of range
crash on the segmentLogProbs[index]
after a long period of silence. https://github.com/argmaxinc/WhisperKit/blob/main/Sources/WhisperKit/Core/TextDecoder.swift#L518-L521
Swift/ContiguousArrayBuffer.swift:600: Fatal error: Index out of range
Two ways I could see guarding against this:
segmentLogProbs
count.for (token, logProb) in zip(segmentTokens, segmentLogProbs) {
tokenProbs.append([token: logProb])
}
for (index, token) in segmentTokens.enumerated() {
if index < segmentLogProbs.count {
tokenProbs.append([token: segmentLogProbs[index]])
}
}
Happy to PR either one but unsure if Im missing a reason for this being as is.
need it for real-time stuff like: https://arxiv.org/pdf/2307.14743.pdf
p.s. great project!
It would be great to start collecting reproducible performance benchmarks for supported hardware (e.g. A14+ and M1+). This should be a self-contained function that uses openai/whisper-base
by default and optionally other versions that the benchmark submitter selects. Benchmarks should run on a standard set of audio files and reports should be in a digestible and shareable format:
Psuedo-code may look like this:
jfk.wav
for short-form, ted_60.wav
and a sample clip from earnings22
for long-form transcriptions)Open ASR leaderboard benchmarks: https://github.com/huggingface/open_asr_leaderboard
Nice script for collecting environment info: https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py
It would be great if brew install whisperkit
just works and the WhisperKit CLI target on macOS could become an out-of-the-box real-time transcription utility.
Hey guys! This looks great, unfortunately I'm having issues loading the models (both in my own code and the sample app)
I'm running this on an M1 Macbook Pro.
Many of the models don't load at all, even when given enough time (the progress bar usually gets stuck around specialization)
I've tried to download the models and manually use it as well, but I'm having trouble loading them as well.
Failed to read model package at file:///Users/puravmanot/Developer/Projects/WhisperTesting/WhisperTesting/whisper_large_v3_turbo. Error: A valid manifest does not exist at path: /Users/puravmanot/Developer/Projects/WhisperTesting/WhisperTesting/whisper_large_v3_turbo/Manifest.json
It also gets stuck sometimes while loading a pre-downloaded model
[WhisperKit] Loading models...
WhisperKit/Sources/WhisperKit/Core/AudioProcessor.swift
Lines 197 to 217 in fed90c7
Creating an AVAudioPCMBuffer
for the whole input audio buffer can easily surpass iOS memory limits.
Attempting to transcribe a 44100hz, 2 channel, ~1hr long video crashes on iOS due to running out of memory. It would be nice if instead of reading all the input audio into a buffer at once and converting, the audio was read and converted in chunks to reduce the memory usage.
Another less common issue that would be solved by chunking the audio is that AVAudioPCMBuffer has a max size of UInt32.max, which can be hit when transcribing a 1-2hr, 16 channel, 44100hz audio file. This is a fairly typical audio file for a podcast recorded with a RODECaster Pro.
Hey guys, great work on the project and nice job with the word level timestamps recently!
Noticed that the package says the minimum target was recently changed to iOS 16 per https://github.com/argmaxinc/WhisperKit/blob/main/Package.swift, which is great.
What would be involved in bringing this down to iOS 15?
would be great
I am getting this error when trying to start WhisperKit in any simulator. Can someone say what could it be and how to fix?
*** Terminating app due to uncaught exception 'com.apple.coreaudio.avfaudio', reason: 'required condition is false: IsFormatSampleRateAndChannelCountValid(format)'
*** First throw call stack:
(
0 CoreFoundation 0x00000001804bceec exceptionPreprocess + 172
1 libobjc.A.dylib 0x0000000180087068 objc_exception_throw + 56
2 CoreFoundation 0x00000001804bcd90 +[NSException raise:format:] + 0
3 AVFAudio 0x00000001c7789130 Z19AVAE_RaiseExceptionP8NSStringz + 48
4 AVFAudio 0x00000001c77e0b84 ZN17AUGraphNodeBaseV318CreateRecordingTapEmjP13AVAudioFormatU13block_pointerFvP16AVAudioPCMBufferP11AVAudioTimeE + 712
5 AVFAudio 0x00000001c78504d4 -[AVAudioNode installTapOnBus:bufferSize:format:block:] + 1324
6 languagelearn 0x0000000102091988 $s10WhisperKit14AudioProcessorC11setupEngine13inputDeviceIDSo07AVAudioF0CSSSg_tKF + 852
7 languagelearn 0x0000000102090c4c $s10WhisperKit14AudioProcessorC18startRecordingLive13inputDeviceID8callbackySSSg_ySaySfGcSgtKF + 224
8 languagelearn 0x0000000102090b2c $s10WhisperKit14AudioProcessorCAA0C10ProcessingA2aDP18startRecordingLive13inputDeviceID8callbackySSSg_ySaySfGcSgtKFTW + 24
9 languagelearn 0x000000010206bc04 $s13languagelearn11ContentViewV14startRecordingyySbFyyYaYbcfU_TY1 + 372
10 languagelearn 0x0000000102077ea5 $s13languagelearn11ContentViewV14startRecordingyySbFyyYaYbcfU_TATQ0 + 1
11 languagelearn 0x0000000102085369 $sxIeghHr_xs5Error_pIegHrzo_s8SendableRzs5NeverORs_r0_lTRTQ0 + 1
12 languagelearn 0x00000001020873cd $sxIeghHr_xs5Error_pIegHrzo_s8SendableRzs5NeverORs_r0_lTRTATQ0 + 1
13 libswift_Concurrency.dylib 0x000000020bfbf621 _ZL23completeTaskWithClosurePN5swift12AsyncContextEPNS_10SwiftErrorE + 1
)
libc++abi: terminating due to uncaught exception of type NSException
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.