argmaxinc / whisperkit Goto Github PK

View Code? Open in Web Editor NEW

2.3K 24.0 191.0 1.01 MB

Swift native on-device speech recognition with Whisper for Apple Silicon

Home Page: https://takeargmax.com/blog/whisperkit

License: MIT License

Makefile 0.77% Swift 99.23%

inference ios pretrained-models speech-recognition swift whisper transformers macos visionos watchos

whisperkit's People

Contributors

Stargazers

Watchers

Forkers

jordibruin mz0in thecatfix 0x454447415244 furkanercan aditya-chandrabhatla princerumi lengocgiang 1-ashraful-islam suryatmodulus kilo-loco vshapenko analogpvt leetesla sezeranojchrisostome archiegoodwin jonah-zhang rsandagon navezjt insprd sagar4tech rajendharmendra tzxdtc yiming992 richardsonjf jansystemic srikalyan usmanm f901107 charliechap3 jkrukowski l0caldadmin gongli1231 xufan081667 xiayu1028 wanglongzheng0313 chensu0329 laiyi1129 samsgates bmorphism ssahgal ziligy kowsheek wujian010382 alejandrosuarez haltiaai qqq-tech positioner ezefranca cnmeeia tomchapin xgithubzero rhinojosa sanbast khankindle aheze decentralizedbug orijintech chidiwilliams aigerimmmm scosman hmthanh entn-at azuer iandundas seb-sep realityspectrum vital121 rexterity nathan0411 alexisgaro927 shaneholloman jonathankilian colordiver tomoyukiorita a7mad-magdy77 garikai22 turbo-agi sa1i beer008 jadegeek chuwoo menghutongji iopa788 metropol xiangweizheng rexsu spreamhachoneprep babixzbabydeckfunk reejoy33stamaha murant7 saberieguidema choneprep22chikdot n-tacticusal certready3grimbel wisellhenryal gossipakselfforlife cheaphitacticusal dekkolo-r eltociear

whisperkit's Issues

Crash when starting whisperKit in MacOS

error message

Could not launch “WhisperAX”
Domain: IDELaunchErrorDomain
Code: 20
Recovery Suggestion: Runningboard has returned error 5. Please check the system logs for the underlying cause of the error.
User Info: {
DVTErrorCreationDateKey = "2024-03-09 13:07:42 +0000";
DVTRadarComponentKey = 968756;
IDERunOperationFailingWorker = IDELaunchServicesLauncher;
}

The operation couldn’t be completed. Launch failed.
Domain: RBSRequestErrorDomain
Code: 5
Failure Reason: Launch failed.

Launchd job spawn failed
Domain: NSPOSIXErrorDomain
Code: 162

Event Metadata: com.apple.dt.IDERunOperationWorkerFinished : {
"device_model" = "Mac15,6";
"device_osBuild" = "14.3.1 (23D60)";
"device_platform" = "com.apple.platform.macosx";
"dvt_coredevice_version" = "355.7.7";
"dvt_mobiledevice_version" = "1643.60.2";
"launchSession_schemeCommand" = Run;
"launchSession_state" = 1;
"launchSession_targetArch" = arm64;
"operation_duration_ms" = 22;
"operation_errorCode" = 20;
"operation_errorDomain" = IDELaunchErrorDomain;
"operation_errorWorker" = IDELaunchServicesLauncher;
"operation_name" = IDERunOperationWorkerGroup;
"param_debugger_attachToExtensions" = 0;
"param_debugger_attachToXPC" = 1;
"param_debugger_type" = 3;
"param_destination_isProxy" = 0;
"param_destination_platform" = "com.apple.platform.macosx";
"param_diag_MainThreadChecker_stopOnIssue" = 0;
"param_diag_MallocStackLogging_enableDuringAttach" = 0;
"param_diag_MallocStackLogging_enableForXPC" = 1;
"param_diag_allowLocationSimulation" = 1;
"param_diag_checker_tpc_enable" = 1;
"param_diag_gpu_frameCapture_enable" = 0;
"param_diag_gpu_shaderValidation_enable" = 0;
"param_diag_gpu_validation_enable" = 0;
"param_diag_memoryGraphOnResourceException" = 0;
"param_diag_queueDebugging_enable" = 1;
"param_diag_runtimeProfile_generate" = 0;
"param_diag_sanitizer_asan_enable" = 0;
"param_diag_sanitizer_tsan_enable" = 0;
"param_diag_sanitizer_tsan_stopOnIssue" = 0;
"param_diag_sanitizer_ubsan_stopOnIssue" = 0;
"param_diag_showNonLocalizedStrings" = 0;
"param_diag_viewDebugging_enabled" = 1;
"param_diag_viewDebugging_insertDylibOnLaunch" = 1;
"param_install_style" = 0;
"param_launcher_UID" = 2;
"param_launcher_allowDeviceSensorReplayData" = 0;
"param_launcher_kind" = 0;
"param_launcher_style" = 99;
"param_launcher_substyle" = 8192;
"param_runnable_appExtensionHostRunMode" = 0;
"param_runnable_productType" = "com.apple.product-type.application";
"param_structuredConsoleMode" = 1;
"param_testing_launchedForTesting" = 0;
"param_testing_suppressSimulatorApp" = 0;
"param_testing_usingCLI" = 0;
"sdk_canonicalName" = "macosx14.2";
"sdk_osVersion" = "14.2";
"sdk_variant" = macos;
}

System Information

macOS Version 14.3.1 (Build 23D60)
Xcode 15.2 (22503) (Build 15C500b)
Timestamp: 2024-03-09T21:07:42+08:00

Reducing halllucinations with first text token logprob thresholding

It would be great if the first text token's logprob can be used to discard a transcription draft as failed and start over. Start over could mean either falling back to a higher temperature sampling or updating the audio buffer for streaming use cases.

the example is unable to run on iphone 11 pro.

the example is unable to run on iphone 11 pro.
(The example is running good on Mac m1 max )

The following is the screen shot on iphone 11 pro. Base Model

debug log:
[WhisperKit] --------------- DECODER INPUTS DEBUG ---------------
[WhisperKit] Cache Length: 2 Input Token: 50359
[WhisperKit] Key Cache | Val Cache | Update Mask | Decoder Mask | Position
[WhisperKit] -0.125732 | 0.048828 | 0 | 0 | 0
[WhisperKit] 0.308350 | -0.556641 | 0 | 0 | 1
[WhisperKit] 0.000000 | 0.000000 | 1 | 0 | 2
[WhisperKit] 0.000000 | 0.000000 | 0 | -10000 | 3
[WhisperKit] [0.00 --> 14.90]
[WhisperKit] ---- Transcription Timings ----
[WhisperKit] Audio Load: 0.00 ms / 1 runs ( 0.00 ms/run) 0.00%
[WhisperKit] Audio Processing: 0.41 ms / 1 runs ( 0.41 ms/run) 0.03%
[WhisperKit] Mels: 57.57 ms / 1 runs ( 57.57 ms/run) 3.96%
[WhisperKit] Encoding: 1171.59 ms / 1 runs ( 1171.59 ms/run) 80.56%
[WhisperKit] Matrices Init: 5.36 ms / 1 runs ( 5.36 ms/run) 0.37%
[WhisperKit] Prefill: 0.49 ms / 1 runs ( 0.49 ms/run) 0.03%
[WhisperKit] Decoding: 208.06 ms / 4 runs ( 52.01 ms/run) 14.31%
[WhisperKit] Non-inference: 7.49 ms / 4 runs ( 1.87 ms/run) 0.52%
[WhisperKit] - Sampling: 4.13 ms / 4 runs ( 1.03 ms/run) 0.28%
[WhisperKit] - Kv Caching: 3.91 ms / 4 runs ( 0.98 ms/run) 0.27%
[WhisperKit] - Windowing: 0.08 ms / 1 runs ( 0.08 ms/run) 0.01%
[WhisperKit] Fallbacks: 122.98 ms / 0 runs ( 0.00 ms/run) 8.46%
[WhisperKit] Decoding Full Loop: 1448.16 ms / 4 runs ( 362.04 ms/run) 99.57%
[WhisperKit] -------------------------------
[WhisperKit] Model Load Time: 6.60 seconds
[WhisperKit] Inference Duration: 1.45 seconds
[WhisperKit] - Decoding Loop: 1.45 seconds
[WhisperKit] Time to first token: 1.30 seconds
[WhisperKit] Total Tokens: 5
[WhisperKit] Tokens per Second: 2.76 tok/s
[WhisperKit] Real Time Factor: 0.10
[WhisperKit] Fallbacks: 0.0
[WhisperKit] [0.00 --> 14.90] <|endoftext|>

Word level timestamps

Segment level timestamps look good, great work guys.

Are token level timestamps currently supported somehow, or on the roadmap?

Support for MacOS 13.0

Hi folks, just wanted to check in and ask what would be entailed in adding support for older mac versions, such as 13.0?

Using locally saved models

Hey! Thanks for making WhisperKit!

I hope I did not miss it in the documentation. But is it possible to provide a local URL to the model for the WhisperKit instead of relying on its internal mechanism to load the model? Inside my app I already have a nice UI that allows to download, suspend, and cancel download progress so would be nice if I could then feed WhisperKit with the local URL.

If there is no such functionality but you are considering adding it - I might try to help by making a PR.

Thanks.

Language Detection

Language detection here should be fairly simple with logits filters now, it will entail a single decoder pass and sample just the language tokens. However, this cannot be used when we are using a prefill prompt (i.e. forced decoder tokens) so that will need special handling.

References

Openai implementation: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L19
WhisperKit inline todo:

WhisperKit/Sources/WhisperKit/Core/TextDecoder.swift

Line 300 in 228630c

// TODO: implement

ReactNative Swift APIs

It would be worth adding support for ReactNative apps using Native Modules and expose Swift APIs to JS.

Any plans to support Android and WearOS in the future?

MLX

Hey,
Apple just dropped MLX-Swift, a cross-platform (currently only iOS/macOS) MLX framework. Are there any plans to support it?
Thanks!

Add `brew install trash` to `make setup` script

trash, used in the following make rule, is not part of a default macOS setup.

clean-package-caches:
	@trash ~/Library/Caches/org.swift.swiftpm/repositories
	@trash ~/Library/Developer/Xcode/DerivedData

I see three options to address this:

use rm instead, and delete immediately
use mv, and move the files to the user's trash in ~/.Trash (only works properly if the files are in the local disk; for external hard drives trashes are at /Volumes/NAME_OF_EXTERNAL/.Trashes/USER_ID/, and to handle these cases probably better go with option 3)
install trash using Homebrew in the setup rule.

Originally posted by @metropol in #47

Specialization takes a really long time

I'm trying the demo app on a MacBook Pro with Apple M1 Pro and 16 GB memory. The large-v3_turbo_1049MB model has been specializing for more than 30 minutes, but aned is still running and using a whole performance core. Have you guys tested the loading time on different devices?

Support with older swift version

I have problem being able to develop and run with this sadly.

I am running a AMD cpu windows 11 pc. I am using vmware to get MacOS, however I am not able to run any MacOS version after 12, due to amd cpu not supporting this. This in turn means that I cannot run the later versions of xcode that support swift 5.9.

Would you ever considering backporting some of this functionality for previous versions of swift?

Avoid requiring an internet connection to transcribe

Currently when using the default WhisperKit flow of auto downloading models on transcribe, an internet connection is required even if models have already been downloaded in the past due to swift-transformers fetching the filenames here.

This is a bit limiting, as e.g. @pveugen was on a train with poor internet and couldn't transcribe audio even after downloading the model in the past (after #80 it would throw an error instead of crashing). I think we could get around this by manually downloading and specifying the path in setupModels modelFolder:, but it would be nice if there was a way to avoid this HTTP get by default.

No Speech Detection

This can be done with logit filters on the first loop, similar to detecting language. However, this cannot be used when we are using a prefill prompt (i.e. forced decoder tokens) so that will need special handling. Ideally, there'd be an option to ignore the prefill prompt for the first decoder loop to detect no speech, which costs 1 extra loop but may allow skipping the entire window if developers are expecting some long stretches of silence in their input audio.

References

Openai implementation: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L692-L693
WhisperKit inline todo:

WhisperKit/Sources/WhisperKit/Core/TextDecoder.swift

Line 497 in 228630c

noSpeechProb: 0, // TODO: implement no speech prob

WhisperKit/Sources/WhisperKit/Core/WhisperKit.swift

Lines 612 to 616 in 228630c

 if let threshold = options.noSpeechThreshold, 

 result.noSpeechProb > threshold 

 { 

 needsFallback = false // silence 

 }

Add support for macOS Ventura (13.0)

From what I understood there are some limitations and degradations to the model quality but it would still be nice to be able to support users on Ventura (and iOS 16)

Crash when starting whisperKit in Simulators iPhone 15 Pro 17.4

I found the sumOfbestIndicesResult value is NaN.

Reducing hallucinations by removing zero-length words based on word timestamps

It would be great if certain patterns in the newly added word timestamps (#38 ) can be leveraged to reduce the incidence rate of hallucinations. This change will require comprehensive re-evaluation of the models since accurate words could also have zero-length based on inaccurate word timestamps.

Timestamp Rules Logits Filter

Timestamp rules are helpful to more consistently find reliable timestamps during decoding.

Important note: We have already brought over some of this logic into the SegmentSeeker which runs at the end of a full decode loop to generate the segments. This feature will need to detangle any repeated logic between them.

References:

Openai implementation: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L441-L505

Unable to load model in CLI

Hey folks! I'm trying to use the CLI, but it fails to load models:

Building for debugging...
Build complete! (0.07s)
Error: Unable to load model: file:///Users/usmanm/whisperkit/Models/whisperkit-coreml/openai_whisper-tiny/MelSpectrogram.mlmodelc/. Compile the model with Xcode or `MLModel.compileModel(at:)`.

The setup instructions seemed to have worked correctly:

➜  whisperkit git:(main) make setup
Setting up environment...
/opt/homebrew/bin/pip3
/opt/homebrew/bin/python3
Requirement already satisfied: huggingface_hub in /opt/homebrew/lib/python3.11/site-packages (0.20.3)
Requirement already satisfied: filelock in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (3.13.1)
Requirement already satisfied: fsspec>=2023.5.0 in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (2023.10.0)
Requirement already satisfied: requests in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (2.31.0)
Requirement already satisfied: tqdm>=4.42.1 in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (4.66.1)
Requirement already satisfied: pyyaml>=5.1 in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (6.0.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (4.8.0)
Requirement already satisfied: packaging>=20.9 in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (23.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/homebrew/lib/python3.11/site-packages (from requests->huggingface_hub) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/homebrew/lib/python3.11/site-packages (from requests->huggingface_hub) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/homebrew/lib/python3.11/site-packages (from requests->huggingface_hub) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /opt/homebrew/lib/python3.11/site-packages (from requests->huggingface_hub) (2023.7.22)
usmanm
Already logged in to Hugging Face.
➜  whisperkit git:(main) make download-models
Downloading compressed models...
Repository exists, pulling latest changes...
HEAD is now at 07ea546 Create config.json

[FEATURE REQUEST] custom prompt to pass vocabulary words

Like OpenAI's Whisper, is it possible to pass a text prompt which could be used to improve the quality of the transcript?

See https://cookbook.openai.com/examples/whisper_prompting_guide#pass-names-in-the-prompt-to-prevent-misspellings

Laurent

Beam Search

Beam search on CoreML will require some model changes to work according to the reference implementation. This is mainly due to CoreML static shapes requiring a new model for each possible beam_size. We have some plans to deal with this so will keep this issue here for tracking purposes.

References

Openai implementation: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L301-L404

After Steps, I can't start my project.

I created a new project in xcode, named WhisperKit.
I added WhisperKit according to the steps.
I added the following code to WhisperKit/WhisperKit/WhisperKitApp

import SwiftUI
import WhisperKit

@main
struct WhisperKitApp: App {
    init() {
        Task {
            do {
                let pipe = try? await WhisperKit()
                let transcription = try? await pipe!.transcribe(audioPath: "Audio/output-lang.wav")?.text
                print(transcription)
            } catch {
                print("Error: \(error)")
            }
        }
    }
    
    var body: some Scene {
        WindowGroup {
            ContentView()
        }
    }
}

Then I get an error : Cannot call value of non-function type 'module<WhisperKit>'

What should I do to solve this problem? tks.

Some timing tokens are included in word timestamps

When filtering out special tokens in addWordTimestamps, word timings that contain a timing token followed by a hyphen aren't filtered out correctly. WordTiming.tokens correctly contains just [532], but WordTiming.word is "<|0.00|> -". This seems to occur most when multiple people are talking over each other in a recording, I guess it's Whisper's way of trying to label speakers.

show benchmarks

the advantage of this project is that it uses CoreML for a performance gain, so showing benchmarks would solidify how much this advantage is

download model failed

How to fix this issue?

Task <60D6EF47-1009-4EFE-9E1B-5988A7FD6E4F>.<1> HTTP load failed, 0/0 bytes (error code: -1200 [3:-9816])
Task <60D6EF47-1009-4EFE-9E1B-5988A7FD6E4F>.<1> finished with error [-1200] Error Domain=NSURLErrorDomain Code=-1200 "An SSL error has occurred and a secure connection to the server cannot be made." UserInfo={NSErrorFailingURLStringKey=https://cdn-lfs-us-1.huggingface.co/repos/8f/fc/8ffc19694b8dfd29ebaafed41040596f15c2a6ee94d3e9f8a0bf0f1523bade3c/6ac1227740ecc2fd7a03df50ac6e2a7f7946acfa77069cf2c486ae0255356b95?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27coremldata.bin%3B+filename%3D%22coremldata.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1710121327&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMDEyMTMyN319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzhmL2ZjLzhmZmMxOTY5NGI4ZGZkMjllYmFhZmVkNDEwNDA1OTZmMTVjMmE2ZWU5NGQzZTlmOGEwYmYwZjE1MjNiYWRlM2MvNmFjMTIyNzc0MGVjYzJmZDdhMDNkZjUwYWM2ZTJhN2Y3OTQ2YWNmYTc3MDY5Y2YyYzQ4NmFlMDI1NTM1NmI5NT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=TR-WYiW9gDlLkJIYv-2TaU4UYLNidoOb9oE-OXvBkpsmBHYZ7%7ElhzAoGKa7aqBYGcUnDmmJG0HTJXVyz-6dYbX%7E6vlU8j3x83mJfi2DEPRKzW1RB0tjRx4HMOpuP1G5FMr9CWBvS8M-icXoz-Beyu%7EmyDcLzKISUPV-RFlw1Jm72PiLb5MvCpdw2cdlDfFYUbmzYYIyWsUZsK5YuB6R187AXqM00lIy05xzIOhmuwJzL1XSMzu5-D2WxnNfkBDP4NUiX6OtYhZgJVA9I2ELqmHhOs4qX6HNAXOkxz6KtnuWEpO3N8%7E-yZ%7EPPeNcOudyuAMKw1m2qp0L8JuUxhqCP8Q__&Key-Pair-Id=KCD77M1F0VK2B, NSLocalizedRecoverySuggestion=Would you like to connect to the server anyway?, _kCFStreamErrorDomainKey=3, _NSURLErrorFailingURLSessionTaskErrorKey=LocalDownloadTask <60D6EF47-1009-4EFE-9E1B-5988A7FD6E4F>.<1>, _NSURLErrorRelatedURLSessionTaskErrorKey=(
"LocalDownloadTask <60D6EF47-1009-4EFE-9E1B-5988A7FD6E4F>.<1>"
), NSLocalizedDescription=An SSL error has occurred and a secure connection to the server cannot be made., NSErrorFailingURLKey=https://cdn-lfs-us-1.huggingface.co/repos/8f/fc/8ffc19694b8dfd29ebaafed41040596f15c2a6ee94d3e9f8a0bf0f1523bade3c/6ac1227740ecc2fd7a03df50ac6e2a7f7946acfa77069cf2c486ae0255356b95?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27coremldata.bin%3B+filename%3D%22coremldata.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1710121327&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMDEyMTMyN319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzhmL2ZjLzhmZmMxOTY5NGI4ZGZkMjllYmFhZmVkNDEwNDA1OTZmMTVjMmE2ZWU5NGQzZTlmOGEwYmYwZjE1MjNiYWRlM2MvNmFjMTIyNzc0MGVjYzJmZDdhMDNkZjUwYWM2ZTJhN2Y3OTQ2YWNmYTc3MDY5Y2YyYzQ4NmFlMDI1NTM1NmI5NT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=TR-WYiW9gDlLkJIYv-2TaU4UYLNidoOb9oE-OXvBkpsmBHYZ7%7ElhzAoGKa7aqBYGcUnDmmJG0HTJXVyz-6dYbX%7E6vlU8j3x83mJfi2DEPRKzW1RB0tjRx4HMOpuP1G5FMr9CWBvS8M-icXoz-Beyu%7EmyDcLzKISUPV-RFlw1Jm72PiLb5MvCpdw2cdlDfFYUbmzYYIyWsUZsK5YuB6R187AXqM00lIy05xzIOhmuwJzL1XSMzu5-D2WxnNfkBDP4NUiX6OtYhZgJVA9I2ELqmHhOs4qX6HNAXOkxz6KtnuWEpO3N8%7E-yZ%7EPPeNcOudyuAMKw1m2qp0L8JuUxhqCP8Q__&Key-Pair-Id=KCD77M1F0VK2B, NSUnderlyingError=0x600000c7edf0 {Error Domain=kCFErrorDomainCFNetwork Code=-1200 "(null)" UserInfo={_kCFStreamPropertySSLClientCertificateState=0, _kCFNetworkCFStreamSSLErrorOriginalValue=-9816, _kCFStreamErrorDomainKey=3, _kCFStreamErrorCodeKey=-9816, _NSURLErrorNWPathKey=satisfied (Path is satisfied), interface: en0}}, _kCFStreamErrorCodeKey=-9816}
WhisperKit/WhisperKit.swift:194: Fatal error: Unexpectedly found nil while unwrapping an Optional value

Unable to delete the model

It looks like the model is deleted when I use FIleManager removeAt method but when I re-run the project the deleted model appears again.

FileManager.default.removeItem(at: URL.init(string: "file://" + "(path)")!)

Stream with audio output

Thank you for your WORK!!!

I'm not a MacOS developer, but a user. I want to know if it's possible to use the computer's audio output in Stream, not just the microphone. The scenario is similar to simultaneous interpretation in meetings.

I look forward to your reply, thank you again!!!

Initializing models... Error: unknown error

When using Swift CLI example, followed the exact commands mentioned, getting above error:

Chip: Apple M3 Max
OS: 14.3.1
Xcode: Version 15.2
Apple Swift version: 5.9.2

The app “WhisperAX” has been killed by the operating system because it is using too much memory.

The app crashes after recording a few seconds of sound. It's being used on an iPhone 12 mini device that has been cold restarted, with Large-v2_1050MB.

The app “WhisperAX” has been killed by the operating system because it is using too much memory.
Domain: IDEDebugSessionErrorDomain
Code: 11
Recovery Suggestion: Use a memory profiling tool to track the process memory usage.
User Info: {
    DVTErrorCreationDateKey = "2024-03-12 18:15:07 +0000";
    IDERunOperationFailingWorker = DBGLLDBLauncher;
}
--
The app “WhisperAX” has been killed by the operating system because it is using too much memory.
Domain: IDEDebugSessionErrorDomain
Code: 11
Recovery Suggestion: Use a memory profiling tool to track the process memory usage.
User Info: {
    IDERunOperationFailingWorker = DBGLLDBLauncher;
}
--

Event Metadata: com.apple.dt.IDERunOperationWorkerFinished : {
    "device_isCoreDevice" = 1;
    "device_model" = "iPhone13,1";
    "device_osBuild" = "17.3.1 (21D61)";
    "device_platform" = "com.apple.platform.iphoneos";
    "dvt_coredevice_version" = "355.24";
    "dvt_mobiledevice_version" = "1643.100.58";
    "launchSession_schemeCommand" = Run;
    "launchSession_state" = 2;
    "launchSession_targetArch" = arm64;
    "operation_duration_ms" = 968315;
    "operation_errorCode" = 11;
    "operation_errorDomain" = IDEDebugSessionErrorDomain;
    "operation_errorWorker" = DBGLLDBLauncher;
    "operation_name" = IDERunOperationWorkerGroup;
    "param_debugger_attachToExtensions" = 0;
    "param_debugger_attachToXPC" = 1;
    "param_debugger_type" = 3;
    "param_destination_isProxy" = 0;
    "param_destination_platform" = "com.apple.platform.iphoneos";
    "param_diag_MainThreadChecker_stopOnIssue" = 0;
    "param_diag_MallocStackLogging_enableDuringAttach" = 0;
    "param_diag_MallocStackLogging_enableForXPC" = 1;
    "param_diag_allowLocationSimulation" = 1;
    "param_diag_checker_tpc_enable" = 1;
    "param_diag_gpu_frameCapture_enable" = 0;
    "param_diag_gpu_shaderValidation_enable" = 0;
    "param_diag_gpu_validation_enable" = 0;
    "param_diag_memoryGraphOnResourceException" = 0;
    "param_diag_queueDebugging_enable" = 1;
    "param_diag_runtimeProfile_generate" = 0;
    "param_diag_sanitizer_asan_enable" = 0;
    "param_diag_sanitizer_tsan_enable" = 0;
    "param_diag_sanitizer_tsan_stopOnIssue" = 0;
    "param_diag_sanitizer_ubsan_stopOnIssue" = 0;
    "param_diag_showNonLocalizedStrings" = 0;
    "param_diag_viewDebugging_enabled" = 1;
    "param_diag_viewDebugging_insertDylibOnLaunch" = 1;
    "param_install_style" = 2;
    "param_launcher_UID" = 2;
    "param_launcher_allowDeviceSensorReplayData" = 0;
    "param_launcher_kind" = 0;
    "param_launcher_style" = 99;
    "param_launcher_substyle" = 8192;
    "param_runnable_appExtensionHostRunMode" = 0;
    "param_runnable_productType" = "com.apple.product-type.application";
    "param_structuredConsoleMode" = 1;
    "param_testing_launchedForTesting" = 0;
    "param_testing_suppressSimulatorApp" = 0;
    "param_testing_usingCLI" = 0;
    "sdk_canonicalName" = "iphoneos17.4";
    "sdk_osVersion" = "17.4";
    "sdk_variant" = iphoneos;
}
--


System Information

macOS Version 14.2.1 (Build 23C71)
Xcode 15.3 (22618) (Build 15E204a)
Timestamp: 2024-03-12T11:15:07-07:00

Link to download models on-demand

As an example, It would be great if there is a way to download models on demand for example like this

Reduce redundant decoder forward passes by leveraging word-level timestamps

The goal is to leverage the high-quality word-level timestamps added in #38 as anchors to reliably seek the audio buffer forward at a higher frequency compared to current behavior:

Current behavior is to seek the audio forward if <|endoftext|> is generated or max_tokens tokens are generated.
Current behavior results in wasteful compute because each text token is re-decoded until the audio seeks beyond them.
This is up to 29 times redundant (worst case) for a 1 second audio refresh rate and a 30 second audio window for Whisper.

Streaming Microphone for CLI

The CLI executable should be able to stream directly from the microphone, similar to the WhisperAX example app. This enables use cases outside of an Xcode project.

Reference

WhisperAX streaming code:

WhisperKit/Examples/WhisperAX/WhisperAX/Views/ContentView.swift

Lines 997 to 1127 in 228630c

 // MARK: Streaming Logic 

 func realtimeLoop() { 

 transcriptionTask = Task { 

 while isRecording && isTranscribing { 

 do { 

 try await transcribeCurrentBuffer() 

 } catch { 

 print("Error: \(error.localizedDescription)") 

 break 

 } 

 } 

 } 

 } 

 func stopRealtimeTranscription() { 

 isTranscribing = false 

 transcriptionTask?.cancel() 

 } 

 func transcribeCurrentBuffer() async throws { 

 guard let whisperKit = whisperKit else { return } 

 // Retrieve the current audio buffer from the audio processor 

 let currentBuffer = whisperKit.audioProcessor.audioSamples 

 // Calculate the size and duration of the next buffer segment 

 let nextBufferSize = currentBuffer.count - lastBufferSize 

 let nextBufferSeconds = Float(nextBufferSize) / Float(WhisperKit.sampleRate) 

 // Only run the transcribe if the next buffer has at least 1 second of audio 

 guard nextBufferSeconds > 1 else { 

 await MainActor.run { 

 if currentText == "" { 

 currentText = "Waiting for speech..." 

 } 

 } 

 try await Task.sleep(nanoseconds: 100_000_000) // sleep for 100ms for next buffer 

 return 

 } 

 if useVAD { 

 // Retrieve the current relative energy values from the audio processor 

 let currentRelativeEnergy = whisperKit.audioProcessor.relativeEnergy 

 // Calculate the number of energy values to consider based on the duration of the next buffer 

 // Each energy value corresponds to 1 buffer length (100ms of audio), hence we divide by 0.1 

 let energyValuesToConsider = Int(nextBufferSeconds / 0.1) 

 // Extract the relevant portion of energy values from the currentRelativeEnergy array 

 let nextBufferEnergies = currentRelativeEnergy.suffix(energyValuesToConsider) 

 // Determine the number of energy values to check for voice presence 

 // Considering up to the last 1 second of audio, which translates to 10 energy values 

 let numberOfValuesToCheck = max(10, nextBufferEnergies.count - 10) 

 // Check if any of the energy values in the considered range exceed the silence threshold 

 // This indicates the presence of voice in the buffer 

 let voiceDetected = nextBufferEnergies.prefix(numberOfValuesToCheck).contains { $0 > Float(silenceThreshold) } 

 // Only run the transcribe if the next buffer has voice 

 guard voiceDetected else { 

 await MainActor.run { 

 if currentText == "" { 

 currentText = "Waiting for speech..." 

 } 

 } 

 // if nextBufferSeconds > 30 { 

 // // This is a completely silent segment of 30s, so we can purge the audio and confirm anything pending 

 // lastConfirmedSegmentEndSeconds = 0 

 // whisperKit.audioProcessor.purgeAudioSamples(keepingLast: 2 * WhisperKit.sampleRate) // keep last 2s to include VAD overlap 

 // currentBuffer = whisperKit.audioProcessor.audioSamples 

 // lastBufferSize = 0 

 // confirmedSegments.append(contentsOf: unconfirmedSegments) 

 // unconfirmedSegments = [] 

 // } 

 // Sleep for 100ms and check the next buffer 

 try await Task.sleep(nanoseconds: 100_000_000) 

 return 

 } 

 } 

 // Run transcribe 

 lastBufferSize = currentBuffer.count 

 let transcription = try await transcribeAudioSamples(Array(currentBuffer)) 

 // We need to run this next part on the main thread 

 await MainActor.run { 

 currentText = "" 

 unconfirmedText = [] 

 guard let segments = transcription?.segments else { 

 return 

 } 

 self.tokensPerSecond = transcription?.timings?.tokensPerSecond ?? 0 

 self.realTimeFactor = transcription?.timings?.realTimeFactor ?? 0 

 self.firstTokenTime = transcription?.timings?.firstTokenTime ?? 0 

 self.pipelineStart = transcription?.timings?.pipelineStart ?? 0 

 self.currentLag = transcription?.timings?.decodingLoop ?? 0 

 // Logic for moving segments to confirmedSegments 

 if segments.count > requiredSegmentsForConfirmation { 

 // Calculate the number of segments to confirm 

 let numberOfSegmentsToConfirm = segments.count - requiredSegmentsForConfirmation 

 // Confirm the required number of segments 

 let confirmedSegmentsArray = Array(segments.prefix(numberOfSegmentsToConfirm)) 

 let remainingSegments = Array(segments.suffix(requiredSegmentsForConfirmation)) 

 // Update lastConfirmedSegmentEnd based on the last confirmed segment 

 if let lastConfirmedSegment = confirmedSegmentsArray.last, lastConfirmedSegment.end > lastConfirmedSegmentEndSeconds { 

 lastConfirmedSegmentEndSeconds = lastConfirmedSegment.end 

 // Add confirmed segments to the confirmedSegments array 

 if !self.confirmedSegments.contains(confirmedSegmentsArray) { 

 self.confirmedSegments.append(contentsOf: confirmedSegmentsArray) 

 } 

 } 

 // Update transcriptions to reflect the remaining segments 

 self.unconfirmedSegments = remainingSegments 

 } else { 

 // Handle the case where segments are fewer or equal to required 

 self.unconfirmedSegments = segments 

 } 

 } 

 } 

 }

Streaming Emulation for Files

Needed for benchmarking the streaming functionality, as well as generally testing it's accuracy and performance. A simple loop can be made to read a file in incremental n second chunks, where the audio length increases by n seconds each loop, and the transcription is appended as the audio size increases.

Want to use AVCaptureSession buffers instead of AVAudioEngine

Hey there!

First off, thanks so much for building this awesome library! Its a total pleasure to use and works great. Looking forward to the Metal update. In the meantime, I was curious if you all would accept a PR to allow for AVCaptureSession to be used in the AudioProcessor class instead of AVAudioEngine.

I was thinking of creating a way to pass in a new setupEngine function that allowed for the captureOutput delegate to be used in place of the installTap function. The reason I want to do this is it makes it easier to change the microphone in app instead of relying on the system default.

Would it make sense to allow for this in the AudioProcessor? If so, Im happy to come up with a clean interface proposal.
If no, perhaps theres a way to override the AudioProcessor class and provide an alternate setupEngine function?

Dependency issue with v0.2.0

Seems like I cannot resolve the packages correctly with 0.20:

swift package update                                                                                                               🌱 main 📝 ×3 via 🐦 v5.9.2
Updating https://github.com/apple/swift-argument-parser
Updating https://github.com/argmaxinc/whisperkit
Updated https://github.com/argmaxinc/whisperkit (0.43s)
Updated https://github.com/apple/swift-argument-parser (0.43s)
Computing version for https://github.com/argmaxinc/whisperkit
error: Dependencies could not be resolved because root depends on 'whisperkit' 0.2.0..<1.0.0.
'whisperkit' >= 0.2.0 cannot be used because no versions of 'whisperkit' match the requirement 0.2.1..<1.0.0 and package 'whisperkit' is required using a stable-version but 'whisperkit' depends on an unstable-version package 'swift-transformers'.

The doc on SPM dependencies says:

packages which use commit-based dependency requirements can't be added as dependencies to packages that use version-based dependency requirements

Duration limit?

Does it have a duration limit? I remember that Whisper limits the input file to 30 seconds, but when I tested it on macOS, the app could handle much longer duration audio files. Do you have to chunk the audio files before transcription?

Speaker Diarization

Hi,
Is speaker diarization planned (espec. in realtime)?
Thx!

Feature request: In example, provide delete and re-download options

Models are taking really good time to download. If the WiFI is OFF for a second it gets struck - it would be great if the example has two more options - Delete and Retry.

Implement memory and latency regression tests

Implement tests to transcribe long audio files (at least several minutes worth) and measure the memory and latency over time. This is to guard against memory leaks or slowdowns potentially being introduced by new PRs (e.g. #40 fixed by #56 thanks to @finnvoor!)

Does WhisperKit support simulator? It emits only [silence] on Simulator but on-device is good.

Index out of range error in TextDecoder

Occasionally Im seeing an index out of range crash on the segmentLogProbs[index] after a long period of silence. https://github.com/argmaxinc/WhisperKit/blob/main/Sources/WhisperKit/Core/TextDecoder.swift#L518-L521

Swift/ContiguousArrayBuffer.swift:600: Fatal error: Index out of range

Two ways I could see guarding against this:

Use swift zip
Check the index against segmentLogProbs count.

for (token, logProb) in zip(segmentTokens, segmentLogProbs) {
    tokenProbs.append([token: logProb])
}

for (index, token) in segmentTokens.enumerated() {
  if index < segmentLogProbs.count {
      tokenProbs.append([token: segmentLogProbs[index]])
  }
}

Happy to PR either one but unsure if Im missing a reason for this being as is.

Ability to pass prompt text when transcribing

need it for real-time stuff like: https://arxiv.org/pdf/2307.14743.pdf
p.s. great project!

Benchmark for WhisperAX & CLI

It would be great to start collecting reproducible performance benchmarks for supported hardware (e.g. A14+ and M1+). This should be a self-contained function that uses openai/whisper-base by default and optionally other versions that the benchmark submitter selects. Benchmarks should run on a standard set of audio files and reports should be in a digestible and shareable format:

Psuedo-code may look like this:

Detect current hardware and load the models that the user has chosen to benchmark (single, multiple, or all available models)
Download standard audio files from Hugging (jfk.wav for short-form, ted_60.wav and a sample clip from earnings22 for long-form transcriptions)
Generate the transcriptions over several iterations and runtime tabulate statistics.
- Runs in streaming and file-based "offline" mode - this will require streaming emulation
- Completes short-form bench and presents results before moving to long-form bench which can potentially take several minutes to complete
- Will want to track: time to first token, RTF, inference timings (for encoder and decoder), total pipeline timings (model load -> transcription result)
Export these into a markdown table with relevant device info, and current commit hash, which can be posted to GitHub for public tracking

References

Open ASR leaderboard benchmarks: https://github.com/huggingface/open_asr_leaderboard
Nice script for collecting environment info: https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py

Related Issue

Publish WhisperKit CLI on Homebrew

It would be great if brew install whisperkit just works and the WhisperKit CLI target on macOS could become an out-of-the-box real-time transcription utility.

Unable to load models

Hey guys! This looks great, unfortunately I'm having issues loading the models (both in my own code and the sample app)

I'm running this on an M1 Macbook Pro.

Many of the models don't load at all, even when given enough time (the progress bar usually gets stuck around specialization)

I've tried to download the models and manually use it as well, but I'm having trouble loading them as well.

Failed to read model package at file:///Users/puravmanot/Developer/Projects/WhisperTesting/WhisperTesting/whisper_large_v3_turbo. Error: A valid manifest does not exist at path: /Users/puravmanot/Developer/Projects/WhisperTesting/WhisperTesting/whisper_large_v3_turbo/Manifest.json

It also gets stuck sometimes while loading a pre-downloaded model

[WhisperKit] Loading models...

Resample audio file in chunks to reduce memory usage

WhisperKit/Sources/WhisperKit/Core/AudioProcessor.swift

Lines 197 to 217 in fed90c7

 let newFrameLength = Int64((sampleRate / audioFile.fileFormat.sampleRate) * Double(audioFile.length)) 

 let outputFormat = AVAudioFormat(standardFormatWithSampleRate: sampleRate, channels: channelCount)! 

 guard let converter = AVAudioConverter(from: audioFile.processingFormat, to: outputFormat) else { 

 Logging.error("Failed to create audio converter") 

 return nil 

 } 

 let frameCount = AVAudioFrameCount(audioFile.length) 

 guard let inputBuffer = AVAudioPCMBuffer(pcmFormat: audioFile.processingFormat, frameCapacity: frameCount), 

 let outputBuffer = AVAudioPCMBuffer(pcmFormat: outputFormat, frameCapacity: AVAudioFrameCount(newFrameLength)) 

 else { 

 Logging.error("Unable to create buffers, likely due to unsupported file format") 

 return nil 

 } 

 do { 

 try audioFile.read(into: inputBuffer, frameCount: frameCount) 

 } catch { 

 Logging.error("Error reading audio file: \(error)") 

 return nil 

 }

Creating an AVAudioPCMBuffer for the whole input audio buffer can easily surpass iOS memory limits.

Attempting to transcribe a 44100hz, 2 channel, ~1hr long video crashes on iOS due to running out of memory. It would be nice if instead of reading all the input audio into a buffer at once and converting, the audio was read and converted in chunks to reduce the memory usage.

Another less common issue that would be solved by chunking the audio is that AVAudioPCMBuffer has a max size of UInt32.max, which can be hit when transcribing a 1-2hr, 16 channel, 44100hz audio file. This is a fairly typical audio file for a podcast recorded with a RODECaster Pro.

Any way to use with iOS 15 minimum target?

Hey guys, great work on the project and nice job with the word level timestamps recently!

Noticed that the package says the minimum target was recently changed to iOS 16 per https://github.com/argmaxinc/WhisperKit/blob/main/Package.swift, which is great.

What would be involved in bringing this down to iOS 15?

please add ipad mode

would be great

Crash when starting whisperKit in iOS simulator or visionPro simulator

I am getting this error when trying to start WhisperKit in any simulator. Can someone say what could it be and how to fix?

*** Terminating app due to uncaught exception 'com.apple.coreaudio.avfaudio', reason: 'required condition is false: IsFormatSampleRateAndChannelCountValid(format)'
*** First throw call stack:
(
0 CoreFoundation 0x00000001804bceec exceptionPreprocess + 172
1 libobjc.A.dylib 0x0000000180087068 objc_exception_throw + 56
2 CoreFoundation 0x00000001804bcd90 +[NSException raise:format:] + 0
3 AVFAudio 0x00000001c7789130 Z19AVAE_RaiseExceptionP8NSStringz + 48
4 AVFAudio 0x00000001c77e0b84 ZN17AUGraphNodeBaseV318CreateRecordingTapEmjP13AVAudioFormatU13block_pointerFvP16AVAudioPCMBufferP11AVAudioTimeE + 712
5 AVFAudio 0x00000001c78504d4 -[AVAudioNode installTapOnBus:bufferSize:format:block:] + 1324
6 languagelearn 0x0000000102091988 $s10WhisperKit14AudioProcessorC11setupEngine13inputDeviceIDSo07AVAudioF0CSSSg_tKF + 852
7 languagelearn 0x0000000102090c4c $s10WhisperKit14AudioProcessorC18startRecordingLive13inputDeviceID8callbackySSSg_ySaySfGcSgtKF + 224
8 languagelearn 0x0000000102090b2c $s10WhisperKit14AudioProcessorCAA0C10ProcessingA2aDP18startRecordingLive13inputDeviceID8callbackySSSg_ySaySfGcSgtKFTW + 24
9 languagelearn 0x000000010206bc04 $s13languagelearn11ContentViewV14startRecordingyySbFyyYaYbcfU_TY1 + 372
10 languagelearn 0x0000000102077ea5 $s13languagelearn11ContentViewV14startRecordingyySbFyyYaYbcfU_TATQ0 + 1
11 languagelearn 0x0000000102085369 $sxIeghHr_xs5Error_pIegHrzo_s8SendableRzs5NeverORs_r0_lTRTQ0 + 1
12 languagelearn 0x00000001020873cd $sxIeghHr_xs5Error_pIegHrzo_s8SendableRzs5NeverORs_r0_lTRTATQ0 + 1
13 libswift_Concurrency.dylib 0x000000020bfbf621 _ZL23completeTaskWithClosurePN5swift12AsyncContextEPNS_10SwiftErrorE + 1
)
libc++abi: terminating due to uncaught exception of type NSException

	if let threshold = options.noSpeechThreshold,
	result.noSpeechProb > threshold
	{
	needsFallback = false // silence
	}

	// MARK: Streaming Logic

	func realtimeLoop() {
	transcriptionTask = Task {
	while isRecording && isTranscribing {
	do {
	try await transcribeCurrentBuffer()
	} catch {
	print("Error: \(error.localizedDescription)")
	break
	}
	}
	}
	}

	func stopRealtimeTranscription() {
	isTranscribing = false
	transcriptionTask?.cancel()
	}

	func transcribeCurrentBuffer() async throws {
	guard let whisperKit = whisperKit else { return }

	// Retrieve the current audio buffer from the audio processor
	let currentBuffer = whisperKit.audioProcessor.audioSamples

	// Calculate the size and duration of the next buffer segment
	let nextBufferSize = currentBuffer.count - lastBufferSize
	let nextBufferSeconds = Float(nextBufferSize) / Float(WhisperKit.sampleRate)

	// Only run the transcribe if the next buffer has at least 1 second of audio
	guard nextBufferSeconds > 1 else {
	await MainActor.run {
	if currentText == "" {
	currentText = "Waiting for speech..."
	}
	}
	try await Task.sleep(nanoseconds: 100_000_000) // sleep for 100ms for next buffer
	return
	}

	if useVAD {
	// Retrieve the current relative energy values from the audio processor
	let currentRelativeEnergy = whisperKit.audioProcessor.relativeEnergy

	// Calculate the number of energy values to consider based on the duration of the next buffer
	// Each energy value corresponds to 1 buffer length (100ms of audio), hence we divide by 0.1
	let energyValuesToConsider = Int(nextBufferSeconds / 0.1)

	// Extract the relevant portion of energy values from the currentRelativeEnergy array
	let nextBufferEnergies = currentRelativeEnergy.suffix(energyValuesToConsider)

	// Determine the number of energy values to check for voice presence
	// Considering up to the last 1 second of audio, which translates to 10 energy values
	let numberOfValuesToCheck = max(10, nextBufferEnergies.count - 10)

	// Check if any of the energy values in the considered range exceed the silence threshold
	// This indicates the presence of voice in the buffer
	let voiceDetected = nextBufferEnergies.prefix(numberOfValuesToCheck).contains { $0 > Float(silenceThreshold) }

	// Only run the transcribe if the next buffer has voice
	guard voiceDetected else {
	await MainActor.run {
	if currentText == "" {
	currentText = "Waiting for speech..."
	}
	}

	// if nextBufferSeconds > 30 {
	// // This is a completely silent segment of 30s, so we can purge the audio and confirm anything pending
	// lastConfirmedSegmentEndSeconds = 0
	// whisperKit.audioProcessor.purgeAudioSamples(keepingLast: 2 * WhisperKit.sampleRate) // keep last 2s to include VAD overlap
	// currentBuffer = whisperKit.audioProcessor.audioSamples
	// lastBufferSize = 0
	// confirmedSegments.append(contentsOf: unconfirmedSegments)
	// unconfirmedSegments = []
	// }

	// Sleep for 100ms and check the next buffer
	try await Task.sleep(nanoseconds: 100_000_000)
	return
	}
	}

	// Run transcribe
	lastBufferSize = currentBuffer.count

	let transcription = try await transcribeAudioSamples(Array(currentBuffer))

	// We need to run this next part on the main thread
	await MainActor.run {
	currentText = ""
	unconfirmedText = []
	guard let segments = transcription?.segments else {
	return
	}

	self.tokensPerSecond = transcription?.timings?.tokensPerSecond ?? 0
	self.realTimeFactor = transcription?.timings?.realTimeFactor ?? 0
	self.firstTokenTime = transcription?.timings?.firstTokenTime ?? 0
	self.pipelineStart = transcription?.timings?.pipelineStart ?? 0
	self.currentLag = transcription?.timings?.decodingLoop ?? 0

	// Logic for moving segments to confirmedSegments
	if segments.count > requiredSegmentsForConfirmation {
	// Calculate the number of segments to confirm
	let numberOfSegmentsToConfirm = segments.count - requiredSegmentsForConfirmation

	// Confirm the required number of segments
	let confirmedSegmentsArray = Array(segments.prefix(numberOfSegmentsToConfirm))
	let remainingSegments = Array(segments.suffix(requiredSegmentsForConfirmation))

	// Update lastConfirmedSegmentEnd based on the last confirmed segment
	if let lastConfirmedSegment = confirmedSegmentsArray.last, lastConfirmedSegment.end > lastConfirmedSegmentEndSeconds {
	lastConfirmedSegmentEndSeconds = lastConfirmedSegment.end

	// Add confirmed segments to the confirmedSegments array
	if !self.confirmedSegments.contains(confirmedSegmentsArray) {
	self.confirmedSegments.append(contentsOf: confirmedSegmentsArray)
	}
	}

	// Update transcriptions to reflect the remaining segments
	self.unconfirmedSegments = remainingSegments
	} else {
	// Handle the case where segments are fewer or equal to required
	self.unconfirmedSegments = segments
	}
	}
	}
	}

	let newFrameLength = Int64((sampleRate / audioFile.fileFormat.sampleRate) * Double(audioFile.length))
	let outputFormat = AVAudioFormat(standardFormatWithSampleRate: sampleRate, channels: channelCount)!
	guard let converter = AVAudioConverter(from: audioFile.processingFormat, to: outputFormat) else {
	Logging.error("Failed to create audio converter")
	return nil
	}

	let frameCount = AVAudioFrameCount(audioFile.length)
	guard let inputBuffer = AVAudioPCMBuffer(pcmFormat: audioFile.processingFormat, frameCapacity: frameCount),
	let outputBuffer = AVAudioPCMBuffer(pcmFormat: outputFormat, frameCapacity: AVAudioFrameCount(newFrameLength))
	else {
	Logging.error("Unable to create buffers, likely due to unsupported file format")
	return nil
	}

	do {
	try audioFile.read(into: inputBuffer, frameCount: frameCount)
	} catch {
	Logging.error("Error reading audio file: \(error)")
	return nil
	}

argmaxinc / whisperkit Goto Github PK

whisperkit's People

Contributors

Stargazers

Watchers

Forkers

whisperkit's Issues

The operation couldn’t be completed. Launch failed. Domain: RBSRequestErrorDomain Code: 5 Failure Reason: Launch failed.

Launchd job spawn failed Domain: NSPOSIXErrorDomain Code: 162

References

References

References:

References

Reference

References

Related Issue

Recommend Projects

Recommend Topics

Recommend Org

The operation couldn’t be completed. Launch failed.
Domain: RBSRequestErrorDomain
Code: 5
Failure Reason: Launch failed.

Launchd job spawn failed
Domain: NSPOSIXErrorDomain
Code: 162