Giter VIP home page Giter VIP logo

jetson-voice's Introduction

jetson-voice

jetson-voice is an ASR/NLP/TTS deep learning inference library for Jetson Nano, TX1/TX2, Xavier NX, and AGX Xavier. It supports Python and JetPack 4.4.1 or newer. The DNN models were trained with NeMo and deployed with TensorRT for optimized performance. All computation is performed using the onboard GPU.

Currently the following capabilities are included:

The NLP models are using the DistilBERT transformer architecture for reduced memory usage and increased performance. For samples of the text-to-speech output, see the TTS Audio Samples section below.

Running the Container

jetson-voice is distributed as a Docker container due to the number of dependencies. There are pre-built containers images available on DockerHub for JetPack 4.4.1 and newer:

dustynv/jetson-voice:r32.4.4    # JetPack 4.4.1 (L4T R32.4.4)
dustynv/jetson-voice:r32.5.0    # JetPack 4.5 (L4T R32.5.0) / JetPack 4.5.1 (L4T R32.5.1)
dustynv/jetson-voice:r32.6.1    # JetPack 4.6 (L4T R32.6.1)
dustynv/jetson-voice:r32.7.1    # JetPack 4.6.1 (L4T R32.7.1)

To download and run the container, you can simply clone this repo and use the docker/run.sh script:

$ git clone --branch dev https://github.com/dusty-nv/jetson-voice
$ cd jetson-voice
$ docker/run.sh

note: if you want to use a USB microphone or speaker, plug it in before you start the container

There are some optional arguments to docker/run.sh that you can use:

  • -r (--run) specifies a run command, otherwise the container will start in an interactive shell.
  • -v (--volume) mount a directory from the host into the container (/host/path:/container/path)
  • --dev starts the container in development mode, where all the source files are mounted for easy editing

The run script will automatically mount the data/ directory into the container, which stores the models and other data files. If you save files from the container there, they will also show up under data/ on the host.

Automatic Speech Recognition (ASR)

The speech recognition in jetson-voice is a streaming service, so it's intended to be used on live sources and transcribes the audio in 1-second chunks. It uses a QuartzNet-15x5 model followed by a CTC beamsearch decoder and language model, to further refine the raw output of the network. It detects breaks in the audio to determine the end of sentences. For information about using the ASR APIs, please refer to jetson_voice/asr.py and see examples/asr.py

After you start the container, first run a test audio file (wav/ogg/flac) through examples/asr.py to verify that the system is functional. Run this command (and all subsequent commands) inside the container:

$ examples/asr.py --wav data/audio/dusty.wav

hi
hi hi this is dust
hi hi this is dusty check
hi hi this is dusty check one two
hi hi this is dusty check one two three
hi hi this is dusty check one two three.

what's the weather or
what's the weather going to be tomorrow
what's the weather going to be tomorrow in pittsburgh
what's the weather going to be tomorrow in pittsburgh.

today is
today is wednesday
today is wednesday tomorrow is thursday
today is wednesday tomorrow is thursday.

i would like
i would like to order a large
i would like to order a large pepperoni pizza
i would like to order a large pepperoni pizza.

is it going to be
is it going to be cloudy tomorrow.

The first time you run each model, TensorRT will take a few minutes to optimize it.
This optimized model is then cached to disk, so the next time you run the model it will load faster.

Live Microphone

To test the ASR on a mic, first list the audio devices in your system to get the audio device ID's:

$ scripts/list_audio_devices.sh

----------------------------------------------------
 Audio Input Devices
----------------------------------------------------
Input Device ID 1 - 'tegra-snd-t210ref-mobile-rt565x: - (hw:1,0)' (inputs=16) (sample_rate=44100)
Input Device ID 2 - 'tegra-snd-t210ref-mobile-rt565x: - (hw:1,1)' (inputs=16) (sample_rate=44100)
Input Device ID 3 - 'tegra-snd-t210ref-mobile-rt565x: - (hw:1,2)' (inputs=16) (sample_rate=44100)
Input Device ID 4 - 'tegra-snd-t210ref-mobile-rt565x: - (hw:1,3)' (inputs=16) (sample_rate=44100)
Input Device ID 5 - 'tegra-snd-t210ref-mobile-rt565x: - (hw:1,4)' (inputs=16) (sample_rate=44100)
Input Device ID 6 - 'tegra-snd-t210ref-mobile-rt565x: - (hw:1,5)' (inputs=16) (sample_rate=44100)
Input Device ID 7 - 'tegra-snd-t210ref-mobile-rt565x: - (hw:1,6)' (inputs=16) (sample_rate=44100)
Input Device ID 8 - 'tegra-snd-t210ref-mobile-rt565x: - (hw:1,7)' (inputs=16) (sample_rate=44100)
Input Device ID 9 - 'tegra-snd-t210ref-mobile-rt565x: - (hw:1,8)' (inputs=16) (sample_rate=44100)
Input Device ID 10 - 'tegra-snd-t210ref-mobile-rt565x: - (hw:1,9)' (inputs=16) (sample_rate=44100)
Input Device ID 11 - 'Logitech H570e Mono: USB Audio (hw:2,0)' (inputs=2) (sample_rate=44100)
Input Device ID 12 - 'Samson Meteor Mic: USB Audio (hw:3,0)' (inputs=2) (sample_rate=44100)

If you don't see your audio device listed, exit and restart the container.
USB devices should be attached before the container is started.

Then run the ASR example with the --mic <DEVICE> option, and specify either the device ID or name:

$ examples/asr.py --mic 11

hey
hey how are you guys
hey how are you guys.

# (Press Ctrl+C to exit)

ASR Classification

There are other ASR models included for command/keyword recognition (MatchboxNet) and voice activity detection (VAD MarbleNet). These models are smaller and faster, and classify chunks of audio as opposed to transcribing text.

Command/Keyword Recognition

The MatchboxNet model was trained on 12 keywords from the Google Speech Commands dataset:

# MatchboxNet classes
"yes",
"no",
"up",
"down",
"left",
"right",
"on",
"off",
"stop",
"go",
"unknown",
"silence"

You can run it through the same ASR example as above by specifying the --model matchboxnet argument:

$ examples/asr.py --model matchboxnet --wav data/audio/commands.wav

class 'unknown' (0.384)
class 'yes' (1.000)
class 'no' (1.000)
class 'up' (1.000)
class 'down' (1.000)
class 'left' (1.000)
class 'left' (1.000)
class 'right' (1.000)
class 'on' (1.000)
class 'off' (1.000)
class 'stop' (1.000)
class 'go' (1.000)
class 'go' (1.000)
class 'silence' (0.639)
class 'silence' (0.576)

The numbers printed on the right are the classification probabilities between 0 and 1.

Voice Activity Detection (VAD)

The voice activity model (VAD MarbleNet) is a binary model that outputs background or speech:

$ examples/asr.py --model vad_marblenet --wav data/audio/commands.wav

class 'background' (0.969)
class 'background' (0.984)
class 'background' (0.987)
class 'speech' (0.997)
class 'speech' (1.000)
class 'speech' (1.000)
class 'speech' (0.998)
class 'background' (0.987)
class 'speech' (1.000)
class 'speech' (1.000)
class 'speech' (1.000)
class 'background' (0.988)
class 'background' (0.784)

Natural Language Processing (NLP)

There are two samples included for NLP:

These each use a DistilBERT model which has been fined-tuned for it's particular task. For information about using the NLP APIs, please refer to jetson_voice/nlp.py and see the samples above.

Joint Intent/Slot Classification

Joint Intent and Slot classification is a task of classifying an Intent and detecting all relevant Slots (Entities) for this Intent in a query. For example, in the query: What is the weather in Santa Clara tomorrow morning?, we would like to classify the query as a weather Intent, and detect Santa Clara as a location slot and tomorrow morning as a date_time slot.

Intents and Slots names are usually task specific and defined as labels in the training data. The included intent/slot model was trained on the NLU-Evaluation-Data dataset - you can find the various intent and slot classes that it supports here. They are common things that you might ask a virtual assistant:

$ examples/nlp.py --model distilbert_intent

Enter intent_slot query, or Q to quit:

> What is the weather in Santa Clara tomorrow morning?

{'intent': 'weather_query',
 'score': 0.7165476,
 'slots': [{'score': 0.6280392, 'slot': 'place_name', 'text': 'Santa'},
           {'score': 0.61760694, 'slot': 'place_name', 'text': 'Clara'},
           {'score': 0.5439486, 'slot': 'date', 'text': 'tomorrow'},
           {'score': 0.4520608, 'slot': 'date', 'text': 'morning'}]}

> Set an alarm for 730am

{'intent': 'alarm_set',
 'score': 0.5713072,
 'slots': [{'score': 0.40017933, 'slot': 'time', 'text': '730am'}]}

> Turn up the volume

{'intent': 'audio_volume_up', 'score': 0.33523008, 'slots': []}

> What is my schedule for tomorrow?

{'intent': 'calendar_query',
 'score': 0.37434494,
 'slots': [{'score': 0.5732627, 'slot': 'date', 'text': 'tomorrow'}]}

> Order a pepperoni pizza from domino's

{'intent': 'takeaway_order',
 'score': 0.50629586,
 'slots': [{'score': 0.27558547, 'slot': 'food_type', 'text': 'pepperoni'},
           {'score': 0.2778827, 'slot': 'food_type', 'text': 'pizza'},
           {'score': 0.21785143, 'slot': 'business_name', 'text': 'dominos'}]}
	
> Where's the closest Starbucks?

{'intent': 'recommendation_locations',
 'score': 0.5438984,
 'slots': [{'score': 0.1604197, 'slot': 'place_name', 'text': 'Starbucks'}]}

Text Classification

In this text classification example, we'll use the included sentiment analysis model that was trained on the Standford Sentiment Treebank (SST-2) dataset. It will label queries as either positive or negative, along with their probability:

$ examples/nlp.py --model distilbert_sentiment

Enter text_classification query, or Q to quit:

> today was warm, sunny and beautiful out

{'class': 1, 'label': '1', 'score': 0.9985898}

> today was cold and rainy and not very nice

{'class': 0, 'label': '0', 'score': 0.99136007}

(class 0 is negative sentiment and class 1 is positive sentiment)

Token Classification

Whereas text classification classifies entire queries, token classification classifies individual tokens (or words). In this example, we'll be performing Named Entity Recognition (NER), which is the task of detecting and classifying key information (entities) in text. For example, in a sentence: Mary lives in Santa Clara and works at NVIDIA, we should detect that Mary is a person, Santa Clara is a location and NVIDIA is a company.

The included token classification model for NER was trained on the Groningen Meaning Bank (GMB) and supports the following annotations in IOB format (short for inside, outside, beginning)

  • LOC = Geographical Entity
  • ORG = Organization
  • PER = Person
  • GPE = Geopolitical Entity
  • TIME = Time indicator
  • MISC = Artifact, Event, or Natural Phenomenon
$ examples/nlp.py --model distilbert_ner

Enter token_classification query, or Q to quit:
> Mary lives in Santa Clara and works at NVIDIA

Mary[B-PER 0.989] lives in Santa[B-LOC 0.998] Clara[I-LOC 0.996] and works at NVIDIA[B-ORG 0.967]

> Lisa's favorite place to climb in the summer is El Capitan in Yosemite National Park in California, U.S.

Lisa's[B-PER 0.995] favorite place to climb in the summer[B-TIME 0.996] is El[B-PER 0.577] Capitan[I-PER 0.483] 
in Yosemite[B-LOC 0.987] National[I-LOC 0.988] Park[I-LOC 0.98] in California[B-LOC 0.998], U.S[B-LOC 0.997].

Question/Answering

Question/Answering (QA) works by supplying a context paragraph which the model then queries the best answer from. The nlp_qa.py example allows you to select from several built-in context paragraphs (or supply your own) and to ask questions about these topics.

The QA model is flexible and doesn't need re-trained on different topics, as it was trained on the SQuAD question/answering dataset which allows it to extract answers from a variety of contexts. It essentially learns to identify the information most relevant to your query from the context passage, as opposed to learning the content itself.

$ examples/nlp_qa.py 

Context:
The Amazon rainforest is a moist broadleaf forest that covers most of the Amazon basin of South America. 
This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres 
(2,100,000 sq mi) are covered by the rainforest. The majority of the forest is contained within Brazil, 
with 60% of the rainforest, followed by Peru with 13%, and Colombia with 10%.

Enter a question, C to change context, P to print context, or Q to quit:

> How big is the Amazon?

Answer: 7,000,000 square kilometres
Score:  0.24993503093719482

> which country has the most?

Answer: Brazil
Score:  0.5964332222938538

To change the topic or create one of your own, enter C:

Enter a question, C to change context, P to print context, or Q to quit:
> C

Select from one of the following topics, or enter your own context paragraph:
   1. Amazon
   2. Geology
   3. Moon Landing
   4. Pi
   5. Super Bowl 55
> 3

Context:
The first manned Moon landing was Apollo 11 on July, 20 1969. The first human to step on the Moon was 
astronaut Neil Armstrong followed second by Buzz Aldrin. They landed in the Sea of Tranquility with their 
lunar module the Eagle. They were on the lunar surface for 2.25 hours and collected 50 pounds of moon rocks.

Enter a question, C to change context, P to print context, or Q to quit:

> Who was the first man on the moon?

Answer: Neil Armstrong
Score:  0.39105066657066345

Text-to-Speech (TTS)

The text-to-speech service uses an ensemble of two models: FastPitch to generate MEL spectrograms from text, and HiFiGAN as the vocoder (female English voice). For information about using the TTS APIs, please refer to jetson_voice/tts.py and see examples/tts.py

The examples/tts.py app can output the audio to a speaker, wav file, or sequence of wav files. Run it with --list-devices to get a list of your audio devices.

$ examples/tts.py --output-device 11 --output-wav data/audio/tts_test

> The weather tomorrow is forecast to be warm and sunny with a high of 83 degrees.

Run 0 -- Time to first audio: 1.820s. Generated 5.36s of audio. RTFx=2.95.
Run 1 -- Time to first audio: 0.232s. Generated 5.36s of audio. RTFx=23.15.
Run 2 -- Time to first audio: 0.230s. Generated 5.36s of audio. RTFx=23.31.
Run 3 -- Time to first audio: 0.231s. Generated 5.36s of audio. RTFx=23.25.
Run 4 -- Time to first audio: 0.230s. Generated 5.36s of audio. RTFx=23.36.
Run 5 -- Time to first audio: 0.230s. Generated 5.36s of audio. RTFx=23.35.

Wrote audio to data/audio/tts_test/0.wav

Enter text, or Q to quit:
> Sally sells seashells by the seashore.

Run 0 -- Time to first audio: 0.316s. Generated 2.73s of audio. RTFx=8.63.
Run 1 -- Time to first audio: 0.126s. Generated 2.73s of audio. RTFx=21.61.
Run 2 -- Time to first audio: 0.127s. Generated 2.73s of audio. RTFx=21.51.
Run 3 -- Time to first audio: 0.126s. Generated 2.73s of audio. RTFx=21.68.
Run 4 -- Time to first audio: 0.126s. Generated 2.73s of audio. RTFx=21.68.
Run 5 -- Time to first audio: 0.126s. Generated 2.73s of audio. RTFx=21.61.

Wrote audio to data/audio/tts_test/1.wav

TTS Audio Samples

Tests

There is an automated test suite included that will verify all of the models are working properly. You can run it with the tests/run_tests.py script:

$ tests/run_tests.py

----------------------------------------------------
 TEST SUMMARY
----------------------------------------------------
test_asr.py (quartznet)                  PASSED
test_asr.py (quartznet_greedy)           PASSED
test_asr.py (matchboxnet)                PASSED
test_asr.py (vad_marblenet)              PASSED
test_nlp.py (distilbert_qa_128)          PASSED
test_nlp.py (distilbert_qa_384)          PASSED
test_nlp.py (distilbert_intent)          PASSED
test_nlp.py (distilbert_sentiment)       PASSED
test_nlp.py (distilbert_ner)             PASSED
test_tts.py (fastpitch_hifigan)          PASSED

passed 10 of 10 tests
saved logs to data/tests/logs/20210610_1512

The logs of the individual tests are printed to the screen and saved to a timestamped directory.

jetson-voice's People

Contributors

dusty-nv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jetson-voice's Issues

rtsp sound classification?

Hi,

I am working on a voice classification. Just laugh and cry.
May I ask if there is a custom training sound classification? (Just like your object detection training)
If not, is there any python audio tutorial?
Also, I need to get the sound from IP cam RTSP, I assume I shall use pyaudio? Not really sure about the details....
Thx

Btw, I have asked similar question in the forum.
https://forums.developer.nvidia.com/t/is-there-any-step-by-step-python-example-of-audio-classification-using-nano/213881/6

Killed

Hi Dusty I was interested to see how your model compared to the official intent and slots example as it was not classifying the slots in the examples after 100 epochs (have started training with 250 epochs now) and performed poorly on additional inputs.

I experienced the following issue on Jetson Nano 4gb, Jetpack 4.5.1-b17

[TensorRT] VERBOSE: After vertical fusions: 250 layers
[TensorRT] VERBOSE: After final dead-layer removal: 250 layers
[TensorRT] VERBOSE: After tensor merging: 250 layers
[TensorRT] VERBOSE: After concat removal: 250 layers
[TensorRT] VERBOSE: Graph construction and optimization completed in 0.934568 seconds.
[TensorRT] VERBOSE: Constructing optimization profile number 0 [1/1].
[TensorRT] VERBOSE: *************** Autotuning format combination:  -> Float(1,768) ***************
[TensorRT] VERBOSE: *************** Autotuning format combination:  -> Half(1,768) ***************
Killed

Failed testing Result for matchboxnet and vad_marblenet

Hi Dusty
Thanks for the great repo and tutorial of introducing the ASR and NLP on Jetson device. I followed the instructions and got 2 error for 2 model s out of 10., here is my testing result.


TEST SUMMARY

test_asr.py (quartznet) PASSED
test_asr.py (quartznet_greedy) PASSED
test_asr.py (matchboxnet) FAILED
test_asr.py (vad_marblenet) FAILED
test_nlp.py (distilbert_qa_128) PASSED
test_nlp.py (distilbert_qa_384) PASSED
test_nlp.py (distilbert_intent) PASSED
test_nlp.py (distilbert_sentiment) PASSED
test_nlp.py (distilbert_ner) PASSED
test_tts.py (fastpitch_hifigan) PASSED

Matchboxnet testing log

RUNNING TEST (ASR)

model: matchboxnet
config: data/tests/asr_keyword.json

binding 0 - 'audio_signal'
input: True
shape: (1, 64, -1)
dtype: DataType.FLOAT
size: -256
dynamic: True
profiles: [{'min': (1, 64, 10), 'opt': (1, 64, 150), 'max': (1, 64, 300)}]

binding 1 - 'logits'
input: False
shape: (1, 12)
dtype: DataType.FLOAT
size: 48
dynamic: False
profiles: []

Vad_marblenet testing log

RUNNING TEST (ASR)

model: vad_marblenet
config: data/tests/asr_vad.json

binding 0 - 'audio_signal'
input: True
shape: (1, 64, -1)
dtype: DataType.FLOAT
size: -256
dynamic: True
profiles: [{'min': (1, 64, 10), 'opt': (1, 64, 150), 'max': (1, 64, 300)}]

binding 1 - 'logits'
input: False
shape: (1, 2)
dtype: DataType.FLOAT
size: 8
dynamic: False
profiles: []

When running command "examples/asr.py --model matchboxnet --wav data/audio/commands.wav", I got an error as follows:
RuntimeError: shape '[1, 154, 2]' is invalid for input of size 79156

When running command "examples/asr.py --model vad_marblenet --wav data/audio/commands.wav", I got a similar error like this:
RuntimeError: shape '[1, 34, 2]' is invalid for input of size 17476

Have you ever encountered this issue before?

jetson-voice on L4T 35.4.1

Is it possible to run this container on 35.4.1.
If possible what changes do I need to make ?

Thank you
Sandeep

Bad asr prediction on audio with a bit of noise

Hi,
first of all, thank you for providing this repo! I was able to set up speech recognition on my Jetson Nano 2GB relatively easily with it.
However, the quality of the prediction with the microphone I'm using is quite poor:

First I checked the provided dusty.wav file with the asr.py example. The predicted full sentences are, just as in the readme, pretty good:

hi hi this is dusty check on two two three.
what's the weather going to be tomorrow in pittsburg.
today is wednesday tomorrow is thursday.
i would like to order a large pepperoni pizza.

Then I tried to play this audio on a speaker and record it with the microphone that I intend to use for detection. It produced this audio file. If you play it, you can hear some noise, but you can still hear the voice very clearly (apart from the first 5 seconds). Still, the prediction on it is pretty bad:

they're going to be.
dawned.
thursday.
larger.
i going tomorrow.
this.
chat.
so.
three.
what weather.
tomorrow pittsburgh.
today is wednesday.
rotary.
ron.
is going tomorrow.
this is dusty.
ca no.
the.
what the weather tomorrow in pittsburgh.
today is wednesday tomorrow's thursday.

When I talk myself, the prediction is similarily bad.

Do you have an idea what might be the cause of it? Maybe there is a relatively simple fix to the preprocessing pipeline or some configuration that I can try?
I noticed that my recording has a very tiny echo. Maybe it's worth a shot to augment the training data in a similar way and retrain it? If you think that might help, can you outline how I would be able to do that?
Or is there maybe a better version of the quarznet model out there? You mentioned RIVA in another issue. Sadly I cannot use that because I need to make it work on the Jetson Nano 2GB. And quarznet already uses 95% of the memory I have. So it would be nice to make it work.

Trying to get tts to load text from a file and size limitations

i'm interested in creating a wav like you did from the input but it seems to be quite limited on the amount of text it can load. Plus I'm looking for file loading. I tried to do a larger text but got:

[TensorRT] ERROR: 3: [executionContext.cpp::setBindingDimensions::969] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::969, condition: profileMaxDims.d[i] >= dimensions.d[i]. Supplied binding dimension [1,80,2378] for bindings[0] exceed min ~ max range at index 2, maximum dimension in profile is 1024, minimum dimension in profile is 1, but supplied dimension is 2378.
)
Traceback (most recent call last):
File "tts2.py", line 90, in
audio = tts(args.text)
File "/jetson-voice/jetson_voice/models/tts/tts_engine.py", line 81, in call
audio = self.vocoder.execute(mels)
File "/jetson-voice/jetson_voice/backends/tensorrt/trt_model.py", line 114, in execute
setup_binding(self.bindings[idx], input)
File "/jetson-voice/jetson_voice/backends/tensorrt/trt_model.py", line 109, in setup_binding
binding.set_shape(input.shape)
File "/jetson-voice/jetson_voice/backends/tensorrt/trt_binding.py", line 80, in set_shape
raise ValueError(f"failed to set binding '{self.name}' with shape {shape}")
ValueError: failed to set binding 'mels' with shape (1, 80, 2378)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2308, GPU 3704 (MiB)

tts2.py is just copying the tts.py and adding a longer string and naming it something else.

How does one increase the size so it's not that limited?

Thanks for the help in advance.

Support for other languages

Is there a way you can add or give instructions on how to adapt another language for the asr for instance spanish.

Running container Error

Nvidia Jetson Xavier NX | Jetpack 4.5 {L4T 32.5.0]

Initially did run the container without any issue. Tested and everything was working fine. However, next day can't run it. Following error:
xtend_m2@m1b2-ai:~/jetson-voice$ docker/run.sh
ARCH: aarch64
reading L4T version from /etc/nv_tegra_release
L4T BSP Version: L4T R32.5.0
[sudo] password for xtend_m2:
CONTAINER: dustynv/jetson-voice:r32.5.0
DEV_VOLUME:
DATA_VOLUME: --volume /home/xtend_m2/jetson-voice/data:/jetson-voice/data
USER_VOLUME:
USER_COMMAND:
Unable to find image 'dustynv/jetson-voice:r32.5.0' locally
docker: Error response from daemon: Get https://registry-1.docker.io/v2/: dial tcp: lookup registry-1.docker.io on 127.0.0.53:53: read udp 127.0.0.1:37823->127.0.0.53:53: i/o timeout.
See 'docker run --help'.

Also, tried on another Xavier NX with clean installation. Same error.

I'd really appreciate an assistance

Extra configuration files from Nemo model

I'm doing some ASR tests and I want to use a different model than the ones offered here. I used the nemo_export_onnx script which produces an onnx and a single json file. Any of the other jsons and binaries from the models you offer are not generated. Is this the expected behavior?

How can I check that the transformation is done correctly?

TTS model doesn't fit into Jetson Nano 2GB

I noticed that the provided fastpitch_hifigan model doesn't work with 2GB of RAM. Is anyone aware of a smaller model in NEMO that I can try to convert?
I also tried to run the model with TensorRT instead of the default onnxruntime, but some bugs in TensorRT prevent this.

Models not working

I can run tests and they pass but if I attempt to run anything else they fail.

`
@Jetson:/jetson-voice/examples# ./asr.py --wav data/audio/dusty.wav
Namespace(debug=False, default_backend='tensorrt', global_config=None, list_devices=False, list_models=False, log_level='info', mic=None, model='quartznet', model_dir='data/networks', model_manifest='data/networks/manifest.json', profile=False, verbose=False, wav='data/audio/dusty.wav')
Traceback (most recent call last):
File "./asr.py", line 25, in
asr = ASR(args.model)
File "/jetson-voice/jetson_voice/asr.py", line 18, in ASR
return load_resource(resource, factory_map, *args, **kwargs)
File "/jetson-voice/jetson_voice/utils/resource.py", line 57, in load_resource
manifest = download_model(resource)
File "/jetson-voice/jetson_voice/utils/resource.py", line 166, in download_model
manifest = find_model_manifest(name)
File "/jetson-voice/jetson_voice/utils/resource.py", line 143, in find_model_manifest
manifest = load_models_manifest()
File "/jetson-voice/jetson_voice/utils/resource.py", line 128, in load_models_manifest
with open(path) as file:
FileNotFoundError: [Errno 2] No such file or directory: 'data/networks/manifest.json'

`

`PASSED TEST test_tts.py (fastpitch_hifigan) - return code 0


TEST SUMMARY

test_asr.py (quartznet) PASSED
test_asr.py (quartznet_greedy) PASSED
test_asr.py (matchboxnet) PASSED
test_asr.py (vad_marblenet) PASSED
test_nlp.py (distilbert_qa_128) PASSED
test_nlp.py (distilbert_qa_384) PASSED
test_nlp.py (distilbert_intent) PASSED
test_nlp.py (distilbert_sentiment) PASSED
test_nlp.py (distilbert_ner) PASSED
test_tts.py (fastpitch_hifigan) PASSED

passed 10 of 10 tests`

problems

I am using a Jetson Xavier NX with Jetpack 4.51. I have docker installed and my $USER is part of the docker group

  1. I successfully pulled your repo
  2. executed from ~/jetson-voice : docker/run.sh
  3. here is the result:
    ARCH: aarch64
    reading L4T version from /etc/nv_tegra_release
    L4T BSP Version: L4T R32.5.1
    CONTAINER: dustynv/jetson-voice:r32.5.1
    DEV_VOLUME:
    DATA_VOLUME: --volume /home/rick/jetson-voice/data:/jetson-voice/data
    USER_VOLUME:
    USER_COMMAND:
    Unable to find image 'dustynv/jetson-voice:r32.5.1' locally
    docker: Error response from daemon: manifest for dustynv/jetson-voice:r32.5.1 not found: manifest unknown: manifest unknown.
  4. so i tried to pull your image first: docker pull dustynv/jetson-voice
  5. docker image ls shows the image is there
  6. i try to:
    docker/run.sh and get the same results
  7. i try to: docker run dustynv/jetsonvoice:r32.5.0 ($USER is part of the docker group)
    result:
    Unable to find image 'dustynv/jetsonvoice:r32.5.0' locally
    docker: Error response from daemon: pull access denied for dustynv/jetsonvoice, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.

Please advise.
Thanks

Regarding Joint intent/slot classification - Wrong intents

I am trying to make a model for HVAC and infotainment based system by using Jetson Nano . But most of the intents generated was false . There were no much adequate labels in the dataset related to my project. Could anyone please suggest any solution for this issue or recommend a new pre-trained model suitable for this project.

These were the results of the queries :

Stop music | audio_volume_mute
Play music | play_music
Play next track | play_music
Play previous track | music_query
Volume up | audio_volume_up
Volume down | audio_volume_up
mute | audio_volume_mute
Unmute | audio_volume_mute
AC temp increase | weather_query
AC temp decrease | weather_query
Fan on | social post
turn on fan | Play radio
on fan | social query
Fan off | audio volume mute
Fan speed increase | audio volume up
Fan speed decrease | audio volume up

Running out of disk space

How much disk space is required for the full container? I keep running out of disk space on my 32gb sd card which had about 6GB free before running the docker/run.sh
image
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.