Giter VIP home page Giter VIP logo

make-a-smart-speaker's Introduction

Voice Engine

Build Status

The library is used to create voice interface applications. It includes building blocks such as KWS (keyword spotting), DOA (Direction Of Arrival). There are also elements to measure RMS (dBFS or dB(A)).

Requirements

  • pyaudio
  • numpy
  • snowboy

Installation

Install pyaudio, numpy and snowboy, use virtualenv a virtual python environment.

sudo apt install python-pyaudio python-numpy python-virtualenv
sudo apt-get install swig python-dev libatlas-base-dev build-essential make
git clone --depth 1 https://github.com/Kitt-AI/snowboy.git
cd snowboy
virtualenv --system-site-packages env
source env/bin/activate
python setup.py build
python setup.py bdist_wheel
pip install dist/snowboy*.whl
cd ..
git clone https://github.com/voice-engine/voice-engine.git
cd voice-engine
python setup.py bdist_wheel
pip install dist/*.whl

Get started

To record audio and search keyword "snowboy", see also kws_snowboy.py

import time
from voice_engine.kws import KWS
from voice_engine.source import Source

src = Source()
kws = KWS()
src.link(kws)

def on_detected(keyword):
    print('found {}'.format(keyword))
kws.on_detected = on_detected

kws.start()
src.start()
while True:
    try:
        time.sleep(1)
    except KeyboardInterrupt:
        break
kws.stop()
src.stop()

Building blocks

The library uses gstreamer-like elements which can be linked together as an audio pipeline. One element can connect to more than one other elements.

The topology can be:

Source --> ChannelPicker --> KWS          Source --> ChannelPicker --> KWS --> Alexa
  |                          /\
  V                        /   \
 DOA                   Alexa   Google Asissitant 
  

make-a-smart-speaker's People

Contributors

fq-selbach avatar maelp avatar xiongyihui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

make-a-smart-speaker's Issues

Record & Play Audio on Linux

part of Smart Speaker from Scratch

To make a smart speaker, the first thing to do in terms of software is to learn how to record & play audio.
Most of smart speakers, except Apple's, are using Linux, so let's get started with audio on Linux.

image

We mainly focus on ALSA (Advanced Linux Sound Architecture) here. ALSA has two parts:
one is sound card drivers in Linux kernal which describes the audio hardware and provide a unified interface;
the other is user-space ALSA library, which has a user friendly API (compare to ALSA kernel API), some utils and a plugin system. The plugin system has plugins such as plug, dmix, dsnoop, pulse, provides functions like resampling, mixing, channel mapping.
ALSA is not powerful enough, for example, we can use ALSA dmix plugin to support multiple applications using one sound card to play at the same time, but it's very limited, it's very hard to adjust each volume of each application.
So we have PulseAudio, which is a sound server to manage all the audio sources and sinks. It can handle not just ALSA audio, but also Bluetooth audio and network audio. When a USB sound card is plugged, PulseAudio will detect it and redirect playback to it (based on settings). Nowadays, PulseAudio is installed by default on most Linux desktop distributions. Similar to PulseAudio, we also have JACK and PipeWire. JACK mainly focuses on realtime, low-latency professional audio applications. PipeWire is an ongoing project which aims to support the usecases currently handled by both PulseAudio and JACK and at the same time provide same level of powerful handling of Video input and output.

When ALSA, PulseAudio and JACK all have their own APIs, what should we choose to use?
The short answer is ALSA API for most recording/playback applications, as PulseAudio and JACK both provide ALSA plugins to support applications through ALSA API. It's like APP3 in the above graph. The author of PulseAudio, Lennart Poettering, alsa recommended a safe ALSA subset API to do basic PCM audio playback/capturing.

So let's start to use ALSA to record & play audio.

Record & Play Audio on Command Line

For raw audio or wav audio, we can just aplay and arecord. Actually, they are the same program with different names. You can run ls -la /usr/bin/arecord to find it out. aplay and arecord have a bunch of parameters, but it is worth to learn these parameters (We will use these parameters in the code in a extremely simple way later). The best way to learn how to use aplay and arecord is to read the help of aplay -h or man arecord and practice:

$ aplay -h
Usage: aplay [OPTION]... [FILE]...

-h, --help              help
    --version           print current version
-l, --list-devices      list all soundcards and digital audio devices
-L, --list-pcms         list device names
-D, --device=NAME       select PCM by name
-q, --quiet             quiet mode
-t, --file-type TYPE    file type (voc, wav, raw or au)
-c, --channels=#        channels
-f, --format=FORMAT     sample format (case insensitive)
-r, --rate=#            sample rate
-d, --duration=#        interrupt after # seconds
-M, --mmap              mmap stream
-N, --nonblock          nonblocking mode
-F, --period-time=#     distance between interrupts is # microseconds
-B, --buffer-time=#     buffer duration is # microseconds
    --period-size=#     distance between interrupts is # frames
    --buffer-size=#     buffer duration is # frames
-A, --avail-min=#       min available space for wakeup is # microseconds
-R, --start-delay=#     delay for automatic PCM start is # microseconds 
                        (relative to buffer size if <= 0)
-T, --stop-delay=#      delay for automatic PCM stop is # microseconds from xrun
-v, --verbose           show PCM structure and setup (accumulative)
-V, --vumeter=TYPE      enable VU meter (TYPE: mono or stereo)
-I, --separate-channels one file for each channel
-i, --interactive       allow interactive operation from stdin
-m, --chmap=ch1,ch2,..  Give the channel map to override or follow
    --disable-resample  disable automatic rate resample
    --disable-channels  disable automatic channel conversions
    --disable-format    disable automatic format conversions
    --disable-softvol   disable software volume control (softvol)
    --test-position     test ring buffer position
    --test-coef=#       test coefficient for ring buffer position (default 8)
                        expression for validation is: coef * (buffer_size / 2)
    --test-nowait       do not wait for ring buffer - eats whole CPU
    --max-file-time=#   start another output file when the old file has recorded
                        for this many seconds
    --process-id-file   write the process ID here
    --use-strftime      apply the strftime facility to the output file name
    --dump-hw-params    dump hw_params of the device
    --fatal-errors      treat all errors as fatal
Recognized sample formats are: S8 U8 S16_LE S16_BE U16_LE U16_BE S24_LE S24_BE U24_LE U24_BE S32_LE S32_BE U32_LE U32_BE FLOAT_LE FLOAT_BE FLOAT64_LE FLOAT64_BE IEC958_SUBFRAME_LE IEC958_SUBFRAME_BE MU_LAW A_LAW IMA_ADPCM MPEG GSM SPECIAL S24_3LE S24_3BE U24_3LE U24_3BE S20_3LE S20_3BE U20_3LE U20_3BE S18_3LE S18_3BE U18_3LE U18_3BE G723_24 G723_24_1B G723_40 G723_40_1B DSD_U8 DSD_U16_LE DSD_U32_LE DSD_U16_BE DSD_U32_BE
Some of these may not be available on selected hardware
The available format shortcuts are:
-f cd (16 bit little endian, 44100, stereo)
-f cdr (16 bit big endian, 44100, stereo)
-f dat (16 bit little endian, 48000, stereo)

To record a 10 second, 48000Hz, 2 channels, S16_LE format audio:

arecord -d 10 -r 48000 -c 2 -f S16_LE audio.wav   # or arecord -d 10 -f dat audio.wav

When debug sound card, we want to check if there is any sound, we can use -V mono or -V stereo to show sound VU meter. To know the details of used plugins and settings, -v can be used. Or to loopback:

arecord -v -f dat - | aplay -v -Vstereo -

To choose PCM device, use -D {name}, use aplay -L to get a list of available PCM devices. PCM devices is different from sound cards. PCM devices can be created by ALSA plugins. When PulseAudio is installed, ALSA pulse plugin will create a PCM device named pulse which will be set as the default PCM device. ALSA plugins are stackable. For example, we can use plug and dmix together to run two recorder in two terminal:

arecord -v -D plug:dsnoop:0 -f dat -V stereo /dev/null

Use one sound card to play multiple audio:

aplay -v -D plug:dmix:0 music.wav

Next, we will use aplay and arecord in the code.

  1. Record & Play Audio in the Code
    The simplest way to get recording audio is: redirect arecord to stdout pipe,and then read audio from stdout pipe. In C code, it is:
#include <stdio.h>
int main(int argc, char **argv) {
    char *cmd = "arecord -D plughw:0 -f S16_LE -c 1 -r 16000 -t raw -q -";
    char buf[256];
    FILE *fp = popen(cmd, "r");

    for (int i=0; i<16; i++) {
        int result = fread(buf, 1, sizeof(buf), fp);
        printf("read %d bytes\n", result);

    }
    pclose(fp);
    return 0;
}

To play sound:

#include <stdio.h>
int main(int argc, char **argv) {
    char *cmd = "aplay -D plughw:0 -f S16_LE -c 1 -r 16000 -t raw -q -";
    char buf[256] = {0,};
    FILE *fp = popen(cmd, "r");

    for (int i=0; i<16; i++) {
        int result = fwrite(buf, 1, sizeof(buf), fp);
        printf("write %d bytes\n", result);

    }
    pclose(fp);
    return 0;
}

In Python code, it is quite simple too:

import subprocess
import audioop

cmd = ['arecord', '-D', 'plughw:0', '-f', 'S16_LE', '-c', '1', '-r', '16000', '-t', 'raw', '-q', '-']
process = subprocess.Popen(cmd, stdout=subprocess.PIPE)
for i in range(16):
    audio = process.stdout.read(160 * 2)
    print(audioop.rms(audio, 2))
process.stdout.close()
process.terminate()

We can also use ALSA API directly, but we learn to use varous of API to set parameters. It's a little hard touse. See pcm.c to learn how to play a sine signal in different ways such as sync, async and poll.

If you want cross-platform support, you may use PortAudio or libsoundio. There is a comparison between libsoundio and PortAudio.
For python program, it seems libsoundio doesn't have a mature python binding, while PortAudio have pyaudio and python-sounddevice.
When using PortAudio (pyaudio or sounddevice), PortAudio will iterate all PCM devices and try to open them. There will be some error logs which is a little annoying.
Using pyaudio to record audio is simple, here is an example from https://people.csail.mit.edu/hubert/pyaudio/

"""PyAudio example: Record a few seconds of audio and save to a WAVE file."""

import pyaudio
import wave

CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 2
RATE = 44100
RECORD_SECONDS = 5
WAVE_OUTPUT_FILENAME = "output.wav"

p = pyaudio.PyAudio()

stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK)

print("* recording")

frames = []

for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
    data = stream.read(CHUNK)
    frames.append(data)

print("* done recording")

stream.stop_stream()
stream.close()
p.terminate()

wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()

IMHO, for simple audio recording/playback program, the best way is just using aplay and arecord. No extra dependencies is required, stdin/stdout pipe provides a buffer mechanism and it's stable and multi-process (A++ for multi-core ARM processor).

How can I make a wake word in my code :(

How can I put the wake word I tried many time

The code:

import libers

import speech_recognition as sr # recognise speech
import playsound # to play an audio file
from gtts import gTTS # google text to speech
import random
from time import ctime # get time details
import webbrowser # open browser
import yfinance as yf # to fetch financial data
import ssl
import certifi
import time
import os # to remove created audio files

class person:
name = ''
def setName(self, name):
self.name = name

def to listening 1

def there_exists(terms):
for term in terms:
if term in voice_data:
return True

def to listening 2

r = sr.Recognizer() # initialise a recogniser

listen for audio and convert it to text:

def record_audio(ask=False):
with sr.Microphone() as source: # microphone as source
if ask:
speak(ask)
audio = r.listen(source) # listen for the audio via source
voice_data = ''
try:
voice_data = r.recognize_google(audio) # convert audio to text
except sr.UnknownValueError: # error: recognizer does not understand
speak('Can you repeat')
except sr.RequestError:
speak('Sorry, the service is down') # error: recognizer is not connected
print(f">> {voice_data.lower()}") # print what user said
return voice_data.lower()

get string and make a audio file to be played and choose the lang

def speak(audio_string):
tts = gTTS(text=audio_string, lang='en') # text to speech(voice)
r = random.randint(1,20000000)
audio_file = 'audio' + str(r) + '.mp3'
tts.save(audio_file) # save as mp3
playsound.playsound(audio_file) # play the audio file
print(f"kiri: {audio_string}") # print what app said
os.remove(audio_file) # remove audio file

wake = ["welcome", "hey sara"]

respond for listening 1

def respond(voice_data):

# 1: greeting
if there_exists(['hey','hi','hello']):
    greetings = [f"hey, how can I help you {person_obj.name}", f"hey, what's up? {person_obj.name}", f"I'm listening {person_obj.name}", f"how can I help you? {person_obj.name}", f"hello {person_obj.name}"]
    greet = greetings[random.randint(0,len(greetings)-1)]
    speak(greet)

# 2: name
if there_exists(["what is your name","what's your name","tell me your name"]):
    if person_obj.name:
        speak("my name is sara")
    else:
        speak("my name is sara. what's your name?")

if there_exists(["my name is"]):
    person_name = voice_data.split("is")[-1].strip()
    speak(f"okay, i will remember that {person_name}")
    person_obj.setName(person_name) # remember name in person object

# 3: greeting
if there_exists(["how are you","how are you doing"]):
    speak(f"I'm very well, thanks for asking {person_obj.name}")

# 4: time
if there_exists(["what's the time","tell me the time","what time is it"]):
    time = ctime().split(" ")[3].split(":")[0:2]
    if time[0] == "00":
        hours = '12'
    else:
        hours = time[0]
    minutes = time[1]
    time = f'{hours} {minutes}'
    speak(time)

# 5: search google
if there_exists(["search for"]) and 'youtube' not in voice_data:
    search_term = voice_data.split("for")[-1]
    url = f"https://google.com/search?q={search_term}"
    webbrowser.get().open(url)
    speak(f'Here is what I found for {search_term} on google')

# 6: search youtube
if there_exists(["youtube"]):
    search_term = voice_data.split("for")[-1]
    url = f"https://www.youtube.com/results?search_query={search_term}"
    webbrowser.get().open(url)
    speak(f'Here is what I found for {search_term} on youtube')

# 7: get stock price
if there_exists(["price of"]):
    search_term = voice_data.lower().split(" of ")[-1].strip() #strip removes whitespace after/before a term in string
    stocks = {
        "apple":"AAPL",
        "microsoft":"MSFT",
        "facebook":"FB",
        "tesla":"TSLA",
        "bitcoin":"BTC-USD"
    }
    try:
        stock = stocks[search_term]
        stock = yf.Ticker(stock)
        price = stock.info["regularMarketPrice"]

        speak(f'price of {search_term} is {price} {stock.info["currency"]} {person_obj.name}')
    except:
        speak('oops, something went wrong')

# 8: to stop
if there_exists(["exit", "quit", "goodbye"]):
    speak("going offline")
    exit()

# 9: thank you responed
if there_exists(["thank you","thanks","you are the best"]):
        speak(" welcome sir ")

time.sleep(0.5)

person_obj = person()
while(1):
voice_data = record_audio() # get the voice input
respond(voice_data) # respond

sorry I am new in githup

Add Snips to the resources

Hi, it is a very nice resource, perhaps you could also add Snips wakeword engine and ASR engines to the list, we are open-sourcing them over time

You can look at https://snips.ai, you can build your own assistant with the console, download the platform and use it with any audio frontend (beamforming, AEC) that you want, and code any skill that you like

Add Rhasspy

Do you know Rhasspy? It's an open source, fully offline voice assistant toolkit for many languages that works well with Home Assistant, Hass.io, and Node-RED. Especially since Snips has been bought by Sonos, there are a lot of new users and developers joining the project and development is going on rapidly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.