Rhubarb Lip-Sync

Rhubarb Lip-Sync is a command-line tool that automatically creates 2D mouth animation from voice recordings. You can use it for animating speech in computer games, animated cartoons, or any similar project.

Rhubarb Lip-Sync produces output files in various text formats (TSV/XML/JSON). If you’re a programmer, this makes it easy for you to use the output in whatever way you like. If you’re not a programmer, there is currently no direct way to import the result into your favorite animation tool. If this is what you need, feel free to create an issue telling me what tool you’re using. I might add support for a few popular animation tools in the future.

Demo video

Click the image for a demo video. This was generated using Rhubarb Lip-Sync 1.0.0; newer versions create even better animation!

Mouth shapes

Rhubarb Lip-Sync can use between six and nine different mouth positions. The first six mouth shapes (Ⓐ-Ⓕ) are the basic mouth shapes and the absolute minimum you have to draw for your character. These six mouth shapes were invented at the Hanna-Barbera studios for shows such as Scooby-Doo and The Flintstones. Since then, they have evolved into a de-facto standard for 2D animation, and have been widely used by studios like Disney and Warner Bros.

In addition to the six basic mouth shapes, there are three extended mouth shapes: Ⓖ, Ⓗ, and Ⓧ. These are optional. You may choose to draw all three of them, pick just one or two, or leave them out entirely.

Ⓐ		Closed mouth for the “P”, “B”, and “M” sounds. This is almost identical to the Ⓧ shape, but there is slight pressure between the lips.
Ⓑ		Slightly open mouth with clenched teeth. This mouth shape is used for most consonants (“K”, “S”, “T”, etc.). It’s also used for some vowels such as the “EE” sound in bee.
Ⓒ		Open mouth. This mouth shape is used for vowels like “EH” as in men, “AH” as in sun, and “EY” as in say. It’s also used for some consonants, depending on context. This shape is also used as an in-between when animating from Ⓐ or Ⓑ to Ⓓ. So make sure the animations ⒶⒸⒹ and ⒷⒸⒹ look smooth!
Ⓓ		Wide open mouth. This mouth shapes is used for vowels like “AA” as in father, “AE” as in bat, and “AY” as in why.
Ⓔ		Slightly rounded mouth. This mouth shape is used for vowels like “AO” as in off. This shape is also used as an in-between when animating from Ⓒ or Ⓓ to Ⓕ. Make sure the mouth isn’t wider open than for Ⓒ. Both ⒸⒺⒻ and ⒹⒺⒻ should result in smooth animation.
Ⓕ		Puckered lips. This mouth shape is used for “UW” as in you, “OY” as in boy, and “W” as in way.
Ⓖ		Upper teeth touching the lower lip for the “F” and “V” sounds. Don’t overdo it: It shouldn’t look as if your character were actually biting on their lower lip. This extended mouth shape is optional. If your art style is detailed enough, it greatly improves the overall look of the animation. If you decide not to use it, you can specify so using the `extendedShapes` option.
Ⓗ		This shape should identical to Ⓒ, except for the tongue showing. It is used for the “L” sound, so the tongue should touch behind the upper teeth. This extended mouth shape is optional. Depending on your art style and the angle of the head, the tongue may not be visible at all. In this case, there is no point in drawing this extra shape. If you decide not to use it, you can specify so using the `extendedShapes` option.
Ⓧ		Idle position. This mouth shape is used for pauses in speech. This should be the same mouth drawing you use when your character is walking around without talking. It is almost identical to Ⓐ, but without the pressure between the lips: For Ⓧ, the lips should be closed but relaxed. This extended mouth shape is optional. Whether there should be any visible difference between the rest position Ⓧ and the closed talking mouth Ⓐ depends on your art style and personal taste. If you decide not to use it, you can specify so using the `extendedShapes` option.

How to run Rhubarb Lip-Sync

General usage

Rhubarb Lip-Sync is a command-line tool that is currently available for Windows and OS X.

Download the latest release and unzip the file anywhere on your computer.
Call rhubarb, passing it a WAVE file as argument, and redirecting the output to a file. In its simplest form, this might look like this: rhubarb my-recording.wav > output.txt. There are additional command-line options you can specify in order to get better results.
Rhubarb Lip-Sync will analyze the sound file and print the result to stdout. If you’ve redirected stdout to a file like above, you will now have an XML file containing the lip-sync data. If an error occurs, Rhubarb Lip-Sync will print an error message to stderr and exit with a non-zero exit code.

Command-line options

The following is a complete list of available command-line options.

Option	Description
`-f` <format>, `--exportFormat` <format>	The export format. Options: `tsv` (tab-separated values, see details), `xml` (see details), `json` (see details). Default value: `tsv`
`-d` <path>, `--dialogFile` <path>	This option is meant for situations where you know the dialog text in advance. Specify a plain-text file (in ASCII or UTF-8 format) containing just the dialog of the audio file. Rhubarb Lip-Sync will still perform word recognition internally, but it will prefer words and phrases that occur in the dialog file. This leads to better recognition results and thus more reliable animation. For instance, let’s say you’re recording dialog for a computer game. The script says: “That’s all gobbledygook to me.” But actually, the voice artist ends up saying “That’s just gobbledygook to me,” slightly changing the dialog. If you specify a dialog file with the original line (“That’s all gobbledygook to me”), this will still allow Rhubarb Lip-Sync to produce better results. Rhubarb Lip-Sync will ignore the dialog file where it audibly differs from the recording, and benefit from it where it matches. It is always a good idea to specify the dialog text. This will usually lead to more reliable mouth animation, even if the text is not completely accurate.
`--extendedShapes` <string>	As described in Mouth shapes, Rhubarb Lip-Sync uses six basic mouth shapes and up to three extended mouth shapes, which are optional. Use this option to specify which extended mouth shapes should be used. For example, to use only the Ⓖ and Ⓧ extended mouth shapes, specify `GX`; to use only the six basic mouth shapes, specify an empty string: `""`. Default value: `GHX`
`--threads` <number>	Rhubarb Lip-Sync uses multithreading to speed up processing. By default, it creates as many worker threads as there are cores on your CPU, which results in optimal processing speed. You may choose to specify a lower number if you feel that Rhubarb Lip-Sync is slowing down other applications. Specifying a higher number is not recommended, as it won’t result in any additional speed-up. Note that for short audio files, Rhubarb Lip-Sync may choose to use fewer threads than specified. Default value: as many threads as your CPU has cores
`-q`, `--quiet`	By default, Rhubarb Lip-Sync writes a number of progress messages to `stderr`. If you’re using it as part of a batch process, this may clutter your console. If you specify the `--quiet` flag, there won’t be any output to `stderr` unless an error occurred.
`--logFile` <path>	Creates a log file with diagnostic information at the specified path.
`--logLevel` <level>	Sets the log level for the log file. Options: `trace`, `debug`, `info`, `warning`, `error`, `fatal`. Default value: `debug`
`--version`	Displays version information and exits.
`-h`, `--help`	Displays usage information and exits.
<input file>	The input file to be analyzed. Must be an sound file in WAVE format.

How to use the output

The output of Rhubarb Lip-Sync is a file that tells you which mouth shape to display at what time within the recording. You can choose between three file formats — TSV, XML, and JSON. The following paragraphs show you what each of these formats looks like.

Tab-separated values (`tsv`)

TSV is the simplest and most compact export format supported by Rhubarb Lip-Sync. Each line starts with a timestamp (in seconds), followed by a tab, followed by the name of the mouth shape. The following is the output for a recording of a person saying 'Hi.'

0.00	X
0.05	D
0.27	C
0.31	B
0.43	X
0.47	X

Here’s how to read it:

At the beginning of the recording (0.00s), the mouth is closed (shape Ⓧ). The very first output will always have the timestamp 0.00s.
0.05s into the recording, the mouth opens wide (shape Ⓓ) for the “HH” sound, anticipating the “AY” sound that will follow.
The second half of the “AY” diphtong (0.31s into the recording) requires clenched teeth (shape Ⓑ). Before that, shape Ⓒ is inserted as an in-between at 0.27s. This allows for a smoother animation from Ⓓ to Ⓑ.
0.43s into the recording, the dialog is finished and the mouth closes again (shape Ⓧ).
The last output line in TSV format is special: Its timestamp is always the very end of the recording (truncated to a multiple of 0.01s) and its value is always a closed mouth (shape Ⓧ or Ⓐ, depending on your extendedShapes settings).

XML format (`xml`)

XML format is rather verbose. The following is the output for a person saying 'Hi,' the same recording as above.

<?xml version="1.0" encoding="utf-8"?>
<rhubarbResult>
  <metadata>
    <soundFile>C:\Users\Daniel\Desktop\av\hi\hi.wav</soundFile>
    <duration>0.47</duration>
  </metadata>
  <mouthCues>
    <mouthCue start="0.00" end="0.05">X</mouthCue>
    <mouthCue start="0.05" end="0.27">D</mouthCue>
    <mouthCue start="0.27" end="0.31">C</mouthCue>
    <mouthCue start="0.31" end="0.43">B</mouthCue>
    <mouthCue start="0.43" end="0.47">X</mouthCue>
  </mouthCues>
</rhubarbResult>

The file starts with a metadata block containing the full path of the original recording and its duration (truncated to a multiple of 0.01s). After that, each mouthCue element indicates the start and end of a certain mouth shape, as explained for TSV format. Note that the end of each mouth cue is identical with the start of the following one. This is a bit redundant, but it means that we don’t need a special final element like in TSV format.

JSON format (`json`)

JSON format is very similar to XML format. The choice mainly depends on the programming language you use, which may have built-in support for one format but not the other. The following is the output for a person saying 'Hi,' the same recording as above.

{
  "metadata": {
    "soundFile": "C:\\Users\\Daniel\\Desktop\\av\\hi\\hi.wav",
    "duration": 0.47
  },
  "mouthCues": [
    { "start": 0.00, "end": 0.05, "value": "X" },
    { "start": 0.05, "end": 0.27, "value": "D" },
    { "start": 0.27, "end": 0.31, "value": "C" },
    { "start": 0.31, "end": 0.43, "value": "B" },
    { "start": 0.43, "end": 0.47, "value": "X" }
  ]
}

There is nothing surprising here; everything said about XML format applies to JSON, too.

Limitations

Rhubarb Lip-Sync has some limitations you should be aware of.

English only

Rhubarb Lip-Sync only produces good results when you give it recordings in English. You’ll get best results with American English.

2D animation only

Rhubarb Lip-Sync tries to imitate the animation style used in classic 2D animated cartoons. The results look stylized, and that’s intentional. If you’re working on a realistic 3D game or movie, Rhubarb Lip-Sync may not be the best choice.

Tell me what you think!

I’d love to hear from you! If you need help or have any suggestions, feel free to create an issue.

tjoen / rhubarb-lip-sync Goto Github PK

rhubarb-lip-sync's Introduction

Rhubarb Lip-Sync

Demo video

Mouth shapes

How to run Rhubarb Lip-Sync

General usage

Command-line options

How to use the output

Tab-separated values (tsv)

XML format (xml)

JSON format (json)

Limitations

English only

2D animation only

Tell me what you think!

rhubarb-lip-sync's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Tab-separated values (`tsv`)

XML format (`xml`)

JSON format (`json`)