We’ve been asked a few times by now, how do we handle the lip syncing in game with Unity? There isn’t a lot of rocket science behind the way that works, but I hope I can make some sense out of this.
Since the original game was built on Source, it was already built with such features in mind that can help us get more or less the same results: blend shapes. For a little behind the scene part, Source utilizes flex blends or morph targets to control the facial expressions and lip sync. Each of these expressions are implemented into each character model, mostly exported as a separate vertex animation data file, based on FACS (Facial Action Coding System). that results into realistic facial expressions. Each of these expressions are stored as presets, where each preset has a set of flexes configured for each phoneme or expression. For example, an expression can have multiple flexes turned on to create the appropriate expression without creating new blend shapes for each minuscule changes.
Since Unity 4.3, it comes with blendshape support within the core, allowing us to easily import the blendshapes from Source into the character mesh and use them from the script itself, by calling skinMeshRenderer.SetBlendShapeWeight(). This allows us to set any value for each blendshape to create the appropriate flexes to use.
These flexes can overlap each other, and can have limitations to avoid creating unwanted results. For example we can control the eye lids, tongue, or anything on the face without the need to create a new model just for this purpose. Blend shapes are literally what they are, shapes that blend between two points, which in our case, controls the blends between two vertex.
Note: the above screenshot is an unoptimized mesh which results into bland morph targets
With such presets we can allow any model to easily switch between the blend shapes to create the appropriate mouth movement and other facial animations based on the phonemes data set on each sound file. As soon as the character talks, it automatically reads the lip-sync data and sets the appropriate flexes based on those presets. The reason why we want to go with presets is to allow anyone to create new voice overs based on these already made presets.
Because blendshapes have little cost in performance, using blendshapes for facial expression are a more realistic approach than of using facial rigs, because they are not tied to a bone, they are directly manipulated on a vertex level. And to eliminate more performance cost, we can always cache the mesh data before utilizing it, so that any changes can do what they must without any overhead.
That’s it for now, we hope to bring some technical demo for the lip sync test within Unity!


