Hey! Alexis here again, Technical Artist at Imaginary Spaces where I work on tools and demo content for cutting edge game engine features. This week’s blog is all about facial performance capture on a shoestring budget.
The hardest part of animation is the face and hands because our brains are wired to expect a lot of detail. Mocap used to require fancy suits and expensive cameras in a studio, often charging a cool $30k per day. Obviously, for the highest quality, that’s just what you need to do - but for open-source projects, inexpensive Unity Asset Store packages and commodity hardware can get you a rough draft for next to nothing! That could help you make a goofy youtube video or block out animation for better previz.
Facing the Problem
Can you have your facial animation cake and eat it too? Not really, honestly. Unless you're going for a really generic-looking bald guy, you're probably going to have to author your rig yourself and then spend a bunch of time animating it. While it's pretty easy in this day and age to get a bunch of humanoid biped animation off of sites like Mixamo and retarget them in your engine of choice, characters are often most discernible by their faces, and for that reason facial animation pipelines often vary highly per-project. All is not lost, however. By breaking things down into smaller tasks, it's possible to identify where we can save time and optimize our pipeline. So really, what do we want out of our facial animation system?
- With animation tools and facial recognition software alike, the specifics of a given face are often outlined as a series of points, known as facial landmarks, positionally related to each other - These often include upper and lower lips, eyebrows, etc.
- By comparing the offset of these points from a neutral face, it's possible to feed in a bunch of data to various animation systems: How wide is my smile? Am I furrowing my eyebrows? All this information is often computed as either raw positional data or simple muscle group values called Action Units which can then be sent right into the system of your choice - as I wrote earlier, usually in a rig composed of a mix of bones and blend shapes.
- Knowing what features you need animated in your face and with what amount of fidelity, you can then better plan your own facial rig "checklist" and not overengineer the system you eventually devise.
- After you create your rig and have it properly set up to author a whole bunch of animation, you then need to organize yourself to be able to churn out everything you need.
And this is where this article really comes in. Short of manually keying everything, what are the options when it comes to animating your face? Can you retarget the same animation onto different faces with palatable results? After spending a bunch of time authoring a pretty complicated rig, it only sounds fair that you could author at least a first pass of your animations without having to do everything by hand.
So What Now?
I put together an absurd little demo project showcasing a bunch of methods you can use to animate faces with various Open-Source tools foraged online or developed by yours truly. At most, you’ll need any old webcam, a Kinect or a bunch of coloured stickers from the dollar store!
Our demo features various denizens of a restroom freaking out because someone isn’t washing their hands after going to the loo. Who knew you could experiment while remaining topical? Anyhow - let’s head into it head-first!
Audio-Based Facial Animation
Holding off from more complicated methods at first, I decided to first explore a simple technique for facial animation: audio-based animation! Think about it:
- In many instances, the most important part of facial animation is getting the mouth to move properly when a character is talking - when working on something more stylized, everything else can pretty much be semi-random (eg. blinking).
- What happens when you talk? You output a mix of pitch and volume.
Based on that, here’s what I model in Maya and import into Unity: a super simple mouth composed of two blendshapes: one for pitch and one for volume!
After a little audio analysis trickery (thanks, aldonaletto!) we’re able to get pitch and volume estimates at runtime in-engine and affect the two blendshapes of our mouth model - now all that’s left is sticking everything in a Unity Timeline and we’re good to go! Good to note here is that I used smoothdamp to smooth out the quite noisy analysis results to save me from going back to Maya to denoise the animation.
After a few seconds of watching our soap bar mumble uncomfortably it becomes obvious that this method obviously won't cut it for every situation - specifically for parts of the face that aren't moving in relation to sound. Blinking here, for instance, is handled by a separate script that just does it semi-randomly. Ultimately it's as I wrote earlier: depending on your facial animation requirements you can get by with surprisingly little - for our soap this method will do but for anything more it probably won’t cut it.
To go further with audio-based mouth animation you’d need to do some phonetic analysis. Disney figured out a long time ago that just 12 phonemes are enough to credibly animate the mouth, as the rest of the variation in sounds we make is due to our vocal cords and tongue which aren’t normally visible. With a bunch more blendshapes, could a more fleshed-out audio-based animation system based on phoneme analysis be envisioned? Probably - and that sure sounds like a fun Github repo! For now you can pull the simple script present in the demo project and use it for your own purposes!
Alright, here’s what you’ll need for the second experiment: a webcam and a bunch of coloured stickers from the dollar store!
Before diving into anything more complicated I wanted to take a look at yet another, more traditional motion capture method: marker-based tracking! The idea of trying to cobble something of the sort together had been trotting through my head for a while now so I’m glad this little facial tracking bonanza is what pushed me to do it! As you will soon see, this experiment gave me a better appreciation for why professional setups are so expensive.
After an hour or two of looking at how webcam textures work in Unity I managed to cobble this little thing together: it finds points on the rendered webcam texture that match various preset coloured markers and automatically moves assigned in-scene game objects. Sounds useful?
Kinda. This is obviously an extremely rough version of what much more realized marker animation tools can do - but it’ll do for now. In our little demo scene, we’ll use this method to animate the faces of our electrical plugs - inside of Maya, we represent these faces with an incredibly simple rig - two bones for the eyes and one to deform the mouth. On the other hand, inside of Unity we record positional data off of three markers: two on my brows and one on my chin (a prospective fourth one unfortunately kept falling off my moustache).
Recording in the engine, it ends up a little noisy and kind of humiliating. To fix this in post, we can refine the keys in Maya afterwards. If you’re wondering how I got an FBX out of it, I used the FBX Exporter/Recorder that we developed at Imaginary Spaces in partnership with Unity!
After some denoising in Maya the animation becomes a little more palatable. Overall, I probably could have keyed the animation by hand faster than this mocap setup let me animate. The team would need to spend a lot more work getting marker-based methods solid (and I'd need to shave my moustache or make markers that clip on to facial hair). At that point we probably should have just bought a professional kit - so how about we look at more complex methods that might be a little more serviceable?
Kinect Face Tracking API
Ever since I experimented with Kinect Motion Capture the thing has been hanging out on my desk. Incredibly cheap and easy to find online for a hundred bucks, in your local classifieds for thirty or in a shoebox at the back of your closet for free, the Kinect can’t seem to fade into oblivion and remains useful literally a decade after its release. The Kinect can output two 640×480 textures:
- A run of the mill color map, A.K.A what you would get out of a generic webcam.
- A surprisingly good depth map!
With the depth map (and the robust associated Kinect SDK codebase) comes a trove of good things like the ability to create world-scale geometry, animate full humanoid characters and most importantly for us, detect and track a myriad of facial landmarks at runtime!
With a bunch of Open-Source Unity integrations available to us, I ended up picking what seemed like the best one and went off into the deep end. With a complex bone-based mouth and eyebrows, the last character of the sample project is definitely the most complex one.
The Kinect face tracking API takes various features of your face and converts them to a list of something called Action Units, values essentially representing various muscle groups of the face. With names like the Inner Brow Raiser, the Lip Corner Puller and the Dimpler, these action units can be scaled and combined to create various expressions, a little like what I was talking about earlier with the 12 phonemes needed for mouth animation. Taking in these values that are conveniently represented by 0-1 sliders, it’s easy for users to pass them onto various parts of their rig, either as blend values for various blendshapes or raw bone positions!
Taking the system for a test-drive, it performs surprisingly well - if a little stiff in some parts of the face like the eyelids and jaw. Always impressive to see how eyebrows add so much personality to a character! Anyways - Ready to capture my grimaces, I again use the Unity FBX recorder functionality and bake out the toothpaste character animation into something I can send to the Unity timeline.
Depending on the system you authored and if you’re recording animation in-engine, it’s always helpful to remember that you can always combine various techniques for your final product. Not being able to get eyelid tracking to work properly, I had to animate the eyes via script - and that’s fine! Again, refer to the facial feature requirement checklist you planned out before freaking out because you can’t get millimeter-close accuracy for some facial expressions.
Ultimately, it still feels a little weird to me to use a ten-year-old piece of hardware for a decidedly still thriving field of experimentation. With the device out of production for years now, what happens in a few years when they all get too old to work properly with newer hardware? I needed to dive deeper into more modern facial motion capture processes to get a good grip of what was possible in 2020.
EmguCV / OpenCV
Complexifying things a little, let's head into the next experiment: OpenCV, a much more comprehensive (and open-source) computer vision solution. From the horse’s mouth:
OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library. OpenCV was built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in commercial products.
OpenCV is the workhorse underlying a lot of software with facial recognition like TikTok and Snapchat filters. It’s free and used by a surprising number of companies for all sorts of computer vision purposes - but it’s not easy to use on its own. For our purposes, we’re using its (also Open-Source) C# wrapper EmguCV which adds a fun layer of complexity!
Out of the box, OpenCV has an extensive face recognition module that can recognize sixty-eight crucial points on the face known as facial landmarks. Similarly to the mouth animation phonemes thing, research has shown that with these landmarks it’s possible to infer the large majority of human facial expression. Looking at how the landmarks are organized, it’s interesting to see how we can basically cherry-pick relevant landmarks and create a more stable version of our earlier marker-based animator... I mean, that's how most face tracking software does it! From the Kinect face tracking API to dlib, most facial recognition or animation tools will first convert your face into a series of landmarks and then either use that raw data with its own system or convert it to a more interoperable representation - like action units!
The fact that this is available to us is pretty impressive, especially considering that we can just use any old webcam out of the box to test-drive it.
Anyways, the stock EmguCV Unity release unfortunately had no face landmark recognition sample whatsoever and thus I had to fight with the thing for a few hours to get it to work (thanks, StackOverflow and Benoit)! This is where I eventually ended up:
Success! It works beautifully and while it’s still a bit noisy, I get excellent landmark tracking with only a cheap webcam and close to no initial setup. In spite of the great tracking however, I only had imparted so much time to visit all possible facial tracking options for this article. I decided that while an EmguCV-based facial animator for Unity looked like a very promising option, it would definitely need more involved work to turn what I had from a fun tech demo to an actual production tool.
Until a future release, here’s the repository if you want to mess with the tool I developed for the demo. For now you can use it to smoothly detect and track facial landmarks with just a webcam, so for the sake of the demo I used it to make a hairbrush freak out!
OpenFace FACS Animation
After messing around with multiple options it becomes pretty clear to me that Action Units are the way to go if you want to author transposable facial animation. Already being used some way or another by multiple face tracking frameworks, adoption of them as a tool to universalize facial expressions sounds like the sensical way to go considering that humanoid animation retargeting basically works the same way.
Looking a bit more into it, I learned that Action Units are a component of the Facial Action Coding System (FACS), a system to taxonomize human facial movements originally devised in 1978. As far as I know human faces haven’t changed all that much since then so we’re probably still good to use this system for our facial animation. So how do we crunch facial landmarks into FACS Action Units? There’s surely some software to do that for us already.
In comes OpenFace. A brainchild of the School of Computer Science at Carnegie Mellon University, OpenFace is a nifty little piece of software that, leveraging deep learning, allows you to visualize, record and stream facial landmarks, FACS data and other parameters like head orientation and gaze straight from a webcam! A few minutes of use made something clear: it’s basically everything I want or need in one lightweight app.
All I needed then was a way to stream that data into Unity. As usual, some beautiful people took care of that already and all that was needed then was to take the slightly dated repo and repurpose it for our needs. At the end of it, Unity now automatically ingests data streamed from an instance of OpenFace running alongside the editor and converts it at runtime to FACS compliant blendshape data. Handy!
I was running out of things to tack faces onto in my scene so I decided to make my life a little easier and do some comparison work. I took essentially the same setup I had for the toothpaste character, authored a bunch of blendshapes on top of it that match various action units and started testing.
This is great! With only a cheap webcam, the end result is a surprisingly solid transfer of OpenFace FACS data to the Unity facial rig I authored, beating in performance and quality all other options I tried. I had to do a quite significant rewrite of the code to unbind it from the original demo content and to make it more multi-purpose - so I’m glad it ended up working so well after all.
That about concludes our facial capture experiments. As time goes by I’m sure that we’ll be seeing more and more interesting computer vision tools in the future considering that various technologies like machine learning are getting more accessible. I’m definitely interested to go back and wrap up either the EmguCV animation tool or repackage the OpenFace/FACS integration to be more artist friendly. With that, anyone will be able to bind their rig animator values onto this simple system and we might finally get somewhere with transposable facial animation!
A Little Extra Hand-Waving
Now that we’re done with the facial animation, let’s add hands that shake off water in the sink. For that I had the idea to use the new Oculus Quest hand tracking feature that came out semi-recently. After a few minutes of setting up a new project using my template project, I just exported some gesticulating into an alembic file, optimized it in Maya, reimported it back into Unity and that was it! It was surprisingly simple.
After the full integration process, I can safely say that while it won’t cut it for every use case, this simple Oculus-To-Alembic process is more than serviceable for cartoony stuff like this demo!
With that done, all elements of our scene are animated and after a little polish and audio work the demo is complete!
You can find the full sample project with code here. Happy grimacing!