For a final project in a course on embedded systems and microcontrollers, my buddy Jason and I decided to make a target-tracking, voice-activated camera. The idea was to create a tool to empower home film-makers and photographers to get the perfect shot without adjusting ten screws on a tripod or awkwardly waiting for self-timers. Instead, the device would automatically track the target, make fine adjustments based on verbal commands, and take the photo upon a verbal cue. As per tech-tradition, we named our device by picking an arbitrary word and spelling it wrong: Nibl.
Nibl sees through a webcam mounted on two servo motors. The bottom servo sits below the custom mount and can spin 360° to point the camera in any direction. The upper servo, connected directly to the webcam, points the camera up and down.
The boards, camera, battery, and servos are all mounted on a 3D-printed frame.
To pull off voice-recognition, we needed a microcontroller powerful enough to take thousands of microphone samples per second and then perform a bunch of linear algebra to classify the command. We picked the BeagleBone® Blue for this purpose, which is essentially a Linux computer that fits in your palm. Because we had to perform a lot of math on the voice signals, we chose to program in Python. This came at a cost, however; the protocol in Python to fetch the analog-to-digital value from the microphone input takes a long detour through the operating system. As a result, between recording and classifying, the BeagleBone had no time left to do any of the servo controlling we needed. So, we connected a second microcontroller, the mbed, via UART and made it handle the PID servo control.
Jason and I had to lift our shirts up during a preliminary demo in order for a heat-seeking prototype to work at close range. Despite the sexy-factor, the teaching staff didn’t love that. So, we switched to processing camera images on a visual cue. We bought two bright, uniquely colored gift bags at CVS. We nicknamed them Cosmo and Wanda. From some pictures of Cosmo and Wanda, we figured out what ratios of RGB values are characteristic for the colors. Using this information, we processed images to see how much Cosmo and Wanda they had to find their “center of mass.”
The algorithm works pretty nicely!
Mel Frequency Cepstral Coefficients
The naïve way to try voice recognition would be to save a bunch of Fourier transform spectra of various commands, compare the spectrum of each input to those labeled spectra, and pick the winner. However, this method would be very specific to one voice and would require a big chunk of storage.
We scoured the internet for ways to do light-weight voice recognition. What we settled on is called MFCC: Mel Frequency Cepstrum Coefficients (cepstrum—like spectrum but the ‘spec’ is backwards). MFCC is a way to break down a signal into coefficients that represent how much of the energy of the signal resides in different frequency ranges (designed to mimic the human cochlea!). The procedure is comically involved. Here are the Sparknotes:
- Filter the input signal
- Multiply it by a cosine
- Take the discrete cosine transform (DCT) of the signal
- Calculate the power periodogram of the spectrum
- Make a series of 26 triangular filters based on Mel frequencies (which correspond to sensitivity of the human ear), then convert back to the normal frequency domain
- See how much energy lies in each of those triangular filters
- Take the log all those energies
- Then take the DCT of all those energies
- Just throw out the last 14
- Finally, we have just 12 numbers to represent our signal
Why go through all this trouble? I want to reemphasize the last point. If we sample at 6000 Hz, and take a 1 second window, then our input signal is represented with 6000 numbers. MFCC reduces the dimensionality of this classification problem from 6000 to 12. Wow! Check out the different characteristic peaks in MFCC plots of two commands.
Now that we have these 12 numbers, how do we identify what kind of command was recorded? From scratch, Jason and I implemented a classification algorithm called quadratic discriminant analysis (QDA). The essence of QDA is partitioning data in its input basis with conic sections… or whatever conic sections are in 12 dimensions. To perform QDA, all we needed to store on the microcontroller were mean vectors and covariance matrices. The covariance matrices would be 12×12 for every output class. Way better than storing dozens of Fourier transform vectors of length 6000!
After a night of coding, we finally ran a preliminary test of QDA on MFC coefficients at around six in the morning. We didn’t believe the results at first. We recorded a test set of 80 samples: 3 different commands recorded 20 times each, and 20 samples of background noise. We used 3/4 of that for training and 1/4 for validation. QDA correctly identified every sample in our test set. We almost cried.
In our final device, Nibl’s job was to recognize commands for moving left, right, up, down, and snapping a photo. To avoid similar sounding words, we opted for “Leff,” “Arrr,” “Go up,” “Submerge,” and “Hit Me,” respectively.
I’ll never forget the rush of the live-demo after two weeks of sleep-deprivation. Our PowerPoint was like poetry and Nibl was the best-behaved little robot. Our professor raised his hand and said, “Can I ask a non-technical question? Why are you two always having so much fun?”
Check out the video!