The button stays beautiful
A growing range (tens rather than hundreds) of high-end TV sets are likely to incorporate gesture or voice based controls in 2013, Deloitte predicts1 . However, while the vast majority of consumers purchasing a TV set with gesture or voice capability will try out the functionality, more than 99 percent of those that may, in the medium term, revert to a standard remote control, due to the unreliability, impracticality or physical effort of using voice or gesture control technologies.
Manufacturers offer gesture and voice recognition for two main reasons. First, vendors need to differentiate their offerings: the user interface is a key differentiator. Second, and related to this, because it is more economically possible: the cost of providing gesture and voice recognition is constantly falling, thanks to Moore’s Law.
Gesture and voice recognition work on similar principles: sensors detect arm movement or a viewer’s voice, and then translate that into a command to the TV using computing hardware and software. The devices compare motions or noises to a database. The larger the database the quicker and more accurate recognition can be2 . Processors get steadily faster, and memory gets ever bigger at the same price point. Moore’s law matters particularly for gesture control, because movement is much more difficult for a computer to interpret than voice.
The computational challenge of voice and gesture recognition
Digital computers are optimized for precise and fast numerical calculations. Numbers and text are easiest for computers to process: they are 100 percent deterministic. Phrases and sentences are slightly less so. Next in the hierarchy of data is sound, including voice. Images are even less deterministic and video is a whole other story, due to motion and time. Getting a device to understand that a person wants to change the channel while someone else in the room wants to pet the cat, or the cat is chasing a fly, and not directing the TV to raise the volume is far more complex. It will most likely happen in time, but not imminently. The challenge is algorithmic. Animal brains are constructed as neural networks which are weaker than computers at precise numerical calculations, but specialized at mapping something to ‘just like’. This is largely a consequence of avoiding predation: humans don’t have to see an entire lion or bear in a specific pose to know to run away. A machine optimized for arithmetic does less well with ‘fuzzy’ conclusions. To give an idea of the scale of the gesture control challenge, computationally and algorithmically, problems generally scale in complexity much faster than data bandwidth. If video occupies 50 times more bandwidth than voice, significantly more than fifty times additional computing power will be required to process video recognition relative to voice recognition.
The appeal to vendors of gesture and voice control is likely to increase over time, particularly as devices become more complex and the range of functionality accessible via a television set or home computer rises.
Standard remote controls, when used with the latest multiple function TVs, may oblige the viewer to navigate through multiple screens of electronic programming guides (EPGs) to get to the intended channel, or through numerous menus to access the desired function. Finding a specific program from a large library is even more cumbersome with a standard remote control.
Gesture control could also be used to interface with the television, for example allowing children to interact with educational programs, much in the same way as games console vendors have incorporated motion detectors for games play.
One weakness of the remote control is that it is easily misplaced, usually to be found under the sofa or behind a cushion. Voice and gesture, meanwhile, are always at hand.
It seems probable that in 2013, and most likely for many years to come, the remote control will retain majority (and often absolute) control of the television set, even if gesture and voice control are used and are successful in other areas of the living room. The simple but fundamental reason why very few TV sets, including high-end models, will be controlled by voice or gesture comes down to three factors: how most TV sets are likely to be used, accuracy and practicality.
It is easy to predict that at the same price point the 2013 model of a given TV set will boast an enhanced level of functionality versus the 2012 version. Most models of technological devices, from cars to irons, are improved each year through the addition of new features. This generally helps sell the latest model. But usage patterns change remarkably little. Deloitte’s expectation is that in 2013 the majority of TV sets sold, or used in living rooms, will be predominantly employed to watch television programs and movies. They will not primarily be used to browse the Internet, play app-based games or listen to music3 .
Deloitte’s expectation is that the most commonly used applications for TV set controls will therefore be to change volume and channel, and that the median frequency of usage of the remote control will be dozens of times per hour, with the frequency changing in proportion to the quantity of ad breaks in the channel being watched. While TV remote controls typically have dozens of buttons, just four of these should be sufficient to provide the majority of control required. A standard remote control, with buttons ergonomically positioned to enable easy, accurate control of volume and channel, does the job. And not just a reasonable job; it almost never fails. A modern, standard remote control is 99.999 percent accurate4 . If remote controls were not that accurate, we would be less inclined to moderate volume or change channel. In households with digital video recorders (DVRs), we would likely pause live TV less often and record less programs.
Deloitte estimates that the rate of false positives or negatives for gesture control on televisions or other devices will be about 10 percent in 20135 . That is roughly four orders of magnitude, or ten thousand times greater, than traditional remote controls. Our view is that most consumers would not tolerate this level of inaccuracy for long. They would quickly go back to the standard, button-based remote control.
Voice control can be far more accurate – as discussed earlier it is a lesser computational task to interpret spoken commands where large databases of a language exist. However, to prevent the incidence of false positives, where a fragment of a conversation is mistakenly interpreted as a command, the viewer may first need to speak a control phrase, which is a sequence of words that would not occur in normal conversations to alert the TV to listen out for a command. This would work well on an occasional basis, but not dozens of time an hour.
Over time gesture control and voice control will become increasingly accurate. The efficacy of gesture control in dimly lit rooms should steadily improve and the need for viewers to be a specific distance or angle from the TV set should lessen6 . Gesture and voice may become the fastest way to access specialized functionality on a video-on demand menu. But if gesture and voice are to be dedicated to specialized or rare tasks, the next challenge will be to train users to memorize specific commands or movements for each of the potentially hundreds of functions a modern TV set offers. Users may find it easier just to scroll through the menu.
Gesture control – like audio control – is not impossible. But in 2013 it may be a hard and possibly overly expensive challenge to solve. In addition, some might argue it is a problem that does not require an urgent solution. The television set should evolve constantly but in 2013 the improvements and innovations that consumers may be most willing to pay for might relate to other aspects of the TV set, such as size, weight, depth, bezel, picture quality, sound or value for money.
Every improvement to a television adds cost. TV set vendors – and any other vendor considering incorporating voice or gesture control in its device – should carefully cost the impact on a set’s bill of materials that adding accurate gesture and voice recognition would add. Accurate gesture recognition that works in dimly lit conditions may require additional processing capability, new cameras and other sensors in the television. This could add tens of dollars to the cost of components.Customers may prefer to trade off larger screen size in lieu of gesture recognition. The incremental cost of the components required for gesture recognition may mean that only high-end sets, the price of which may absorb the cost of additional materials, will offer this functionality.
Gesture and voice control are excellent technologies, but are only useful when in the appropriate context. Voice recognition to control functionality, such as calling a single number from a list of hundreds, works well in cars because drivers’ hands are firmly attached to the wheel or gear shift. But on the couch at home viewers’ hands are typically free and the standard remote control does the job just fine.
2 For more information on the mechanics of voice recognition, Source: How Speech Recognition Works, HowStuffWorks, 2011. See: http://electronics.howstuffworks.com/gadgets/high-tech-gadgets/speech-recognition.htm. For information on how Kinect works, Source: How Motion Detection Works in Xbox Kinect, WIRED, 3 November 2010. See: http://www.wired.com/gadgetlab/2010/11/tonights-release-xbox-kinect-how-does-it-work/all/
3 A minority of TVs, often in bedrooms, are likely to be usedextensively to play console-based games.
4 The first remote controls used audio recognition as an input. The challenge with this approach, as is the case now, was false positives and negatives. Some of the very first remote controls used very small hammers hitting very small tuning forks. Their tones were then picked up by a microphone on the TV set. Manufacturers carefully selected frequencies outside the range of human voices and most common household sounds, but carelessly failed to factor other household noises, from doorbells to dog collars.
5 For more analysis on gesture control, Source: DH Jung, UK Berkeley School of Information, 2012. See: http://people.ischool.berkeley.edu/~donghyuk-jung/?page_id=161
6 Source: Samsung Smart TV Voice, Gesture and Face Recognition Hands-on, SlashGear, 24 May 2012. See: http://www.slashgear.com/samsung-smart-tv-voice-gesture-and-face-recognition-hands-on-24229664/