Reality-based MT?

Here is an interesting article on the making of the “Matrix” sequel, focusing on the CG techniques used.

In one big fight scene, Keanu Reeves (who I occasionally see working out at the gym I go to, looking a bit smaller than he seems on the big screen) has to battle 100 clones of the bad guy. The approach they used, essentially, was to generate the basic mechanics and kinetics of the scene using traditional CG modeling algorithms (based on motion capture data), but then to actually generate the finished scene by “painting” or “molding” actual pictures of the actors faces onto the CG surfaces. This creates a much more realistic result, with much less effort, than trying to mathematically model the face in detail. We’ll call this “reality-based CG”.

This kind of thinking has been around for a while. Back in the mid-90’s, I was involved in a project to “paint” or “wrap” photographic images of actual kimonos onto models in different poses to create catalog images. Today, many on-line clothes stores use a simplified version of this technique to show potential buyers what a piece of clothing would look like when worn by someone of a particular body type.

As an example of a similar technique, the article mentions music synthesizers. Originally, the attempt was again to model the sound entirely and create it artificially—resulting, unsurprisingly, in something that sounded, well, computer-generated. The alternative approach which is now in widespread use is to record actual instrument sounds and “paint” them onto computer-generated rhythm sequences. And I think we are all familiar with the analog in the voice synthesis area, where approaches based on recorded snippets of actual human voices being woven together are supplanting the original approach of completely computer-generated speech which ends up sounding like a robot.

Existing machine translation approaches still largely take the approach of trying to mathematically model and generate everything. Statistical and corpus-based approaches do somwhat presage the “paint reality onto the model” idea, but in practice are still basically limited to post-processing (in the CG model, “smoothing”) model-based output, to creating word or phrase-level dictionaries, or dealing with local problems such as disambiguation. We have “example-based MT”, but this has not yet reached the stage of being generally applicable.

I was thinking about the implications of this sort of idea for natural language processing and machine translation—“reality-based MT”. I think you can map the CG approach used in the Matrix sequel to an MT approach quite easily. A mathematical-type grammar corresponds to the CG model. Corpuses correspond to photographic images taken of actual reality. Treebanks and other tagged databases correspond to the step taken when photographic images are mapped to models through the use of “anchor” points, designating, for example, 10 key points around Keanu Reeve’s lips. The above elements are all well-known components of present-day MT solutions. In reality-based MT, though the transformations linking one grammar, for instance that of the source language, to another, that of the target language, need to be recast as a particular type of motion capture—remember that motion capture can not only identify linear transformations such as a walking leg, but more structural transformations, such as someone doing a somersault, as well. The totally new element in reality-based MT will be the “painting” or “wrapping” process, where “reality fragments” (of the target language) are “pasted over and around” the transformed model.

Who will be the first to create a simple proof-of-concept prototype of reality-based MT?

Leave a Reply