New milestone of Artificial Intelligence: now you can reconstruct a face through speech

An algorithm of Artificial Intelligence developed by the Massachusetts Institute of Technology (MIT), in the United States, was able to reconstruct the appearance of a person’s face from a recording of his voice.

This is Speech2Face, which was trained using millions of audio clips from more than 100,000 different speakers, many of them from YouTube educational videos.

The MIT Computer Science and Artificial Intelligence Laboratory (MIT CSAIL) published this tool that is also able to determine factors such as a person’s age, gender and ethnicity.

The authors of the study said that their objective “is not to reconstruct a precise image of the person, but rather to recover physical characteristics that are correlated with speech”.

This project seeks to determine to what extent it is possible to configure the appearance of a person based on their voice, and is inspired by the way in which people build models of the traits of someone from whom we only know their voice.

Speech2Face works through a neural network of deep learning designed and trained from the open database AVSpeech, composed of more than 100,000 people speaking in short fragments of six seconds.

To demonstrate its results, the research also used the VoxCeleb database, made up of millions of videos published on the Internet in which 7,000 famous people appear in interviews, in short fragments of at least three seconds.

The generated image is that of a person’s face in front, with a neutral gesture, and they were exposed as an example next to real images of the celebrities in the videos to show the resemblance to the original.

However, the algorithm still has some biases that show that the dataset on which your training was based is incomplete.

Speech2Face, for example, generates images of white men when listening to Asians speaking English, although when they speak Chinese, they identify their ethnicity correctly.

“If a certain language does not appear in the training data, our reconstructions will not capture well the facial attributes that could be correlated with that language,” clarified from MIT.

The speculations around this algorithm and its possible commercial use would be related to the possibility of generating a representative image of our interlocutor when we are maintaining a telephone call.

About the author


Ginger Baker