0 / 60 seg.

So when I realized that those same cues are also important for speaker identity, I had this idea.