In general, the content-to-time ratio of video is very bad, compared to reading text. (That is a personal thing, because I read text faster than average. I've noticed that people who read text slowly often prefer video tutorials.) For artificially-created voices it's often much worse because there's also listening comprehension. So before I watch a video, I very much want to know what I'm getting into.
The titles you suggest tell me more about the format of the video ("It's about some characters performing a scene!") but only tease me about the content -- it's about Valentine's Day, but is it mocking? Is it comedic joking? Is it angsty pathos? Is it physical action?
One drawback of embedded video is that there's no hint as to the video length. If I decide to watch the video, am I making a 20-second commitment, or a 90-minute one? I can honestly say that after I clicked on the video the first thing I looked at was the duration.
For this specific video, something like "An interesting attitude about Valentine's Day in one minute" might work as a title for me -- it tells me what the approximate duration of the clip is, and teases that the video is probably going to have a mildly thought-provoking idea in it, but probably not crazy slapstick.
The thumbnail is pretty bad. It's not just a person standing there, but it's not even centered in the frame. It totally screams "automatically generated". If I was going to guess at the content of the video based on the thumbnail alone, I would expect the whole video to be that avatar awkwardly staring at the camera. Since this is supposed to be a dialogue, a better thumbnail would at least show two characters.
Having better subtitles for the video than the automatic CC would improve the actual video-watching experience for me. Usually subtitles are a pain because someone needs to listen to the text and write them, but since you probably started with the text, this should be easier to do.
And, honestly, for me, I would much rather just read the text content of this dialogue straight. Having it delivered by little avatars does not improve the experience for me at all. As a cute little dialogue, I can read it in 10 seconds and imagine the conversation in my head spoken by real people. I have to do that anyway when I watch the video, but it now takes me 37 seconds and I have to stress out the listening comprehension part of my brain before I do that.
If I actually wanted to give my users the best experience, what I would do is describe the content of the video in text within the post, so that users can either click straight to the video if they want, or read the content description below if they're not sure. For a reasonable example of this, check out this article: http://www.cracked.com/blog/the-10-most-brilliant-comedy-gems-hiding-youtube-pt.-8/