I am a bit annoyed by the recent trend of performing sucky evaluation in word-embedding papers, comparing models trained under vastly different conditions and reaching conclusions based on that.

Yesterday, I saw this tweet promising a new and exciting word-embedding model, outperforming word2vec by 11% absolute points!

https://twitter.com/RichardSocher/statuses/497235079903473664

I rushed to read the paper, and got very disappointed by the evaluation section.

https://twitter.com/yoavgo/statuses/497290869163040768

When I discovered that the authors made the software available on their webpages (this is great! everyone should do it!), it prompted me to run some experiments on my own.

My conclusions are quite different than the author's: the word2vec and the GloVe models actually perform very similarly when evaluated under "fair" conditions.

My writeup is available here:
https://docs.google.com/document/d/1ydIujJ7ETSZ688RGfU5IMJJsbxAi-kRl8czSwpti15s/edit#

and I welcome discussion on this page.

I think the take home message from this is that we should be much more careful when reading and writing these kinds of papers, both as authors and as reviewers.

And I again applaud the Stanford team for making their software available. This really helps advance the field further.

 I would also like to stress that I really did enjoy reading the GloVe paper, and I think it is really well written, and presents an interesting model. It is just the evaluation and the resulting unwarranted claims that ticked me off.
Shared publiclyView activity