Shared publicly  - 
 
Why does Yelp completely refuse to provide service to Tor nodes (as in 403 for all queries)?

All their interaction is done through registered accounts, and they could always just refuse to register an account over Tor if there's some good reason they need users' IPs (though that's generally stupid too).
5
1
Ken Montenegro's profile photoAmber Yust's profile photoSai (saizai)'s profile photoJeff Schroeder's profile photo
37 comments
 
Because Yelp blocks Tor IPs at the load-balancer level; otherwise we'd run into resource issues from letting our web servers handle requests from scrapers. (We block non-Tor IPs that host scrapers as well in this fashion; it's not specific to proxies.)

Sorry. :| Not all of our interaction is through registered accounts. (You can access reviews w/o an account.)
 
+Amber Yust Do scrapers typically run over Tor to such a degree that it would really cause a burden on service so great that you need to refuse it altogether, rather than (say) using throttling or the like?

I don't see why merely viewing something (which is not 'interaction' in the active sense of user-generated content) should be restricted from Tor users.
 
+Sai . The vast majority of incoming traffic from Tor IPs to Yelp is from scrapers, because scrapers generate far more requests than regular users. There's no simple way to differentiate the single request from a Tor node that isn't a scraper from the many that are.

Even if we throttled it, you'd wind up with a fairly significant chance that your request was one of the ones throttled and one of the scraper's was the one let through.

So yes, blocking them at the load balancer layer is what makes sense for the situation.
 
+Amber Yust What if you allowed registered users to go through Tor? Then the default page is simply sign in / register, and voilà you have per-entity throttlability.
 
+Sai . That requires all of the requests to hit the web servers to find out whether they're from a registered user or not (plus potential DB queries to validate sessions, etc), which defeats the point of blocking at the load balancer in the first place.
 
+Amber Yust Is the load balancer incapable of doing something basic like cookie inspection?
 
+Sai . If all we did was glance at a cookie, the scrapers would just make up a cookie. Anything beyond just glancing at the cookie would significantly slow down the load balancer (which needs to be extremely quick).
 
Use captchas like Google does for tor users?
 
+Liz Fong-Jones Again, requires hitting the webs, which is the entire point of blocking at the LB - not invoking the codebase on the webs.
 
+Amber Yust The scrapers could make up a cookie, yes, but then that would get checked against sessions, and if it's false then they'd get redirected to login page. So it wouldn't be pointful from a scraping perspective. It doesn't even need to be a DB hit for the most part, assuming you aren't stupid and are signing your cookies, it's a fairly simple thing to verify at least that it's a valid cookie to bother even checking. (Recent timestamp, cryptosig valid. These are fairly fast, and don't require DB hit.)

Tor doesn't have enough bandwidth for it to be pointful from merely a DDOS perspective, and anyway that's not a motivation for scrapers.

Plus there's the alternate option Liz mentioned: captcha to prevent scraping but still allow anonymous visitors. Which is basically just an anonymous variant of allowing people to sign in, something I would strongly support.
 
+A. Midlain … Yahweh does load balancing?

Sorry, but I'm blocking you as either a troll or someone who doesn't know boundaries. Either way, you're not welcome. Bye.
 
Amber: Has Yelp considered running a Tor Hidden Service for Tor users? The latency would make it impractical to run a scraper on the Tor HS.
 
+Arturo Filastò I can't really comment in an official manner, but in the sense of pure personal speculation, my guess is that it wouldn't currently be worth the time investment. Our development timelines are jam-packed as it is.
 
Amber: it requires very little effort to set this up and I could assist you in doing so. Do you have an email I can write to get in contact with the Yelp dev team? I am also in San Francisco now and could schedule a meeting with them.
 
+Arturo Filastò I think you underestimate the effort that would be involved in integrating it with the rest of the site infrastructure.

I'm also not sure what problem you think a hidden service would solve?
 
+Amber Yust Possibly, but I believe it is also not a good thing that Yelp is not allowing access to their site by Tor users. Tor Hidden Service integrate quite transparently into an existing infrastructure. To reduce complexity you could keep serving the resources from the CDN (since I see that you don't block Tor users from access to *.ak.yelpcdn.com) and just have a HS setup for yelp.com. You could have on the load balancer a 302 redirect to the .onion for users that are detected to be using Tor.

As I said I am currently in SF and it would bring me lot's of joy to see Tor users being able to access your service and am willing to assist in setting this up.
 
Suffice to say that there are complications that you are not aware of that would make this not be as simple as you seem to think it would be. I'm sorry I can't elaborate, and as much as I would like to see all legitimate users have access to Yelp, it's probably not something we're going to be able to do right now.
 
+Amber Yust A Tor Hidden Service would solve the problem of scrapers abusing the Tor network to scrape yelp.com. The setup latency and hop number required for a HS connection would make it unpractical for scrapers to use the .onion address.
 
Scrapers generally don't care about latency.
 
In Tor HS it is not uncommon to have a round trip time of 30 seconds. I think this would make the Tor solution be unperformant enough to just buy a cheap VPS or some un-blocked machine to run the scan off of.
 
Again, round trip time really doesn't matter.

I realize that you care about this, but please take me at my word when I say that it's simply not going to happen right now for a variety of reasons.
 
I would really like to know the actual negative impact that the Tor network had on yelp. I don't think blocking a demographic of users for a few scrapers is a good decision.
 
+Arturo Filastò That is not something I'm really at liberty to discuss. You could always email pr@yelp.com if you're looking for an official statement.
 
I understand the desire to block abusive users from trying to rig ratings and bots scraping content. But blocking Tor nodes seems like using a hammer where only a feather is needed. Blocking by IP doesn't really work that well to begin with, and (human) scrapers have an enormous amount of resources to aid them in scraping sites.

Yelp's load-balancing situation makes sense -- it's a major site. But would allowing Tor exits through the LB really cause that much of a performance change? Maybe, maybe not -- it depends on what the overall traffic from these users actually is. I'd be quite surprised if it were statistically significant.
 
Once you've allowed a Tor user past the load-balancer, it's possible to have a secondary filter for these types of users. Making a read-only version available would be a positive move from a user privacy perspective even if it required a captcha or had a built-in time-delay before being accessible. From another perspective, blocking privacy-seeking users who might have serious privacy concerns (eg medical) seems like a PR kerfuffle waiting to happen.
 
Presumably the blocking is based on the exit node DNSBL? This appears to be the issue that's going to prevent me from running an exit node, despite not allowing Yelp's /22. Ah well.
 
+Amber Yust  Resurrecting a dead thread, because I've just gotten bitten by Yelp's policy of blocking all TOR exits (I assume because they're using something like Emerging Threats' TOR list).  What Yelp should do is run their own TOR exit node and throttle connections from the TOR network as they see fit.  If the TOR network sees an exit with the same address as the one being requested, it'll route all traffic through that exit instead of using other exits.  This way, Yelp can have direct control over how much TOR traffic it will accept.  This solution doesn't address the problem I have, however, which is the fact that I run an exit node, but I don't use the TOR network.  My IP is still blocked, even though I'm not trying to access Yelp via TOR.
I think, overall, the problem is that Yelp is, as +Griffin Boyce says, using a hammer where a feather might better serve.  Simply blocking a network because of a few bad apples is a policy which, if universally applied, would leave Yelp in a very small box.
Finally, Amber, I know you don't work for Yelp anymore, but perhaps you could give us a way to contact someone there?  The emails I've sent to support addresses aren't being replied-to.
 
Hulu and mlb.TV are doing the same, albeit with extreme latency.
 
I'm blocked on our company VPN which originates traffic from our VPN access point on AWS.  This is really unfortunate.  It seems very heavy-handed.  

This thread can stay alive until someone at Yelp starts paying attention.  It's one of the top results on Google for the issue.

Cheers, folks.  Great conversation here. 
 
(Note: I no longer work at Yelp; I changed jobs about half a year ago.)

+John Hobart The AWS bit I can definitely answer... pretty much all AWS traffic to Yelp is scrapers. Obviously there's the occasional exception, but we're talking, like, far less than a percentage point of legitimate traffic out of AWS.

To put it frankly, the amount of Yelp engineering time required to implement (and perhaps even more importantly, maintain) one of the supposed solutions mentioned in various comments here probably isn't worth it to Yelp for the amount of benefit provided.

Obviously it's hard to get solid numbers on how many Tor users there are, but https://metrics.torproject.org/users.html?graph=direct-users&start=2012-11-29&end=2013-02-27&country=us&events=off#direct-users estimates the number of users in the US at around 80,000. Only a fraction of those are likely to care about Yelp, and a fraction of those are probably likely to only use Tor for some services (e.g. BitTorrent) and not others.

Compared to the overall number of Yelp users, the resulting figure is very small.
 
I understand the issue about Tor and AWS. But, yelp blocks our company IP. For some reason it thinks our IP belongs to Dreamhost, while it is not. I have tried contacting Yelp many times to address this issue. No avail. Is there any other channel to get us unblocked.
 
Really, we've told our users that there are better options to Yelp because we find it more important to run a tor exit node than to keep asking Yelp to unblock our IP.  It's too bad Yelp doesn't just take notice from IP owners and whitelist the IP until there is an abuse.
 
+Ken Montenegro You can blacklist Yelp on your torrc; tor detection is only "is this an exit node to me", not "is this an exit node in general".
 
+Sai I don't think Yelp does exit node detection themselves.
Add a comment...