First off: I like your product and pay for it each month. I also have no problems with you personally (seeing as we have never spoken).
Now: the concept of what you want to do is fine, but you had better damned sure you get it right before you release this. And I'd be lying if I didn't say I have my apprehensions.
You're already collecting data on our sites without permission (we can't block inclusion in your index through robots.txt). You preach transparency, but that fiasco over whether or not you really have a crawler a few years back causes me to raise a skeptical eye each time you use the word. You've kind of conditioned me to believe that you're transparent when it benefits you (ie: showing off traffic stats, large indexes, and revenue), and there's nothing wrong with that. It's good marketing, and buys you good will among your fans. And yes, you're 'transparent' when you screw up (like with the delays on the last update). But is that transparency? Or is it just good customer relations and marketing?
Now you're considering creating your own rules to decide what spam is, and volunteering us all into a potential reputation management problem. I don't spam. Google knows I don't spam. But what will your algorithm think? I don't know. And to be honest, I don't really trust you to decide that.
Here are a few of my concerns:
1. It's impressive that you're indexing as many links as you are. But in an effort to improve the quality of your link index, you decided not to index lower quality pages. Aren't those the pages most likely to be the spam? Without those, what's the point?
2. You haven't figured out how to fix basic problems with your core products. For instance, Open Site Explorer has no idea that www and non www are different versions of the same page. As far as I can tell, it fails at all canonicalization issues (minus 301s).
3. The latest link index shows search results pages from some search engines as backlinks (I'm seeing tons from cox.net
, for instance). That's kind of a pretty basic screw up.
4. Your index is very stale. I find 404s in 'fresh' indexes all the time, as well as recorded links that haven't existed for six months or more. Like I said earlier, I find your product useful for a lot of things, but how am I supposed to trust it when it comes to things like spam, when it can literally be turned on and off with the push of a button? How can you catch that with such a slow index?
And even if you do get it right, what actionable thing am I supposed to do with this data?
I don't mean to be an ass, but I really don't see how this is a good idea. Unless you just want a lot of free PR and links.