Shared publicly  - 
Googles lack of transparency.

I've fought with the urge to produce this piece for some time – but after much deliberation, I've decided to go ahead with it. I'll try not to make it a "hater" piece, and instead go with the "we should know" angle (though to be honest, some of you will still rush to their aid :D)

Googles Transparency Report – is basically of no real use or interest to almost anyone.
It's not a real transparency report, it doesn't contain any vital data, it doesn't permit the public to scrutinize Googles actions etc.
At best, it's a "look at us" tool masquarading as a transparency report

Googles Transparent Posts – their blog, their employees blogs/sites ... again, little real transparency here. Most of the time you either get confirmation of what was already known or strongly apparent, or some suggestions that some things may be of influence, or may not be.

So what is my beef with Google and their claims of transparency?
Well, I want to see the "real data" – the stuff that would not only permit the public to see G's statistics, but to allow them to see just how bad some things can be.

Yes, I want to see the "fun and fluff" figures. Things like search request volume by country, number of returned results over a year, the amount of energy consumed per quarter etc.
Soem of the more detailed information, uch as product/platform usage, cost and earnings would be nice too.

But – what I'm really wanting is, of course, the dirt!

Where are the DMCA report details ?
Being told how many DMCAs were received/acted on is one thing, but being told how often G had to actually revoke/reverse a decision is another.
Why are we not told what % of DMCAs were denied incorrectly, and how many granted when they shouldn't have been?

Where are the removal statistics?
Google pushes some sites out of the SERPs via filters, and some out of the public index ... and possibly even some out of it's real index. Where are those figures? Why can we not see how many sites/pages G have (roughly?) indexed, how many it refrains from displaying, and how many it has actually booted?

Where is the Algorithm effect data?
G like pushing out all these algorithms and making tweaks – they like telling us it will only affect n% etc. But we don't get to see the actual data.
Why can we not be told that out of a total X sites/pages, that Y% of sites were affected negatively, Z affected positively? Why are we not given actual figures (think of it, 1% of a billion is a huge number!)?

Where are the "since we implemented" numbers?
Google likes to promote it's own services, and alter the SERPs display. This obviously has an impact on what the Click Thru Rates are, and what the click dispersal patterns are.
Why can we not be told how things have changed since G included it's credit card comparison block, or how the click rates changed when they shifted to a 3 pack etc.?

Where are the "oops" figures?
This one is possibly my biggest bugbear – Google refuses to publicly acknowledge when they get something wrong in regards to deranking or deindexing a site.
I want to know just how many sites G thinks it may have accidentally shot down with Algo-X.
I want to see figures (even estimates) on what % of affected sites shouldn't have been hit.
I want to know what G thinks is an acceptable "collateral damage" figure.

Now, the reasons for not doing these things will obviously be numerous and complex. 
II'll utterly ignore the "they'd look bad" and the "could go legal" ones.  Instead, I'll show a little empathy and understanding.
* Complicated and convoluted – the data will vary depending on country, search type, time of year etc.
* Then we have to consider the "competition" – some of the "others" out there have no qualm about pointing the finger of disgust at Google (even if they are guilty of the same/worse themselves). G would simply be supplying them with additional ammo.
* There's also the "cost" - I doubt if all of those figures are "ready" - they'd have to be dug out and put together ... and that is costly in time and effort - for no real gain (bad business sense straight away).  Further, it could impact other avenues and slow down development etc. elsewhere.

So no – I don't expect hard, solid, specific data.
But a darn good rough estimate, some sort of ball-park figure, some sort of review/audit and publication should be done.

That would be real transparency!

#googletransparencyreport #googletransparency  
Pe lagic's profile photoLyndon NA's profile photoEdwin Jonk's profile photoMeki Cox's profile photo
Where is the algorithm effect data?

I suspect one reason they don't provide figures in the format that you'd prefer is due to the fluid nature of what they are measuring & how it'd easily be misconstrued.

As an example, in January 2011 they announced that algorithm A will effect 1% of queries or alternatively a real number 1 million (for simplicities sake). Google continue to update algorithm A every 2 months, but at the same time the known Internet is expanding. In March 2011 they announce they've made some changes, it is no more severe but because the Internet has expanded they report 1.1 million queries. There is a big risk that lots of people are going to think Google is being harsh because the number increased when that isn't the case at all.

Next you might say well report the number of queries phrases that are effected, let's say that number is initially 750,000 because we know some queries get used all the time. The problem with this approach is that Google have reported that every day, they see x% of queries for the first time.

I'm not being pro-Google necessarily above, though I do love/appreciate the amazing feats they do every day but I'm trying to illustrate that it isn't going to be as straight forward in this particular item as you might want/like.

I'd be quite positive that Google think carefully about how they report certain figures to avoid confusion in the marketplace. If they could reasonably report a hard number for the above, I think they'd want to because they can say we have just removed 51,123,456,789 websites that were low quality and weren't bringing any value to the Internet.

All the above being said, I'm confident that Google could produce the figures as Lyndon suggested, in some form - they absolutely know who gets what volume/percentage of traffic/queries. I suppose another obvious reason they might not want to report hard figures is because it makes them look, even more so like the hand of God. Google have had lots of criticism over their search results, there have been anti-trust investigations, posturing that Google should release its algorithms and more. If Google reported hard figures, it'd be clear that Google have just effected X websites and indirectly Y businesses. That might be a pretty strong motivator for them to use percentages, it provides an insight into the severity of a change without them layout in the hard data because it could be used against them - when that isn't their intention when they are trying to improve the quality of the search results.
As I said - I understand that it isn't straightforward and that it could be quite complex ... but that doesn't stop them from publishing a rolling chart.
They could easily show data per period, per month etc.

January - AlgoX - out of N-million sites, 50k afftected (X%).
March - AlgoX - out of N+-million sites, 35k afftected (X+%).

At least people would be able to start envisioning the scope of things, and how far reaching some of these "little" changes really are.

It's not all bad either - I think it would G good if people could see the sort of volume and turn around of Spam Reports, Reconsideration Requests etc.
(Though I have my beef with the Spam Report squads, I also realise they have a somewhat sucky job (I wouldn't do it :D))
energy used per quarter, by the servers saving our deathless prose? I worry about that.
I think asking for the "big picture transparency" is commendable. However, if it is transparency that we are talking about I would rather Google focuses on transparency in Webmaster Tools. Having the big picture is nice but give me data that is actionable. If we can get better informed on the micro side of things then maybe the same thing can happen on the macro level.
* Googles Transparency Report
 Well Google (Twitter too) is leading the industry with regard to transparency report. Is there one for Apple (for example)?

* Googles Transparent Posts
 Maybe you don't get the answer you liked but atleast they are presenting themselves. Can you present a different company that does more then Google?

* Where is the Algorithm effect data?
 Atleast they are talking about it and explaining it. For example Facebook changed its "Edgerank" with regard to pages (unless you pay). As far as I know no numbers where ever mentioned. So isn't Google leading here too?

* Where are the "oops" figures?
There might not be any figures. But high level Googlers did state that they watch individual sites and that they will make changes if an high quality content site would be hit (Daniweb). 

You make some great points, and it can be better for sure. But looking at other companies Google is doing a great job. And I guess you need someone to disagree ;-).

P.S. since transparency is very related to "how open a company is". In other parts Google is opening more and more with huge benefits to the community as a whole. For example, consider this quote:

"... Google used to be this black hole. They would hire kernel engineers and they would completely disappear from the face of the earth. They would work inside Google, and nobody would ever hear from them again, because they'd do this Google-specific stuff, and Google didn't really feed back much.

That has improved enormously... Now they're way more active, people don't disappear there any more. It turns out the kernel got better, to the point where a lot of their issues just became details instead of being huge gaping holes.
+Alistair Lattimore - ah, now I get a better picture ... and yes, I agree.
(see below)

+Diana Studer - I think G have done some reports on things like their energy consumption and their green efforts (hats of to them ... they are very "green" and push hard to be efficient (considering their electric bill, I'm not surprised :D)).

Ah, +Rob Wagner - that's another topic and post I'm working over :D

+W.E. Jonk
Yes, G is more transparent than many ... but what they are transparent with tends to be of little real value/import.
I see it as little more than "look how great we are" and "look at the bright shiny light - ignore the funny smell and what looks liek a body in the corner" type of behaviour.

But again - yes ... they are far more open than many ... jsut not in the ways I'd like :D


To be honest, I know it may seem/sound/read like a gripe post ... it isn't.
I really can see some of the negatives for G in such information provision ... and much of the grief they would get would be uncalled for, misread, abused etc.
Providing competitors with sticks and stones would be a somewhat foolish thing to do as well.

But my concern is the whole "wizard of oz" approach.
It's all smoke, mirrors, pretty lights and big booming voices ... very little tangible, useful or detailed information.

This leaves me wondering jsut how bad things may actually be.
You see - there is the possibility that G only cripples 50 sites a year that shouldn't be hit ... and that the CTRs etc. are good in most cases, and their tweaks actually boost traffic for the deserving sites... but we simply don't know because they don't tell us :(

(Then again, if I'm honest ... even if they do tell us ... I'm not sure how much I'd believe it :D)
Hrm... the post seemed to tepid to me... how about areas like:

Bidding on PPC and ranking in SEO gives an incremental boost:
Oh really? For branded terms too, right? Can I see and test the data... oh right, not provided :(

Not even going to touch GWT... that's a whole other can of worms...

And why did they not explain their name change to "Googles"?
Imagine if someone got their hands on that data and then calculated loss of rankings and the relevant net loss companies worldwide?! That would be awesome. But not for Google.

Remember, if someone loses out - someone else is winning.

+Alistair Lattimore 
Remember, the american legal system is screwed up. Just because there is no valid reason to sue, doesn't mean you can't sue and win! :)
I did reference that sort of issue +Joe Ford ... and I think it is a serious problem, both ways.
As it currently stands, G are unassailable.  It's their toy, they can do as they like ... but some of the errors/mistakes/bugs are "costly" ... and I think they should be accountable.
That said - I don't think they should be faced with massive fines/penalties/fiscal loses ... not if they are taking action to correct.

And there in lays my bigf  beed with them - I know that they don't move that quickly in some cases.
(Seriously, what sort of idiot builds a fully automated system that doesn't permit manual correction?  A highly paid, highly egotistical one!  I think That is neglegent to the point of being criminal - and it should be corrected.)
We don't know if they do or don't have that ability. Google have said in the past that the results are algorithmic, but who it to say that they don't have the tools to rectify issues temporarily when they arise until the algorithms are adjusted to correct it automatically.

I can understand why you think it is a massive issue that they might not have that facility, but manual adjustment like that doesn't scale to billions of websites. Google's market dominance has been provided through providing highly relevant results for as many queries as possible and that isn't possible when 'manual' is needed -- that's what bit pushers are good for after all.

They've said they don't.
They've said no manual intervention is possible.
They've said it's 100% automated.
(I think I even had a Googler apologise for it in the past)

The worst part is - from what I've seen ... they don't even have flags to tell them what a site has been hit for ... only indicators they have to interpret ... which means they cannot readily/easily idtentify what is wrong with a site without manually looking.
(Again - I find that somewhat retarded ... I myself would want a DB of symptons and causes, with specific pointers ... else how am I to double check that it's doing it right?)
I think you're completely off the mark with the last part. Google are the supreme leader in big data management, mass computational analysis and algorithms. To think that a Google employee with the appropriate access can't delve into the depths of what their system can see to do with a given site is short sighted in my opinion.

That is like suggesting that they make changes with an algorithm and don't understand its implications - I don't believe that for a second; they know and would have very detailed data about every change they make. I would go as far as saying that they'd know that certain changes are good for certain groups of users based on a raft of different factors such as location, preferences, site usage, browser, search frequency/history, device, browser, internet usage (not in Google but via DoubleClick, Google+, ..).

Maybe we're both wrong and its somewhere in between ;p
The system is far too big for any person to have a full understanding of all the implications of any change. They would need to perform such extensive testing if they did. The system simply has too many variables for someone to hold all of it in their head.
+Alistair Lattimore - I'm not wrong, trust me :D
As +Steven Lockey points out ... the system is huge.

Each Engineer has access to their own data set for testing and tweaking their bit of code and to see the potential results - but that is as far as they can see.
Ibndeed, they know the good/bad - within that set, within that scope ... within those limited, restricted and partial ranges.

When it goes out into the wild - it has been tested (heavily) ... but they are never 100% certain on how much of an impact it will have ... there is always the risk that it will go a little wrong, that it won't be as effective, or that it will be overly effective.

If you want proof of the pudding, look to the Google Raters.
What do you think their job actually is?
They are the backup - they are the monitor to spot if an algo is a little to heavy.
Have you not seen the occassional rollback every so often?
Have you not seen the SERPs go one way, then revert a few days later?

G have a collateral damage line ... and so long as they don't exceed it, they seem happy.
The unfortunate truth of that is - it means some will be hit/harmed when they shouldn't be.
I'm a realist - I don't expect them to not do that.  It's nigh impossible, they'd never make improvements.
But I think them not ensuring there is a way to rapidly correct the issue is disgusting. 

How many sites do you think are "innocent" that get clobbered? 
How many do you think is "acceptable"?
Do you think it "right" that G have made sure there is no way for those sites to recover?
(Think about it - you get hit by AlgoX - for doing X wrong ... but as you haven't, you cannot fix it .... thus your site is permenantly sunk)
Interesting find - and yes, odd that the suggest and the SERP don't match.
Actually I think you may have argued yourself out of an argument there Lyndon. Allowing manual correction, when no-one really knows the exact effect of a change could be really really bad. 

I think there is quite a big difference between overview and direct interference and I think part of their 'hands-off' approach is due to not wanting to have a whole layer of manual changes to work through each time they make a change, not to mention if you manually 'mark up' a site that the algorithm would be marking down, the site can then be used for dodgy things and would still rank highly, at least till the next manual review or until the manual ranking was removed so the algorithm could judge it properly again.

Remember they do make some manual changes, aka manual penalties.
Not in the least +Steven Lockey ... though I can see where you are coming from.

Over the years, I've seen several sites that are suffering for XYZ, and seen Googlers go "erm".
And yet there was nothing they could do about it.
It always boiled down to getting an engineer to look at it, and try to push a fix.

That means that they have all this lovely automated stuff, and they know it occassionally hits someone that it shouldn't ... and they have no "switch" to rectify things for that site.

This leaves me wondering jsut how much control Google actually have over their SERPs.  I don't think it's that "fine" to be honest.
+Lyndon NA: To play devil's advocate: How would you prevent an errant spam control person from using the automated explanation of why your site was penalized for nefarious purposes post-Google? Basically how does Google protect its IP (like it does by having extremely few Engineers knowing the full algo) without fear of it fully leaking?
This is where it gets complicated :D

I can understand the approach G have taken in regards to "1 person, 1 part" - it's sensible, and I'd have done it myself.

But there is a distinct difference between someone creating an algo (or making tweaks), and someone going in and removing a "flag" from a site.

It would appear that G has no granular "call back" mechanism.  They can only apply across the board (or per sector), or roll the whole thing back (or per sector).
What is required is a feature that would permit them to identify a site that is incorrectly "hit", and remove the "hit".

Yet I thik due to the way they have built it up over the years - they cannot actually spot such errors - let alone go in and manually correct them.
Instead, it's a bunch of signals - and they have tolook at the patterns and flags showing, and then decide which pattern/algo it suits to identify the type of "hit" .... then they would haveto examine the site to see what has caused that hit (as I don't think they get to see that in the system).

Unfortuantely - without someone from Google stepping in ... we will never know.
But I atleast hope the above is the case (else it means G could be fixing indvidual casualties and hasn't been, which would be worse!)
Would you say the same is true for the more machine learning algos like Panda? That when sites have been hit by Panda, and noted that they shouldn't, that they cannot put site A over into the "shouldn't have been hit, it's a good site" re-run the algo?

The actual method of how they do the teaching, and how they do the assigning/annotating - no idea.
The fact that some sites get hit and don't recover fast enough suggests that either they don't do it, or that it's more complicated than it would seem, or as +Steven Lockey implied - fixing it for 1 site may result in breaking it for 500 others.

And that there is the problem with the machine based algos - there is no singular method available ... G have built a system that is heavy handed and there is no recourse.
Yeah, I think many of us agree that machine based algos have their faults... whether there is a better choice is another thing :/
Given the chances of getting a decent and up to date human as oppose to machine search engine are minimal, I guess we are stuck with the flaws of the system :)
I don't knock the general approach.
The only real issue I have with it is the lack of manual intervention for specific cases.
Fix that - and I think it nigh on perfect.
I don't know if that'd fix it. The intrinsic problem of machine learning is to scale it up and set a bar for "innocent bystanders". To have a manual intervention means you would be back at the problem of how to scale that up.
Not really.
They have the Raters - they must find some sites that shouldn't be hit.
They have  the Reconsideration Requests as well.
Then you get the odd case that shows in teh Forums, or is flagged to Googlers privately.

So G is aware of the fringe cases.
The problem is they are unable (apparently) to do anything about them with anything resembling speed ... and in some cases, not without having negative impact else where due to appllying it algorythicly.

If they had included a more granular and manual approach, it wouldn't be a problem.
And it's not like they should use the "resource intensive" excuse - as there shouldn't be more than a few sites ... right?
The issue isn't of scalability - the issue is of refinement of indicators and scoring of thresholds.
If they are unable to identify SiteA as good and SiteB as bad, due to a strong similarity in signals, ... that's one thing.
If they miss label due to different signals, but same resulting weights - that's a problem of their programming and scoring ... and that needs correction.

Whether this means they need to refine their scoring, or run sub routines to identify fringe cases ... I don't know.
It would all depend on how they built it in the first place .
Isn't the inability to do so at speed just one of the factors of "unable to scale"? :) I think requiring people to submit reconsideration requests is a hallmark of "unable to scale".

I come down on the side where more sites are hit than they can scale with manual raters. :)
So you suggest that G have a system where they can identify their own mistakes?
How else are they to have the learning set?
How else are they to identify their failures (FP/FN)?

They need the manual input for identification.
from there they could make it automated and try to distinguish further seperation signals ... and then refine the parent system or introduce sub-systems.

But, they need that slow, manual approach at the start.  That will have some resource costs, that will be somewhat slow - but it is technically unavoidable.
That would theoretically help make the data sets less biased as reconsideration requests are a biased learning set from those that submit. Important difference that I think is too often not considered (squeaky wheel and all).

Remember, devil's advocating here. My real view is that you don't use machine learning on a site level as you nuke good pages and raise bad pages at the expense of true relevance.
And that there is what we originally saw with Panda - there were good sites with some bad bits that got shut down ... whilst some crap floated to the top.

What I don't understand is how comes we then see G refine the system and nuke the floaters ... but we dont' seem to see the good sites come back?

Personally - I don't think it should have been site-wide.
Only in rare/extreme cases (eg. the entire site was crap).
Instead, G should have nuked the bad Pages, not the Site.
Each Engineer has access to their own data set for testing and tweaking their bit of code and to see the potential results - but that is as far as they can see.
Ibndeed, they know the good/bad - within that set, within that scope ... within those limited, restricted and partial ranges.
When it goes out into the wild - it has been tested (heavily) ... but they are never 100% certain on how much of an impact it will have

Have to disagree with you Lyndon, after its been through the manual quality testing stage and before it's released into the wild as such, its applied to one of their DCs where typically 1% of users will get those modified results, following which a statistical analysis is carried out before any decision is made whether to roll it out fully or not. In addition a dedicated search quality analyst is assigned to study the impact of that change.

ps the tweaking and twiddling you describe is an iterative process of the above ;)
Those are valid and correct points.
But I was responding to the bit about the Engineers knowing what they are doing and the reach of the impact.
At the time of them tweaking and twiddling - they don't know.

The knowledge of reach and impact comes after - and again, only on a somewhat limited set.
Only when fully lived do we see the full effect - and that takes a bit of time to get the bigger picture.

This is how comes we';ve seen G do the occassional roll back in the past.  What was created and tested worked fine - it was only when let loose that it was noted as beign a little too much, or having adverse effects that hadn't been noted/predicted - so G clawed it back.

I'm trying to be as fair as possible here.
I don't encvy the engineers - attempting to make a quantitive of something qualitative is hard work.  Trying to differentiate between accidetnal and intentional spam is a nightmare.  trying to fathom whether two of the same score, but acquired via different routes should be treated the same must suck.

But the base line remains the same - why build a system that deny's manual correction?
They have knowingly and intentionally created an approach that can (and does) harm sites that should not be harmed - with no method of those sites being saved barring the long, slow and arduous method of refinement.

Again - I don't hold them accountable for collateral damage.
It's nigh impossible to avoid ... and they do go to some lengths to reduce it (apparently).
But they have ensured that those in the collateral range are not just injured, but crippled.

That's negligent and inexcusable.
+Lyndon NA: Because Google doesn't think of them as good sites? Heh.

I think it comes down to the aspect that Google found it could accurately nuke bad sites at (made up number) 90% confidence, but never more without making the system worse by trying to tweak out exceptions (and there are always exceptions to everything when sites number in the billions).
I know - it sucks, doesn't it!

I think, at th end of the day ... my real issue with G has always been .... I expect more from them.
I think it's because I know they are smart ... and yet I see them make some damn stupid mistakes - it's frustrating as hell.
Then throw in that they stop short, or occassionally overdo .... arg!


But, if G had to publish the stats - I think there would be the potential for enough of a backlash to force them to make changes.
The downside is, the first change they would make would be to retract and stop publishing data :(
Lets be clear here, its applied to a limited set of user queries, and not just a limited set of the index.
Even smart people make mistakes, are naive (I've had my run-ins with many naive Engineers), and fear transparency (way too many examples for that) in the sake of protecting X.
Again, yes - thus how comes they don't realise the longer/further affects (which is what I said :D).
Its not the engineers who make the decisions on whether a change is applied or not, there is a separate Search Evaluation Team who use statistical analysis on a representative sample of the current query stream ;)
More than likely - but the fact still remains that the engineers don't know the full effects,
and that there is still no manual rectification option included :D
Maybe we're both wrong and its somewhere in between

+Lyndon NA +Alistair Lattimore  I don't think so... Looking at the discussion both arguments seem to come from different angles. One is from individual data and one is from aggregated data. Sure in a perfect world both end up in the middle. But we don't live in a perfect world. 

Google Search understand that and they try to off-set this by manual penalties (this is not really scalable and not 100% bulletproof). But manual penalties only removes one side of the error and you still end up with what is called "collateral damage" here. 

I am pretty sure G is hiring econometrics but at the moment they are not able to solve this dilemma.
"But, if G had to publish the stats - I think there would be the potential for enough of a backlash to force them to make changes.
The downside is, the first change they would make would be to retract and stop publishing data :("

Uhhhh Lyndon... can we go with massive "duhhhh" factor here and know you made me spit out my soda onto my screen. ROFLMAO!

Personally the things I've been trying to push to you and everyone on the boards here about SEO:

1) Google is a business.

2) It doesn't matter what ANYONE thinks - they're going to do what they want - there is no need for transparency when the actual company itself is an individual with the rights of an individual.

3) Google has access - and has had access to any machine that ever hooked up online - and any data those machines had on them. It has had this ability for almost every moment since it began breathing as an entity.

4) The changes that are made seem to imply that common SEO tricks are not a tool - but instead it appears the big "G" sees  SEO as a cheat, as a trickery of sorts- if you look at SEO as implementation of specific coding or tweaks as a way to defeat other pages for rankings by tweaking things then you're missing what G is looking for.

5) Google's information of what they have purchased and are using - or will use in the near future is public knowledge and I implore that anyone wanting to know what "G" is looking for should look at the list of acquisitions and determine what they have deemed as important - in order to understand what the algos have capability of understanding and what Google is looking for by providing Search Services in the first place...... think here .  Stop thinking "code" per se - instead think about your foot print in advertising revenue as information is bought and sold.

6) If you want to defeat any pages on the search results - you have to be better than them. Find the pattern for the topic/niche - and go one better.

7) What they have invested interest in - could possibly be pushed to higher ranking... Who cares - they own the search business - and they have the right to do whatever they want..... RIGHT?

Suggest if you want to rank:

Have more information that is valid and unique.

Have a better site that is simple to navigate, is informative and is easy for ANY consumer to use should they land on it.......

Sorry - Coca-Cola is going to outrank "try my new seltzer lemon water pop that I make 10 cases of a year" - simply because of the volume of visitors and the footprint they make for advertising revenue.

They tweak the algos to meet those standards - and remove "SEO TRICKS"

They create info for you to follow -- look at the top sites and figure out why they beat the #2 the #3 and so on... and why #2 beat #4 --- etc.

Other than that - it's pretty damned transparent.
Add a comment...