Shared publicly  - 
12
2
Sylvain Galineau's profile photoLea Verou's profile photo
17 comments
 
"The major limitation of the test is that it only checks whether or not a browser recognises the syntax .....WebKit, for example, will recognise the values round and repeat for background-repeat and subsequently pass the test for those features, even though it doesn’t yet understand or process these values. " In other words, you'll get a higher score by pretending to understand something. Great.
 
There is a big fat disclaimer about this on the sidebar. Also, whoever doesn't play fair, will be shamed in the sidebar, under "Cheaters". I wrote more about this here: http://robert.ocallahan.org/2012/02/problem-with-counting-browser-features.html?showComment=1328258569892#c7451942119663698927

Instead of being sarcastic and dismissive, you could actually help by pointing out these cases, if you actually care that much and you think the issue is that extensive. But alas, who cares about being constructive? Writing a snarky remark is easier, faster and makes you feel better about yourself with little effort.
 
Thanks for the helpful lecture. I'm relieved you know so much about sarcasm and how I feel about myself As I don't believe arbitrary tests of this kind are helpful or constructive, and as the disclaimer indicates this one reports even less meaningful results than previous attempts, I am highly unlikely to provide support. Writing testcases for level 3 specs would be far more helpful to implementors and the WG. Or does promoting yet another superficial browser score make you feel better about yourself? 
 
> as the disclaimer indicates this one reports even less meaningful results than previous attempts
Did you even read the blog post I linked to? Even though Rob criticized the test, he agreed that it provides much more meaningful results than similar attempts. Did you even look at the tests run? It might not test how stuff looks, but I think it at least tests what's parsed pretty decently.

> Or does promoting yet another superficial browser score make you feel better about yourself?
Given that I don't work for a browser vendor or have even stated a personal preference, I don't see how any browser's score would make me feel better. The whole incentive behind writing this test was to raise awareness that a) WebKit is not as amazing and ahead of other engines as people think b) There are tons of level 3 features that aren't yet supported by any browsers or by less popular browsers like Opera and IE.

"Raising awareness" about something presupposes that it's interesting and easy enough to attract a large audience. Being quick and effortless matters, so making testcases that need human intervention was out of the question, and frankly, I couldn't think of a way to make testcases that don't. Pretty much every feature detection script on the web, including Modernizr works in the same way as my test or even more superficially.

Superficial browser scores are better than no scores at all. I'd be glad to work with others on something deeper, that will provide more accurate results that are easily comprehensible by the public. But I haven't seen anything of the sort. A few months ago there was an attempt to make the W3C testsuite results more useful to the public, no idea what happened to it. Back then, I had helped in that by putting in quite a lot of actual work hours.

Sarcasm is easy. Actually doing something is much harder, and whether you agree with what I do, at least I try to do something, and my intentions are good. Trashing other people's work with snarky remarks is neither. Also, given that your remark was about something that's pretty much the most obvious thing on the page after the score itself, it wasn't even useful.
 
I'm not questioning your intentions, I'm questioning the consequences - especially the unintended ones - and the worth of the exercise. The number of so-called 'HTML5' and 'CSS3' 'tests' where IE9 lost points as we fixed bugs is non-zero. And every single time, we got feedback reporting the 'bugs' in the new build. Because, you know, the test successfully 'raised awareness'. The folks who make these tests are of course always very happy with themselves and their contribution to humanity so that seems to be at least as easy as sarcasm. But I have yet to see any compelling evidence that these tests have any meaningful impact in 'raising awareness' or improving interop in a substantial manner (at least the kind of interop that really makes a difference for authors). It is absolutely true that running a w3c test suite is a challenge. A huge one, even, because testing this stuff correctly and completely is in fact super hard; coming up with a way to run these tests at the push of a button is even harder. Orders of magnitude harder than creating yet another score test, which is probably why these things get written instead. Given that they're always popular - and I understand why they would be - they're also a pretty cool opportunity for self-promotion. (Nothing wrong with that either). I just completely disagree it's helpful, useful or meaningful. That is just my feedback, based on a few years working on a browser. But if the discussion is going to be about how the message is delivered instead of its substance, I guess we can adjourn the discussion here.
 
Pretty much anything someone makes can contribute to self-promotion, does that mean it's the primary incentive behind everything?

I can't possibly see how fixing a bug would make a browser lose points in my test. I didn't write these tests from memory, I carefully went through each of these specs, read the grammars and wrote the tests based on that. The only case I can think of when something like that might happen is if I misunderstood something and wrote a wrong test. But in that case, anyone could just send a pull request on github and fix it.

I won't accept criticism for other people's mistakes. If what they did doesn't apply to my test, it's wrong (and unfair) to just assume it does without even checking.
 
Where did I suggest self-promotion was the incentive behing everything you do? You're not fixing any bugs. You're reporting them. It's a safe assumption you can report them to browser vendors in far less time than it took you to set up this test. So there has to be some other kind of benefit to doing it that way or the exercise makes little sense. Unless maybe there is evidence bug reports go nowhere? Then the next step is submitting a testcase as those contribute to stnadardization and are scrutinized by everyone.

Last I never said the test was 'wrong'. I said it was unhelpful and not meaningful. Not the same thing at all. Having good intentions and meaning well does not prove the outcome is useful or positive for everyone. But since you're above criticism that will be my last comment. Thanks.
 
What makes you think I haven't reported browser bugs that this test helped me discover? One of them is actually already fixed! But discovering browser bugs is not the only or even the main purpose of this. It's encouraging people to start caring about the L3 features that aren't very popular and thus, don't have implementations at all or only have few (examples: attr, viewport relative units, tab-size, box-decoration-break, calc etc). Once the public realizes how they could help their work, they will start asking for them, so more implementations will come. Being helpful and meaningful has to do with the purpose, and I'm afraid you've completely misunderstood what this test's purpose was.

You've hardly looked at it well enough, you keep projecting gripes you've had from other other similarly targeted tests on it and then you say I don't accept criticism because I won't accept issues that aren't there! Sheer madness. Or just an easy way out of the conversation. Even if your "criticism" was actually relevant and informed, whether someone accepts criticism has to do with how politely it's expressed, the tone, the intentions etc. Few people accept sarcasm as "criticism", and I'm not one of them.
 
If this was impolite feedback, you never ever want to work on IE :)
 
LOL, I can imagine. But that’s no excuse. The Opera guys also get tons of nastygrams but they are always nice and friendly.
Also, I sort of expect rude, nonconstructive feedback from morons or ignorants. You are neither, so I hold you to higher standards :)
 
It's not an excuse. If we only considered the feedback we deem constructive we'd still be shipping IE6. As for higher standards, that is exactly the point: I didn't expect yet another 'CSS3' score page from you. 
 
Well thank you. :)

I'd honestly be interested in your proposed solution for what css3test attempted to do. To recap:
- Raise awareness in authors that CSS3 has much more to it than the fancy stuff (rounded corners, gradients, transforms and the like)
- Show authors that even their favorite engines only support a relatively small portion of CSS3
- Show authors that WebKit isn't such a huge standards champion when it comes to CSS3. Even if it's ahead, it's only by a small bit.
- Show authors that IE(10) and Opera aren't as bad as they think when it comes to CSS3 support.
- Show authors that stuff like -webkit-mask or -webkit-box-reflect is non-standard and thus, NOT CSS3
- Show authors that just because an engine superficially claims to support a CSS3 feature, it doesn't mean it implements everything in the specification for that feature (examples: CSS gradients in non-background properties, border-image longhands, tab-size: <length>, various computations in calc() etc)
- Give people an easy to find and pass around metric to measure the breadth of CSS3 implementations in the various engines

What would you do? I'm really interested in your response.

Also, does your sentiment about css3test extend to http://www.css3.info/selectors-test/ ? It kinda was my inspiration for css3test.
 
Well, I find tests that focus on a specific feature to be much more useful that those that focus to ill-defined targets such as 'CSS3', actually. One foot wide and a mile deep is way more useful - to me at least - than the one-mile-wide and one-foot-deep tests. But while that makes sense for an implementor, I completely understand how the latter model 'sells' better for users. While the desire for a Consumer Reports-like score is perfectly understandable, CR also tests products intended for specific uses e.g. much of what a range of scores means for a set of vacuum cleaners is implied in what a vacuum cleaner is used for. As a vac is not a platform from which you build any household tool you want, things are relatively easy to score and the result can be made pretty meaningful to potential buyers.'CSS3' is not such a canned product; if I look at a site like Airbnb and a browser with a score of 90% on css3test, what does that score mean in terms of how easy it will be to author that design for that browser? Or, if I wrote Airbnb for a 100% browser, what happens in a 95% browser? Minor cosmetic issue like the lost of the logo jiggle on hover? Is some layout broken? Both? There is no way to tell from the score, of course, because that's not a goal and it can't be captured that way. But CSS is not some collection of equally valuable spare parts flying in close formation; if a browser scores 90% on every single module tested, the real-world interaction of .9 x .9 x .9 x .9 support could turn to be significantly less useful for a given design than a very different distribution with a lower average. So is the issue really how many properties-values are supported, or what can be achieved ? Are all property-value pairs equal in importance for all designs? I don't think so. Bottom line: sure, your goals are positive. But so what? Are they sufficient ? And if the test makes the expected difference, what improves? What is fixed? How do we measure that success? It's rather hard to tell. Is having such a metric very important? Does it really solve author problems? I think the answer is generally either no, 'not really' or at best 'maybe'. And I also think that giving more meaningful answers is hard enough that we keep falling back on what's easier: looping through some small set of asserts and increment counters. I'm not at all convinced the resulting stats are anywhere near as valuable as they are popular. Does that make more sense?
 
Yes, much better, that's way more constructive than sarcasm :)

I completely agree that depth is better. After all, the deeper the tests, the more accurate the score! The reason I didn't do that was purely technical: There doesn't seem to be a way for deep automated testing in the browser. The W3C testsuite (which is a perfect example of deep testing) requires human interaction in every test, either to tell the result (is this green?), or to even perform the test in some cases ("hover over this, is it green?"). So while this is good for implementors, I can't require that kind of effort from the average author. :(

Regarding CSS3, I'm using it as short for "a set of the most stable and not abandoned Level 3 CSS specs". Yes, not every spec was born equal, but that could be said for every subset. For example, not every feature in a certain spec is equal either. Supporting @font-face is much more important than supporting old-style Opentype numerals for example. However, this gives me a very good idea: Adding weights to each feature wouldn't be hard, and would make the score more valuable. Although I'm afraid that the hyped features would get the highest scores, which is exactly what I wanted to avoid. Any ideas?

Btw, speaking of the W3C testsuite, I'd love to contribute but I can't find anything to guide me about how to do so. Is it even possible, or only implementors can contribute?
 
Actually, W3C testsuites such as the one for CSS2.1 maximize spec coverage i.e. they mean to 'test the spec'. Each normative statement is tested individually and we check whether two implementations pass each such statement. That's it. Even with 10,000 such testcases, experience shows that is not sufficient for real-world interop because the real world combines all those things together in interesting and sometimes crazy ways. For implementors, they're super helpful. So, quite selfishly, I love them! For authors, they have huge potential as an educational/learning tool. As a proof of real world interop - where real-world involves messy combos of HTML, CSS and JS - their usefulness is far more limited. I see css3test as shooting for the comfy compromised middle. It's not so narrow as to be overly specialized, it's not so deep as to find lots of interesting bugs....but it also doesn't combine feature in any meaningful real-world sense. Each compromise is individually reasonable but the combination feels like one of those designs where the three core components indivually aimed for 70% impact/usefulness and the combo ends up being 0.70^3. So if I had to try and narrow down my frustration with these tests is that they don't take any real stand, nor do they express enough of a point of view on what matters to move the needle in any lasting way. I guess I'd better get a real something for someone than too little for everyone. The weighting idea is interesting, but I'd assume the weights would depend on your end goal e.g. maybe you start from a bunch of designs - blog, retail site, social page, news.... - and rank features by their importance to achieve that type of pattern? This is starting to sound like some new take on the CSS Zen Garden: instead of one-content/many-stylesheets we have one-feature-set/many-designs.

As far as contributing testcases, start here http://wiki.csswg.org/test. It's rather raw; feedback welcome.
 
Thanks for the info about the testsuite! Is there any list about which specs don't have tests besides this: http://www.w3.org/Style/CSS/Test/Overview.en.html ?

> the combination feels like one of those designs where the three core components indivually aimed for 70% impact/usefulness and the combo ends up being 0.70^3.

Ha, maybe. But have you noticed that 0.7^3 > 1/3 ? ;)

> The weighting idea is interesting, but I'd assume the weights would depend on your end goal e.g. maybe you start from a bunch of designs

How so? For example, the absence of layout stuff would break everything more badly than say, the absence of box-shadow. Even in the most light flexbox cases and the heaviest uses of box-shadow, the impact doesn't get even close. So, I think a crude approximation could be specified per feature. The question is, who will decide and in what basis?
 
It's your test, you decide! Take a stand. It's sure to get some lively attention and interesting feedback. 
Add a comment...