Improving GStreamer quality

I've been using +GStreamer for a long time now, and I feel it's a very strong technology. I think GStreamer 1.0 has what it takes to be the industry-leading technology for professional video editing. However, there is a certain class of quality issues that I've long observed in GStreamer and (more often) the various codec libraries that GStreamer wraps (libavcodecs, libvpx, libtheora, etc).

Testing multimedia code is hard because you need to test against a rather exhaustive set of real-world files. But for video especially, the files are huge and the tests take a long time to run. Rarely are changes made without being tested against at least some video files, but the problem is no one is currently testing against a set of files large enough to be truly representative. We simply don't have a practical way to test against such an exhaustive set of files. So there's a constant quality ebb-and-flow where fixes and improvements for one variety of media file inadvertently cause regressions for another. This bug is a perfect example, "H264 MOV files from Canon DSLR resolution reported as 1920x1088":

Although I don't know the details of that bug, a Google search suggests it has happened before (including in some proprietary software), and I bet it was caused by a change that actually fixed or improved something for another kind of file.

One consequence of this ebb-and-flow is that at any point in time, there are likely serious quality issues that effect at least some common types of media files. And perhaps a bigger problem is that developers are wasting huge amounts of time chasing down this constant flow of regressions, time that could be better spent solving hard engineering problems (or catching up on sleep).

Currently the feedback loop is far too slow. It can easily take months till a developer realizes they broke something for certain media files they personally don't test with. By the time a regression is discovered, many changes have been made, and the change that broke things wont be fresh in the developer's mind. For GStreamer to conquer the world of pro video, we need to reduce this feedback loop from months to hours. We need a way to do continuous integration testing against an exhaustive set of real-world files, for every commit made to trunk, for both all the GStreamer components and all the major codec libraries.

For the last year or so I've been musing on how to do this, and I've come up with a clear plan. I think this is one of the most valuable contributions that +Novacut can make to the GStreamer ecosystem. It will benefit every app that uses GStreamer, and every app and framework that uses any of the codec libraries we test via GStreamer. My plan has two parts:

Media Database

We're going to store the actual media files in +Dmedia and host them on some public cloud. Often I get asked why I wrote Dmedia instead of just using git, and this is perfect way to explain why: we need the ability for a single computer or VM to grab just some small, arbitrary subset of these files. This media database will easily reach hundreds of terabytes. If you want to `git clone` that, be my guest :P

However, the metadata about these files will be under version control (we're going to use bzr, host on Launchpad). For each file in the media database, there will be a JSON file containing something similar to the CouchDB doc that Dmedia stores for each file.

The point of the metadata isn't just to list the files, but to codify various properties of the file, so we can compare that to what GStreamer says when we run the tests. For example, the video resolution (height and width) will be in here. So we'd immediately catch the above bug where a 1920x1080 video file is reported as 1920x1088.

As another example, Novacut absolutely needs to deliver perfect frame accuracy, in every seek, in every render, without fail. So we might store what (we think) are the correct timestamp for every video buffer in the video. We'll store the framerate, the exact number of frames, the samplerate, the exact number of samples, and so on.

This metadata wont be manually entered, but we do want it to be human readable and easy to manual edit if needed. We're going to build tools for adding files into the database, which will generate this metadata using what we think is correctly working code. Over time, people can do all sorts of out-of-band checks on this metadata and reason about its correctness. When corrections are needed, updates will be made. Just propose a merge.

The point is as we take the time to deeply understand what the correct properties are for these files, we want to store metadata describing this so that we have an automated way to ensure that GStreamer and the codec libraries continue to deliver correct behavior, version after version.

The media database wont be tied to Novacut or GStreamer (although we are going to build the related tools atop GStreamer). So there's no reason other projects can't use it. It would be awesome if, say, MLT ran tests against this database too. I'd also love to see Lightworks tests against this database, hint hint.

Running the Tests

We want the quality feedback loop to be as tight as possible. To test such an exhaustive set of files yet get the feedback time down to mere hours, we need to spread the tests across a large number of nodes working in parallel. And I think the cloud is the easiest, most cost effective way to do this.

Right from the start I want to test a wide range of files in aggressive playback scenarios. Not just testing if the file plays without crashing, but does a full play-though deliver the expect number of frames, the expected number of samples, is stuff like the width, height, and framerate correctly reported. This is the only way we can help reduce the quality ebb-and-flow. These are fairly simple tests that we need only write once, and then run across every file in the media database.

For NLE scenarios, things get a lot more complex. There are some files in the wild for which we'll never be able to deliver perfect frame accurate editing and a good user experience. Some files are just plain broken and will need to first be transcoded (and possibly fixed/conformed) before any NLE can edit them. So this is a place where we must choose our battles carefully.

Initially Novacut must focus on the media files, scenarios, and platforms that reflect the market reality for our target users. Apple has given us a "Windows Vista" sort of moment, and we don't want to squander it.

For NLE scenarios, at first we are only focusing on video produced by Canon HDSLR cameras like the 5D Mark II. Testing edit scenarios is a giant step up in complexity from testing the playback of a single file. So it will take time to develop a deep understanding of what's going on with just files from Canon HDSLR. We want to be industry leading when it comes to editing these files before we dilute our efforts with other cameras. We will add other cameras into these tests, but it will take time (and help from others).

Most of our target users currently use OSX, and so realistically when these users make the leap to Linux, it will be to Ubuntu, mostly likely because they bought hardware with Ubuntu pre-installed. Canonical is the only distro-sponsor I see putting the time and money into building the OEM relationships needed to make great Linux desktop hardware a reality. And Ubuntu is the only distro I see that is clearly focused on reaching the 99% of computer users not yet using Linux, even if it means some of the existing 1% of Linux users stop using Ubuntu.

Note that our loyalty is to our users, not to Ubuntu. If the tables turn and another distro does a better job for our users, we'll switch. But as it stands today, Ubuntu has earned this position, and I'm guessing they'll continue to do so. Also note that I hope other distros run these same tests, and in whatever way makes the most sense for them. Novacut just can't invest any time or money in making this happen right now.

Anyway... so we're going to do builds in a Launchpad PPA, and we'll do an appropriated series of rebuilds whenever a commit is made to any of the trunks we're tracking. We want to test with packages (rather than an install from the source) because we want to test the exact scenario in which our target users will use the software. If the build fails, we stop there and report this upstream.

If the build succeeds, then we run the integration tests. I want it to be easy for anyone to run these tests, on any public cloud or a local cluster or even a single machine. So I think deploying the test controller and the test workers with +juju is a great idea. +Marco Ceppi, I know you offered to help us get Charm-savvy, and this might be the first place we take you up on the offer.

In terms of data collection, we want a lot more than just log files. We need structured data, and so my plan is to store this is CouchDB because JSON is a great fit for what we need. For example, if we don't get the expected timestamps on every video buffer as we play through a file, we want to store the timestamps we did get. I've also been doing some interesting experiments where I do a sha1sum of the video buffer data, which seems especially useful as a cheap but brutally accurate way of testing NLE scenarios. Is gnonlin giving us the exact frames from the exact clips we expect, all delivered with the exact timestamps we expect?

I also like CouchDB because it makes it easy to build nice web UI that will allow anyone to visualize and study the data that comes out of these tests. Plus it makes it easy to incrementally sync new results to a local CouchDB for local analysis. And we want easy ways for results to be reported back upstream, and for upstream to be able to query the results in productive ways, including using the results as part of their own continuous integration.


Lots more to come...
Shared publiclyView activity