Back when we did the iPhone discussion on Late Night Cocoa, I made a point of distinguishing the iPhone’s media frameworks, specifically Core Audio and friends (Audio Queue Services, Audio Session, etc.), from “document-based” media frameworks like QuickTime.
This reflects some thinking I’ve been doing over the last few months, and I don’t think I’m done, but it does reflect a significant change in how I see things and invalidates some of what I’ve written in the past.
Let me explain the breakdown. In the past, I saw a dichotomy between simple media playback frameworks, and those that could do more: mix, record, edit, etc. While there are lots of media frameworks that could enlighten me (I’m admittedly pretty ignorant of both Flash and the Windows’ media frameworks), I’m now organizing things into three general classes of media framework:
Playback-only – this is what a lot of people expect when they first envision a media framework: they’ve got some kind of audio or audio/video source and they just care about rendering to screen and speakers. As generally implemented, the source is generally opaque, so you don’t have to care about the contents of the “thing” you’re playing (AVI vs. MOV? MP3 vs. AAC? DKDC!), but you also can’t generally do anything with the source other than play it. Your control may be limited to play (perhaps at a variable rate), stop, jump to a time, etc.
Stream-based – In this kind of API, you see the media as a stream of data, meaning that you act on the media as it’s being processed or played. You generally get the ability to mix multiple streams, and add your own custom processing, with the caveat that you’re usually acting in realtime, so anything you do has to finish quickly for fear you’ll drop frames. It makes a lot of sense to think of audio this way, and this model fits two APIs I’ve done significant work with: Java Sound and Core Audio. Conceptually, video can be handled the same way: you can have a stream of A/V data that can be composited, effected, etc. Java Media Framework wanted to be this kind of API, but it didn’t really stick. I suspect there are other examples of this that work; the Slashdot story NVIDIA Releases New Video API For Linux describes a stream-based video API in much the same terms: ‘The Video Decode and Presentation API for Unix (VDPAU) provides a complete solution for decoding, post-processing, compositing, and displaying compressed or uncompressed video streams. These video streams may be combined (composited) with bitmap content, to implement OSDs and other application user interfaces.’.
Document-based – No surprise, in this case I’m thinking of QuickTime, though I strongly suspect that a Flash presentation uses the same model. In this model, you use a static representation of media streams and their relationships to one another: rather than mixing live at playback time, you put information about the mix into the media document (this audio stream is this loud and panned this far to the left, that video stream is transformed with this matrix and here’s its layer number in the Z-axis), and then a playback engine applies that mix at playback time. The fact that so few people have worked with such a thing recalls my example of people who try to do video overlays by trying to hack QuickTime’s render pipeline rather than just authoring a multi-layer movie like an end-user would.
I used to insist that Java needed a media API that supported the concept of “media in a stopped state”… clearly that spoke to my bias towards document-based frameworks, specifically QuickTime. Having reached this mental three-way split, I can see that a sufficiently capable stream-based media API would be powerful enough to be interesting. If you had to have a document-based API, you could write one that would then use the stream API as its playback/recording engine. Indeed, this is how things are on the iPhone for audio: the APIs offer deep opportunities for mixing audio streams and for recording, but doing something like audio editing would be a highly DIY option (you’d basically need to store edits, mix data, etc., and then perform that by calling the audio APIs to play the selected file segments, mixed as described, etc.).
But I don’t think it’s enough anymore to have a playback-only API, at least on the desktop, for the simple reason that HTML5 and the
<video> tag commoditizes video playback. On JavaPosse #217, the guys were impressed by a blog claiming that a JavaFX media player had been written in just 15 lines. I submit that it should take zero lines to write a JavaFX media player: since JavaFX uses WebKit, and WebKit supports the HTML5
<video> tag (at least on Mac and Windows), then you should be able to score video playback by just putting a web view and an appropriate bit of HTML5 in your app.
One other thing that grates on me is the maxim that playback is what matters the most because that’s all that the majority of media apps are going to use. You sort of see this thinking in QTKit, the new Cocoa API for QuickTime, which currently offers very limited access to QuickTime movies as documents: you can cut/copy/paste, but you can’t access samples directly, insert modifier tracks like effects and tweens, etc.
Sure, 95% of media apps are only going to use playback, but most of them are trivial anyways. If we have 99 hobbyist imitations of WinAmp and YouTube for every one Final Cut, does that justify ignoring the editing APIs? Does it really help the platform to optimize the API for the trivialities? They can just embed WebKit, after all, so make sure that playback story is solid for WebKit views, and then please Apple, give us grownups segment-level editing already!
So, anyways, that’s the mental model I’m currently working with: playback, stream, document. I’ll try to make this distinction clear in future discussions of real and hypothetical media APIs. Thanks for indulging the big think, if anyone’s actually reading.