A Menagerie of Image File Formats

This is a follow-up to my recent post on parsing NITF files that contain JPEG data. It’s basically a crash course into the organization of the guts of image file formats. If I were ever asked to be an expert witness in a trial, it would probably be about file formats.* This is the area of my expertise.

You can divide the world of image file formats into different kingdoms based on the their structure. There is some overlap between these categories, but for the most part image formats are (1) tag/record-based, (2) structure-like, (3) marker/stream-based, (4) textual, (5) card-like, (6) raw, or (7) opaque.

TIFF, DNG, and DICOM are examples of tag/record-based formats. A unique tag identifies the entity in the file and its meaning. For example, a particular hexadecimal tag might indicate that this is the “photometric interpretation” record. The datatype of this record either explicitly appears after the tag or appears in a data dictionary that’s known to the application developer. Almost always, these records explicitly tell the length of their data, which makes it easy to skip to the tag location of the next record.

Microsoft was (for a time) very fond of making structure-like formats. In these formats, the file looks a lot like the memory representation of a C/C++ data structure. These formats are easy to describe and easy to read if you have the structure definition; simply fread() the data into a variable and reference the data members by name. The problems should be pretty clear. You need to be using a programming language that supports C structs. And you need to know the layout of the struct. And once you define the layout of the struct, it’s fixed. (Well, not exactly. Microsoft changed the data layout in its BMP family of formats with every release of Windows, and used a “magic” value to tell readers which struct to use.) All told, it’s a very brittle kind of format.

JPEG is the prototypical — but certainly not the only — marker-based format. Markers are special combinations of bytes that, like a tag, tell what the data is that’s coming next in the stream. But, very much like struct-based formats and very unlike tagged formats, the data that follows the marker can be heterogeneous. In JPEG, the data that appears after the SOF (Start of Frame) marker is a record, while the data that follows an RSTn marker is just a stream of compressed bytes. The SOI and EOI (Start/End of image) markers don’t even have any bytes that follow them. In marker-based formats, semantics and syntax are rather carelessly jumbled together.

It’s very difficult to quickly parse marker-based formats, because often markers don’t specify how much data appears before the next marker. These are very much “streams” of bytes that you’re forced to read until you come to the next marker. Consequently the number and appearance of markers is very limited and this limitation ripples through to the data that they contain. JPEG markers all begin with the 0xFF byte followed by another byte, which taken together specify which marker it is. Consequently, the appearance of an 0xFF byte in the data of a marker has to be escaped by a NULL byte so that it’s not mistaken for the next marker.

Textual formats, such as XML, have the benefit of being self-describing and readable by both humans and machines. Their main drawbacks are the inflated size of the data they contain (even when represented in a semi-binary CDATA hunk) and the inability to quickly skip through them with binary I/O routines.

FITS is a fairly prototypical “card-like” format. As the name implies, these are fixed-length records like one might have encountered on a punch card. For example in format with 120-character records, the first n characters are reserved for the “variable name” part of the equation, while the remaining 120-n characters are the textual representation of the value of the record. They are frequently text-only for the descriptive part of the format with a binary payload at the end. These are easy to read, but a pain to parse, since the “right hand side” values often have to be interpreted.

Raw and opaque formats aren’t very easy to describe because they’re so varied. In a “raw” format (and there are dozens or hundreds . . . possibly more) all of the bytes are jumbled together in a payload-only file. A separate file may have a header that describes the data and helps a reader/parser make sense of the payload. Or not. These are almost always completely free of any helpful description within the file.

This shouldn’t be confused with opaque files, such as HDF, CDF, or netCDF. These formats are completely defined by their API, which for all intents and purposes, you have to use to access the data within the file. This allows for a lot of richness in handling the data contents, which can be organized in highly optimized ways. The downside is that you’re limited in how you can interact with your data to mechanisms someone else has defined. And data permanence can suffer, since if the tool chain changes (or goes out of existence) you don’t really have a way to get at your data.

Practically, each format style has it’s pros and cons. But tagged formats (which might incorporate features of the record style) are the most durable and easiest for third-parties to work with.


* — Cue awesome “CSI” + “Law and Order” + “House” mashup daydream. *DOINK DOINK*

This entry was posted in Computing, File Formats, Fodder for Techno-weenies, Life Lessons. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>