About a year ago I started working on a side project to create a free tool for .lzh (LHA) archive files. I recently released the first beta version of this tool, which I've named Lhasa [1].

+Eric Raymond once noted that "Every good work of software starts by scratching a developer's personal itch". In this case my personal itch was annoyance at the fact that the canonical tool for .lzh archives is in Debian's non-free repository [2] and there is no free tool for creating or extracting them (I later discovered that there is one, but it's written in Java).

The history of the .lzh format and the LHA tool is an interesting one. The format was developed through essentially an open-source process, and the source code for the tool has always been publicly available [3]. However, the authors failed to agree on a proper, consistent license for the code. The GNU website's licenses page says: "The lha license must be considered non-free because it is so vague that you cannot be sure what permissions you have" [4].

The Unix LHA tool is therefore stuck in an inconvenient position: while there are free compression tools even for obscure formats like "ZOO", nobody bothered to write a free replacement for LHA, because the existing tool was free enough for most people. Still, it seemed a shame for it to be missing from Debian main, considering that it was once quite a popular file format, and apparently remains so to this day in Japan. It also didn't seem like it would be a lot of work to write a free replacement, and I figured it would make an interesting side project.

When any file format becomes even moderately popular, people write tools for different platforms, add their own extensions, and sometimes add their own subtle incompatibilities (by mistake and deliberately). The .lzh format has formed in layers, rather like those in an archaelogical dig. The original format was first used in a DOS tool called "Larc" which generated files with a .lzs extension. Later, "LHarc" reused the file container format with a new compression scheme, and was subsequently renamed "LHA". The original header format (retconned to be named the "level 0" format) became extended in a backwards compatible way to the level 1, 2 and 3 header formats.

Obviously wishing to support as many archives as possible, I took the approach of "retracing history", starting by implementing the oldest header formats and compression algorithms, and working through the various extensions in turn. It was quite fun to reach a certain point with my code and be able to say, "I've now reached 1991". To deal with the many different platforms, I used emulators to emulate various different old computers and operating systems and generate archives on each, building up a suite of files for regression testing. Platforms I emulated include MS-DOS, Amiga, Atari ST, Sharp X68000, MSX-DOS (not a typo) and OS/2. Getting all these old systems up and running was rather fun in itself. There are still a few other platforms I am planning to investigate.

Having a suite of testing archives like this proved to be very useful. I've found that the non-free LHA tool will fail on certain archives. Tools on some platforms generate broken archives that I can nonetheless handle correctly. I made a deliberate effort to ensure that self-extracting (executable) archive files can be read correctly, as this proved to be a common corner case.

With the current version in git HEAD, I'm now pretty confident that Lhasa should now be able to extract pretty much any .lzh archive that you can find in the wild. At the very least, it should be far more resilient to invalid files. My experience has been that it isn't very difficult to crash the old non-free tool, and I'm pretty sure that there are several exploitable holes in it. If you care about security, don't use it.

In many cases the LHA program is used in a programmatic way, invoked by some higher-level program to access .lzh archives. An example is Gnome's file-roller archiving tool, which provides a user-friendly frontend, running LHA in the background to list and extract the contents of archives. It was therefore important that my tool should act as a drop-in replacement.

To achieve this I captured the output from the non-free tool when performing various operations (listing archives, extracting them, etc.) and used this as a base for comparison. It was important that the output be byte-for-byte identical, as any difference could affect compatibility with existing tools. Certain options flags can subtly affect the output, so I made a point of carefully documenting the effect of each option and reproducing it.

My work isn't yet complete. The biggest limitation at present is an inability to create new archives, only extract existing ones. There are also a couple of "ghost" decompression formats (lh2, lh3) that aren't supported - the non-free tool can extract them, but despite extensive searching, I haven't been able to find any program capable of generating archives that use them. My suspicion is that these are "beta" compression algorithms that were tested and abandoned. Ideally I'd still like to support them anyway, just in case there are a few archives out there that use them.

[1] - http://fragglet.github.com/lhasa/
[2] - http://packages.debian.org/sid/lha
[3] - http://oku.edu.mie-u.ac.jp/~okumura/compression/history.html
[4] - http://www.gnu.org/licenses/license-list.html#Lha
[5] - https://github.com/fragglet/lhasa/tree/master/test/archives
Shared publiclyView activity