Computer Languages as Artistic Medium

April 23rd, 2008

Not many people have the ability to look at the few fundamental elements of a system (such as constructs of a new computer language) and extrapolate the long-term implications of their use. But rather than jump in and tackle the difficult question of understanding the expressive power and limits of computation, I’ll start by talking about a simpler problem. Don’t panic: it involves crayons and coloring!

How many crayons does it take to color a map?

This example is going to seem pretty easy on the surface, and you may be familiar with it already. It involves trying to find a way to color in maps so that no areas which share a border are going to be the same color. Here’s an example of a simple coloring with 5 crayons, and then finding a method for doing it with only 4 (taken from Wikipedia’s article on the subject):

fourcolor_wikipedia

Without prior knowledge, the average Joe can only guess at how many crayons he’d need to reliably fill in any map he is given. Just by trying a few cases, he might get an intuition that a small box of different colors will do the trick. But if Joe knew the Devil was going to be presenting him with an arbitrary map and getting his soul if he couldn’t color it properly, he’d probably buy Crayola’s biggest box…120 colors! (Plus, because he doesn’t know for sure that even THAT would be enough, he’d also probably try to round up the discontinued colors off of eBay!)

Yet there is salvation from this kind of paranoia. Early formalisms of the map-coloring-problem from mathematicians comforted us with proof that it *never* took more than 5 different colors…and that 3 will certainly not always be enough. After considerable debate and experimentation over the years, we ultimately found that we only need four unique colors for any map the Devil—or anyone else—makes.

(Note: Depending on your purposes, discovering it was 4 vs. 5 might be somewhat academic. I myself feel that the more crucial step was bringing in math to save us from needing to pack a potentially infinite number of crayons.)

The programming analogue to the map coloring problem

Now…what if the “Devil’s Map” you must solve is instead an arbitrary selection from mankind’s constant creation of new ideas for software? That raises a lot of questions about what a programmer needs in their backpack to be confident they could correctly “color in” any program specification with a working implementation. Here are a few issues that might come to mind:

  • Does one need to pack a language with, say, multiplication built-in… or should you just add a number to itself repeatedly in a loop?
  • If you don’t actually *need* multiplication, then doesn’t that make the need for addition and loops seem suspicious?
  • Is there anything that computers could never solve efficiently? (Or if that ever happens, is it just the case of the programmer just “missing a crayon” that the language designer accidentally forgot to put in his backpack?)

These are the kinds of great conundrums that preoccupied early computer scientists. Smart guys like Turing and Church came up with pretty deep answers. These folks were busy defining mathematical abstractions of computers and “computer languages” that were extraordinary simple, but could do anything that any other generic computer could do. I think Turing’s own words on this subject are surprisingly easy for the layman (and better than many textbook rehashes).

But trying to program in a Turing Machine would be somewhat analogous to facing the Devil’s Map-Coloring Challenge with 4 crayons that had been put through a Cuisinart! You can have complete confidence that your bag of gray-looking dust will let you color in areas just as you could with four whole crayons. But you’ll have to spend an unreasonable amount of time sifting the particles and sorting it into piles with your hands, then rubbing it into the shapes with your fingers. On top of that difficulty, you might find that Beelzebub picked a map that’s really hard to color with just 4—sometimes the 4-color solution takes a lot of time to find!

The fact that someone theoretically could find a way to overcome these obstacles doesn’t mean you’ll have the patience. So if crayons are cheap, you might want to pack the 120 color box anyway…even if it’s overkill. At least that way, any map with 120 states or fewer will be easy to color: pick out a crayon, color a region, and just don’t use that one again!

Now, a more slippery issue…

In the previous sections I discussed somewhat mathematically precise questions. But what if the challenge was a bit more abstract? Let’s say the Devil was going to ask you to produce a work with actual artistic merit for any given thing the he might ask you to draw—be that a landscape, or a space battle, or a unicorn. How many crayons would you need to make sure you could meet his challenge?

I don’t think there’d be much controversy among artists in saying that this has very little to do with how many colors you are given. Subtle differences arise in works by those who had green crayons vs. those who only had blue and yellow in their set. But it’s still “crayon art”, and most critics believe that that anyone with talent should use something different (like oil paint).

Arguments about medium can become very passionate. Still, people agree that it is useful to speak of Crayons and Oil Paint and LEGOs as being markedly different…despite some similarities in how a creative mind might apply them. Yet few would suggest that a box with more or fewer crayons should be labeled a “new” medium.

Programming is the “math problem” AND the “art problem”

I chose to start with map coloring because it’s a rigorous problem-solving domain with artistic overtones. That’s because I wanted to segue way to the idea that when you are given a programming challenge, doing the work well is not just solving math equations as some people think. Those immersed in the field start to see it more as being commissioned to make a work of art. And every artist has a medium they enjoy working in (sometimes to the exclusion of others)!

LISP and C++ are considered different mediums by most everyone. A programmer needs to use a different state of mind to program in them entirely, and that gives rise to different program types. Yet C++ and Java are very similar in their overall nature with only a few strategic decisions made to differentiate the languages.

What does it mean for a language to be a truly new medium, as opposed to a variation on a theme? That’s rather debatable. A language I have been looking at lately is called REBOL, and its advocates believe it to be a unique new medium. This is despite the fact that on first glance, it is an incredibly small box of crayons with a special color or two added, and lots of “ugly” colors taken out.

Yet to them it’s also not like any other language that already exists. So if it were a product being introduced at Toys R Us, it wouldn’t be a LEGO set with square plug-knobs instead of round ones….nor would it be finger paint that was merely a little stickier or a little easier to clean up. Beyond it’s new-ness, its champions say it also has the property of good-ness: a hopeful new way to build fun art.

Bold claims! But I’ve become interested in the way REBOL has simplified the language and interpreter so that it yields a somewhat homogenous computing substrate. It’s hands-on, and doesn’t have properties that will suddenly surprise you as you dig in. I will post another article on REBOL and its benefits soon, and link to it from here.

Addendum: note on inspiration for this article

The phrasing in this article was motivated by a quote from a post written by Joel Neely, in which he wrote:

COBOL is alphabet blocks.
C is a set of tinkertoys.
Java is an Erector Set (IIRC similar to Meccano in Europe?)
    with all kinds of braces, motors, girders, nuts, and bolts.
Perl is a Swiss-army chainsaw.
REBOL is modeling clay.

Learn to use each one according to its own nature.

When Sockets Attack: DNS & DHCP

February 20th, 2008

Somehow I made it up to 2007 without ever writing code that opened a raw network connection or pulled apart a TCP/IP packet. Naturally, I had some hand-wavy notions of more-or-less what was going on under the hood—in college I took a 400 level CS course based on an old edition of Tanenbaum’s excellent Introduction to Computer Networks. But our homework was all theoretical, and I didn’t delve very deeply into the practicality of what powers our modern Internet.

But this year I was tasked to port some Windows networking code to Linux and OS/X. Looking at what was there, I guessed it was no big deal and said I’d do it. After all…there is a layer of abstraction known as “Berkeley sockets” which is practically identical to the UNIX API for reading and writing files. How hard could it be to recompile that on a new platform?

Though sometimes it might not be that hard, this case turned out to include the functionality of a DHCP server. The Windows version was able to work at the socket level by broadcasting UDP packets to 255.255.255.255, but the OS/X and Linux versions fundamentally couldn’t do it this way. Their semantics are different for how the socket calls are translated into packets.

I want to talk about some of the high-level gotchas to watch out for if you’ve just never run up against these particular dark alleys. Because I’m talking about “weirdness” only, I’m not going to explain basic mundane socket programming—because there are many guides explaining that. (One of the best I found was Beej’s Guide to Network Programming—so check that out if you’re interested.)

How IP addresses are doled out

Most everyone has a basic idea what an IP (version 4) address looks like. The whole amazing story of the Internet is all the wacky mechanics to get a message from the computer that identifies itself as A.B.C.D to another machine that identifies itself as W.X.Y.Z, taking evolving paths. This is where companies like Cisco make their money. I’ll skip over the exciting topic of how traffic is routed and talk about the much more bone-headed idea of how addresses are allocated in the first place.

Let’s just quickly go to the command line and see what the IP address is for google.com if I ping them:

PING google.com (72.14.207.99): 56 data bytes
64 bytes from 72.14.207.99: icmp_seq=0 ttl=234 time=85.093 ms
64 bytes from 72.14.207.99: icmp_seq=1 ttl=234 time=85.008 ms
64 bytes from 72.14.207.99: icmp_seq=2 ttl=234 time=80.797 ms
^C
--- google.com ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 80.797/83.633/85.093/2.005 ms

We see that somehow the textual name google.com was converted into 72.14.207.99. Once we get a number, that’s when the routing optimization magic starts kicking in. But you might be thinking: “There’s a control panel where I can set my IP address to be any number I want. What’s to stop me from picking that number, and suddenly intercepting the traffic people intended to send to Google.com?”

At an abstract level, nothing is stopping you from doing this. On your own isolated network (where you control the routing) you can do just that. But your ISP is providing your connectivity to the greater world, and is thus responsible for limiting the nature of your incoming/outgoing traffic. If they assign a number to you like A.B.C.12, you might be able to get away with lying and saying you were A.B.C.13—but the farther outside your “block” you try, the more likely a high-level router in the Internet is going to ignore you.

“Officially” recognized IP addresses have been divvied up in a ridiculous process that roughly parallels the Oklahoma Land Rush. One cartoon version breaking it down was featured on XKCD:

map_of_the_internet

An even more interesting map (to me) is one that lets you drill down and navigate through which addresses are assigned to organizations or countries that have been blacklisted by many popular mail servers.

If you dig down into this map you’ll notice that 72.14.207.99 is assigned to an organization called ARIN. ARIN in turn has given their seal of approval that this number belongs to Google Inc.. Because ARIN is considered a legitimate authority by the average ISP, few routers would ever forward a packet intended for 72.14.207.99 from one of their customers to anyone other than Google.

(Hackers are constantly trying to subvert this, of course. Though I’ll point out that since who-gets-what-number is based more on money and cybersquatting than it is on actual legitimacy, it’s not like you should have all that much faith in the powers-that-be. In the words of the X-files: Trust No One—and that includes your ISP…especially if it’s Comcast!)

When reading socket sample code, you often find two routines for finding machines on the net. One is known as gethostbyname and the other is gethostbyaddr. You might expect that if you have a number that works from the command line in ping for finding a machine, that gethostbyaddr would work for that address as well…but this is not necessarily so.

The reason is that just because something has an IP address doesn’t mean it has a host record with your domain name server. If you feel like bypassing the DNS, you might still be able to open a socket connection to it via a directly coded sockaddr_in (as ping does). But by doing so, you better know what you’re doing…because you’re bypassing any checking that your DNS servers might have been doing to make sure that IP address is who they are claiming to be!

Dynamic IP Addresses

We know there’s such a thing as a “static IP address” where you tell the machine what address has been assigned to it on the Internet map. There’s another thing known as a “dynamic IP address”…where each time you connect your computer a server tells it what address to use for that session. If you turn your computer off you lose it and have to get it again, and it might be different.

Just because an IP address is dynamic doesn’t mean it gets to ignore the rules of the Internet allocation strategy. If you want to talk to other computers, then the dynamic address must fit into the map. But what if you’re setting up a local network of thousands of computers and you don’t want to interfere with the rest of the world? Are any addresses reserved that are not meaningful on the global internet, but you can use locally and not be considered a liar?

Well, there were, and you run into them rather often…such as these private network ranges:

  • 10.0.0.0 — 10.255.255.255
  • 172.16.0.0 — 172.31.255.255
  • 192.168.0.0 — 192.168.255.255

So let’s say you were Google and you own thousands of machines, but the only address you have registered through ARIN is 72.14.207.99. You can use that address as a bridge for connecting to the broader Internet, while your corporate machines all have private addresses to talk to each other. It’s kind of like when people inside a company are using extension #s off of a main line instead of having their own full telephone number. All traffic intended to come in and out of the global internet will go through the main number and then be forwarded to the right local address.

If you’re the average internet user, chances are that you have a private network address right now. You’re probably connected over Wi-Fi or a router that lets you share an Internet connection with the public internet number handed out by an ISP. For instance, on my machine right now connected via Wi-Fi I am 192.168.1.3:

bash> ifconfig
...
en1: flags=8863 mtu 1500
    inet 192.168.1.3 netmask 0xffffff00 broadcast 192.168.1.255
    ether xx:xx:xx:xx:xx:xx
    media: autoselect status: active
    supported media: autoselect
…

But if I go to http://whatismyipaddress.com, it reports a public number belonging to my ISP.

If you are enjoying the study of these magic numbers and how they are handed out, you might be interested to know that a poor technical decision led to the loss of all the 127.xxx.xxx.xxx numbers save one—127.0.0.1. This is an alias for your own machine (known as Loopback). Wikipedia’s article on the topic explains how a similar oversight in inconsistent implementations led 0.0.0.0 to be unused.

The DHCP Protocol

So let’s say your machine gets a dynamic address from whatever router or coffee shop you happen to wander into. How does your machine find out what number it’s supposed to use? There is a chicken-and-egg problem: the server that hands out addresses can’t tell you who you are with a message addressed from itself (A.B.C.D) to you (W.X.Y.Z), since you do not yet recognize that traffic for W.X.Y.Z is meant for your machine!

Special IP address formats, such as 255.255.255.255, were reserved to mean “send this message to everyone on the network, whether they have an assigned IP address yet or not”. So a machine looking for an IP address would thus broadcast a magic number to 255.255.255.255 and say “if anyone listening on my network would like to give me an IP address, then please broadcast back to 255.255.255.255 the magic non-IP-address number I just sent (so I know you’re talking to me), along with the IP address I should use.”

You might be guessing that the semantics of these “special” addresses are going to be a little wonky. In other words, don’t expect to telnet to 255.255.255.255 and have anything interesting happen, such as simultaneously connecting to every computer in the world:

bash> telnet 255.255.255.255
Trying 255.255.255.255...
telnet: connect to address 255.255.255.255: Permission denied
telnet: Unable to connect to remote host

Given the fact that you clearly can’t use these magic addresses anywhere you might put an ordinary IP address, then when *can* you use them? And since you don’t expect all the millions of computers worldwide to get your message… which subset of computers will actually get it?

Remember how when you try to send a message pretending to be Google’s IP address and it gets squashed once you bubble that up toward the Greater Internet? Any messages you broadcast to 255.255.255.255 will be squashed at least by then, maybe sooner. The lost messages can even happen inside your computer, such as one with multiple network cards—the socket layer might send the outgoing data to only one of the cards, or to neither!

So if you are reading socket code and see any IP addresses with 255’s in them, start worrying. The broadcasting phenomenon is handled differently on various platforms. You might get the desired behavior on one platform or when only one network card is plugged in, but to get the behavior you want you might have to use a lower level API than the socket layer.

Going underneath the socket layer

Socket tutorials on the web give the surface impression of being everything you need…and certainly they’re adequate for a lot of basic programs. But the coolest network apps and utilities need more. The code inside of a router itself is probably not done with sockets, Skype is basically hacking and shaping on the packets themselves with their application—that’s how they work magic.

Some great exploratory network tools are based on simple socket layer, such as netcat. But if you’re in a case where you need to go deeper, such as what I found with DHCP, you need something more powerful.

Where to start? One good place is libpcap for receiving packets off of specific network cards in your machine (including the ability to do “hackerish” things like read packets not intended for your address). This is the foundation of the more powerful tool known as tcpdump.

Especially important for cross-platform network code dealing with packets is the libnet library. Not only does it have some APIs that make it a little less tedious to build various packet types (e.g. by implementing checksums for you), it can find out the hardware Ethernet address of an adapter in a cross-platform fashion. Plus, it has defined structures for every networking struct you can imagine!

(Note: The libnet project’s homepage is a little out of date. Refer to the updated documentation inside the archive.)

Special thanks

I’d like to thank Peter Sichel from the Apple Network Programming mailing list, who was very helpful while I was dealing with the DHCP porting issue. He makes a product called IpNetRouter for OS/X. Here’s our conversation thread:

HOWTO: a “Perfect Reconstruction” Graphic Equalizer

December 16th, 2007

If you’ve ever looked at a hi-fi component stereo system you’ve probably seen a graphic equalizer. Traditional analog ones look like this one sold by Crutchfield:

A graphic equalizer stereo component

Each slider lets you raise or lower the volume of certain sounds in your music. If you know that a flute is in a frequency range of 250 Hz - 2500 Hz, you could make the flute part louder by raising the sliders that corresponded to that range. (As you see from this acoustics chart, frequency ranges for instruments do overlap—so raising a flute can possibly also affect other things like Oboes or vocals.)

Cheaper stereo systems usually only offer “bass” and “treble” knobs. Yet nearly every MP3 player has a digital EQ. Even iTunes has one, if you click Show Equalizer on the View menu:

The equalizer in iTunes showing a peak for flute

But here’s a problem: because graphic equalizers break your signals into pieces and then put them together again, they introduce some distortion. There’s a simple way to get a sense that this distortion exists: you can set all the sliders to “flat”… run the equalizer algorithm… and see that you don’t get the same data out that you put in.

Back in 1996, I came up with a way you can design a digital graphic equalizer so the algorithm will exactly output the original signal when sliders are flat, even after running your audio through a barrage of filtering mathematics. Practically speaking, you would want to bypass the equalizer in that case for performance reasons. Yet there is an alluring elegance to an approach which would converge to the original signal if you ran it… and it strongly suggests you’d have overall less distortion when the sliders weren’t at zero. (This turns out to be true!)

I employ only two digital filters: a specially matched highpass and lowpass. Recursively applying these filters on downsampled versions of the data creates the exact logarithmic bands that are found in graphic equalizers. The process additionally leads to a phenomenon known as “perfect reconstruction”. I explain the details below.

How do other equalizers work?

There are many different ways you might choose to implement a graphic equalizer. Here’s a brief survey of methods I’m aware of:

  1. Frequency Sampling: A graphic equalizer can be seen as a simple digital filter which is redesigned each time the equalization settings change. When looked at this way, the obvious method for creating a graphic equalizer is to incorporate a filter design routine into your digital device, and change the desired frequency response Hd(ω) according to the equalization levels. Since frequency sampling design is as simple as an inverse discrete fourier transform (IDFT) of the desired filter curve, it is entirely possible to implement this on inexpensive hardware.

    The advantage to this approach is that it is relatively easy to understand and implement. Only one filter is required at any given time, so it uses memory effectively. Unfortunately, since a complex filter design algorithm is not feasible on inexpensive hardware, there will undoubtedly be ripple and Gibbs phenomena from the IDFT. Reducing this ripple will require filters of somewhat extreme length. More importantly, because the low frequency bands have so few points relative to the high frequency bands, the response in these regions will always be relatively poorer.

  2. Filter Banks: Equalization using filter banks means that you would design a high quality band-pass filter for each frequency range, then run the data through each filter in parallel and take the weighted sum of the outputs to form your final signal. The weights used in the sum would be derived from the equalization settings. This method is directly related to analog graphic equalizers, where a potentiometer is used to change the contribution of a different band-pass filter for each band.

    This approach could quite possibly achieve a better response than the frequency sampling technique. Because the band-pass filters are designed in advance, they could be made using a more sophisticated algorithm than the IDFT. Also, no run-time filter redesign process is needed, thus simplifying the DSP routine. However, since all the filters would still be the same length, the fundamental problem would remain of the low frequency bands being more poorly filtered than the high frequency bands. In addition, if you are going to have N levels of equalization, this technique will need to apply N filters and hence need N times the number of mathematical operations that the single-filter frequency sampling technique required!

  3. Algorithmic Modification: Rather than applying several different filters to the data and then constructing a weighted sum from the outputs, it may be possible to determine the relationship between an equalization slider value and the impulse response of the filter to be designed. This way, whenever a user modifies an equalization level, the impulse response of the filter can be affected in a predetermined way for that modification. One can imagine the impulse response starting out as a unit impulse when the equalization levels are flat, and adjustments to the equalization levels causing the filter’s impulse response to undergo mathematical corrections.

    This approach could achieve much the same result as the filter bank example, except that it would only generate a single filter (as in the frequency sampling case). Additionally, the frequency response could be perfectly flat if all the equalization levels were flat. The biggest disadvantage of this approach is the complexity involved in figuring out if such an algorithm exists, and if so…exactly what it is.

My Downsampling Approach to Graphic Equalization

The method I came up with involves using only two pre-created filters to provide any number of equalization levels. The signal is recursively split into its high half and low half of the frequency spectrum—and the low half is then downsampled to pull down the band edge to half the former Nyquist rate. Then the high half is weighted according to the appropriate equalization setting and added into the total result. I’ve drawn out this procedure in the figure below (Eq1..EqN represent multiplication by the weightings of the frequency bands from the graphic equalization levels.):

Downsampling equalizer block diagram

The recursive binary division of the low half of the frequency spectrum results in a logarithmic drop in the bandwidth as the frequency becomes lower. The low-pass filter causes the data to only occupy the lower half of the frequency spectrum, and we take advantage of this by cutting the sampling rate in half. Then the same high and low pass filters can be applied to the data again.

It would seem that since it is not possible to create an ideal high pass or low pass filter, that this technique will always throw out some important data. However, if the high and low pass filters are exactly complementary, the theory is that this process will provide the original signal when the equalizer weightings are at 1.

If this is the case, then this technique should provide a completely flat frequency response graph when there are flat equalization levels—and the output will merely be a time-delayed version of the input. There also are not resolution-limiting problems seen in the low frequency (narrow-width) bands from the other techniques, since the same size filter is being used to produce each band.

This method is somewhat more complicated than the others. Yet depending on the order of the filters used it could represent a drastic increase in equalizer quality.

Implementation of the Downsampling Equalizer

In order to code the block diagram as an actual program, we need to clearly define what the ↓2 and ↑N boxes are doing. Though we theoretically only need half of the samples to represent the information once it has been low-pass filtered, this cannot be suitably accomplished in this algorithm by throwing away every other sample and then restoring samples through a hold function. Full interpolation would be required to restore the missing samples.

Instead of using interpolation, I used an alternate method to to downsample, filter, and then upsample again. Two levels of recursion are shown, and the time domain representations of each graphic equalizer band are shown as the partial results (band 1 and band 2). If this procedure is applied, then these arrays merely need to be weighted and then added together to produce the output signal:

A visualization of the downsampling graphic equalizer implementation

It may appear that a lot of calculation is being performed to produce these equalization bands. However, the parallelism and logarithmic decrease of the array widths being operated upon makes it so that each level really only corresponds to the application of a low pass and a high pass filter to an array the size of the original data. Also, the “odd/even split” steps and the “interleave” steps can actually done in-place—and require no processing time.

The “lowpass” and “highpass” functions were performed by convolving the data with a linear phase filter created using the Remez Algorithm. The filter was specified in two bands—one from 0 to 0.45 and the other from 0.55 to 1.0. The bands were set at values of 0 and 1, appropriate to whether it was high or low pass.

The frequency response of this equalizer could not be directly calculated, so it was performed by measuring the response to a unit impulse. Applying a “midrange boost” ([1 2 3 2 1]) forms a very good response using length 32 lowpass and highpass filters, and has a respectable response even with length 16 filters.

Applying the algorithm using a level set of equalization values was the true test of whether or not it held up to one of the few tangible criteria for a graphic equalizer, and the results were encouraging. Despite several levels of data operations (5 applied levels of recursion) the response was perfectly flat for filters of length 32 and 16.

Investigations into the limits of the equalizer show that it performs quite well with almost any number of bands, and the bass bands do not degrade terribly even though they become quite narrow. A truly fair comparison of this technique with other possible methods would involve a detailed measure of the number of operations required as well as psycho-acoustics, but this filter nevertheless performs remarkably well.

Postscript: Why is it called a “graphic equalizer”?

Looking at the arrangement of sliders you can see kind of a “graph”, whose X-axis is the frequency you want to modify and whose Y-axis is whether you are amplifying that range or cutting it out. This is how it gets the name “graphic equalizer”. Because of that troublesome word “graphic”, while researching for this article I found that a lot of people confuse this with the dancing animated bars while you play music. Those are called spectrum analyzers, and if you’re geeky you can buy a shirt with one on it:

A spectrum analyzer shirt (not a graphic equalizer shirt!)

Spectrum analyzers are related to graphic equalizers, because they show you what frequencies are currently playing in your music. Bars closer to the left represent how much bass is playing, and bars closer to the right correspond to the treble. So if you were playing a flute, you would expect to see the bars jump in the range that corresponds to the flute. However, a spectrum analyzer doesn’t let you change how much flute you hear… it just displays it.

As a sidenote, many flash-animated spectrum analyzers are fake and have nothing to do with what music is playing. For instance, even when you’re listening to dead silence in the Myspace music player, the bars will still hop along. This kind of thing upsets me because it complicates technical education by showing people random nonsense to seem advanced.

Valid XHTML 1.0 Transitional

Mouse Placeholders for when Programs Lose Focus

November 25th, 2007

Programmers typically assume that a MouseUp() message won’t happen unless they had previously received a MouseDown()… and that these signals will come in precise pairs. Yet in almost every modern system that runs multiple applications at once, you will hit edge cases that send your program a MouseUp() when no MouseDown() ever happened—or two MouseDown() messages in a row.

There is no well-studied example of how to write your mouse handling code in a way that accounts for these cases. As a consequence, merely task-switching while still in the middle of a mouse operation will send a lot of applications into unexpected conditions! Many will crash or assert when you return… and those that don’t crash often do something bizarre. So I thought I would make a screencast of a prototype I made in 2002 which insulates programmers from these concerns. Even if you’re not a programmer and don’t want to read the whole article about the implementation, you might think the feature itself is unique, so check it out!

(Note: I use alt-tab to take the focus away in mid-mouse stroke, because that was easy to choreograph. But of course the technique is more compelling when an application in the background jumps forward and “steals” the focus. )

As you can see, my library took control of the mouse message pump and reduced the concerns of the programmer. If a mouse gesture was interrupted somehow I didn’t cancel the operation (nor did I just pretend the mouse button had been released and commit it). Instead, the library put the application into a suspended state with a placeholder icon at the last known mouse position. Clicking inside the placeholder restores your cursor to the previous coordinates and resumes the mouse operation, while pressing escape lets you abort.

Use Command objects that only run on MouseUp

In the past I’ve written about the importance of designing your program’s command processor in such a way that undo and redo operate consistently. One of the rules I mentioned was that your command processor shouldn’t be modifying the user’s document on MouseDown() or MouseMove(), but should accumulate the state in a Command object that is only submitted to your Undo/Redo queue when the MouseUp() is finally reached.

The good news is that if you’ve done that part right, then most of the necessary support work for this feature is already done!

A Command object not only makes things clearer for undo and redo—it also gives you a fantastic way of holding a mouse operation in “suspended animation”. The counterintuitive aspect is that pending commands must participate in the rendering process—since the document’s state alone is not enough to draw the view.

Build on Drag&Drop APIs, not mouse messages

One issue that I really had to grapple with was how to retain some control of the mouse cursor even when it had left my application. Sadly, running SetCapture() on Windows still means that the cursor will turn into the default arrow after the focus is lost. The way I found to work around this on Windows was with the Drag&Drop APIs—which turned out to be “tighter” than the default mouse API.

On Windows and other platforms, the Drag&Drop methods are precise analogues to the mouse messages we are familiar with:

  1. DragStart() = MouseDown()
  2. DragOver() = MouseMove()
  3. Drop() = MouseUp()

Read the rest of this entry »

Tying Undo/Redo Actions to a Single User Event

November 25th, 2007

Windows Explorer has a rather odd quirk when you rename a file. If you happen to have several items selected, all those selected files are given the exact same name. I don’t like the behavior and it only ever happens to me on accident, but Microsoft documented it… so I’ll let it slide for now.

What I *will* complain about is what happens when I try to undo one of these multi-renames. Despite only running one command, you have to press undo multiple times to restore your state! You have to actually hit undo for each file you had selected. (Adding injury to insult, Windows only keeps 10 undo items at a time—so you *can’t* get back to your initial state if you had 11 or more files selected.)

I demonstrate this defect for you in the video below. Windows Explorer isn’t the only program that does this, so after the video I’ve written about how we can architect our software so that it will never require more than one undo per “user command”:

Why would this happen?

To understand why this happens, you have to know a little bit about how a typical undo manager works. Most applications simply have a list of objects representing recent commands that have been executed. These “Command” objects usually have the following methods:

  1. “Run” (or “Do”): to execute the command initially, while storing enough information inside the command object that it may revert the effect
  2. “Undo”: remove the changes the command made by using the stored information (assuming the relevant application state is in the precise condition after the command was finished)
  3. “Redo”: bring back the effects of the command using the stored information (assuming the relevant application state is in the precise condition as when the command was initially run)

It might seem that “Run” and “Redo” are redundant, but they actually perform different functions. Just imagine a command that inserts the current time into your word processing document. If someone runs this at 4:00 and then undoes it at 5:00, they expect a redo moments later to restore the “4:00″ they just undid…not inject “5:01″! The undo and redo methods merely playback the action data that was stored during the command—by design.

Once you understand that a simple command processor will only undo one “Command” at a time, it becomes easy to intuit what Windows Explorer’s problem must be. Somewhere in the shell there is code which looks vaguely like this:

class RenameCommand : public Command
  {
  string filenameNew;
  string filenameOld;
  FileHandle fh;
 
  RenameCommand(FileHandle fh, string filenameNew)
    {
    this->fh = fh;
    this->filenameNew = filenameNew;
    }
 
  void Run()
    {
    filenameOld = fh->GetCurrentName();
    fh->SetName(filenameNew);
    }
 
  void Undo()
    {
    assert(fh->GetCurrentName() == filenameNew);
    fh->SetName(filenameOld);
    }
 
  void Redo()
    {
    assert(fh->GetCurrentName() == filenameOld);
    fh->SetName(filenameNew);
    }
  };

Then when it came time to implement the multiple renaming facility, the Microsoft programmers didn’t modify RenameCommand to take a list of files. They simply told the command processor to invoke several of them in series:

string filenameNew = Shell.GetNameFromUser();
FileHandle fh;
foreach (fh, Shell.GetSelectedFileHandles())
  {
  Command* cmd = new RenameCommand(fh, fileNameNew);
  commandProcessor.RunUndoableCommand(cmd);
  }

This is where they get into trouble. They’ve added multiple undo items to the command processor’s list for what conceptually should have been a single command.

Tying transactions directly to user events

A naive workaround would be to add the idea of a transaction that encapsulates multiple commands into an “undo group”:

commandProcessor.BeginUndoGroup();
 
string nameNew = Shell.GetNameFromUser();
FileHandle fh;
foreach (fh, Shell.GetSelectedFileHandles())
  {
  Command* cmd = new RenameCommand(fh, newName);
  commandProcessor.RunCommand(cmd);
  }
 
commandProcessor.EndUndoGroup();

Yet exposing an API like this from your command processor will not protect against the bug in any general sense. What will solve the general problem is if you only permit a few spots in the main UI loop to access BeginUndoGroup() and EndUndoGroup(). Those special places are the points where the user triggered an event that they consider “conceptually atomic”. Some good examples of these privileged moments are:

  1. a key press or running an accelerator key
  2. the selection of a menu item
  3. the pushing of a toolbar button
  4. the release of the mouse after a dragging operation

Obviously each individual mouse movement shouldn’t generate an undo group—in a paint program you wouldn’t want to have to undo each pixel from a brush stroke! Yet simply calling BeginUndoGroup() when the mouse goes down and then EndUndoGroup() when the mouse goes up isn’t the ideal solution. The problem is better solved by not allowing the mouse action to submit a command to the command processor until the mouse button is released. Until that time, the program just accumulates state from the mouse’s movement that will ultimately be used by the command’s Run() method.

There are so many benefits to deferring the calls to BeginUndoGroup() and EndUndoGroup() until the mouse button is released that I’d have a hard time condensing them all here. The features this enables warrant articles of their own! Savvy GUI developers can probably guess that most of the practical benefits relate to not having to pump UI messages while the command processor is inside a “transaction”. Yet there are more fun results, such as the ability to gracefully suspend a mouse operation when an application loses focus.

Protecting against stray document modifications

One way to get even more power out of this architectural pattern is to add runtime checks to ensure that none of your documents can be written to unless a BeginUndoGroup() is in effect. This way you protect yourself from writing a program that persistently modifies a user’s document while they are merely hovering over the application’s window, or running in the idle loop. During these times there would be no user event with which to associate the effects, so the undo behavior is going to seem random.

I know many programs as they are currently written would choke on these strict rules—but look closely and ask yourself how you expected the undo/redo to work otherwise? You probably are just hiding bugs very much like the one in Explorer above. Ensuring that a user-motivated undo group is always in effect before invoking a command with the power to make document modifications will protect against a number of awkward scenarios.

(Note for C++ programmers: to take this even farther and protect against stray document modifications at compile-time, try my suggestion of “extreme” use of const’s transitive power! If the only place your application gives out non-const pointers is as a parameter to the Command.Run() method, you guard against all kinds of accidents.)

Final thoughts

I do want to add in the caveat that there are probably scenarios where you want to hit undo fewer times than the number of user events. In Microsoft Word, typing a sentence and hitting Ctrl-Z will remove the whole sentence—and that’s sometimes what the user wants. Yet even in these cases of providing higher-order undo commands (which I approve of), the user should still always be allowed able to undo on a per-operation basis.

Also, there is no way to avoid the multiple-undo situation when a third-party software tool is directly manipulating the user interface of an application on a user’s behalf. A devil’s advocate might even argue that the bug in the Windows Shell happens because renaming multiple files isn’t an intrinsic function of Windows Explorer, but rather a convenience provided by a separate “Windows Shell Extension Tool”. Yet this is hogwash, since any system advertising an extensible architecture should be able to handle those plug-ins gracefully within its undo model.

In summary: I am convinced that requiring undo more times than the number of user events is a sign of poor design. Your undo/redo model will be clean and solid if you manage your undo groupings according to the guidelines above. I’d love to hear any success or failure stories people have of working with this approach, so please comment.


Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported