Improving Anti-Aliasing

I was able to improve the antialiasing rasterizer significantly. So far I calculated the coverage on ‘intersection pixels’ by scanning each scanline 4 times (or any other power of 2 for that matter) and counting the intersections within each pixels. The coverage (delta) for each pixel is then calculated as count / max. For example, if we count 4 intersections in a pixel when we did 8 scans, then we have a coverage delta of 0.5 for that pixel.

So far so good. But it quickly turned out that this approach is flawed. Imagine for example a straight vertical line right through the middle of a pixel. The above calculation would result in the maximum number of intersections, and thus the coverage delta is 1.0 rather than the expected 0.5. Visually this looked like vertical edges are ‘cut off’, in contrast to the horizontal edges which are smooth.

Luckily this problem can be solved without any penalty. I only needed to account for the actual intersection point inside the pixel. So now I not only count the intersections, but also add up the ‘fractions’ that the intersections create. I can increase the accuracy by a factor of 16 (or more, but I reserve only 4 binary digits accuracy for the ‘horizonal’ coverage, and leave up to another 4 digits for the ‘vertical’ coverage). Of course, the actual math behind all that is slightly more complex than this, but you get the idea.

This makes the rasterizer good enough for rendering arbitrary shapes smoothly. Even TrueType glyphs can be rendered nicely with it. The following screenshot uses 4x scanning only with 16x horizontal resolution.

I proposed this rasterizer as possible replacement for the encumbered graphics rasterizer in OpenJDK. Let’s see if it turns out to be feasible. If yes, then it’s going to be interesting to solve the legal issues around it. I think the worst (and likely) case would be to do the copyright grantback dance with the FSF and provide the code to OpenJDK via the SCA. If not it’s going to be used anyway in the Escher AWT peers, in the JamaicaVM graphics backends and font rasterizer and hopefully soon in the Odonata toolkit (which has been proposed as new OpenJDK project, which is an exciting idea for itself).

I’m off now for two weeks, celebrating my birthday tomorrow and my marriage in a week, hurray! Must be my lucky year. Absolutely incredible so far. 😀

Efficient Anti-Aliasing

These days I did what I wanted to do for a long time, that is implement an efficient anti-aliasing for the rasterizer of the Escher peers (this rasterizer can be used in arbitrary other scenarios as well though). This was mainly triggered by OpenJDK’s need for a replacement for their encumbered rasterizer, and I hope that mine is probably a good fit. Let me outline how it works.

Before doing anti-aliasing, let me show how shapes are generally rendered to a raster in general:

Rendering a glyph

Basically, we scan the shape line by line, and at the intersection points we ‘turn on’ or ‘turn off’ the pixel at that intersection point. In order to do this efficiently, we first sort the edges of the shape according to their Y and X coordinates.

Ok, but what about anti-aliasing? Basically it’s the same, only that we scan each line more than once. Usually each line is scanned a -power-of-2 times, for example 8. Depending on how often we intersect the shape in a pixel, we can calculate a coverage value. This coverage value is then used as alpha value when filling the actual pixel.

The interesting part here is how the coverage values are managed.  A simple and naive approach is of course to store them in an array, one entry for each pixel. Now if we scan a line we proceed just like in non-AA rasterization and turn on (add 1) to the entry in this array that corresponds to the pixel when we ‘enter’ the shape, do the same for all pixel entries in between until we ‘leave’ the shape for that scanline. This is repeated for the same pixel line 2^N times. In the end we have an array that has the correct coverage values for each pixel.

Of course, iterating over the array and adding up the coverage entries isn’t exactly efficient. A better idea is to only store deltas. So, instead of adding 1 to all the pixel entries ‘inside’ the shape, we only add 1 at the entry pixel and substract 1 at the exit pixel. This is much better.

We must still iterate over the whole array though when we want to display it to the destination surface, to find consecutive chunks of equal pixels. Imagine that for a simple 600 pixels wide rectangle. Pretty bad. So what I did today and the last couple of days is design a datastructure that manages the coverage values in a much more efficient way. Basically, it is a simple linked list of deltas together with their X coordinates. This list is always sorted according to the X coordinates. Each time we scan a line, we insert one entry into that list for each intersection point. In order to find the insertion point, we do a linear search. A linear search? Haven’t we all learned that this is pretty inefficient. Yes. But here we make use of the fact that the scanline is always scanned in a linear fashion, from left to right, and that the list is always sorted. The data structure has a pointer to the last item that got inserted, and starting from there, it is not possible (or at least highly unlikely) that the next insertion point is more than 2 items away. When the next scanline inside the same Y pixel row is scanned, we ‘rewind’ the data structure to point to the first item again, and start searching there. Finally, when doing the actual rendering, we can simply iterate this list, and fill complete segments in one go. This can be quite efficiently implemented on most backends (for example, for a framebuffer it’s basically a simple memset operation when the color is opaque).

A nice side effect of this technique is that non-anti-aliased rendering can now be handled as a special case of anti-aliased rendering, with no performance penalty. Another little detail of the implementation is that the datastructure doesn’t throw away and re-allocate its entries, but keeps unused items for later reuse. This way, the scanline converter usually needs allocation, after some initial setup of its tables and structures.

MP3 Vinyl

Hey cool. As mentioned on /., there are now record companies selling Vinyls with the option to download the songs as MP3. For me that’s a great move, I love vinyls, I always prefer buying Vinyl (sometimes DVD-audio) copies of music over CDs when available because of the much better quality, but usually make a copy on a CD for listening in the car or in the train. This is usually of quite poor quality because I don’t have the high-end hardware for doing proper analog->digital transfers. Beeing able to download the songs makes this problem go away completely, yay. Now if they’d add the option to download OGGs of the songs, the world would be perfect. I’ll ask the companies right away. Let’s see what they say.

Colors all around

Just a quick update, I got the ColorConvertOp plus java.awt.color from GNU Classpath working nicely in OpenJDK, and could run the Java2Demo with it. Now one more piece to go is the ColorModel class, which has some references to Sun’s internal cms classes.

GNU Classpath CMS on OpenJDK

CMSing around

I am still looking at fixing that color management encumbrance. Right now the GNU Classpath CMS seems the most attractive choice (to me). I did some quick comparisons:

Performance

I wrote a small test program that converts 1000000 pixels from gray to sRGB. Here are the results:

JDK6: 26301 ms

plain OpenJDK: 13173 ms

OpenJDK with Classpath CMS: 422 ms

Yes that’s right! Only about half a second for 1000000 pixels. Apparently it’s not a good idea to do a million very small JNI calls with hotspot.

Accuracy

I took the above program and adjusted it to do comparisons of accuracy. There’s one program that does 1000000 random pixel conversions and writes the results to stdout, and another program that reads in these reference values and does the same thing on the current build and creates some stats about the divergences. Here are the results:

JDK6:

average difference from reference: 0.0
maximum difference from reference: 0.0

Apparently this is the reference 😉

plain OpenJDK:

average difference from reference: 0.00380017
maximum difference from reference: 0.038417127
Take into accout that the value range is 0.0 -> 1.0. This confirms one of the outstanding bugs with littlecms and OpenJDK.

OpenJDK with Classpath CMS:

average difference from reference: 0.00280433
maximum difference from reference: 0.0042832294

Much better, isn’t it?

TODO

From an implementation POV, the next obvious thing to do (AFAICT) is that some classes (namely ColorConvertOp and ColorModel) in java.awt.image use internal implementation classes of the CMS, rather than using java.awt.color. These need to be reworked to work with GNU Classpath’s CMS. This is not a big deal though. I’m on it.

Then this code would need much more testing for conversions other than GRAY->RGB. Unfortunately, OpenJDK has no tests at all for the CMS stuff. I asked the Sun developers to run the package against their testsuites. Let’s see if they do it and how it performs.

The other problem is the legal side of the story. While there’s nothing against mixing GPL’ed code in general, Sun and the FSF have some special requirements with such things. AFAIK, the FSF and Sun are talking about how to solve the problem of integrating each other’s code easily.

Color encumbrance fixed

… well at least for me. I simply included GNU Classpath’s java.awt.color implementation into the OpenJDK tree, compiled, and voila! It seems to work well here for those tests that I have here. Unfortunately, the OpenJDK doesn’t seem to have own tests for the color management stuff, so I couldn’t thoroughly test the stuff. It is a 100% Java implementation of the color management library so it should be easy to handle.

For those who want to try it, I have bundled the Classpath sources for the related packages. Simply drop it into the OpenJDK tree, adjust the Makefile a little to include the gnu/java/awt/color packages and there you go.

Update: This, btw, fixes the problem seen with littleCMS. The other two problems I couldn’t test quickly, due to lack of testcases. I’d think that these could be fixed by this ‘patch’ too.

OpenJDK

Glad to see most of the JDK code released under GPL now. Great news and reason to celebrate a little with a bottle of wine! Cheers!

I just finished downloading the OpenJDK source code. I’m glancing over it now. Some pieces that I have immediate interest right now (because they’re bugging me at work) unfortunately aren’t there, in this case java.awt.image.***SampleModel.java. Too bad. I’m trying to mix in some GNU Classpath classes and see how that works. Besides that, I quickly suscribed to a couple of mailing lists. In the next weeks I’ll check if and how I could possibly help out.

It is my hope that the contribution process will pan out nicely, so that it is not too hard for non-Sun contributors to get in the game. That should be a high priority for Sun too, because what is it worth to free the code without building a healthy community around it.

Thanks Sun! That is certainly a great milestone that is likely to change the software landscape significantly.

Nice to see Dalibor in the governance board.

PS: That feels so good. And Classpath shines in there too 😉

/*
* Copyright 1995-2007 Sun Microsystems, Inc.  All Rights Reserved.
* DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
*
* This code is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License version 2 only, as
* published by the Free Software Foundation.  Sun designates this
* particular file as subject to the "Classpath" exception as provided
* by Sun in the LICENSE file that accompanied this code.
*
* This code is distributed in the hope that it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
* version 2 for more details (a copy is included in the LICENSE file that
* accompanied this code).
*
* You should have received a copy of the GNU General Public License version
* 2 along with this work; if not, write to the Free Software Foundation,
* Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
*
* Please contact Sun Microsystems, Inc., 4150 Network Circle, Santa Clara,
* CA 95054 USA or visit www.sun.com if you need additional information or
* have any questions.
*/

Escher by numbers

I did some preliminary and very simplistic benchmarks with the current development version of Escher vs. the released version 0.2.3.

The benchmark is doing 1,000,000 operations per run (e.g. 1,000,000 lines or rectangles) with random color and random coordinates.

This is measured using JDK6 Hotspot VM.

Test SVN Local SVN TCP 0.2.3 TCP
lines 6765 7938 9291
rects 6812 8274 9363
fill rects 58072 59625 113316

For comparison, the following is measured with JamVM:

Test SVN Local SVN TCP 0.2.3 TCP
lines 31992 35851 232868
rects 32240 36250 232809
fill rects 59171 59746 231549

Update : I’ve written an equivalen benchmark for AWT and did a test on JDK6:

lines 6240 ms
rects 6332 ms
fill rects 44512

That’s still slightly better than Escher, but pretty close already. With some more protocol enhancements and the usage of NIO, there is hope that Escher can catch up with JDK6 performance though.

Mysterious segfault

When you get a segfault in a function, all input arguments look ok, the state of the program looks ok, and it’s probably even working in certain circumstances, just not in _this_ one (for example, different libraries, different drivers or different Java VMs). That is, the segfault is completely mysterious. It might just be the case that the calling stack has an overflow. This can be caused by really deeply nesting function calls or passing large datastructures, or simply by a stack that is too small. Another indication of that problem is when gdb points you to a source line that has a function signature.

You can probably solve the problem by making sure that you don’t pass large datastructures around in function calls (only pointers to them!) or – if that is not the case – by enlarging the stack. This can be achieved either by setting some flag in the compiler or linker (in ld, it’s –stack IIRC), or in the program when creating the thread.

My case was especially annoying, the program was segfaulting when using the proprietary nvidia driver, but was running well with X.org’s ati and intel chipset drivers. I could trace the problem exactly to the segfaulting OpenGL call (glXCreateContext), the ingoing parameters all looked OK and I just couldn’t figure out what went wrong. Unfortunately I couldn’t even ‘look deeper’ in the stack, because the nvidia driver has no source and no debug information. It was a different program (that suffered the same problem) that brought me to the stack size thing. Apparently, the nvidia driver does call deeper into the stack and/or passes some bigger datastructures around. (And yes, the program was compiled with a smaller default stack size that usual).

Escheritis

I almost finished the first step in my attempt to make Escher more efficient. This was by far the biggest part I supposed. I had to restructure almost all of the Escher classes to use the new protocol implementation. What I did was this: The old protocol implementation created a new Request object for each little request (unless, when multiple requests, like many lines one after the other, get collapsed), each with its own buffer. This buffer then got copied to the socket. You can imagine that creating and disposing buffers while doing heavy rendering (like many simple things like rectangles and lines) put quite a heavy load on the GC. The new implementation uses only one buffer (ok, two; one for input and one for output), and the protocol implementation writes into this one buffer. There is no (de-)allocation at all now. However, after the core of Escher has been adjusted this way, I needed to go through all classes and adjust every bit of Escher to this new implementation. This is now finished and SVN trunk now completely compiles again (it didn’t for a couple of months).

The next step is to shake out any bugs that sneaked in with this rewrite. After this, I might release a new version of Escher.

Things I am pondering to implement for further performance improvement are:

  • Remove synchronization. The idea is that programs that don’t do multithreading shouldn’t be punished by the synchronization. And, after all, applications usually can do this for themselves much more effectively (for example, locking based on ‘frames’ rather than for each graphics primitive). I am not sure if I should ditch the synchronization altogether, or if I should put in some lock() and unlock() methods, which are no-ops when no locking is required, or which do locking via java.util.concurrent.locks, if an application chooses to do so. But my guess is that this not worth the effort or even couterproductive on non-optimizing VMs, a method call usually beeing more expensive than a monitor enter/exit.
  • Implement the communication using NIO Channels and Buffers. This way, the above described buffer would be a direct ByteBuffer and this could be sent directly to the X server, without the hidden copy that is usually made by the SocketOutputStream. This would require to extend the LocalSocket implementation in classpath to implement SocketChannel. I think this shouldn’t be too hard. Communicating over a local socket channel to the X server is almost or exactly like using shared memory area. (My guess is that the buffer is passed through the kernel directly to the receiving process to read from, but I might be wrong here). Either way, this should certainly maximize rendering throughput for OpenGL.