Re: PCB's internals




DJ Delorie wrote:
I think the key to the gtk pcb's sudden slowdown can be found in
queuing theory. As you move the mouse, the X server generates events.
They come at a certain pace, and you deal with then in a certain time.
The size of the queue of events is determined by the input rate and
completion rate. One interesting rule - when the input rate exceeds
the completion rate, the queue eventually becomes infinite. This
"trip point" happens when the redraw exceeds a certain complexity,
such as the sample board, and depends on your hardware speed too.

This is right on, and the queuing theory issues are a fundamental partt
of understanding the hysteresis problems presented by exceeding working
set. If it takes 0.1 seconds (as you state below) to redraw the screen
while running cleanly out of l1/L2/L3 cache, then you can accept mouse
events at 100 per second and keep up with the users motion without
creating a backlog that grows queue length. The problem starts when the
working set is exceeded, processor bandwidth drops due to cache
faulting, and all of a sudden it starts taking 10 time longer to redraw
the screen.

The lesstif HID was designed for my 400MHz laptop, so I went to great
lengths to avoid this problem (having been stung by it before). It
does two things to avoid the queue.

I actually used to use PCB on a 233mhz Compaq a few years back when I
was traveling, and it was pretty usable under RH9, a 2.2 kernel, and
96mb of memory. Light paging traffic, mostly caused by crond which I
would normally turn off to get rid of the jerkiness.

First, I combine motion events. When I get a motion event I remove
all motion events from the queue and query the *current* mouse
pointer. Thus, if the system is busy, I end up skipping events to
keep up.

Bravo, that is the first step in countering working set problems ...
reduce the total work load linearly, and work harder in each cache
context before faulting to another. By combining motion events you slow
down the rate of context switching to the X server, and do more work
per context switch.

Second, I redraw the screen only when the program is otherwise idle.
The event handlers only update the data structures, they don't usually
draw on the screen, just set a flag to do so later. When the event
queue is empty, I test the flag and, if set, *then* I redraw the
screen. The net result is, if the redraw takes 0.1 second, the screen
will be done redrawing 0.1 second after you stop moving the mouse.

AKA latency hiding, by defering work to a less critical time.

Also, note that PCB's core uses an rtree to store the data it needs
for a redraw. If you're looking at the whole board, you have no
choice but to go through the whole data list. However, if you zoom
in, the working set shrinks to only those objects that are visible.

That was visible in the first Gtk release last year, that zooming in
would reduce the latency lag, and at some point it would suddenly
become realtime again.

Linked lists and trees frequently have a very poor memory usage
efficiency with small data structures and lots of pointer overhead,
combined with a kitchen sink problem (everything that is related to an
object is tossed into the same structure). FpgaC suffers from this
pretty badly.

Let me explain ... the problem is that to get to one or two words of
data, we frequently reference a structure that has maybe a dozen
related variables that are not used for every operation - plus a couple
pointers for linking the objects, all without cache line alignment.
When working sets start thrashing caches, there are smarter ways to
conserve working set by getting better memory utilization:

1) separate out variables which a heavily searched/used from those that
are not critical, so large working set, latency critical operations
fetch from memory only what is needed. Using this strategy, the
attributes necessary for drawing are in one object, and other
non-critical attributes are in a secondary object. It might even be
useful to compact some of these attributes in the latency critical
object, and keep a non-compact native form in the non-critical
attribute object.

2) use segmented tables (arrays, vectors) instead of single object
linked lists where possible to avoid the pointer overheads. Using this
strategy there may still be linked lists and trees, but each leaf node
includes a dozen or more objects in a table. Thus the ratio of usable
data to pointer overhead greatly improves.

3) Use some care in designing and allocation of your objects so that
they do not span multiple cache lines. Since a full cache line is
read/written as a memory operation, when an object uses the end of one
and the beginning of another cache line, two cache lines are partially
used, which cuts memory bandwidth in half.

using these strategies can improve working set performance by a factor
of 3 to 10, and application performace once the working set exceeds
cache sizes by 3 to 20 times.

One last tidbit ... dynamically linked shared libraries have signficant
working set bloat and poor cache line balancing ... it's sometimes
useful to statically link to get better cache performance ... but that
is another long discussion about why.

Toy student applications don't need to worry about these problems most
of the time. Larger production applications where interactive
performace and batch run times are import, frequently can not avoid
these optimizations.

My two cents worth from 30 years of performance engineering experiece
from fixing bloated applications.

.



Relevant Pages

  • Re: Has anyone produced a board using Kicad?
    ... On larger designs, memory is being pushed to maintain lists and objects ... application running concurrently with equally large working set, ... provoke substantial cache thrashing, which will show up as memory ...
    (sci.electronics.cad)
  • Re: How does one stop cache flushing?
    ... cache size drops dramaticly and unnesssarily). ... The working set is the memory that has been recently used so Windows expects that it is used again in the near future. ...
    (microsoft.public.windowsxp.general)
  • Re: POWER7 information - global shared memory, eDRAM for cache
    ... and their last level, on-die, cache is eDRAM and probably around 16MB. ... memory, and novel cache architectures. ... I'm a subscriber to the Seymour Cray dictum about *bandwidth* (not ... You don't need the whole working set, ...
    (comp.arch)
  • Re: Memory Sizing Advice
    ... The working set of indexes simply don't fit in cache all that well. ... that you're acting as though cache memory isn't an analagous ... Oracle restricts block cleanout for a transaction of any size to 10% ...
    (comp.databases.oracle.server)
  • Re: Has anyone produced a board using Kicad?
    ... As for the unfairly maligned GTK port of PCB: ... architecture, cache usage, and stuff like that. ... dictate design to the PCB project which existed long before gEDA. ... to over a day and a half, simply by exceeding working set sizes for L2 ...
    (sci.electronics.cad)