Thursday, October 27, 2011

Another chart

Here is a chart of the number of words in use after each garbage collection (for MIT/GNU Scheme compiling itself with a heap of 1935360 words):
This is the smallest heap that is big enough for the compilation to complete. The limiting factor seems to be the big spike on the left at about 0.9×109. We obviously wouldn't want to run with the minimum size heap; 3935 garbage collections are needed and almost a third of the total time is spent in GC. But the more frequently you collect, the better resolution you get.

Wednesday, October 26, 2011

Garbage collection gives us the illusion of nearly unlimited memory for temporary storage. To pull off this illusion, we must be able to satisfy every allocation request (at least up to the point where we throw up our hands and raise an “out of memory” condition). In other words, there must be free or unused memory available whenever we need some. Since the point of automatic memory management is to free the programmer from having to think about it, we expect to be able to allocate temporary storage at any time we wish during program execution.

A data structure can be considered “in use” if modifications to it can change the future behavior or result of a program. Since we cannot in general statically determine the future behavior, we compute the transitive closure of the data structures under the data access primitives. This is a superset of the data “in use”. That is, if you can't access the data, it cannot be “in use”.

In order for garbage collection to succeed as a storage management strategy, there must be at all times (anytime we want to allocate) some storage that is either unallocated or unreachable. The size of the heap must always exceed the size of the reachable data. Or we could say that the amount of reachable data must at all times be less than the size of the heap. (By ‘size of the heap’ I mean the total amount of storage both in use and free. For example, I could have a five megabyte heap, but only have 3 words in use.)

Any allocated data structure has but one of two ultimate fates: it remains reachable for the remainder of the run of the program or it becomes unreachable at some point (and we may re-use the storage). Assuming the heap starts empty, the size of reachable storage at any point in time is the difference between the total amount of all allocation and the total amount of all storage that has become unreachable up to that point. For example, if my program allocates sixty megabytes over its lifetime and it abandons fifty-nine megabytes, then the heap must contain the remaining megabyte.

We now consider the differential case, which is to say, the rates of allocation, collection, and heap growth or shrinkage. If the rate of allocation exceeds the rate of collection, the amount of storage in use in the heap must grow to meet the demand. Conversely, if the amount of reachable storage in the heap is decreasing, then we are abandoning storage at a faster rate than we are allocating it. But there are hard boundaries on the size of the heap. We can never have a negative amount of storage in use, and we have the requirement (see above) that the heap must be larger than the reachable storage at all times. The heap size is limited, so the rate of increase in reachable storage must be quite small over the long term if we are not to run out of heap.

The rate of storage allocation must be nearly equal to the rate of deallocation (or abandonment) almost all the time.

(This is only slightly different from the “steady-state” assumption that the rates are essentially equal. We're simply admitting the possibility that the reachable storage grows indefinitely and that the program would eventually crash if it doesn't finish first.)

Monday, October 24, 2011

John McCarthy

1927 - 2011

Boring charts

Here is a chart that shows the amount of free space remaining after each garbage collection (MIT/GNU Scheme compiling itself, heap size 1935360 words):
Compare that chart with this one which shows the amount of free space remaining after each GC with heaps of sizes 8388608, 16777216, 33554432, and 100663296 words. Note that the y axis is now displayed in log scale in order to get the wide variation in heap size on one chart.
As we move to larger and larger heap sizes, the proportion of memory that is freed after each GC increases. This makes sense because the program allocates memory in the same way regardless of how much heap is present (a program that adjusts its allocation profile in response to the heap size would look much, much different). When the heap is small, large allocations of temporary storage (like the one at about 8.5×108) make a significant difference in the amount of free space, but when the heap is big, these differences virtually disappear.

It is almost disappointing that the detail is lost when we use larger heap sizes. The charts for the large heap sizes are essentially flat. At large heap sizes, the variations in temporary storage usage become insignificant. In other words, the number of garbage collections becomes less dependent upon the moment by moment allocation of the program being run and approaches the simple limit of the heap size divided by the total allocation.

Friday, October 21, 2011

Scheming and plotting

Another interesting thing to plot is the amount of space that remains in use after the garbage collector runs. This is simply the total size of the heap minus the amount the collector frees (on average). For example, if we start MIT/GNU Scheme with --heap 4096, we'll get a heap of (* 4096 1024) = 4194304 words. With this size heap, the garbage collector will run 1589 times and recycle 6204751035 words, or 3904815 words per collection. This gives us (- 4194304 3904815) = 289489 words that are still in use (averaging over all 1589 collections, of course). If we do this calculation for various heap sizes and plot the results, we get this:

Sunday, October 16, 2011

Adjusting for extra collections

Here is a log-log plot of garbage collection counts as function of heap size. The green line is the predicted number of collections assuming that the product of the heap size and GC count is constant. The red points are actual GC counts for MIT/GNU Scheme compiling itself.
As you can see, the line matches pretty good in the middle, but not so much at the ends. It took me a while, but I figured out that the discrepancy at the lower right of the curve (where the heap size is very large and the number of collections quite small) can be attributed to a few extra garbage collections that are triggered for the purposes of compaction. If we take these into account, our graph looks more like this:
That takes care of the lower end, but it is the discrepancy at the upper end of the curve (smaller memory and more frequent GC) that is more interesting.

Dividing the total amount of memory used by the size of the heap gives us a pretty good estimate of the number of garbage collections provided that each garbage collection frees the entire heap. This is generally not the case in a real program. We should be dividing the total amount of memory used by the average of the amount of storage freed by the garbage collector. In this next chart, we adjust for both the extra collections and the space in use that cannot be collected.
In this chart, we can see the effect of not reclaiming 100% of the storage on each collection. When the heap size is small, the memory that survives garbage collection becomes a larger fraction of that heap. This causes the GC count to increase faster than it would if it could free all the storage. There is a vertical asymptote to the line. This is where the heap size is equal to the averaged amount of storage retained at each GC. Basically, when the heap is only a little larger than the amount of storage retained, you have to do a lot of extra collection. This next chart is simply a close-up of the previous chart showing the divergence better.
It is interesting to see that retaining a modest amount of storage (empirically, it is about 400K words) can have a noticeable effect even when the heap is tens of times larger. (Remember, the vertical scale is logarithmic, so differences that appear small are actually quite a bit larger.)

If it is possible to use a heap ten, twenty, fifty, or even a hundred times larger than the averaged amount of storage retained, then we'll be beyond the worst part of the curve and down in the linear region. Another way to state this is that you want the amount of storage retained at each GC to be only a few percent of the entire heap.

Thursday, October 13, 2011

Demystified

I discovered why the total amount of memory used does not seem approximately constant across the different heap sizes. I had forgotten to take into account that a garbage collection cycle can be run for reasons other than reclaiming storage. In particular, it can be run as part of a compaction phase that precedes dumping a binary object or moving an object out of the collected heap into “pure” storage. These are not particularly interesting operations; they contribute little to the total resource usage or total time. They just cause an overestimate of the amount of storage used in very large heaps. I could correct for this, but I'm far more interested in what happens on the other end of the curve when you have a relatively small heap. More to come.

Thursday, October 6, 2011

Confusing

I found some of the data in my GC timings to be a little confusing. I can understand how different runs of the compiler might have different space usages even when compiling the same thing (things like timestamps are going to change), but the variation is too large to explain away like this. I think the problem is that by instrumenting the code to measure the timings I have inadvertently changed the behavior of the code beyond simply collecting the data.

Tuesday, October 4, 2011

Some GC stats

I invested some time making MIT/GNU Scheme compile itself with various sizes of heap. I did one set of runs with the variable compiler:preserve-data-structures? set to #f and another with it set to #t. The MIT/GNU Scheme compiler was written in a way that allows the garbage collector to reap the intermediate storage for the compiler after each phase of compilation. When compiler:preserve-data-structures? is set to #t, however, the intermediate data structures are retained until the next compilation. This should not change the total amount of storage used by the compiler, but it should have the effect of reducing the amount of storage recovered by a GC cycle. The number of GC cycles can be seen to be larger for the smaller heap sizes. When the heap is huge, the extra amount of storage in use has little effect on the GC count.

The Total Free column contains the sum of the memory freed after each GC totaled for the entire compilation.
Memory SizeDiscard Data StructuresPreserve Data Structures
(blocks)(words)GC CountTotal FreeGC CountTotal Free
1890 193536039356213003931 - -
1900 194560038826211940217 - -
1920 196608038186213153480 - -
2048 209715235026210260535 - -
3072 314572821786205257571 - -
4096 41943041589620769766816266200942665
5000 51200001283620608132513036198903924
6144 62914561033621067590410436205324016
8192 8388608 7666212253195 7716207156928
16384 16777216 3776222514645 3796233468540
32768 33554432 1886257358354 1886244416384
40000 40960000 1546266232939 1546257300338
50000 51200000 1246316240152 1246311996423
65536 67108864 946283277196 946276744843
98304100663296 636324424897 636320187112
131072134217728 486430202774 486426705701
262144268435456 256704242312 256702940561
Some of my thoughts about this (with graphs) in a bit. (It becomes very interesting when the data seriously disagrees with the math, but it is tedious to collect enough data to either support or refute the mathematical model.)