diff options
Diffstat (limited to 'boehm-gc/doc/scale.html')
-rw-r--r-- | boehm-gc/doc/scale.html | 210 |
1 files changed, 210 insertions, 0 deletions
diff --git a/boehm-gc/doc/scale.html b/boehm-gc/doc/scale.html new file mode 100644 index 000000000..2e70148df --- /dev/null +++ b/boehm-gc/doc/scale.html @@ -0,0 +1,210 @@ +<HTML> +<HEAD> +<TITLE>Garbage collector scalability</TITLE> +</HEAD> +<BODY> +<H1>Garbage collector scalability</h1> +In its default configuration, the Boehm-Demers-Weiser garbage collector +is not thread-safe. It can be made thread-safe for a number of environments +by building the collector with the appropriate +<TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation +flag. This has primarily two effects: +<OL> +<LI> It causes the garbage collector to stop all other threads when +it needs to see a consistent memory state. +<LI> It causes the collector to acquire a lock around essentially all +allocation and garbage collection activity. +</ol> +Since a single lock is used for all allocation-related activity, only one +thread can be allocating or collecting at one point. This inherently +limits performance of multi-threaded applications on multiprocessors. +<P> +On most platforms, the allocator/collector lock is implemented as a +spin lock with exponential back-off. Longer wait times are implemented +by yielding and/or sleeping. If a collection is in progress, the pure +spinning stage is skipped. This has the advantage that uncontested and +thus most uniprocessor lock acquisitions are very cheap. It has the +disadvantage that the application may sleep for small periods of time +even when there is work to be done. And threads may be unnecessarily +woken up for short periods. Nonetheless, this scheme empirically +outperforms native queue-based mutual exclusion implementations in most +cases, sometimes drastically so. +<H2>Options for enhanced scalability</h2> +Version 6.0 of the collector adds two facilities to enhance collector +scalability on multiprocessors. As of 6.0alpha1, these are supported +only under Linux on X86 and IA64 processors, though ports to other +otherwise supported Pthreads platforms should be straightforward. +They are intended to be used together. +<UL> +<LI> +Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to +run the mark phase in parallel in multiple threads, and thus on multiple +processors. The mark phase typically consumes the large majority of the +collection time. Thus this largely parallelizes the garbage collector +itself, though not the allocation process. Currently the marking is +performed by the thread that triggered the collection, together with +<I>N</i>-1 dedicated +threads, where <I>N</i> is the number of processors detected by the collector. +The dedicated threads are created once at initialization time. +<P> +A second effect of this flag is to switch to a more concurrent +implementation of <TT>GC_malloc_many</tt>, so that free lists can be +built, and memory can be cleared, by more than one thread concurrently. +<LI> +Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread +local allocation. It does not, by itself, cause thread local allocation +to be used. It simply allows the use of the interface in +<TT>gc_local_alloc.h</tt>. +<P> +Memory returned from thread-local allocators is completely interchangeable +with that returned by the standard allocators. It may be used by other +threads. The only difference is that, if the thread allocates enough +memory of a certain kind, it will build a thread-local free list for +objects of that kind, and allocate from that. This greatly reduces +locking. The thread-local free lists are refilled using +<TT>GC_malloc_many</tt>. +<P> +An important side effect of this flag is to replace the default +spin-then-sleep lock to be replace by a spin-then-queue based implementation. +This <I>reduces performance</i> for the standard allocation functions, +though it usually improves performance when thread-local allocation is +used heavily, and thus the number of short-duration lock acquisitions +is greatly reduced. +</ul> +<P> +The easiest way to switch an application to thread-local allocation is to +<OL> +<LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>, +and then include the <TT>gc.h</tt> +header in each client source file. +<LI> Invoke <TT>GC_thr_init()</tt> before any allocation. +<LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>, +and/or <TT>GC_GCJ_MALLOC</tt>. +</ol> +<H2>The Parallel Marking Algorithm</h2> +We use an algorithm similar to +<A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by +Endo, Taura, and Yonezawa</a> at the University of Tokyo. +However, the data structures and implementation are different, +and represent a smaller change to the original collector source, +probably at the expense of extreme scalability. Some of +the refinements they suggest, <I>e.g.</i> splitting large +objects, were also incorporated into out approach. +<P> +The global mark stack is transformed into a global work queue. +Unlike the usual case, it never shrinks during a mark phase. +The mark threads remove objects from the queue by copying them to a +local mark stack and changing the global descriptor to zero, indicating +that there is no more work to be done for this entry. +This removal +is done with no synchronization. Thus it is possible for more than +one worker to remove the same entry, resulting in some work duplication. +<P> +The global work queue grows only if a marker thread decides to +return some of its local mark stack to the global one. This +is done if the global queue appears to be running low, or if +the local stack is in danger of overflowing. It does require +synchronization, but should be relatively rare. +<P> +The sequential marking code is reused to process local mark stacks. +Hence the amount of additional code required for parallel marking +is minimal. +<P> +It should be possible to use generational collection in the presence of the +parallel collector, by calling <TT>GC_enable_incremental()</tt>. +This does not result in fully incremental collection, since parallel mark +phases cannot currently be interrupted, and doing so may be too +expensive. +<P> +Gcj-style mark descriptors do not currently mix with the combination +of local allocation and incremental collection. They should work correctly +with one or the other, but not both. +<P> +The number of marker threads is set on startup to the number of +available processors (or to the value of the <TT>GC_NPROCS</tt> +environment variable). If only a single processor is detected, +parallel marking is disabled. +<P> +Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside +the collector to immediately yield the processor instead of busy waiting +first. In the case of a multiprocessor and a client with multiple +simultaneously runnable threads, this may have disastrous performance +consequences (e.g. a factor of 10 slowdown). +<H2>Performance</h2> +We conducted some simple experiments with a version of +<A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to +run multiple concurrent client threads in the same address space. +Each client thread does the same work as the original benchmark, but they share +a heap. +This benchmark involves very little work outside of memory allocation. +This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine +under Linux 2.2.12. +<P> +Running with a thread-unsafe collector, the benchmark ran in 9 +seconds. With the simple thread-safe collector, +built with <TT>-DLINUX_THREADS</tt>, the execution time +increased to 10.3 seconds, or 23.5 elapsed seconds with two clients. +(The times for the <TT>malloc</tt>/i<TT>free</tt> version +with glibc <TT>malloc</tt> +are 10.51 (standard library, pthreads not linked), +20.90 (one thread, pthreads linked), +and 24.55 seconds respectively. The benchmark favors a +garbage collector, since most objects are small.) +<P> +The following table gives execution times for the collector built +with parallel marking and thread-local allocation support +(<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We tested +the client using either one or two marker threads, and running +one or two client threads. Note that the client uses thread local +allocation exclusively. With -DTHREAD_LOCAL_ALLOC the collector +switches to a locking strategy that is better tuned to less frequent +lock acquisition. The standard allocation primitives thus peform +slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be +avoided in time-critical code. +<P> +(The results using <TT>pthread_mutex_lock</tt> +directly for allocation locking would have been worse still, at +least for older versions of linuxthreads. +With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the +lock with pthread_mutex_try_lock(), busy_waiting between attempts. +After a fixed number of attempts, we use pthread_mutex_lock().) +<P> +These measurements do not use incremental collection, nor was prefetching +enabled in the marker. We used the C version of the benchmark. +All measurements are in elapsed seconds on an unloaded machine. +<P> +<TABLE BORDER ALIGN="CENTER"> +<TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th> +<TH>2 marker threads (secs.)</th></tr> +<TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td> +<TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td> +</table> +<PP> +The execution time for the single threaded case is slightly worse than with +simple locking. However, even the single-threaded benchmark runs faster than +even the thread-unsafe version if a second processor is available. +The execution time for two clients with thread local allocation time is +only 1.4 times the sequential execution time for a single thread in a +thread-unsafe environment, even though it involves twice the client work. +That represents close to a +factor of 2 improvement over the 2 client case with the old collector. +The old collector clearly +still suffered from some contention overhead, in spite of the fact that the +locking scheme had been fairly well tuned. +<P> +Full linear speedup (i.e. the same execution time for 1 client on one +processor as 2 clients on 2 processors) +is probably not achievable on this kind of +hardware even with such a small number of processors, +since the memory system is +a major constraint for the garbage collector, +the processors usually share a single memory bus, and thus +the aggregate memory bandwidth does not increase in +proportion to the number of processors. +<P> +These results are likely to be very sensitive to both hardware and OS +issues. Preliminary experiments with an older Pentium Pro machine running +an older kernel were far less encouraging. + +</body> +</html> |