Home About USENIX Events Membership Publications Students
ALS 2001 Paper    [ALS 2001 Technical Program]

Pp. 153-164 of the Proceedings

A Study in Malloc: A Case of Excessive Minor Faults

 

Phillip Ezolt

Compaq Computer Corporation

Abstract

 

GNU libc's default setting for malloc can cause a significant performance penalty for applications that use it extensively, such as Compaq's high performance extended math library, CXML.  The default malloc tuning can cause a significant number of minor page faults, and result in application performance of only half of the true potential. This paper describes how to remove the performance penalty using environmental variables and the method used to discover the cause of the malloc performance penalty.

 


1. Why?

 

When a performance problem is discovered the first question asked is usually "How can it be fixed?". Although the solution to the performance problem is valuable, the method used to diagnose and fix the problem is also valuable. An explanation can teach the inexperienced engineer the thought process of the experience engineer, and give the inexperienced engineer a method for finding and fixing future performance problems.

 

This paper describes how the performance problem of GNU libc's malloc was diagnosed and how a solution was discovered. The performance hunt is documented to demonstrate the methods used to find and fix a performance problem.

 

2. What?

 

A customer running a chemistry benchmark on an Alpha system reported a radically different application run time between Tru64 UNIX and Linux/Alpha on the same hardware. Since the hardware of the two test systems was identical, the runtime difference had to be caused by software. Fortunately, the customer used the Compaq Fortran compiler, the Compaq Portable Math Library (CPML), and the Compaq Extended Math Library  (CXML) on Tru64 UNIX and Linux/Alpha. This meant that the compiler subsystem was also the same. The main difference was the operating system.

 

The program was run on both systems, and the "time" command showed the runtime of both. The customer reported that the user time was roughly the same on both operating systems, but system time on Linux/Alpha was much greater than on Tru64 UNIX.

 

 

    

 

User   

System 

Elapsed  

CPU 

Linux 

256.284u 

209.641s 

7:46.35

99.9%

Tru64 UNIX

257.027u   

3.176s 

4:29.85

96.4%

 

This difference pointed to a possible performance problem in the Linux operating system. To determine where in Linux the time was spent, DCPI[1] (an alpha profiling system) and was used to extract the following data[2]:

cycles[3]      

dtbmiss   

Image

8116120

1062

System Total

6979695

0

/vmlinux

6044706

0

  cpu_idle

386675

0

  do_anonymous_page

87264

0

  __free_page

79269

0

  __get_free_pages

60127

0

  __copy_user

45412

0

  EntMM

36035

0

  do_page_fault

1110642

1052

Xvcc

226983

183

  Dgemm_nt

213498

86

  Dgemm_nn

187057

47

  Icopy_

92613

153

  Dgemm_tt


This profile showed that a large amount of the non idle kernel cycles was spent in the 'do_anonymous_page' kernel function.  It also showed that a large number of dtbmisses occurred in xvcc, the customer’s chemistry code. 

 

The function of ‘do_anonymous_page’ was not immediately clear, but further investigation revealed that it was part of the Linux kernel's memory management routines (in /usr/src/linux/mm/memory.c), and that all calls to it ultimately began with the page fault handler ‘handle_pte_fault’. Therefore, if ‘do_anonymous_page’ was called a large number of times, the page fault handler was also being called a large number of times.

 

In addition to DCPI, the "time" command was also used to measure where time was spent. As a side effect, it revealed that a large amount of minor page faults occurred.

 

..

(168major+23099385minor)pagefaults

..

 

It was unclear at this point what a minor page fault was, and whether a high number of them could cause a performance problem. However, if Linux displayed a high number of minor faults, and Tru64 UNIX did not, it could have been an indication of the problem.

 

It was known that a minor fault was a type of pagefault, and when a pagefault occurred a dtbmiss[4] or itbmiss must also have occurred.  The high number of page faults that the "time" command reported corresponded nicely with DCPI's report of a high number of dtbmisses.

 

To determine if the number of minor faults was different on the two Alpha operating systems, the customer ran the following script on both Linux/Alpha and Tru64 UNIX:

 

(findfault.sh)

#!/bin/sh

 

COMMAND=$1

#Print command with headers.

 

ps -a -o vsize,rss,minflt,majflt,cmd | grep -e $COMMAND -e CMD | grep -v grep | grep -v $0

 

while (true)

do

    sleep 1

    #Print command without headers.

    ps -a -o vsize, rss, minflt, \

    majflt, cmd | grep -e $COMMAND |\

    grep -v grep |grep -v $0

done

 

This script showed the size of the virtual and resident set as well as the number of major and minor page faults for a specified process.

 

The customer reported a significant number of minor faults on Linux/Alpha, but nearly none on Tru64 UNIX.

The result of the test is reported in graphical form below.

 

 

(Notice the difference in the scale of the minor faults)

 

Linux/Alpha had an ever-increasing number of minor faults, while Tru64 UNIX's fault count stayed nearly constant. Linux/Alpha's virtual set size fluctuated, while Tru64 UNIX's stayed nearly constant.

 

3. Faults are at fault

 

The high minor fault count on Linux/Alpha pointed to a significant difference between Tru64 UNIX and Linux/Alpha. This was the first piece of the puzzle. However, to understand what a high minor fault count meant, it was necessary to understand what a minor fault was.

 

A google[5] search of "minor fault" and "linux", revealed the following information about the different types of page faults.

 

In Linux and Unix, page faults are either minor or major. A major fault requires an I/O operation to complete such as a page swap from disk. Minor faults can be handled without an I/O such as a Copy on Write (COW) request or a request for a zeroed page.

 

A linux kernel website[6]gave the following definitions:

Major fault

A major page fault occurs when an attempt to access a page not currently present in physical memory was made. The page must be swapped in to physical memory by the fault fix-up code.

 

Minor fault

A minor page fault occurs when an attempt to access a page present in physical memory, but without the correct permissions. An example is the first write to a second reference to a shared page, when the kernel must perform the copy-on-write and allow the task to update the copied page.

 

On Compaq’s OpenVMS[7], a high number of minor faults usually indicated that a process's working set was larger than its allowed working set. Every attempt to use a new page would result in an old one being kicked out of its working set, and the program would spend a significant amount of time faulting in new pages.

 

It was assumed that this was what was happening on Linux.

 

The resident set of the customer's program hovered around 131 megabytes of memory, which seemed suspiciously close to a 128 megabytes limit. The Linux kernel code was searched for such a hard coded limit, but unfortunately, it was a dead end.

 

By running the following program, it was determined that a program could allocate 256 megabytes of memory, and touch every page without taking a minor fault:

 

#include <unistd.h>

#include <malloc.h>

#include <stdio.h>

#include <stdlib.h>

 

int main(int argc, char *argv[])

{

 int num_byte;

 char *buffer, *p;

 

 num_byte = atoi(argv[1])*1024*1024;

 buffer = malloc(num_byte);

 

 while (1){

   for (p = buffer;

                p < (buffer+num_byte);

        p += getpagesize())

        {*p= 0;}

    }

}

 

This lack of faults did not match the behavior of the customer's program.

 

4. Mmap Maker, Make Me a Map

 

The author of this paper would have been puzzled had he not remembered that a member of the Compaq Math library team reported a similar problem months ago. The problem of the Math Library team member and that of the customer appeared to be very similar. A message to the Math library team member revealed that he had found more information about the problem, but had not found a solution.

 

His message stated that:

"The problem involving minor page faults in DGEMM on Linux/Alpha is caused by the way Linux does heap management (i.e., malloc and free). Allocation of large buffers is done via mmap, and when they are freed, they are unmapped via munmap. The buffer allocated by DGEMM falls into this category. Thus, for each call to DGEMM, address space for the buffer is created, buffer pages are faulted into the resident set and then the buffer, and the address space, is deleted. "


This changed the focus of the search, and also allowed for the creation of a smaller test program which showed similar behavior to the original chemistry code: an ever increasing number of minor page faults on Linux, and a small number of page faults on Tru64 UNIX.

 

For those not fortunate enough to have a colleague who experienced a similar problem, the kernel's minor page fault handler could have been instrumented to print the address of instructions that cause more than 1000 minor page faults. Using this to find the guilty instruction, one could then use 'nm' and 'gdb' to determine which function or line of code caused the minor faults. Although this would not be a general-purpose solution, the availability and modifiability of Linux kernel source makes this instrumentation possible.

 

5. Minor Fault in Allocation

 

Since it appeared that memory allocation was the cause of the problem, it could be tested independently of the customer's chemistry program. This was fortunate because the customer's chemistry program had many modules and a long compile time.

 

The following simple program could reproduce the high number of minor faults; it allocates a piece of memory, and then immediately frees it.

 

#include <malloc.h>

#include <stdio.h>

#include <stdlib.h>

 

int main(int argc, char *argv[])

{

  int number_of_meg, num_byte;

  char *buffer;

 

  num_byte = atoi(argv[1])*1024*1024;

 

  while (1){

    buffer = malloc(num_byte);

    free(buffer);

    }

}

 

5.1 Linux/Alpha

 

An strace[8] of the test on Linux/Alpha confirms what the math library engineer had said. ‘mmap’ is used when mallocing large amounts of memory. Linux/Alpha had an ever-increasing amount of minor faults.

 

strace ./malloc_test 1

....

mmap(0, 1056768, PROT_NONE,

       0 /* MAP_??? */, 0, 0) = 0x20000456000

munmap(0x20000456000, 1056768) = 0

mmap(0, 1056768, PROT_NONE,

       0 /* MAP_??? */, 0, 0) = 0x20000456000

munmap(0x20000456000, 1056768) = 0

mmap(0, 1056768, PROT_NONE,

       0 /* MAP_??? */, 0, 0) = 0x20000456000

munmap(0x20000456000, 1056768) = 0

mmap(0, 1056768, PROT_NONE,

       0 /* MAP_??? */, 0, 0) = 0x20000456000

munmap(0x20000456000, 1056768) = 0

....

5.2 Tru64 UNIX

 

Tracing the same program on Tru64 UNIX yielded two interesting facts:

 

Few minor page faults occurred on Tru64 UNIX, and

Tru64 UNIX used obreak() (a system call which increases the processes heap size) instead of mmap() (a system call which allocates system wide resources to a program) to malloc memory.

 

...

obreak (0x140108000) = 0

obreak (0x14020a000) = 0

obreak (0x140108000) = 0

obreak (0x14020a000) = 0

....

5.3 Intel/Linux

 

An Intel/Linux system behaved the same as an Linux/Alpha system, with a fluctuating mmap() value and an increasing number of faults.

 

strace ./malloc_test_i386 1

....

old_mmap(NULL, 1052672,                  PROT_READ|PROT_WRITE,

         MAP_PRIVATE|MAP_ANONYMOUS,

         -1, 0) = 0x4024f000

munmap(0x4024f000, 1052672) = 0

old_mmap(NULL, 1052672,

         PROT_READ|PROT_WRITE,

         MAP_PRIVATE|MAP_ANONYMOUS,

         -1, 0) = 0x4024f000

munmap(0x4024f000, 1052672) = 0

old_mmap(NULL, 1052672,

         PROT_READ|PROT_WRITE,

         MAP_PRIVATE|MAP_ANONYMOUS,

         -1, 0) = 0x4024f000

munmap(0x4024f000, 1052672) = 0

.....

5.4 FreeBSD

 

To test this malloc issue on another operating system, FreeBSD was installed on VMware[9], a very fast i386 virtual machine.

 

FreeBSD did not display the increasing number of page faults. FreeBSD used break() to set memory limits, much like Tru64 UNIX.

 

 ...

2814 pagefault CALL break(0x4000)

2814 pagefault RET break 0

2814 pagefault CALL break(0x104000)

2814 pagefault RET break 0

2814 pagefault CALL break(0x14000)

2814 pagefault RET break 0

2814 pagefault CALL break(0x114000)

2814 pagefault RET break 0

2814 pagefault CALL break(0x14000)

2814 pagefault RET break 0

2814 pagefault CALL break(0x114000)

2814 pagefault RET break 0

2814 pagefault CALL break(0x14000)

2814 pagefault RET break 0

....

 

 

Linux

Tru64 UNIX

Linux

FreeBSD

Architecture

Alpha

Alpha

Intel

Intel

Allocation

mmap

obreak

mmap

Break

Changing Allocation Amount

Yes

Yes

Yes

Yes

Large # of Minor Faults

Yes

No

Yes

No

 

It appeared as if this problem was unique to Linux, and possibly mmap.

 

6. Memory Management in Linux/Unix

 

To understand why different system calls were used (mmap() & break()) when the same lib memory allocation routine (malloc()) was called, it is necessary to understand how memory management in Linux/Unix works.

 

A Linux/Unix process can have three types of memory allocated on its behalf: stack, heap and mmaped memory.

 

Stack memory is managed by the operating system, and is not generally managed by individual processes. Stack memory (or "the stack") usually contains local variables, and information saved during a function call.

 

Stack memory is automatically allocated by the operating system, when a process needs more. Stack memory is a temporary storage space, which is not guaranteed to remain allocated for the life of a process.

Heap and mmaped memory are more permanent areas of memory and remain allocated for the life of a process. Normally, heap and mmapped memory are managed through malloc, but they can also be managed independently. (A process can call the memory allocation system calls directly to bypass malloc.)

 

Heap memory (or "the heap") is managed by the brk() system call. The brk() system call takes one argument which sets the "end of heap" for a process. If brk() is passed a value greater than the process's current brk() value, the size of a process's heap grows to the new value, and the operating system reserves more memory for the process. If the value passed to brk() is less than the current brk() value, the size of a process's heap shrinks to the new value, and the operating system will free memory from the process. (break(), brk() and obrk() are different names for the same system call ‘brk()’)

 

Mmaped memory is managed by the mmap() and munmap() system calls. When a piece of mmapped memory is to be allocated, mmap() is called the with size of the requested memory. A pointer to the memory is returned, which is used by the process. When the memory is to be deallocated, the pointer is passed to the munmap() system call, and the operating system deallocates the memory.

 

Use of mmap()/munmap() is more flexible than brk(), but it has more size restrictions and a higher overhead per allocation. If a piece of memory allocated with brk() is not at the end of the heap when it is freed, it can not be released back to the system as free memory, because the brk() interface only allows the end of heap memory to be specified. mmap() & munmap do not suffer this problem.

 

Some mallocs, GNU libc's in particular, use both heap memory and stack memory to fulfill allocation. Which type of memory is used depends on the size of the allocation request.

 

7. Focusing on the Linux problem

 

It was reasoned that at some point below one megabyte allocations, malloc would start to behave more like a traditional malloc(), using brk() instead of mmap(). As a result, a test program was rewritten to allow kilobytes to be specified as an allocation amount instead of megabytes.

 

#include <stdlib.h>

#include <stdio.h>

int main(int argc, char *argv[]){

  char *buffer;

  int num_byte;

 

  num_byte = atoi(argv[1])*1024;

 

  while(1)

    {buffer=malloc(num_byte);

     free(buffer);}

}

 

After further investigation under linux, it appeared that 128k was an important malloc threshold. Three memory allocations close to 128k in size (126k, 127k and 128k) yielded very different results.

 

7.1  126k allocation

 

When malloc was called with an allocation request of 126k, brk() was used to allocate the memory. free() did not release the memory; once the end of the heap was set to "0x8069000", it did not change. Minor page faults did not occur.

 

strace ./pagefault 126

....

brk(0) = 0x804965c

brk(0x8068e74) = 0x8068e74

brk(0x8069000) = 0x8069000

(Nothing further)

 

7.2 127k allocation

 

When malloc was called with an allocation request of 127k, brk() was used to allocate the memory. free() released the memory; the end of the heap fluctuated between 0x806a000 and 0x804a000. A large number of minor page faults occurred.

 

strace ./pagefault 127

....

brk(0) = 0x804965c

brk(0x8069274) = 0x8069274

brk(0x806a000) = 0x806a000

brk(0x804a000) = 0x804a000

brk(0x806a000) = 0x806a000

brk(0x804a000) = 0x804a000

...

 

7.3 128k allocation

 

When malloc was called with an allocation request of 128k, mmap was used to allocate the memory. free() released the memory, as the repeated calls to mmap showed. A large number of minor page faults occurred.

 

strace ./pagefault 128

old_mmap(NULL, 135168,

         PROT_READ|PROT_WRITE,

         MAP_PRIVATE|MAP_ANONYMOUS,

         -1, 0) = 0x40116000

munmap(0x40116000, 135168) = 0

old_mmap(NULL, 135168,

         PROT_READ|PROT_WRITE,

         MAP_PRIVATE|MAP_ANONYMOUS,

         -1, 0) = 0x40116000

munmap(0x40116000, 135168) = 0

old_mmap(NULL, 135168,

         PROT_READ|PROT_WRITE,

         MAP_PRIVATE|MAP_ANONYMOUS,

         -1, 0) = 0x40116000

munmap(0x40116000, 135168) = 0

old_mmap(NULL, 135168,

         PROT_READ|PROT_WRITE,

         MAP_PRIVATE|MAP_ANONYMOUS,

         -1, 0) = 0x40116000

munmap(0x40116000, 135168) = 0