Shared memory is an interprocess communication (IPC) facility that exists in every major version of Unix available today. It is ubiquitous in its use by applications developed for Unix systems and is used extensively by commercial relational database management systems (RDBMS) as a means of implementing a cache. This month we'll look at the implementation of shared memory internally in SunOS.
Note that a non-goal of this month's column is a programmers "how-to" discussion on writing applications that use shared memory. Information on how to code using the shared memory interfaces is available from numerous sources, including the Solaris Developers Kit (SDK) documentation (see below). (3,800 words)
Shared memory provides an extremely efficient means of sharing data between multiple processes on a Solaris system because the data need not actually be moved from one process's address space to another. As the name implies, shared memory is exactly that; the sharing of the same physical RAM pages by multiple processes, such that each process has mappings to the same physical pages and can access the memory through pointer dereferencing in code. The use of shared memory in an application requires implementing just a few interfaces bundled into the standard C library, /usr/lib/libc. These interfaces are listed in Table 1 below. Consult the man pages for more detailed information. In the following sections, we'll examine what these interfaces do from a kernel implementation standpoint.
The kernel implementation of shared memory requires two loadable kernel modules, the shmsys module, which contains the kernel support routines for the shared memory library calls (Table 1) and the ipc module, which contains two kernel routines, ipcget() and ipcaccess(), that apply to all the interprocess communication (IPC) facilities. The location of these dynamically loadable modules is the directory /kernel/sys for the shmsys module, and /kernel/misc for the ipc module (See the kernel(1M) man page for information on loadable kernel modules.)
key, size, flags
Creates a shared segment if one with a matching key does not exist (and the appropriate flags are set), or locates an existing segment based on key. Returns a shared memory identifer.
identifier, address, flags
pointer to the shared segment
Attaches the shared segment to the process's address space.
0 or 1 (success or failure)
Detaches a shared segment from a process's address space.
identifier, command, status structure
0 or 1 (success or failure)
Allows for some basic shared memory control functions, such as getting statistics, setting permissions, etc
These modules are not loaded automatically by SunOS at boot time. The kernel will dynamically load a required module when a call is made that requires the module. Thus, if the shmsys and ipc modules are not loaded, the first time an application makes a shared memory system call (e.g. shmget(2)), the kernel will load the module and execute the system call. The module will remain loaded until it is explicitly unloaded, via the modunload(1M) command, or the system reboots. This explains an FAQ on shared memory -- why, when the ipcs(1M) command is executed, it sometimes comes back with:
IPC status from as of Tue Jul 22 21:49:34 1997
Message Queue facility not in system.
Shared Memory facility not in system.
The "facility not in system" message means the module is not loaded. You can tell the operating system to load the module during bootup by using the forceload operation in the /etc/system file:
Also, you can use the modload(1M) command, which allows a root user to load any loadable kernel module from the command line. The modinfo(1M) command can be used to see which loadable modules are currently loaded in the kernel. Note that SunOS is smart enough not to allow the unloading (modunload(1M)) of a loadable module that is in use. Note also that the code is written to be aware of dependencies, such that loading the shmsys module will also cause the ipc module to be loaded.
Shared memory tunable parameters
The kernel maintains certain resources for the implementation of shared memory. Specifically, a shared memory identifier (shmid) is initialized and maintained by the operating system whenever a shmget(2) system call is executed successfully (recall from Table 1 that shmget(2) returns a shared memory identifier upon successful completion). The shmid identifies a shared segment, which has two components -- the actual shared RAM pages and a data structure that maintains information about the shared segment, the shmid_ds data structure, detailed in Table 3.
The system allocates kernel memory for some number of shmid_ds structures at boot time, based on the shared memory tunable parameter called shmmni. All together, there are only four tunable parameters associated with shared memory. They are listed in Table 2, with a description, default, data type, and minimum and maximum values.
4294967295 (4 GB)
Maximum size for a shared segment
4294967295 (4 GB)
Minimum size for a shared segment
2147483648 (2 GB)
Max number shared memory identifiers
32767 (32 k)
Max number shared segments per process
Corresponding ipcs(1) column
see ipc_perm table 4
Embedded ipc_perm structure. Generic structure for IPC facilies that maintains permission information
Size in byte of the shared segment
Pointer to corresponding anon_map structure
Number of locks on the shared segment
PID of last process that did a shared memory operation
PID of process that created the shared segment
Number of attaches to the shared segment
Creator attaches??? Not currently used
Time of last attach to shared segment
Time of last detach from shared segment
Time of last change to shmid_ds structure
A kernel condition variable. Not currently used
Pointer to address space structure. Used with ISM for managing shared page tables (translation tables)
When the system first loads the shared memory module, it allocates kernel memory to support the shmid structures and other required kernel support structures. The kernel memory required is based on the shmmni tunable, since that defines the requested number of unique shared memory identifiers the system maintains. Each shmid_ds structure is 112 bytes in size and has a corresponding kernel mutex lock, which is an additional eight bytes. Thus, the amount of kernel memory required by the system to support shared memory can be calculated as ((shmmni * 112) + (shmmni * 8)). The default value of 100 for shmmni requires the system to allocate about 13 kilobytes of kernel memory for shared memory support. The system makes some attempt at protecting itself against allocating too much kernel memory for shared memory support by checking for the maximum available kernel memory, dividing that value by four, and using the resulting value as a limit for allocating resources for shared memory. Simply put, the system will not allow more than 25 percent of available kernel memory to be allocated. Note that the above applies to Solaris 2.5, 2.5.1, and 2.6. Prior releases, up to and including Solaris 2.4, did not impose a 25 percent limit check. Nor did they require the additional eight bytes per shmid_ds for a kernel mutex lock because shared memory used very coarse-grain locking in the earlier releases and only implemented one kernel mutex in the shared memory code. Beginning in 2.5, finer-grained locking was implemented, allowing for greater potential parallelism of applications using shared memory.
It should be clear that one should not set shmmni to an arbitrarily large value simply to ensure sufficient resources. There are limits as to how much kernel memory the system supports. On sun4m-based platforms, the limits are on the order of 128 megabytes (MB) prior to Solaris 2.5, and 256 MB for 2.5, 2.5.1, and 2.6. On sun4d systems (SS1000 and SC2000), the limits are about 576 MB in 2.5 and later. On UltraSPARC[sun4u]-based systems, the kernel has its own four gigabyte (GB) address space, so it's much less constrained. Still, keep in mind that the kernel is not pageable, and thus whatever kernel memory is needed remains resident in RAM, reducing available memory for user processes. Given the fact that Sun ships systems today with very large RAM capacities, this may not be an issue, but it should be considered nonetheless. (sun4m, sun4d, and sun4u is Sun nomenclature for defining different "kernel architectures.") Every operating system has some hardware-independent and hardware-dependent components to it -- the hardware-dependent components are at the lower levels, where hardware registers, and things get touched. Different kernel architectures exist for different Sun desktop and server systems and vary due to processor technology (SuperSPARC, UltraSPARC, etc.) and system infrastructure (Mbus, XDbus, Gigaplane, etc.). Use uname(1M) with the -m flag to determine what your systems kernel architecture is:
% uname -m
Note that the maximum value for shmmni listed in Table 2 is 2 GB. This is a theoretical limit, based on the data type (a signed integer) and should not be construed as something configurable today. Applying the math from above, you see that two billion shared memory identifiers would require over 200 GB of kernel memory! One should assess to the best of their ability the number of shared memory identifiers required by the application and set shmmni to that value plus 10 percent or so for headroom.
The remaining three shared memory tunables are quite simple in their meaning. Shmmax defines the maximum size a shared segment can be. The size of a shared memory segment is determined by the second argument to the shmget(2) system call. When the call is executed, the kernel checks to ensure that the size argument is not greater than shmmax. If it is, an error is returned. Setting shmmax to its maximum value does not effect the kernel size -- no kernel resources get allocated based on shmmax, so this can be tuned to its maximum value of 4 GB (0xffffffff), as in this (/etc/system) entry:
set shmsys:shminfo_shmmax=0xffffffff /* hexidecimal */
set shmsys:shminfo_shmmax=4294967295 /* decimal */
Actually, the 4 GB size applies only to Solaris 2.5.1 and 2.6. Prior to those releases, the maximum value is 2 GB:
set shmsys:shminfo_shmmax=0x80000000 /* hexidecimal */
set shmsys:shminfo_shmmax=2147483648 /* decimal */
The maximum size change is due in part to changing the shmmax data type from a signed integer to an unsigned integer in the kernel code.
Keep in mind that SunOS today supports a maximum virtual address space of 4 GB, due to the current 32-bit implementation. Since shared memory mappings apply to a process's virtual address space, one could never address a full 4 GB of shared memory. Every process has some address space used for text (execution code), stack space, and data, which all get charged against the 4-GB total, leaving something less than 4 GB for shared memory.
The shmmin tunable defines the smallest possible size a shared segment can be, as per the size argument passed in the shmget(2) call. There's no real compelling reason to set this from the default value of 1. Lastly, there's shmseg, which defines the number of shared segments a process can attach (map pages) to. Processes may attach to multiple shared memory segments for application purposes, and this tunable determines how many mapped shared segments a process can have attached at any one time. Again, the 32-kilobyte (K) limit (maximum size) in Table 2 is based on the data type (short), and does not necessarily reflect a value that will provide application performance that meets business requirements if some number of processes attach to 32,000 shared memory segments. Things like shared segment size and system size (amount of RAM, number/speed of processors, etc.) will all factor into determining the extent to which you can push the boundaries of this facility.
Intimate shared memory
Intimate shared memory (ISM) is an optimization introduced first in Solaris 2.2. It allows for the sharing of the translation tables involved in the virtual to physical address translation for shared memory pages, as opposed to just sharing the actual physical memory pages. Typically, non-ISM systems maintain a per-process mapping for the shared memory pages. With many processes attaching to shared memory, this creates a lot of redundant mappings to the same physical pages that the kernel must maintain. Additionally, all modern processors implement some form of a translation lookaside buffer (TLB), which is (essentially) a hardware cache of address translation information. SPARC processors are no exception, and, just like an instruction and data cache, the TLB has limits as to how many translations it can maintain at any one time. As processes get context switched in and out, we can reduce the effectiveness of the TLB. If those processes are sharing memory, and we can share the memory mappings also, we can make more effective use of the hardware TLB.
The actual mapping structures differ across processors. UltraSPARC (SPARC V9) processors implement translation tables, comprised of translation table entries (TTEs). SuperSPARC (SPARC V8) systems implement page tables, which contain page table entries (PTE). They both do essentially the same thing -- provide a means of mapping virtual to physical addresses. However, the two SPARC architectures differ pretty substantially in MMU (memory management unit) implementation. (The MMU is the part of the processor chip dedicated to the address transaction process.) SPARC V8 defines the SPARC Reference MMU (SRMMU) and provides implementation details. SPARC V9 does not define an MMU implementation, but rather provides some guidelines and boundaries for the chip designers to follow. The actual MMU implementation is left to the chip design folks.
Additionally, there is a significant amount of kernel code dedicated to the address translation process (such as the creation and management of the translation tables). The actual details of translating a virtual address to a physical address and tying the hardware and software pieces together make for YAITOWA (yet another interesting thing to write about). Figure 1 provides a diagram of the virtual-to-physical address translation tables, with and without ISM. The diagram is of course an over-simplification, as a process will have other address mappings for its text, data, etc., in addition to the shared (or unshared) shared memory segment mappings. Also, the diagram is generic, not specific to a particular platform (as above, that would require a different diagram for several processor/server combinations).
Let's consider just one simple example of how ISM can save kernel space. Oracle uses shared memory for its Shared Global Area (SGA), which is how Oracle does its caching of data, indexes, stored procedures, etc. Assume Oracle is configured with a 2-GB SGA, and there are 400 Oracle processes (each attaching to the shared segment holding the SGA) running concurrently on the system at any point in time. 2 GB of RAM equates to 262,144 8-K pages. Assuming that the kernel needs to maintain eight bytes of information for each page mapping (two four-byte pointers), that's about 2 MB of kernel space needed to hold the translation information for one process. Without ISM, those mappings get replicated for each process, so multiply the number times 400, and we now need 800 MB of kernel space just for those mappings. With ISM, the mappings get shared, so we only need the 2 MB of space, regardless of how many processes attach.
In addition to the translation table sharing, ISM also provides another feature. When ISM is used, the shared pages are locked down in memory, such that they'll never get paged out. This feature was added for the RDBMS vendors. As we said earlier, shared memory is used extensively by commercial RDBMS systems to cache data (among other things, such as stored procedures). Non-ISM implementations treat shared memory just like any other chunk of anonymous memory -- it gets backing store allocated from the swap device, and the pages themselves are fair game to get paged out if memory contention becomes an issue.
The effects of paging out shared memory pages that are part of a database cache would be disastrous from a performance standpoint (RAM shortages are never good for performance...). Because a vast majority of customers that purchase Sun servers use them for database applications and because database applications make extensive use of shared memory, addressing this issue with ISM was an easy decision.
Memory page locking is implemented in SunOS by setting some bits in the memory page's page structure (every page of memory has a corresponding page structure that contains information about the memory page. Page sizes vary across different hardware platforms. UltraSPARC-based systems implement an 8-K memory page size, which means that 8 K is the smallest unit of memory that can be allocated and mapped to a process's address space). The page structure contains several fields, among which is a field called p_cowcnt and p_lckcnt, page copy-on-write count and page lock count, respectively. Copy on write tells the system that this page can be shared as long as it's being read, but once a write to the page is executed, makes a copy of the page and maps it to the process that is doing the write. Lock count maintains a count of how many times page locking was done for this page. Because many processes can share mappings to the same physical page, the page may be locked from several sources. The system maintains a count to ensure that processes that complete and exit will not result in a page being unlocked that has mappings from other processes. The system's pageout code, which runs if free memory gets low, checks the status to the page's p_cowcnt and p_lckcnt fields. If either of these fields are non-zero, the page is considered locked in memory and thus not marked as a candidate for freeing. Shared memory pages using the ISM facility do not use the copy-on-write lock (that would make for a non-shared page after a write). Pages locked via ISM implement the p_lckcnt page structure field.
Even though ISM locks pages in memory such that they'll never get paged out, Solaris still treats ISM shared segments the same way it treats non-ISM shared segments and other anonymous memory pages -- it makes sure there is sufficient backing store in swap before completing the page mapping on behalf of the requesting process. While this seems superfluous for ISM pages (allocating disk swap space for pages that can't be swapped out), it makes the implementation cleaner. Solaris 2.6 changes this somewhat, and in 2.6 swap is not allocated for ISM pages. The net effect of this is that allocation of shared segments using ISM requires sufficient available swap space for the allocation to succeed, at least until Solaris 2.6.
Using ISM requires setting a flag in the shmat(2) system call. Specifically, the SHM_SHARE_MMU flag must be set in the shmflg argument passed in the shmat(2) call to instruct the system to set the shared segment up as intimate shared memory. Otherwise, the system will create the shared segment as a non-ISM shared segment.
Note that memory pages can be locked through other means. A root user can use the mlock(3) library routine or the memcntl(2) system call (with the MLOCK flag) to lock pages in memory. One another note: In case you're not familiar with SunOS Virtual Memory nomenclature, anonymous memory is any memory page that does not have a corresponding named location in the file system. Things like files, executables, and shared libraries originate as files in the file system, and thus can be restored to memory if the memory page they were mapped to gets freed and used for another object. Things like heap space (malloc(3), sbrk(2) calls) and shared memory pages have no corresponding named location, and thus swap disk must be allocated in case the page must be pushed out to make room for something else.
In this section we will look at the flow of kernel code that executes when the shared memory system calls are called.
Applications first call shmget(2) to get a shared memory identifier. A key value is passed in the call that the kernel uses to locate (or create) a shared segment.
(application) shmget(key, size, flags (PRIVATE or CREATE))
if (key equals IPC_PRIVATE)
create shared segment
return unique shm_id
else if (key exists)
else if (key does not exist AND IPC_CREAT is set)
create shared segment
return unique shm_id
if (this is a new shared segment)
check size against min & max tunables
get resources for anonymous memory mapping
来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/26706/viewspace-64601/，如需转载，请注明出处，否则将追究法律责任。