Operating System Concepts

While some consider databases the center of the application universe, the reality is that operating systems (OS) are the true center. Nothing happens in your application without one or more operating systems being involved. Access to memory, disk, networks and CPU are all handled by the OS. If you have a bottleneck in the OS the application will suffer the consequences of that bottleneck.

Every OS has its own strengths, weaknesses and quirks. Complicating things further are the various custom drivers and packages that vary from vendor to vendor for different types of disk subsystems and network adapters. In many cases below I will speak in general terms that may have to be translated into the specific terms used by your specific vendor.

Hello... NUMA

NUMA stands for non-uniform memory access and usually the results are just as pleasant as the name implies. If you are purchasing new hardware and have a choice between NUMA and non NUMA systems I would always lean towards the non NUMA option if they are otherwise equivalent.

Traditional SMP systems (UMA) allow each CPU to have access to all of the available memory through a common memory bus. This has a distinct advantage for database applications but does suffer for other workloads once you reach a certain number of CPUs.

NUMA systems attempt to bypass some of the limitations of UMA by splitting the system into zones which consist of subsets of the available CPUs and memory. The core concept is that access to the local zone memory will be much faster compared to a UMA system.

The downside is when a CPU needs access to memory outside of its zone (remote memory) it must communicate with the other zone(s) to request the memory contents. Because OpenEdge is not currently NUMA aware this can cause some unexpected performance issues. The most common being inconsistent performance depending on whether a process is in the same zone as the database shared memory or not.

Other issues can happen when your database shared memory spans zones. Several times clients have increased their buffer pool sizes and actually ended up with a reduction in performance related to spanning zones.

NUMA Affinity

If you currently have a NUMA system you should consider configuring zone affinity for your OpenEdge database. Most operating systems will have options available to either pin certain processes to a zone or at least make it more likely they are assigned to a zone.

OpenEdge processes to consider "zoning" would be the core database processes; meaning all of the database server and broker processes, background writers and probkup. Because each zone also has a certain number of CPUs associated with it you have to be careful to balance the number of processes assigned to make sure you are not over or under committing those CPUs.

Back To Client Server?

In some cases it might make sense to switch from shared memory connections to client server connections in a NUMA environment. This is assuming that all of the broker processes are pinned to the same zone as the buffer pools.

This has proven effective with certain combinations of applications and NUMA systems, but I would only consider this as a last resort after all of your other issues have been resolved.

Memory

By default most modern operating systems will happily use all of the available memory to buffer file system blocks. In some circumstances this can be a good thing; especially if you have large amounts of RAM and are running 32 bit OpenEdge with limits on how large your buffer pools can grow.

In most applications you are much better off using that memory for the database buffer pools or client startup options to reduce IO from temp-tables, r-code access or sorting.

How Much Free Memory Do I Really Have?

Most of the standard tools used to report free memory only include the memory that is completely unused. Different operating systems will require different methods to determine how much memory you could borrow from the OS buffer cache. In addition certain operating systems like AIX and HPUX may have minimum OS cache size limits enforced and will require specific kernel options to be set.

OSCommands to display current OS cache usage
AIXnmon
topas
HP UXkmeminfo
glance
Linuxfree
cat /proc/meminfo
top
WindowsTask Manager Performance Tab (Available Memory includes free memory and unmodified cached pages)

IMPORTANT NOTE: Tuning the OS cache is not for the faint of heart and you should refer to the specific documentation and best practices for the exact version of your OS.

In Sync

The OS buffer cache supports writing changes to memory multiple times before those changes are written to disk. All changed memory blocks are written to disk on a predetermined sync schedule (every 60 seconds) or when a process issues a sync call (OpenEdge checkpoint). The sync process will block all other writes until it is completed.

Depending on the number of changed blocks, the size of the OS buffer cache and the speed of your disks this can take anywhere from a few milliseconds to a few minutes. If you notice meaningful pauses during an OpenEdge checkpoint without promon reporting "buffers flushed" then you may have an OS sync issue.

There are a few options available to try and reduce the sync times. Depending on your OS you might consider doing one or more of the following:

  • Modify the sync daemon (syncd) to run the sync process on a more frequent basis
  • Reduce the size of the OS buffer cache and instead use that memory for OpenEdge
  • Mount your database file systems with options that prevent or reduce OS caching (always check with Progress about support for your specific mount options)
  • Add -directio to your database startup scripts (DO NOT USE on HP UX)

File Systems

OS file systems have multiple tuning options available and in general the default values are not optimal for high volume OpenEdge systems. Some of these options may not be available on older UNIX based operating systems or Windows.

File System Types

Choosing the proper file system type is an important decision for the stability and performance of your application. Typically you want to choose a relatively modern file system but not the "latest and greatest" version. The suggestions below are considered safe choices but may need to be altered based on the actual SAN you are using.

OSSuggested FS
AIXjfs2
HP UXVxFS
Linuxext3
VxFS
WindowsNTFS

Block Size

Set your file system block size to match your database block size for optimal performance. This is typically handled at file system creation time and cannot be changed afterwards.

Kernel Options

Depending on your file system there will be a number of options that determine how IO is handled on that file system. Some are much more useful for OpenEdge applications than others.

OptionDescriptionOpenEdge Suggestions
Read Ahead Controls how the OS will read the next blocks in a sequential chain of blocks.

Example: If the file system detects that 2 sequential blocks have been read then assume that the next 16 blocks will be needed and pre-read those blocks into the OS cache.
Read ahead should be limited or eliminated for most OpenEdge applications. The random nature of how OpenEdge reads blocks compared to other databases that support full table scans and hash joins means that in most cases those extra reads will result in wasted IO and flushed OS cache.

This might result in slower times for maintenance events and backups and changes should be tested with both your normal application workload and maintenance workload.
File System Buffers Limits how many concurrent requests can be handled by the OS for the file system. Monitor your file system buffers with the tools provided by your OS vendor. AIX provides this information through vmstat -v and other vendors will have similar tools.

If you see IO blocked by buffer waits then you should gradually increase the number of buffers available until the waits disappear.
IO Pacing/LimitsLimits how much IO a single process (or all processes) can perform at one time Either disable the limits or increase them to more reasonable values.
Write BehindEnables either random or sequential write behind for modified pages. Basically this enables certain types of IO to be written to disk before a sync event happens. Depending on your write activity you might want to enable write behind. Be careful setting this too aggressively since it can cause more physical writes if you are frequently modifying the same blocks.

Disk Adapters

Disk adapters handle the communication between the OS and the physical disk drives or the SAN system. There are typically only two settings to worry about when it comes to disk adapters.

Maximum Concurrent Requests

Depending on your OS and disk adapters this setting can be set at the OS, SAN or hardware level. Tuning is further complicated by different vendors using different terms for this concept (max requests, number of command elements, adapter queue depth, etc).

Once you have identified the current limits for your adapter, monitor your adapters during peak times to determine if you are approaching the limit (nmon, top, topas, glance, etc). If you are approaching or hitting the limits you have three options: increase the limit, install additional adapters and use load balancing or replace your adapter with a newer version that supports more IO requests.

Load Balancing

Load balancing in this context is spreading your IO requests among multiple disk adapters. Even if you have multiple adapters in your system that does not always mean that load balancing is properly configured.

Identify which adapters are servicing your application disks and monitor those adapters during peak times to verify the adapters are properly balanced. Assuming that each adapter is servicing the same set of disks the IO activity should be consistent (with a few percentage points) between the various adapters.

If the activity is unbalanced then it is likely that load balancing is not properly configured or the adapters are not servicing the exact same set of disks. Getting this corrected can increase the performance of your application (especially if one or more adapters are approaching their limits) and will definitely increase the fault tolerance of your system.

Disk Access (Queue Depth)

For purposes of this section a disk refers to either 1) a physical locally attached disk or 2) the logical representation of a disk or set of disks on a SAN. The OS treats either type the same from a performance perspective.

Queue Depth

Queue depth is defined as the limit of how many concurrent IO operations that disk can support. In most cases the default values are much too low compared to the capabilities of modern drives, especially when SAN storage is being used. Consult the documentation for your specific OS to display the queue depth for an individual disk.

Monitor your disk drives during peak operations with a tool that will provide detailed disk statistics; specifically statistics relating to queue depth and service/wait times. On most UNIX based systems you can use iostat -x -d to show these stats, on Windows use PerfMon. Both will provide information that will show you if requests are being blocked because of queue depth issues as well as general information on service times for reads,writes and total (usually includes wait times).

Load Balancing

Assuming your disks are set up properly you should see very similar IO rates for each disk that services your database. If you see meaningful variances in IO rates there is a configuration issue to address.

Specific configuration details are discussed on the Disk Subsystem page, but here are some highlights of proper disk configuration.

  • Use one file system for your database extents
  • Stripe disks together instead of concatenating them
  • Use consistent disk or LUN sizes

Process Scheduling And Priority

Typically you should avoid manually setting the priority of a process through a renice or Windows equivalent. You are much better served by letting the scheduler do its job with proper guidance from you (although I admit this is next to impossible on some systems).

The single biggest issue with many UNIX implementations is the concept of process aging, meaning that as a process runs for longer periods of time it receives fewer and fewer resource cycles. This can cause a performance problem with the database server processes, appservers and other background jobs.

In some UNIX variations you can choose different schedulers or at least fine tune the parameters for putting a process to "sleep". As a last resort you can change your startup scripts to modify which scheduler a process runs under (SCHED_NOAGE for example). This will generally result in better overall performance than trying to change the priorty of the processes because the aging will still apply to those processes.