Heavy swapping observed on system in last 5 mins

As I was refreshing a dev database and checking out the alert log I noticed the following message:

To make sure it wasn’t to do with the refresh I looked at other database alert logs and history and found that this message has been constantly appearing in all logs for the last 6 months at least. In the database I was looking into it constantly happened at 40 minutes past the hour. So what is happening on the system in the previous few minutes to cause this message to appear?

The server has about 30 databases so it was going to be tricky and time consuming going through each db to see what activity there is. There were also no cronjobs at the time. I suspected the issue may be related to the EM12c agent or DISM of database instances.

There are some notes at Metalink in relation to this message and a bug on AIX but the server is Solaris and I couldn’t find anything useful.

So lets do some investigating. The database reports os stats in v$osstat and there are two statistics of interest – VM_IN_BYTES & VM_OUT_BYTES. Lets use a little script that captures the delta of these per n seconds. I’ve slightly modified James Koopman’s script from this blog to suit my needs. These stats are system wide so I’m comfortable only getting them from the 1 db.

The database reports a fair bit of paging in.

How about we drill down using dtrace. I’m using Brendagg Gregg’s Toolkit and specifially the vmbypid.d script

As suspected it looks the the em12c java agent is doing a bit of work at this time. With copy on write fault being the main culprit.

So it could be the agent spawning its children to do all the gathering at a particular point in time. Checking out is schedule with:

I can see all the metric gathering from all the 30 databases starting at approx 35 minutes past the hour. Combine this with the work that is happening over the 30 databases and it might just push the page in threshold over the database metric limit.

As a test lets stop all the metric collections for the databases using DISM.

After disabling the metric collections for the databases using DISM the paging in has ceased a bit and we are now under the threshold of complaint. There are no more messages in the alert log. Is this the long term fix? Possibly not. We may need to turn off DISM on these instances but that requires setting sga_target and sga_max_size to be the same which in turn needs a reboot. Not sure the developers want that to happen right now and as this isn’t causing any performance issues we’ll let is rest for now to a later date.

Once DISM is switched off on the remaining instances I’ll turn the agent schedule back on and see if we are on the right track.

Leave a Reply

Your email address will not be published. Required fields are marked *

CAPTCHA * Time limit is exhausted. Please reload CAPTCHA.