As I was refreshing a dev database and checking out the alert log I noticed the following message:
1 2 3 4 |
WARNING: Heavy swapping observed on system in last 5 mins. pct of memory swapped in [3.00%] pct of memory swapped out [0.00%]. Please make sure there is no memory pressure and the SGA and PGA are configured correctly. Look at DBRM trace file for more details. |
To make sure it wasn’t to do with the refresh I looked at other database alert logs and history and found that this message has been constantly appearing in all logs for the last 6 months at least. In the database I was looking into it constantly happened at 40 minutes past the hour. So what is happening on the system in the previous few minutes to cause this message to appear?
The server has about 30 databases so it was going to be tricky and time consuming going through each db to see what activity there is. There were also no cronjobs at the time. I suspected the issue may be related to the EM12c agent or DISM of database instances.
There are some notes at Metalink in relation to this message and a bug on AIX but the server is Solaris and I couldn’t find anything useful.
So lets do some investigating. The database reports os stats in v$osstat and there are two statistics of interest – VM_IN_BYTES & VM_OUT_BYTES. Lets use a little script that captures the delta of these per n seconds. I’ve slightly modified James Koopman’s script from this blog to suit my needs. These stats are system wide so I’m comfortable only getting them from the 1 db.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
$ ./swap.sh 5 50 ***/***@***/*** time user swapin swapout system iowait idle 20140319:08:37:56 486 190000000 0 2400 0 9157 20140319:08:38:03 459 90000000 0 1196 0 10398 20140319:08:38:09 549 90000000 0 1353 0 10143 20140319:08:38:15 774 240000000 0 2468 0 8805 20140319:08:38:21 931 70000000 0 1615 0 9526 20140319:08:38:27 1108 90000000 0 2648 0 8285 20140319:08:38:33 698 40000000 0 754 0 10592 20140319:08:38:40 822 140000000 0 1404 0 9825 20140319:08:38:46 760 140000000 0 2621 0 8748 20140319:08:38:52 1095 240000000 0 4134 0 6835 20140319:08:38:58 750 160000000 0 2161 0 9144 20140319:08:39:05 433 40000000 0 674 0 10943 20140319:08:39:11 522 0 0 262 0 11258 20140319:08:39:17 604 0 0 596 0 10845 20140319:08:39:23 1229 20000000 0 1871 0 8943 20140319:08:39:29 457 30000000 0 698 0 10895 20140319:08:39:35 763 0 0 356 0 10918 20140319:08:39:42 708 0 0 1268 0 10064 20140319:08:39:48 541 0 0 235 0 11263 20140319:08:39:54 349 20000000 0 1452 0 10244 20140319:08:40:00 759 80000000 0 2162 0 9130 |
The database reports a fair bit of paging in.
How about we drill down using dtrace. I’m using Brendagg Gregg’s Toolkit and specifially the vmbypid.d script
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
./vmbypid.d ..snip.. oracle 24841 pgfrec 14970 oracle 24841 pgrec 14970 oracle 18162 as_fault 16716 oracle 18162 maj_fault 17121 oracle 23720 maj_fault 27475 oracle 23720 pgin 27476 oracle 23720 fspgin 27523 oracle 23720 pgpgin 27523 oracle 24841 maj_fault 33052 oracle 24841 pgin 33990 oracle 18162 pgin 34245 oracle 24841 fspgin 38272 oracle 24841 pgpgin 38272 java 28608 as_fault 86635 tnslsnr 7790 cow_fault 153376 oracle 18162 pgfrec 240070 oracle 18162 pgrec 240070 oracle 18162 fspgin 257553 oracle 18162 pgpgin 257553 java 28608 prot_fault 341975 java 28608 cow_fault 2319941 |
As suspected it looks the the em12c java agent is doing a bit of work at this time. With copy on write fault being the main culprit.
1 2 3 4 |
COW Faults (#/s) Description COW (copy-on-write) Faults is the number of copy-on-write faults per second. If one of the processes sharing the page attempts to write to the page, a copy-on-write page fault occurs. Another page is taken from the free list and the original page is copied. |
So it could be the agent spawning its children to do all the gathering at a particular point in time. Checking out is schedule with:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
emctl status agent scheduler ..snip.. 2014-03-19 11:37:50.669 : oracle_database:DB2:DatabaseVaultPolicyChanges_collection 2014-03-19 11:37:52.963 : oracle_database:DB2:DatabaseVaultRealmViolation_collection 2014-03-19 11:37:52.971 : oracle_database:DB3:DatabaseVaultCommandRuleConfigurationIssue_collection 2014-03-19 11:37:53.561 : oracle_database:DB4:DatabaseVaultRealmConfigurationIssue_collection 2014-03-19 11:37:54.160 : oracle_database:DB4:DatabaseVaultCommandRuleConfigurationIssue_collection 2014-03-19 11:37:54.160 : oracle_database:DB4:DatabaseVaultCommandRuleViolation_collection 2014-03-19 11:37:54.160 : oracle_database:DB4:DatabaseVaultPolicyChanges_collection 2014-03-19 11:37:54.160 : oracle_database:DB4:DatabaseVaultRealmViolation_collection 2014-03-19 11:37:56.790 : oracle_database:<>:oracle_dbconfig 2014-03-19 11:37:57.447 : oracle_database:<>:DatabaseVaultRealmConfigurationIssue_collection 2014-03-19 11:37:57.457 : oracle_database:<>:DatabaseVaultCommandRuleViolation_collection 2014-03-19 11:37:57.457 : oracle_database:<>:DatabaseVaultRealmConfigurationIssue_collection 2014-03-19 11:37:57.457 : oracle_database:<>:DatabaseVaultRealmViolation_collection 2014-03-19 11:37:59.457 : oracle_database:<>:DatabaseVaultPolicyChanges_collection 2014-03-19 11:37:59.862 : oracle_database:<>V:DatabaseVaultCommandRuleViolation_collection 2014-03-19 11:37:59.862 : oracle_database:<>:DatabaseVaultPolicyChanges_collection ..snip.. |
I can see all the metric gathering from all the 30 databases starting at approx 35 minutes past the hour. Combine this with the work that is happening over the 30 databases and it might just push the page in threshold over the database metric limit.
As a test lets stop all the metric collections for the databases using DISM.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
$ for x in `ps -fe | grep dism | perl -anF/_/ -e 'print "$F[2]";'` > do > emctl stopschedule agent $x,oracle_database > done Oracle Enterprise Manager Cloud Control 12c Release 2 Copyright (c) 1996, 2012 Oracle Corporation. All rights reserved. Stop Schedule Target succeeded Oracle Enterprise Manager Cloud Control 12c Release 2 Copyright (c) 1996, 2012 Oracle Corporation. All rights reserved. Stop Schedule Target succeeded Oracle Enterprise Manager Cloud Control 12c Release 2 Copyright (c) 1996, 2012 Oracle Corporation. All rights reserved. Stop Schedule Target succeeded Oracle Enterprise Manager Cloud Control 12c Release 2 Copyright (c) 1996, 2012 Oracle Corporation. All rights reserved. Stop Schedule Target succeeded |
After disabling the metric collections for the databases using DISM the paging in has ceased a bit and we are now under the threshold of complaint. There are no more messages in the alert log. Is this the long term fix? Possibly not. We may need to turn off DISM on these instances but that requires setting sga_target and sga_max_size to be the same which in turn needs a reboot. Not sure the developers want that to happen right now and as this isn’t causing any performance issues we’ll let is rest for now to a later date.
Once DISM is switched off on the remaining instances I’ll turn the agent schedule back on and see if we are on the right track.