[20230809]ora-04030问题分析整理.txt
--//生产系统同事使用toad连接经常出现ora-04030错误。
ORA-04030: out of process memory when trying to allocate 123416 bytes (QERHJ hash-joi,kllcqas:kllsltba)
--//仔细看joi确实不是join,开始猜测可能某个程序的sql语句选择hash-join,导致pga消耗太大。
--//同事给了我alert日志的截图,提示都是icare_s001_20087.trc的进程,难道全部使用共享连接模式报错!!
--//icare_s001_20087.trc的进程报如下错误,对应进程里面应该能看到sql语句。
ORA-04030: out of process memory when trying to allocate 82456 bytes (pga heap,control file i/o buffer)
--//原始的分析版本写的太乱,事后重新做一下整理,主要还是想记录一下当时的分析过程,估计还是很乱!!
1.环境:
# cat /etc/redhat-release
redhat 4
#Red Hat Enterprise Linux Server release 6.0 (Santiago)
# free -m
total used free shared buffers cached
Mem: 15953 15775 178 0 4 5003
-/+ buffers/cache: 10767 5186
Swap: 15999 15999 0
--//昏,swap内存都全部使用了.
[email protected]:1521/icare> @ ver1
PORT_STRING VERSION BANNER
------------------------------ -------------- ----------------------------------------------------------------
x86_64/Linux 2.4.xx 10.2.0.4.0 Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
--//哇!很古老的版本.
[email protected]:1521/icare:DEDICATED> select * from v$instance;
INSTANCE_NUMBER INSTANCE_NAME HOST_NAME VERSION STARTUP_TIME STATUS PAR THREAD# ARCHIVE LOG_SWITCH_WAIT LOGINS SHU DATABASE_STATUS INSTANCE_ROLE ACTIVE_ST BLO
--------------- ---------------- --------------- ----------------- ------------------- ------------ --- ---------- ------- --------------- ---------- --- ----------------- ------------------ --------- ---
1 icare DZBL_DB_101.132 10.2.0.4.0 2021-08-24 08:24:17 OPEN NO 1 STARTED ALLOWED NO ACTIVE PRIMARY_INSTANCE NORMAL NO
--//出现问题后没有重启过机器!!顺便说明一下,这台机器很少访问,基本作为历史数据库,偶尔有应用连上就是查询数据.
2.分析:
[email protected]:1521/icare> show sga
Total System Global Area 6442450944 bytes
Fixed Size 2093584 bytes
Variable Size 2181041648 bytes
Database Buffers 4244635648 bytes
Redo Buffers 14680064 bytes
[email protected]:1521/icare> show parameter pga
PARAMETER_NAME TYPE VALUE
-------------------- ----------- -------
pga_aggregate_target big integer 3G
[email protected]:1521/icare> @ spid
SID SERIAL# PROCESS SERVER SPID PID P_SERIAL# C50
---------- ---------- ------------ --------- ------ ------- ---------- --------------------------------------------------
855 22839 31462 SHARED 1811 18 237 alter system kill session '855,22839' immediate;
--//噢!以前管理的数据库不合理,选择的连接模式是SHARED模式,spid=1811.
--//这样可能更好理解为什么alert日志都是记录都是S001的进程.
$ grep -B1 "ORA-04030" alert_icare.log | grep Error | sort | uniq -c
1 Errors in file /opt/oracle/admin/icare/bdump/icare_s000_18301.trc:
1 Errors in file /opt/oracle/admin/icare/bdump/icare_s000_26348.trc:
1968 Errors in file /opt/oracle/admin/icare/bdump/icare_s001_20087.trc:
--//很明显问题问题出在共享连接的进程,也就是可以估计应用段的进程都是以共享模式连接的.
[email protected]:1521/icare> show parameter disp
PARAMETER_NAME TYPE VALUE
---------------- -------- ---------------
dispatchers string (PROTOCOL=TCP)
max_dispatchers integer
--//奇怪没有任何配置.难道同事建立的配置连接串没有(SERVER = DEDICATED)设置.缺省共享模式优先.
$ lsnrctl service
LSNRCTL for Linux: Version 10.2.0.4.0 - Production on 10-AUG-2023 08:25:53
Copyright (c) 1991, 2007, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1)))
Services Summary...
Service "icare" has 2 instance(s).
Instance "icare", status UNKNOWN, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:1419 refused:0
LOCAL SERVER
Instance "icare", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:1458 refused:0 state:ready
LOCAL SERVER
"D000" established:7472195 refused:0 current:109 max:1022 state:ready
DISPATCHER <machine: DZBL_DB_101.132, pid: 26346>
(ADDRESS=(PROTOCOL=tcp)(HOST=DZBL_DB_101.132)(PORT=45587))
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Service "icaredg4" has 1 instance(s).
Instance "icare", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:1458 refused:0 state:ready
LOCAL SERVER
"D000" established:7472195 refused:0 current:109 max:1022 state:ready
DISPATCHER <machine: DZBL_DB_101.132, pid: 26346>
(ADDRESS=(PROTOCOL=tcp)(HOST=DZBL_DB_101.132)(PORT=45587))
Service "icaredg4_XPT" has 1 instance(s).
Instance "icare", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:1458 refused:0 state:ready
LOCAL SERVER
"D000" established:7472195 refused:0 current:109 max:1022 state:ready
DISPATCHER <machine: DZBL_DB_101.132, pid: 26346>
(ADDRESS=(PROTOCOL=tcp)(HOST=DZBL_DB_101.132)(PORT=45587))
The command completed successfully
--//icare服务存在两者模式,缺省连接串没有配置,优先选择共享模式连接.
[email protected]:1521/icare> column Program format a30
[email protected]:1521/icare> @pga 10
NAME VALUE_MB
------------------------------ ----------
aggregate PGA target parameter 3072
total PGA inuse 17655.4385
total PGA allocated 19551.1279
over allocation count 18030955
PROFILE CNT PERCENTAGE
------------------------------- ---------- ----------
workarea executions - optimal 74202449 100
workarea executions - onepass 89372 0
workarea executions - multipass 3 0
SPID SID SERIAL# MACHINE CLIENT_INFO Program PGA_USED_MEM PGA_ALLOC_MEM PGA_MAX_MEM
------ ---------- ---------- ---------------- -------------------- ------------------------------ ------------ ------------- -----------
2290 1014 34111 H3C-APM 192.168.101.212 JDBC Thin Client 6862056397 7554874813 7555005885
1811 855 22839 gxqyydg4 192.168.100.78 sqlplus@gxqyydg4 (TNS V1-V3) 2603811029 2875538877 2881830333
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15102 977 9825 DZBL_DB_101.132 oracle@DZBL_DB_101.132 (LNS1) 38893197 79486477 79486477
26437 1089 3 DZBL_DB_101.132 oracle@DZBL_DB_101.132 (ARC0) 22414453 47438477 48683661
26332 1101 1 DZBL_DB_101.132 oracle@DZBL_DB_101.132 (LGWR) 10946301 28613773 33987725
26439 1086 3 DZBL_DB_101.132 oracle@DZBL_DB_101.132 (ARC1) 10814885 24238733 25483917
26342 1096 1 DZBL_DB_101.132 oracle@DZBL_DB_101.132 (MMON) 1770773 3662333 3793405
26346 833 21702 192.168.100.51 w3wp.exe 933309 933309 1761861
26346 835 18802 192.168.100.51 w3wp.exe 933309 933309 1761861
26346 839 27183 192.168.100.51 w3wp.exe 933309 933309 1761861
10 rows selected.
--//看看那个进程消耗很大的PGA。
--//注意看下划线是我诊断的连接进程.我不可能怎么没做就使用这么多PGA,一定是连接上共享模式的进程导致的问题.
--//Program=w3wp.exe的spid=26346,说明也是共享模式连接的进程.
--//后记:我这里分析有错!!实际上spid=2290进程占用的pga内存可能是多个回话共享的,因为当前仅仅1个连上,其他用户估计很难登陆上.
--//如果多个用户使用共享模式连上该进程,后面的PGA_USED_MEM都都等于6862056397.看看spid=26346,就很容易明白了.
--//也就是我写pga.sql脚本查询涉及共享模式使用pga的情况有点问题.
# ps -fp 2290,1811,26346
UID PID PPID C STIME TTY TIME CMD
oracle 1811 1 0 2022 ? 08:41:28 ora_s002_icare
oracle 2290 1 0 2022 ? 16:15:39 ora_s000_icare
oracle 26346 1 54 2021 ? 391-21:33:26 ora_d000_icare
--//都是共享模式相关的进程.
--//后记:这里出现一个ora_d000_icare进程,我当时并没有多想,实际上应用程序的执行卡在这里,还停留在ora_d000_icare进程,正常成
--//功的化,应该出现的是ora_s00N_icare的进程.当时没有考虑周全,经验不足.
--//如果安装smem软件包,可以执行如下:
# smem -tk -U oracle -P "ora_|oracleicare"
--//检查内核参数/proc/sys/vm/max_map_count:
# cat /proc/sys/vm/max_map_count
65530
--//来自链接 https://www.josip-pojatina.com/en/bulk-collect-and-memory-limits/
To increase a number of the map entries from 4 GB (65530 x 64 / 1024 to get the amount in GB) into the 16 GB, you need
to execute the following command as a root user:
--//65530*64/1024 = 4095.62500000000000000000
--//2^16 = 65536
--//如何知道每个64K?
# sysctl -w vm.max_map_count=262144
vm.max_map_count = 262144
--//这个我没有执行,目前不是这方面的问题.
--//顺便查看ash的情况:
[email protected]:1521/icare:DEDICATED> @ tpt/ash/ashtop10 sql_id 1=1 &day
Total
Seconds AAS %This SQL_ID FIRST_SEEN LAST_SEEN
--------- ------- ------- ------------- ------------------- -------------------
791 .0 37% | 2023-08-09 08:58:00 2023-08-10 08:52:13
200 .0 9% | 8szmwam7fysa3 2023-08-09 22:00:06 2023-08-09 22:03:26
154 .0 7% | 93jgxvdzsx4y1 2023-08-09 08:57:58 2023-08-10 08:27:37
93 .0 4% | 5r14h528vkacs 2023-08-09 08:58:03 2023-08-10 08:27:41
47 .0 2% | b0nsh8xz40nhr 2023-08-09 08:58:01 2023-08-10 08:27:38
42 .0 2% | 19nrxkxw2b8j1 2023-08-09 08:58:07 2023-08-10 08:27:44
30 .0 1% | dqxfcr2c8265u 2023-08-09 09:06:36 2023-08-10 08:14:24
30 .0 1% | g97w6fq8yw5br 2023-08-09 08:58:11 2023-08-10 08:27:48
29 .0 1% | gy2ncu3j1x25j 2023-08-09 09:56:59 2023-08-10 08:44:33
25 .0 1% | 6q6tgwwt6bku9 2023-08-09 09:58:02 2023-08-10 08:27:40
23 .0 1% | 7dsgq5k08yymk 2023-08-09 12:28:06 2023-08-10 08:27:47
22 .0 1% | 6wcywxctbbvs8 2023-08-09 11:17:26 2023-08-10 06:43:46
21 .0 1% | 27m1sf1nknfz2 2023-08-09 08:58:10 2023-08-10 06:27:48
21 .0 1% | 39nd965zn01yq 2023-08-09 09:26:37 2023-08-10 07:44:18
19 .0 1% | 3vsh63tqr0sr3 2023-08-09 08:54:35 2023-08-10 07:41:41
15 rows selected.
[email protected]:1521/icare:DEDICATED> @ tpt/ash/ashtop10 event 1=1 &day
Total
Seconds AAS %This EVENT FIRST_SEEN LAST_SEEN
--------- ------- ------- ----------------------------- ------------------- -------------------
1053 .0 50% | 2023-08-09 08:56:29 2023-08-10 08:52:13
412 .0 19% | db file sequential read 2023-08-09 08:57:59 2023-08-10 08:48:57
245 .0 12% | log file parallel write 2023-08-09 08:58:00 2023-08-10 08:52:09
139 .0 7% | control file parallel write 2023-08-09 09:14:32 2023-08-10 08:51:24
93 .0 4% | log file sync 2023-08-09 08:58:25 2023-08-10 08:52:09
86 .0 4% | control file sequential read 2023-08-09 08:58:04 2023-08-10 08:14:24
26 .0 1% | null event 2023-08-09 09:30:51 2023-08-10 08:41:22
23 .0 1% | LNS wait on SENDREQ 2023-08-09 09:13:47 2023-08-10 08:41:36
14 .0 1% | log file sequential read 2023-08-09 10:05:24 2023-08-10 08:42:45
11 .0 1% | os thread startup 2023-08-09 09:00:13 2023-08-10 00:00:49
10 .0 0% | db file scattered read 2023-08-09 22:00:19 2023-08-09 22:00:59
3 .0 0% | db file parallel read 2023-08-09 16:31:31 2023-08-10 08:52:09
3 .0 0% | log file switch completion 2023-08-09 22:00:05 2023-08-10 08:42:42
2 .0 0% | Log archive I/O 2023-08-09 09:15:04 2023-08-09 12:30:26
1 .0 0% | latch free 2023-08-10 07:03:11 2023-08-10 07:03:11
15 rows selected.
--//数据库性能基本没有怎么问题.
3.临时解决方法:
--//先停止192.168.101.212 上的应用.修改配置连接参数加上
(SERVER = DEDICATED)
--//192.168.100.51 类似.其他地址类似.
--//剩下的是如何回收内存的工作.
[email protected]:1521/icare:DEDICATED> alter system kill session '1014,34113' immediate;
System altered.
# free -m
total used free shared buffers cached
Mem: 15953 15781 172 0 3 5002
-/+ buffers/cache: 10775 5178
Swap: 15999 15999 0
--//前面的执行情况,为了对比方便我复制下来:
# free -m
total used free shared buffers cached
Mem: 15953 15775 178 0 4 5003
-/+ buffers/cache: 10767 5186
Swap: 15999 15999 0
--//昏基本没有任何改变.
[email protected]:1521/icare:DEDICATED> @ pga 10
NAME VALUE_MB
------------------------------ ----------
aggregate PGA target parameter 3072
total PGA inuse 17656.9668
total PGA allocated 19554.3701
over allocation count 18031763
PROFILE CNT PERCENTAGE
-------------------------------- ---------- ----------
workarea executions - optimal 74204980 100
workarea executions - onepass 89374 0
workarea executions - multipass 3 0
SPID SID SERIAL# MACHINE CLIENT_INFO Program PGA_USED_MEM PGA_ALLOC_MEM PGA_MAX_MEM
------ ---------- ---------- --------------- -------------------- ------------------------------ ------------ ------------- -----------
15102 977 9825 DZBL_DB_101.132 oracle@DZBL_DB_101.132 (LNS1) 38893197 79486477 79486477
26437 1089 3 DZBL_DB_101.132 oracle@DZBL_DB_101.132 (ARC0) 22414453 47438477 48683661
26332 1101 1 DZBL_DB_101.132 oracle@DZBL_DB_101.132 (LGWR) 10946301 28613773 33987725
26439 1086 3 DZBL_DB_101.132 oracle@DZBL_DB_101.132 (ARC1) 10814885 24238733 25483917
26342 1096 1 DZBL_DB_101.132 oracle@DZBL_DB_101.132 (MMON) 1770773 3662333 3793405
11755 855 22871 gxqyydg4 192.168.100.78 sqlplus@gxqyydg4 (TNS V1-V3) 1617021 2220613 5497413
26346 833 21702 192.168.100.51 w3wp.exe 933309 933309 1761861
26346 835 18802 192.168.100.51 w3wp.exe 933309 933309 1761861
26346 839 27183 192.168.100.51 w3wp.exe 933309 933309 1761861
26346 840 28987 192.168.100.51 w3wp.exe 933309 933309 1761861
10 rows selected.
--//后记:这是我才反应过来前面alter system kill session '1014,34113' immediate;不可能完全回收s00N占用的内存的.我必须
--//kill对应的进程.这里也说明一点,如果这个时候使用pag.sql脚本查询,很难判断问题在那里.
--//先看看那些进程占用内存.
[root@DZBL_DB_101 app]# top
top - 09:30:22 up 762 days, 13:58, 5 users, load average: 1.08, 1.14, 1.10
Tasks: 271 total, 2 running, 269 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.9%us, 4.4%sy, 0.0%ni, 92.4%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16336568k total, 16169608k used, 166960k free, 4940k buffers
Swap: 16383992k total, 16383992k used, 0k free, 5137576k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26346 oracle 20 0 6292m 313m 311m R 100.0 2.0 564370:57 oracle
2792 root 20 0 154m 27m 1816 S 5.9 0.2 25322:45 Xorg
10132 root 20 0 4470m 1.1g 2568 S 2.0 6.8 30280:47 gnome-system-mo
1 root 20 0 19244 896 676 S 0.0 0.0 1:23.96 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.04 kthreadd
--//不熟悉top的使用,如果安装我熟悉的smem,也许问题定位会更快一些.实际上缺省按照CPU 使用排序的.spid=26346,CPU使用率100%,实
--//际上可能通过共享模式登陆还是报错.从以上输出可以发现启动到图形模式5,可以回到多用户模式3,回收一些内存,切换到多用户模式
--//,不启动图形模式X11,注意同时修改/etc/inittab文件:
--//id:3:initdefault:
[root@DZBL_DB_101 app]# who -r
run-level 5 2021-07-08 19:31
[root@DZBL_DB_101 app]# init 3
[root@DZBL_DB_101 app]# who -r
run-level 3 2023-08-10 09:33 last=5
# free -m
total used free shared buffers cached
Mem: 15953 14582 1371 0 9 5068
-/+ buffers/cache: 9503 6449
Swap: 15999 12577 3422
--//回收了一些内存.swap free增加3422M.
[root@DZBL_DB_101 app]# top
top - 09:57:24 up 762 days, 14:06, 2 users, load average: 1.03, 1.33, 1.23
Tasks: 225 total, 2 running, 223 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.9%us, 4.4%sy, 0.0%ni, 92.4%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16336568k total, 14947360k used, 1389208k free, 11068k buffers
Swap: 16383992k total, 12879480k used, 3504512k free, 5199580k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26346 oracle 20 0 6292m 313m 311m R 100.0 2.0 564378:32 oracle
3956 root 20 0 1394m 30m 3992 S 2.0 0.2 0:41.88 infoeyeagent_mq
1 root 20 0 19244 992 752 S 0.0 0.0 1:24.00 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.04 kthreadd
--//spid=26346还是占用许多内存,分析有误,实际上它不是占用最多的.
--//后记:我这里再次犯了严重错误,没有按照内存使用排序,如果排序能很快定位问题.不过我当时想既然spid=26346进程对应
--//ora_d000_icare,kill -9应该下面ora_s00N_icare进程也会一起kill掉,重新建立新的进程.
--//再尝试前,我想看看先kill对应的session如何.
# ps -fp 26346
UID PID PPID C STIME TTY TIME CMD
oracle 26346 1 54 2021 ? 391-22:19:20 ora_d000_icare
--//后记:实际上这时我才想起共享连接不应该停在这里,实际上前面的w3wp.exe执行卡在这里,无法继续下去.
--//取消kill这些连接应该没有任何问题,注实际上很多,差不多有1XX个.
[email protected]:1521/icare:DEDICATED> alter system kill session '833,21702,@1' immediate -- ICARE@ (w3wp.exe);
alter system kill session '833,21702,@1' immediate -- ICARE@ (w3wp.exe)
*
ERROR at line 1:
ORA-00026: missing or invalid session ID
[email protected]:1521/icare:DEDICATED> alter system kill session '833,21702' immediate -- ICARE@ (w3wp.exe);
System altered.
--//昏!10g版本竟然不支持@1这样的写法.
[email protected]:1521/icare:DEDICATED> @ killi program='w3wp.exe'
COMMANDS_TO_VERIFY_AND_RUN
--------------------------------------------------------------------------------------
alter system kill session '1010,44238' immediate -- ICARE@WORKGROUP\DELL70 (w3wp.exe);
...
--//拷贝然后执行,再查看:
[root@DZBL_DB_101 app]# free -m
total used free shared buffers cached
Mem: 15953 14722 1230 0 13 5136
-/+ buffers/cache: 9572 6381
Swap: 15999 12557 3442
--//变化不是很大.
top - 10:07:41 up 762 days, 14:16, 2 users, load average: 1.10, 1.20, 1.19
Tasks: 227 total, 2 running, 225 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.9%us, 4.4%sy, 0.0%ni, 92.4%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16336568k total, 15085796k used, 1250772k free, 14212k buffers
Swap: 16383992k total, 12858468k used, 3525524k free, 5262764k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26346 oracle 20 0 6292m 316m 314m R 98.9 2.0 564388:39 oracle
1 root 20 0 19244 992 752 S 0.0 0.0 1:24.02 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.04 kthreadd
--//简单一点直接kill掉,说明一下:在执行前我看过select * from V$BGPROCESS where PADDR<>'00';的输出,
--//ora_d000_icare,ora_s00N_icare居然没有出现在输出列表里面.有点奇怪!!
[root@DZBL_DB_101 app]# kill -9 26346
[root@DZBL_DB_101 app]# free -m
total used free shared buffers cached
Mem: 15953 14812 1141 0 15 5196
-/+ buffers/cache: 9599 6353
Swap: 15999 12552 3447
--//对比前面基本没有任何变化.
[oracle@DZBL_DB_101 ~]$ ps -ef -u oracle | grep d0[0]
oracle 25655 1 0 10:08 ? 00:00:00 ora_d001_icare
root 30812 2 0 Jan19 ? 00:00:00 [kslowd000]
root 30813 2 0 Jan19 ? 00:00:00 [kslowd001]
--//启动1个新的d001进程.
$ lsnrctl service
LSNRCTL for Linux: Version 10.2.0.4.0 - Production on 10-AUG-2023 10:16:38
Copyright (c) 1991, 2007, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1)))
Services Summary...
Service "icare" has 2 instance(s).
Instance "icare", status UNKNOWN, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:1419 refused:0
LOCAL SERVER
Instance "icare", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:3043 refused:0 state:ready
LOCAL SERVER
"D001" established:7 refused:0 current:3 max:1022 state:ready
DISPATCHER <machine: DZBL_DB_101.132, pid: 25655>
(ADDRESS=(PROTOCOL=tcp)(HOST=DZBL_DB_101.132)(PORT=33030))
Service "icaredg4" has 1 instance(s).
Instance "icare", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:3043 refused:0 state:ready
LOCAL SERVER
"D001" established:7 refused:0 current:3 max:1022 state:ready
DISPATCHER <machine: DZBL_DB_101.132, pid: 25655>
(ADDRESS=(PROTOCOL=tcp)(HOST=DZBL_DB_101.132)(PORT=33030))
Service "icaredg4_XPT" has 1 instance(s).
Instance "icare", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:3043 refused:0 state:ready
LOCAL SERVER
"D001" established:7 refused:0 current:3 max:1022 state:ready
DISPATCHER <machine: DZBL_DB_101.132, pid: 25655>
(ADDRESS=(PROTOCOL=tcp)(HOST=DZBL_DB_101.132)(PORT=33030))
The command completed successfully
--//使用d001替换原来的d000.
[oracle@DZBL_DB_101 bdump]$ ps -ef | grep s0[0]
oracle 1811 1 0 2022 ? 08:41:30 ora_s002_icare
oracle 2290 1 0 2022 ? 16:15:53 ora_s000_icare
oracle 20087 1 0 2022 ? 21:44:42 ora_s001_icare
oracle 26354 1 0 2021 ? 01:23:30 ora_s003_icare
--//这才发现ora_s00N_icare进程还在并没有删除.
[oracle@DZBL_DB_101 bdump]$ top -u oracle
top - 10:26:51 up 762 days, 14:35, 2 users, load average: 0.01, 0.06, 0.36
Tasks: 229 total, 1 running, 228 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.9%us, 4.4%sy, 0.0%ni, 92.4%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16336568k total, 15241824k used, 1094744k free, 19756k buffers
Swap: 16383992k total, 12854104k used, 3529888k free, 5416796k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1811 oracle 20 0 9057m 3.9g 2.8g S 0.0 25.1 521:30.70 oracle
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2290 oracle 20 0 13.2g 7.4g 2.5g S 0.0 47.5 975:53.23 oracle
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5500 oracle 20 0 180m 3064 1736 S 0.0 0.0 79:26.09 tnslsnr
15102 oracle 20 0 6338m 32m 6068 S 0.0 0.2 867:40.38 oracle
20087 oracle 20 0 15.3g 4.6g 2.7g S 0.0 29.3 1304:42 oracle
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
23203 oracle 20 0 6291m 33m 28m S 0.0 0.2 0:00.08 oracle
25655 oracle 20 0 6290m 19m 15m S 0.0 0.1 0:00.07 oracle
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ =>d0001
26324 oracle 20 0 6290m 14m 13m S 0.0 0.1 67:26.29 oracle
26326 oracle 20 0 6289m 3588 3440 S 0.0 0.0 30:28.83 oracle
26328 oracle 20 0 6289m 943m 942m S 0.0 5.9 60:37.91 oracle
26330 oracle 20 0 6302m 1.7g 1.7g S 0.0 11.0 91:47.32 oracle
26332 oracle 20 0 6312m 19m 18m S 0.0 0.1 114:29.21 oracle
26334 oracle 20 0 6295m 193m 188m S 0.0 1.2 197:47.28 oracle
26336 oracle 20 0 6292m 1.2g 1.2g S 0.0 7.7 17:08.92 oracle
26338 oracle 20 0 6289m 19m 19m S 0.0 0.1 0:10.50 oracle
26340 oracle 20 0 6296m 152m 147m S 0.0 1.0 115:34.77 oracle
26342 oracle 20 0 6294m 225m 222m S 0.0 1.4 137:20.15 oracle
26344 oracle 20 0 6289m 34m 34m S 0.0 0.2 638:24.48 oracle
26354 oracle 20 0 6290m 1.6g 1.6g S 0.0 10.3 83:30.18 oracle
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
26437 oracle 20 0 6315m 17m 6324 S 0.0 0.1 10:20.44 oracle
26439 oracle 20 0 6304m 16m 5456 S 0.0 0.1 5:52.89 oracle
26455 oracle 20 0 6289m 8480 8320 S 0.0 0.1 9:42.74 oracle
26522 oracle 20 0 6291m 31m 30m S 0.0 0.2 0:03.41 oracle
26950 oracle 20 0 6291m 134m 129m S 0.0 0.8 0:00.39 oracle
30800 oracle 20 0 6289m 3832 3700 S 0.0 0.0 4:56.27 oracle
31446 oracle 20 0 105m 1900 1420 S 0.0 0.0 0:00.02 bash
31831 oracle 20 0 6290m 30m 26m S 0.0 0.2 0:00.03 oracle
32310 oracle 20 0 15072 1204 812 R 0.0 0.0 0:00.00 top
--//昏,没有按照内存使用排序,可以发现spid=1811,2290,20087,26354(注意坎%MEM列) 占用很大内存.
--//后记:自己对top使用的一些细节实在太不熟悉,还是最近干活太少的原因.另外写一篇使用top如何按照内存使用排序的文章.
--//顺便说一下按c可以可以看到COMMAND更加详细的内容.
--//找同事要安装盘没找到,管理混乱,想安装自己熟悉的smem包看看,6.0版本在网上已经很难找到,安装1个rh5的版本smem,居然可以安装
--//成功.
[root@DZBL_DB_101 rpm_tgz]# smem -tk -U oracle -P "ora_|oracleicare"
PID User Command Swap USS PSS RSS
26326 oracle ora_psp0_icare 3.2M 160.0K 319.0K 3.5M
30800 oracle ora_q002_icare 3.3M 164.0K 328.0K 3.7M
26455 oracle ora_qmnc_icare 3.3M 200.0K 698.0K 8.3M
26324 oracle ora_pmon_icare 3.2M 628.0K 1.8M 14.0M
26338 oracle ora_reco_icare 3.7M 372.0K 1.9M 19.6M
26522 oracle ora_q000_icare 3.6M 1.0M 4.1M 31.7M
26332 oracle ora_lgwr_icare 5.5M 4.7M 6.4M 19.1M
25655 oracle ora_d001_icare 0 4.4M 7.2M 19.8M
3510 oracle oracleicare (LOCAL=NO) 0 4.7M 9.6M 36.5M
26344 oracle ora_mmnl_icare 3.1M 1.2M 10.7M 34.4M
31831 oracle oracleicare (LOCAL=NO) 0 4.9M 10.9M 48.9M
26439 oracle ora_arc1_icare 3.5M 11.2M 11.5M 16.5M
26437 oracle ora_arc0_icare 14.3M 11.9M 12.4M 17.9M
26334 oracle ora_ckpt_icare 3.5M 4.6M 25.3M 193.0M
26340 oracle ora_cjq0_icare 4.2M 8.6M 26.8M 152.1M
15102 oracle ora_lns1_icare 3.2M 26.8M 27.2M 32.5M
26342 oracle ora_mmon_icare 3.8M 4.5M 33.5M 225.6M
26950 oracle ora_j000_icare 0 20.5M 40.6M 137.5M
26328 oracle ora_mman_icare 3.2M 44.7M 206.7M 943.1M
26336 oracle ora_smon_icare 3.3M 12.6M 237.0M 1.2G
26354 oracle ora_s003_icare 2.9M 5.8M 324.4M 1.6G
26330 oracle ora_dbw0_icare 10.0M 182.6M 558.5M 1.7G
1811 oracle ora_s002_icare 1.5G 1.2G 1.8G 3.9G
20087 oracle ora_s001_icare 7.3G 1.9G 2.5G 4.6G
2290 oracle ora_s000_icare 2.1G 4.9G 5.5G 7.4G
-------------------------------------------------------------------------------
25 1 11.1G 8.4G 11.4G 22.3G
--//注意看s00X进程占用的内存空间.而且还使用了大量的swap内存.
[email protected]:1521/icare:DEDICATED> select * from V$BGPROCESS where PADDR<>'00';
PADDR PSERIAL# NAME DESCRIPTION ERROR
---------------- ---------- ----- ------------------------------ ---------
00000001FE637BB8 1 PMON process cleanup 0
00000001FB641E88 1 PSP0 process spawner 0 0
00000001FE6383A0 1 MMAN Memory Manager 0
00000001FB642670 1 DBW0 db writer process 0 0
00000001FE63DA98 2 ARC0 Archival Process 0 0
00000001FB647D68 1 ARC1 Archival Process 1 0
00000001FB648550 1 ARC2 Archival Process 2 448
00000001FB645DC8 55 LNS1 Network Server 1 0
00000001FE63E280 1 LNS4 Network Server 4 448
00000001FE638B88 1 LGWR Redo etc. 0
00000001FB642E58 1 CKPT checkpoint 0
00000001FE639370 1 SMON System Monitor Process 0
00000001FB643640 1 RECO distributed recovery 0
00000001FE639B58 1 CJQ0 Job Queue Coordinator 0
00000001FE63E280 5 QMNC AQ Coordinator 0
00000001FB643E28 1 MMON Manageability Monitor Process 0
00000001FE63A340 1 MMNL Manageability Monitor Process 0
17 rows selected.
--//居然S00N不是后台进程应该可以kill掉.
--//后记:我在另外1台11g查看11g上有显示.
[root@DZBL_DB_101 rpm_tgz]# kill -9 1811
[root@DZBL_DB_101 rpm_tgz]# ps -ef | grep s0[0]
oracle 2290 1 0 2022 ? 16:15:55 ora_s000_icare
oracle 14542 1 0 11:03 ? 00:00:00 ora_s002_icare
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
oracle 20087 1 0 2022 ? 21:44:42 ora_s001_icare
oracle 26354 1 0 2021 ? 01:23:30 ora_s003_icare
--//发现重新建立新的ora_s002_icare进程.
--//全部删除ora_s00N_icare进程.
# ps -ef --no-header | grep s0[0] | awk '{print $2}' | xargs -IQ kill -9 Q
--//简单说明:grep 过滤s0[0],使用awk取第2个字段(就是进程号),通过xargs -IQ的模式传参调用kill -9执行.
[root@DZBL_DB_101 rpm_tgz]# ps -ef --no-header | grep s0[0]
oracle 15769 1 0 11:08 ? 00:00:00 ora_s000_icare
oracle 15771 1 0 11:08 ? 00:00:00 ora_s001_icare
oracle 15773 1 0 11:08 ? 00:00:00 ora_s002_icare
oracle 15775 1 0 11:08 ? 00:00:00 ora_s003_icare
--//OK,已经重新启动ora_s00N_icare进程,从时间列上也可以看出.
[root@DZBL_DB_101 rpm_tgz]# smem -tk -U oracle -P "ora_|oracleicare"
PID User Command Swap USS PSS RSS
26326 oracle ora_psp0_icare 3.2M 164.0K 333.0K 3.5M
30800 oracle ora_q002_icare 3.3M 172.0K 358.0K 3.7M
26455 oracle ora_qmnc_icare 3.3M 200.0K 836.0K 8.3M
26338 oracle ora_reco_icare 3.7M 400.0K 2.7M 19.6M
26324 oracle ora_pmon_icare 3.2M 1.2M 3.2M 14.6M
15773 oracle ora_s002_icare 0 3.2M 4.6M 16.7M
15771 oracle ora_s001_icare 0 3.2M 4.6M 16.7M
15775 oracle ora_s003_icare 0 3.2M 4.6M 16.7M
15769 oracle ora_s000_icare 0 3.2M 4.6M 16.7M
26522 oracle ora_q000_icare 3.6M 1.2M 6.3M 31.7M
25655 oracle ora_d001_icare 0 4.7M 6.7M 19.9M
26332 oracle ora_lgwr_icare 5.5M 6.9M 8.6M 19.1M
26439 oracle ora_arc1_icare 3.5M 11.2M 11.7M 16.8M
26437 oracle ora_arc0_icare 14.3M 11.7M 12.4M 17.9M
31831 oracle oracleicare (LOCAL=NO) 0 6.0M 13.8M 51.0M
3510 oracle oracleicare (LOCAL=NO) 0 8.7M 14.4M 40.3M
26344 oracle ora_mmnl_icare 3.1M 17.9M 20.4M 34.4M
15102 oracle ora_lns1_icare 3.2M 26.8M 27.3M 32.5M
26334 oracle ora_ckpt_icare 3.5M 4.8M 39.5M 193.0M
26340 oracle ora_cjq0_icare 4.2M 13.9M 47.2M 152.1M
26342 oracle ora_mmon_icare 3.8M 16.1M 67.1M 225.6M
26950 oracle ora_j000_icare 0 156.6M 233.9M 433.4M
26328 oracle ora_mman_icare 3.2M 365.8M 570.2M 943.1M
26336 oracle ora_smon_icare 3.3M 208.3M 614.1M 1.2G
26330 oracle ora_dbw0_icare 10.0M 783.5M 1.2G 1.7G
-------------------------------------------------------------------------------
25 1 77.9M 1.6G 2.8G 5.2G
--//对比前面已经回收了大量的内存,swap仅仅占用77.9M.
[root@DZBL_DB_101 rpm_tgz]# free -m
total used free shared buffers cached
Mem: 15953 6521 9432 0 36 5534
-/+ buffers/cache: 951 15002
Swap: 15999 1298 14701
--//对比问题出现前的情况,估计在观察几天看看内存使用情况.下面的输出时最开始看到的内存使用情况.
# free -m
total used free shared buffers cached
Mem: 15953 15775 178 0 4 5003
-/+ buffers/cache: 10767 5186
Swap: 15999 15999 0
4.总结:
--//配置10g的数据库连接串时一定要显式地加上使用专用模式连接或者共享模式.
--//我估计使用太久了,共享进程s00N占用内存存在bug,导致对应进程内存没有释放,出现这样的情况.有机会看看dump.
--//估计很久没做这类分析工作,top不熟悉,当时不知道如何实现按照内存使用排序,浪费一些时间.
--//对于共享模式连接也有点生疏,想当然认为kill ora_d000_icare进程,对应的ora_s00N_icare进程也会kill,os操作系统没学好.
--//如果服务器安装我熟悉的smem工具,也许就没有必要走这个大的弯路定位问题了.
--//惯性思维,以为alter system kill session '855,22839' immediate;对应进程就被kill掉.
--//解决问题时间过长,大约从8:30开始到11:30才定位解决问题,花了3个小时,事后我想我的问题思路是否有错,
--//ora-04030 => pga相关问题 -> pga.sql查询,如果当时这里看不出来,也许时间会更长...
--//另外顺手的工具很重要,实际上大部分时间浪费在上网找smem的安装包.
--//同事安装时应该不管怎么选择使用hugepages模式,减少页面表的使用.这部分内存占用1G.
5.附件脚本代码:
$ cat pga
column name format a30
column machine format a30
SELECT NAME, VALUE/1024/1024 VALUE_MB
FROM V$PGASTAT
WHERE NAME IN ( 'aggregate PGA target parameter',
'total PGA allocated',
'total PGA inuse')
union all
SELECT NAME, VALUE
FROM V$PGASTAT
WHERE NAME IN ('over allocation count');
SELECT name profile, cnt, decode(total, 0, 0, round(cnt*100/total)) percentage
FROM (SELECT name, value cnt, (sum(value) over ()) total
FROM V$SYSSTAT WHERE name like 'workarea exec%');
SELECT *
FROM ( SELECT p.spid,
s.sid,
s.serial#,
s.machine,
s.client_info,
DECODE (s.program, NULL, p.program, s.program) AS "Program",
p.pga_used_mem,
p.pga_alloc_mem,
p.pga_max_mem
FROM v$process p, v$session s
WHERE s.paddr = p.addr
ORDER BY p.pga_used_mem DESC)
WHERE ROWNUM <= &1;
$ cat spid.sql
column sid new_value newsid
column serial# new_value newserial
column spid new_value newspid
set verify off
SELECT s.sid
, s.serial#
, s.process
, s.server
, p.spid
, p.pid
, p.serial# p_serial#
, 'alter system kill session '''||s.sid||' , '||s.serial#||''''||' immediate;' c50
FROM v$session s
, v$process p
WHERE s.sid in ( SELECT sid FROM v$mystat WHERE rownum = 1)
AND s.paddr = p.addr;