Pages

Wednesday, March 1, 2017

Correct Corrupted Undo Tablespace Segment

=====================
Issue
=====================
Due to a power failure Oracle server was abruptly shutdown.
After Linux server was started, the database could not start.
It seems that data block(s) in UNTO Tables are now corrupt.
The solution to fix this issue was of several steps.

=====================
First Error:
=====================
ORA-00600 kcratr1_lastbwr

This error was fixed following Oracle Metalink: 
ORA-600 [kcratr1_lastbwr] (Doc ID 1200478.1)

After a disk failure that caused the database to crash, the instance fails to start up with ORA-00600: arguments: [kcratr1_lastbwr].

The alert log file shows the following entries :
Completed: ALTER DATABASE MOUNT
Tue Sep 19 09:43:03 2006
ALTER DATABASE OPEN
Block change tracking file is current.
Tue Sep 19 09:43:04 2006
Beginning crash recovery of 1 threads
parallel recovery started with 2 processes
Tue Sep 19 09:43:04 2006
Started redo scan
Tue Sep 19 09:43:05 2006
Errors in file gns80_ora_9936.trc:
ORA-00600: internal error code, arguments: [kcratr1_lastbwr], [], [], [], [],[], [], []
Tue Sep 19 09:43:06 2006
Aborting crash recovery due to error 600

CHANGES
There was a disk problem that caused the database to crash.

CAUSE
Oracle is unable to perform instance recover but it works when is invoked manually.

SOLUTION
Mount the database and issue a recover statement

STARTUP MOUNT;
RECOVER DATABASE;
ALTER DATABASE OPEN;

=====================
Second Error:
=====================
ORA-00600: internal error code, arguments: [4194], [46], [8], [], [], [], [], [], 

This error is addressed in Oracle Metalink: 
Step by step to resolve ORA-600 4194 4193 4197 on database crash (Doc ID 1428786.1)

These are the errors I have faced
Tue Feb 28 12:59:56 2017
Errors in file /software/oracle/diag/rdbms/igt/igt/trace/igt_mmon_6581.trc  (incident=156131):
ORA-00353: log corruption near block 65536 change 6002234962 time 02/28/2017 02:03:40
ORA-00334: archived log: '/oracle_db/db2/db_igt/arch/arch0001_188396_669127092.arc'
Errors in file /software/oracle/diag/rdbms/igt/igt/trace/igt_mmon_6581.trc:
ORA-00368: checksum error in redo log block
ORA-00353: log corruption near block 65536 change 6002234962 time 02/28/2017 02:03:40
ORA-00334: archived log: '/oracle_db/db2/db_igt/arch/arch0001_188396_669127092.arc'
Errors in file /software/oracle/diag/rdbms/igt/igt/trace/igt_mmon_6581.trc  (incident=156132):
ORA-00600: internal error code, arguments: [4194], [46], [8], [], [], [], [], [], [], [], [], []
Incident details in: /software/oracle/diag/rdbms/igt/igt/incident/incdir_156132/igt_mmon_6581_i156132.trc
Errors in file /software/oracle/diag/rdbms/igt/igt/trace/igt_ora_28127.trc  (incident=129763):
ORA-00600: internal error code, arguments: [4194], [46], [8], [], [], [], [], [], [], [], [], []
Incident details in: /software/oracle/diag/rdbms/igt/igt/incident/incdir_129763/igt_ora_28127_i129763.trc

This is Oracle Solution
The following error is occurring in the alert.log right before the database crashes.

ORA-00600: internal error code, arguments: [4194], [#], [#], [], [], [], [], []

This error indicates that a mismatch has been detected between redo records and rollback (undo) records.

ARGUMENTS:

Arg [a] - Maximum Undo record number in Undo block
Arg [b] - Undo record number from Redo block

Since we are adding a new undo record to our undo block, we would expect that the new record number is equal to the maximum record number in the undo block plus one. Before Oracle can add a new undo record to the undo block it validates that this is correct. If this validation fails, then an ORA-600 [4194] will be triggered.

CHANGES

This issue generally occurs when there is a power outage or hardware failure that initially crashes the database. On startup, the database does the normal roll forward (redo) and then rollback (undo), this is where the error is generated on the rollback.

CAUSE

This also can be cause by the following defect

Bug 8240762 Abstract: Undo corruptions with ORA-600 [4193]/ORA-600 [4194] or ORA-600 [4137] after SHRINK

Details: 
Undo corruption may be caused after a shrink and the same undo block may be used 
for two different transactions causing several internal errors like:
ORA-600 [4193] / ORA-600 [4194] for new transactions
ORA-600 [4137] for a transaction rollback

SOLUTION
Best practice to create a new undo tablespace.
This method includes segment check.

Create pfile from spfile to edit
>create pfile from spfile;

1. Shutdown the instance

2. set the following parameters in the pfile
    undo_management = manual
    event = '10513 trace name context forever, level 2'

3. >startup restrict pfile=<initsid.ora>

4. >select tablespace_name, status, segment_name from dba_rollback_segs where status != 'OFFLINE';

This is critical - we are looking for all undo segments to be offline - System will always be online.

If any are 'PARTLY AVAILABLE' or 'NEEDS RECOVERY' - Please open an issue with Oracle Support or update the current SR.  There are many options from this moment and Oracle Support Analyst can offer different solutions for the bad undo segments.

If all offline then continue to the next step

5. Create new undo tablespace - example
>create undo tablespace <new undo tablespace> datafile <datafile> size 2000M;

6. Drop old undo tablespace
>drop tablespace <old undo tablespace> including contents and datafiles;

7. >shutdown immediate;

8 >startup nomount;  --> Using your Original spfile

9 modify the spfile with the new undo tablespace name

  Alter system set undo_tablespace = '<new tablespace created in step 5>' scope=spfile;

10. >shutdown immediate;

11. >startup;  --> Using spfile

The problem is that step 4. "select tablespace_name, status, segment_name from dba_rollback_segs where status != 'OFFLINE';" returned a segment, with status = PARTLY AVAILABLE

TABLESPACE_NAME                STATUS               SEGMENT_NAME
------------------------------ -------------------- --------------------
SYSTEM                         ONLINE               SYSTEM
UNDOTBS                        PARTLY AVAILABLE     _SYSSMU10_1225024740$

Per Oracle Documentation "This is critical - we are looking for all undo segments to be offline - 
If any are 'PARTLY AVAILABLE' or 'NEEDS RECOVERY' - Please open an issue with Oracle Support.  There are many options from this moment and Oracle Support Analyst can offer different solutions for the bad undo segments."



=====================
Third Error:
=====================
Same error as before, 
ORA-00600: internal error code, arguments: [4194], [46], [8], [], [], [], [], [], 
and _SYSSMU10_1225024740$ segment in UNDO Tablespace being in status PARTLY AVAILABLE.

Following steps are not available from Oracle Metalink, nut were provided by Oracle Support.
This is what worked for me.
These is internal Oracle stuff, and not an official solution. From here, proceed at your own risk

Solution:
The goal of this solution is to clear the faulted segment from UNDO tablespace, and recreate the undo tablespace. 

1. Create pfile
create pfile '/software/oracle/admin/igt/pfile/pfile_20170227.ora' from spfile; 

2. Shutdown the instance 
SHUTDOWN IMMEDIATE;

3. Edit the pfile, set the following parameters in the pfile:
undo_management = manual 
event = '10513 trace name context forever, level 2' 
_corrupted_rollback_segments = ('_SYSSMU10_1225024740$') 

4. Startup using pfile
startup restrict pfile='/software/oracle/admin/igt/pfile/pfile_20170227.ora'

5. Verify segment is now in  'NEEDS RECOVERY' status
SELECT tablespace_name, status, segment_name 
  FROM dba_rollback_segs 
 WHERE status != 'OFFLINE'; 

Now the segment in question should be in 'NEEDS RECOVERY'
If so, continue to the next step.

Indeed, now segment _SYSSMU10_1225024740$ status has changed from 'PARTLY AVAILABLE' to 'NEEDS RECOVERY'



TABLESPACE_NAME                STATUS               SEGMENT_NAME
------------------------------ -------------------- --------------------
SYSTEM                         ONLINE               SYSTEM
UNDOTBS                        NEEDS RECOVERY      _SYSSMU10_1225024740$


6. Create new undo tablespace 
CREATE UNDO TABLESPACE UNDOTBS_02 datafile '/oracle_db/db1/db_igt/ora_undotbs_02.dbf' size 4000M;

7. Drop old undo tablespace 
DROP TABLESPACE UNDOTBS INCLUDING CONTENTS AND DATAFILES;

8. Shutdown Database
SHUTDOWN IMMEDIATE;

9. Startup mount; 
STARTUP MOUNT;

10. Modify the spfile with the new undo tablespace name 
ALTER SYSTEM SET UNDO_TABLESPACE = UNDOTBS_02 SCOPE=SPFILE;

11. Shutdown Database
SHUTDOWN IMMEDIATE;

12. Start Database; 
STARTUP;

Optionally, rollback to the original name and file.
This is useful, if database is running in cluster mode, and you do not want to update all the nodes to the new name.

CREATE UNDO TABLESPACE UNDOTBS DATAFILE '/oracle_db/db1/db_igt/ora_undotbs_01.dbf' SIZE 4000M;

ALTER SYSTEM SET UNDO_TABLESPACE = UNDOTBS SCOPE=BOTH;

DROP TABLESPACE UNDOTBS_02 INCLUDING CONTENTS AND DATAFILES;

No comments:

Post a Comment