Exadata Cloud at Customer VM Cluster update failed using the OCI console: how to troubleshoot and resolve it

Hi people,

It’s been 3 weeks since I updated the OS of the VMs of one of our client’s Exadata Cloud at Customer (ExaCC) in their DR site. The update was performed from Exadata image 19.2.x to 22.1.10.230422 using the OCI console without any issues. There was a huge leap in the image version because the client never updated this system since its deployment in 2020. All prerequisites checks were performed without any issues and the update was applied successfully afterwards. However, that was not the case when I applied the same image update in their production ExaCC VM Cluster in the primary site. The prerequisites checks were performed and an error was reported. From the OCI console there are no logs or any useful information, so you have to login to the VMs and find where patchmgr logs are so you can read them and troubleshoot, or you can open an SR with My Oracle Support to get some help. The troubleshooting steps required reading the logs of patchmgr on the node that started the patchmgr session. With ExaCC the update is started from the OCI console which kicks off a process that will push the dbnodeupdate.sh zip package and Exadata ISO image to the /u02 file system. The patchmgr session itself runs from the node that is not being patched, i.e.: VM 1 will be updated first, so the patchmgr session will run from VM 2. Login to VM 2 to find the patchmgr.log and patchmgr.trc files under /u02/dbserver-patch*/ directory.
How to find patchmgr logs and traces? Easy, connect to both nodes and check which one is running a patchmgr session and get the log location from the -log_dir attribute:

[root@dc-a-exaccdb02 ~]# ps -fe|grep patchmgr
root     211103      1  0 00:52 ?        00:00:01 /bin/bash ./patchmgr -dbnodes node_list -upgrade -iso_repo exadata_ol7_22.1.10.0.0.230422_Linux-x86-64.zip -target_version 22.1.10.0.0.230422 -rolling -log_dir /u02/dbserver.patch.zip_exadata_ol7_22.1.10.0.0.230422_Linux-x86-64.zip/dbserver_patch_221130/patchmgr_log_90da5e82-fe43-4d25-9ddb-c9befc574c5d -allow_active_network_mounts
root     224028 211103  0 00:52 ?        00:00:00 /bin/bash ./patchmgr -dbnodes node_list -upgrade -iso_repo exadata_ol7_22.1.10.0.0.230422_Linux-x86-64.zip -target_version 22.1.10.0.0.230422 -rolling -log_dir /u02/dbserver.patch.zip_exadata_ol7_22.1.10.0.0.230422_Linux-x86-64.zip/dbserver_patch_221130/patchmgr_log_90da5e82-fe43-4d25-9ddb-c9befc574c5d -allow_active_network_mounts
root     224040 224028  0 00:52 ?        00:00:00 python /u02/dbserver.patch.zip_exadata_ol7_22.1.10.0.0.230422_Linux-x86-64.zip/dbserver_patch_221130/dcli -s -o ServerAliveInterval=550 -o ServerAliveCountMax=157 -o BatchMode=yes -m hmac-sha1,hmac-sha2-256,hmac-sha2-512 -l root -c dc-a-exaccdb01.loredata.com.br cd /u01/dbnodeupdate.patchmgr; ./dbnodeupdate.sh  -g -P 260523005215 -b -q   -a
root     224054 224040  0 00:52 ?        00:00:00 /bin/sh -c /usr/bin/ssh -o ServerAliveInterval=550 -o ServerAliveCountMax=157 -o BatchMode=yes -m hmac-sha1,hmac-sha2-256,hmac-sha2-512 -l root dc-a-exaccdb01.loredata.com.br '( cd /u01/dbnodeupdate.patchmgr; ./dbnodeupdate.sh  -g -P 260523005215 -b -q   -a   ) 2>&1' 2>/tmp/banner_3hIorA.dc-a-exaccdb01.loredata.com.br
root     224055 224054  0 00:52 ?        00:00:00 /usr/bin/ssh -o ServerAliveInterval=550 -o ServerAliveCountMax=157 -o BatchMode=yes -m hmac-sha1,hmac-sha2-256,hmac-sha2-512 -l root dc-a-exaccdb01.loredata.com.br ( cd /u01/dbnodeupdate.patchmgr; ./dbnodeupdate.sh  -g -P 260523005215 -b -q   -a   ) 2>&1
root     339103 149235  0 00:59 pts/0    00:00:00 grep --color=auto patchmgr
[root@dc-a-exaccdb02 ~]#

Go to that directory and open the log using vi, tail, grep or whatever method you prefer.

[root@dc-a-exaccdb02~]# cd /u02/dbserver.patch.zip_exadata_ol7_22.1.10.0.0.230422_Linux-x86-64.zip/dbserver_patch_221130/patchmgr_log_90da5e82-fe43-4d25-9ddb-c9befc574c5d

I was able to identify that the first issue was related to /u01 being over 80% utilization. I cleared some logs there and then ran the prerequisites checks again which failed this time due to JRE being installed on VM 1:

dc-a-exaccdb01.loredata.com.br: ERROR: JRE rpm is detected on this database node and should be removed. When needed replace with JDK.

I removed the JRE RPM:

jre removal

After releasing some space on /u01 and removing JRE from VM 1 I reran the prerequisites check which now successfully completed.

prereqs green

This time I was good to start applying the update to the VMs Operating System. With the prerequisites checks cleared I clicked apply update in the OCI console to start the update process.

The update started and I decided to monitor it closely from the patchmgr.log. The OS was successfully updated and the new Exadata image was activated,

imageinfo

but after the reboot the update process failed from the OCI console and the patchmgr session was terminated.

update failed

I started having issues to keep connected to the VM 1 that was patched while trying to check the image status, filesystems and other things. That was when I saw the message in the patchmgr.log on VM 2 about the SSH configuration under /etc/sshd_config stating that ClientAliveInterval was not set to expected 600 seconds, but to 30 seconds. That’s why my connection was dropping on VM 1 after 30 seconds of inactivity. Please, note that this SSH configuration was changed by the patching session itself (and the issue happened when I patched VM 2 later). Knowing how the patching process works in an Exadata I then changed ClientAliveInterval from 30 to 600, restarted sshd daemon and resumed the patching process via dbnodeupdate.sh. It is important to mention that dbnodeupdate.sh is executed locally on the VM that was being patched, not remotely as patchmgr does. That was just to complete the post-patching steps and enable and bring up the GI and DB resources automatically

[root@dc-a-exaccdb01 ~]# /u01/dbnodeupdate.patchmgr/dbnodeupdate.sh -c
  (*) 2023-05-26 07:32:21: Initializing logfile /var/log/cellos/dbnodeupdate.log
##########################################################################################################################
#                                                                                                                        #
# Guidelines for using dbnodeupdate.sh (rel. 21.221130):                                                                 #
#                                                                                                                        #
# - Prerequisites for usage:                                                                                             #
#         1. Refer to dbnodeupdate.sh options. See MOS 1553103.1                                                         #
#         2. Always use the latest release of dbnodeupdate.sh. See patch 21634633                                        #
#         3. Run the prereq check using the '-v' flag.                                                                   #
#         4. Carefully run prereq check with the '-M' flag to allow conflicting rpms being removed or preupdated.        #
#                                                                                                                        #
#   I.e.:  ./dbnodeupdate.sh -u -l /u01/my-iso-repo.zip -v  (may see rpm conflicts)                                      #
#          ./dbnodeupdate.sh -u -l http://my-yum-repo -v -M (resolves known rpm comflicts)                               #
#                                                                                                                        #
# - Prerequisite rpm dependency check failures can happen due to customization:                                          #
#     - The prereq check detects dependency issues that need to be addressed prior to running a successful update.       #
#     - Customized rpm packages may fail the built-in dependency check and system updates cannot proceed until resolved. #
#     - Prereq check may fail because -M flag was not used and known conflicting rpms were not removed.                  #
#                                                                                                                        #
# - As part of the update, a number of rpms will be removed.                                                             #
#                                                                                                                        #
# - In case of any problem when filing an SR, upload the following:                                                      #
#      - /var/log/cellos/dbnodeupdate.log                                                                                #
#      - /var/log/cellos/dbnodeupdate.<runid>.diag                                                                       #
#      - where <runid> is the unique number of the failing run.                                                          #
#                                                                                                                        #
#                                                                                                                        #
##########################################################################################################################


  WARNING: Your dbnodeupdate.sh is more than 60 days old. Check MOS 1553103.1 and make sure you have the most recent version.

Continue ? [y/n] y

  (*) 2023-05-26 07:32:25: Unzipping helpers (/u01/dbnodeupdate.patchmgr/dbupdate-helpers.zip) to /opt/oracle.SupportTools/dbnodeupdate_helpers
  (*) 2023-05-26 07:32:27: Collecting system configuration settings. This may take up to 5 seconds.
  (*) 2023-05-26 07:33:17: Service status and file attribute report created in the background and will become available in: /etc/exadata/reports

Active Image version   : 22.1.10.0.0.230422
Active Kernel version  : 4.14.35-2047.518.4.3.el7uek
Active LVM Name        : /dev/mapper/VGExaDb-LVDbSys1
Inactive Image version : 19.2.13.0.0.200428
Inactive LVM Name      : /dev/mapper/VGExaDb-LVDbSys2
Current user id        : root
Action                 : finish-post (validate image status, fix known issues, cleanup and enable crs to auto-start)
Shutdown EM agents     : Yes
Shutdown stack         : No (Currently stack is down)
Logfile                : /var/log/cellos/dbnodeupdate.log (runid: 260523073221)
Diagfile               : /var/log/cellos/dbnodeupdate.260523073221.diag
Server model           : Exadata Virtual Machine
dbnodeupdate.sh rel.   : 21.221130 (always check MOS 1553103.1 for the latest release of dbnodeupdate.sh)


The following known issues will be checked for but require manual follow-up:
  (*) - Yum rolling update requires fix for 11768055 when Grid Infrastructure is below 11.2.0.2 BP12


Continue ? [y/n] y

  (*) 2023-05-26 07:34:04: Verifying GI and DB's are shutdown
  (*) 2023-05-26 07:34:04: Executing plugin /u01/dbnodeupdate.patchmgr/dbnu-plugin.sh with arguments 260523073221 start-execfinish
  (*) 2023-05-26 07:34:05: Verifying firmware updates/validations. Maximum wait time: 60 minutes.
  (*) 2023-05-26 07:34:05: If the node reboots during this firmware update/validation, re-run './dbnodeupdate.sh -c' after the node restarts..
  (*) 2023-05-26 07:34:07: No rpms to remove
  (*) 2023-05-26 07:34:18: EM agent in /u02/app/oracle/product/agent_13.5/agent_13.5.0.0.0 stopped
  (*) 2023-05-26 07:34:18: Service acpid enabled to autostart at boot
  (*) 2023-05-26 07:34:19: Not Relinking oracle homes
  (*) 2023-05-26 07:34:19: Executing plugin /u01/dbnodeupdate.patchmgr/dbnu-plugin.sh with arguments 260523073221 before-relink
  (*) 2023-05-26 07:34:20: Starting Grid Infrastructure (/u01/app/19.0.0.0/grid)
  (*) 2023-05-26 07:36:25: Sleeping another 60 seconds while stack is starting (1/15)
  (*) 2023-05-26 07:36:25: Stack started
  (*) 2023-05-26 07:36:25: TFA Started
  (*) 2023-05-26 07:36:25: Enabling stack to start at reboot. Disable this when the stack should not be starting on a next boot
  (*) 2023-05-26 07:36:39: EM agent in /u02/app/oracle/product/agent_13.5/agent_13.5.0.0.0 started
  (*) 2023-05-26 07:36:40: Purging any extra jdk packages.
  (*) 2023-05-26 07:36:40: No jdk package cleanup needed. Retained jdk package installed: jdk1.8-1.8.0_361.x86_64
  (*) 2023-05-26 07:36:40: Retained the required kernel-transition package: kernel-transition-3.10.0-0.0.0.2.el7
  (*) 2023-05-26 07:36:41: Disabling diagsnap in 12.2 and later Grid Infrastructure (24900613)
  (*) 2023-05-26 07:36:51: All post steps are finished.

The Exadata OS image was updated on VM 1, but remember that the update process had failed in the OCI console which coordinates everything, so at this point the update process would not be kicked off for VM 2. When the update fails in the OCI console you have two options presented there: rollback or retry update. I hit retry update. That way the process could start over by identifying that VM 1 was already at the desired stated, having the update process started on VM 2. The update on VM 2 failed for the same reason of ClientAliveInterval being set to 30 seconds instead of 600. I followed the same steps of changing the parameter, restarting sshd daemon and finishing the update using dbnodeupdate.sh.
After running dbnodeupdate.sh GI and DB resources were enabled and started without any issues and at this time the whole cluster was back online and fully functional, but the OCI console still had a failed state and was presenting the options: rollback or retry update.

update failed

Again I hit retry update in order to try to have the OCI console identify that the VMs were already at the desired state and clear the failed state and changing it to available and green again.

status green

That worked! This system is now updated and with clear green status!

imageinfo success

I hope this blog post gives you some clarity of the ExaCC patching process and in case you face issues I believe this will be of good help!

See you again soon,

Franky