Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Problem Resolution Sure Solution 1007790.1 : Sun Fire[TM] 12K/15K/E20K/E25K: System Controller (SC) platform messages file reports "FRAD chkpt WRITE failed. session id: 128, return code: 8" errors.
PreviouslyPublishedAs 210776 Symptoms In the /var/opt/SUNWSMS/adm/platform/messages file on a Sun Fire[TM] 12K/15K/E20K/E25K SC, all components in the platform report FruAcess errors and "FRAD chkpt WRITE failed" messages such as the following: Oct 4 07:00:43 2004 sc0 frad[660]: [10009 1379176584473204 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id: 128, return code: 8 Oct 4 07:00:43 2004 sc0 esmd[1422]: [1994 1379176638578801 ERR FruAccess.cc 554] Failed to update the power summary record of fru FT5: rc=-2 Oct 4 07:00:43 2004 sc0 esmd[1422]: [1994 1379176639358700 ERR DynamicFru.cc 256] Failed to update the power summary record of fru FT5: rc=-2 Oct 4 07:00:43 2004 sc0 frad[660]: [10009 1379176775710362 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id: 128, return code: 8 Oct 4 07:00:43 2004 sc0 esmd[1422]: [1991 1379176829667639 ERR FruAccess.cc 473] Failed to write the power event record of fru FT5: rc=-2 Oct 4 07:00:43 2004 sc0 esmd[1422]: [1992 1379176830622863 ERR DynamicFru.cc 394] Failed to write the power event record, STILL_ON, of fru FT5: rc=-2 Oct 4 07:03:43 2004 sc0 frad[660]: [10009 1379357012147050 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id: 128, return code: 8 Oct 4 07:03:43 2004 sc0 esmd[1422]: [1994 1379357084158464 ERR FruAccess.cc 554] Failed to update the power summary record of fru SB14: rc=-2 Oct 4 07:03:43 2004 sc0 esmd[1422]: [1994 1379357085045960 ERR DynamicFru.cc 256] Failed to update the power summary record of fru SB14: rc=-2 Oct 4 07:03:43 2004 sc0 frad[660]: [10009 1379357173410163 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id: 128, return code: 8 Oct 4 07:03:43 2004 sc0 esmd[1422]: [1993 1379357221559279 ERR FruAccess.cc 655] Failed to update the temperature summary record of fru SB14(sensor=0): rc=-2 Oct 4 07:03:43 2004 sc0 esmd[1422]: [1993 1379357222339133 ERR DynamicFru.cc 210] Failed to update the temperature summary record of fru SB14(sensor=0): rc=-2 Oct 4 07:03:43 2004 sc0 frad[660]: [10009 1379357334767963 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id: 128, return code: 8 Oct 4 07:03:43 2004 sc0 esmd[1422]: [1993 1379357388884910 ERR FruAccess.cc 655] Failed to update the temperature summary record of fru SB14(sensor=1): rc=-2 Oct 4 07:03:43 2004 sc0 esmd[1422]: [1993 1379357389675546 ERR DynamicFru.cc 210] Failed to update the temperature summary record of fru SB14(sensor=1): rc=-2 Oct 4 07:03:43 2004 sc0 frad[660]: [10009 1379357502066976 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id: 128, return code: 8 The command showenvironment reports all temperature and voltage status checks are fine for all components, and nothing appears to be wrong on the platform, so why are all the messages occurring and how do we stop them? Resolution The error message is indicating that FRAD, Fru Access Daemon, can not write to a checkpoint file. Q: Why can't a daemon, or a user for that matter, write to certain files? A: Because the daemon or user doesn't have permissions to the file. In the case of the FRAD chkpt error, the file in question is located in the /var/opt/SUNWSMS/data/.failover/chkpt directory on the SC. This file is a checkpoint file that is used as reference by FOMD (Failover Monitoring Daemon) for file propagation between SCs. If the permissions on this chkpt file are incorrect, the SMS daemon can not write to it and the error messages appear. So, a possible "fix" for this issue would be to simply open up the permissions on this file or directory and the daemons could now write to the chkpt file, as root does: chmod -R 777 /var/opt/SUNWSMS/data/.failover/chkpt BUT, this is not really a good solution because this may not actually be the real root cause. There might be more problems that need to be resolved. If the directory /var/opt/SUNWSMS/SMS1.4.1/data/.failover has the wrong group/ownership permissions, it's subdirectories are not writeable by sms daemons, and the error messages above will happen. Changing just the permissions on the chkpt files or chkpt directory is not the correct course of action, because we need to make sure that the parent directory is not actually the real root cause. The whole directory structure needs it's ownership configuration resolved to head off possible future issues: BAD CONFIGURATION (NOTE: ".cod" and ".failover" directories should be root:sms) sms-svc> cd /var/opt/SUNWSMS/SMS1.4.1/data/ sms-svc> ls -la total 54 drwxrwxr-x+ 23 root sms 512 Oct 4 14:15 . drwxr-xr-x+ 8 root sys 512 Oct 2 00:52 .. drwxrwxr-x 2 root bin 512 Jun 18 2002 .cod drwxrwxr-x 6 root bin 512 Jun 18 2002 .failover -r-------- 1 root sys 17 Sep 16 17:46 .remotesc drwxr-xr-x 2 root sms 512 Oct 2 01:55 .wcapp drwxrwx---+ 2 root sms 512 Oct 2 01:55 A drwxrwx--- 2 root sms 512 Sep 12 02:45 B drwxrwx--- 2 root sms 512 Sep 12 02:45 C drwxrwx--- 2 root sms 512 Sep 12 02:45 D drwxrwx--- 2 root sms 512 Sep 12 02:45 E drwxrwx--- 2 root sms 512 Sep 12 02:45 F drwxrwx--- 2 root sms 512 Sep 12 02:45 G drwxrwx--- 2 root sms 512 Sep 12 02:45 H drwxrwx--- 2 root sms 512 Sep 12 02:45 I drwxrwx--- 2 root sms 512 Sep 12 02:45 J drwxrwx--- 2 root sms 512 Sep 12 02:45 K drwxrwx--- 2 root sms 512 Sep 12 02:46 L drwxrwx--- 2 root sms 512 Sep 12 02:46 M drwxrwx--- 2 root sms 512 Sep 12 02:46 N drwxrwx--- 2 root sms 512 Sep 12 02:46 O drwxrwx--- 2 root sms 512 Sep 12 02:46 P drwxrwx--- 2 root sms 512 Sep 12 02:46 Q drwxrwx--- 2 root sms 512 Sep 12 02:46 R -rw-r----- 1 sms-dsmd sms 288 Oct 2 02:04 dsmd_domain_info srwxrwxrwx 1 sms-efe sms 0 Oct 2 02:03 efeSock -rw-r--r-- 1 sms-osd sms 72 Oct 2 00:11 osdTimeDeltas -rw-r--r-- 1 root root 4 Oct 2 01:52 ssd_loop.pid GOOD CONFIGURATION sms-svc> pwd /var/opt/SUNWSMS/SMS1.4.1/data sms-svc> ls -la total 54 drwxrwxr-x+ 23 root sms 512 Oct 2 16:30 . drwxr-xr-x+ 8 root sys 512 Sep 22 11:47 .. drwxrwxr-x 2 root sms 512 Sep 22 11:51 .cod drwxrwxr-x 6 root sms 512 Sep 22 11:46 .failover -r-------- 1 root sys 17 Sep 23 12:08 .remotesc drwxr-xr-x 2 root sms 512 Oct 1 11:00 .wcapp drwxrwx---+ 2 root sms 512 Oct 1 11:00 A drwxrwx---+ 2 root sms 512 Sep 29 14:17 B drwxrwx---+ 2 root sms 512 Sep 27 10:27 C drwxrwx---+ 2 root sms 512 Sep 27 10:27 D drwxrwx---+ 2 root sms 512 Sep 22 11:51 E drwxrwx---+ 2 root sms 512 Sep 22 11:51 F drwxrwx---+ 2 root sms 512 Sep 22 11:51 G drwxrwx---+ 2 root sms 512 Sep 22 11:51 H drwxrwx---+ 2 root sms 512 Sep 22 11:51 I drwxrwx---+ 2 root sms 512 Sep 22 11:51 J drwxrwx---+ 2 root sms 512 Sep 22 11:51 K drwxrwx---+ 2 root sms 512 Sep 22 11:51 L drwxrwx---+ 2 root sms 512 Sep 22 11:51 M drwxrwx---+ 2 root sms 512 Sep 22 11:51 N drwxrwx---+ 2 root sms 512 Sep 22 11:51 O drwxrwx---+ 2 root sms 512 Sep 22 11:51 P drwxrwx---+ 2 root sms 512 Sep 30 14:21 Q drwxrwx---+ 2 root sms 512 Sep 22 11:51 R -rw-r----- 1 sms-dsmd sms 288 Oct 1 21:01 dsmd_domain_info srwxrwxrwx 1 sms-efe sms 0 Oct 1 11:02 efeSock -rw-r--r-- 1 sms-osd bin 72 Oct 1 17:58 osdTimeDeltas -rw-r--r-- 1 root root 5 Oct 1 10:58 ssd_loop.pid sms-svc> cd .failover sms-svc> ls -la total 12 drwxrwxr-x 6 root sms 512 Sep 22 11:46 . drwxrwxr-x+ 23 root sms 512 Oct 2 16:30 .. drwxrwxr-x 2 root sms 512 Oct 5 10:15 chkpt drwxrwxr-x 2 root sms 512 Sep 22 11:51 fomd drwxrwxr-x 2 root sms 512 Sep 22 11:46 local drwxrwxrwx 2 root sms 512 Oct 5 10:55 tmp sms-svc> cd chkpt sms-svc> ls -la total 10 drwxrwxr-x 2 root sms 512 Oct 5 10:15 . drwxrwxr-x 6 root sms 512 Sep 22 11:46 .. -rw-r--r-- 1 root other 544 Oct 1 17:32 2.128.1.0 -rw-r--r-- 1 root other 544 Oct 1 11:03 2.130.1.0 -rw-rw-rw- 1 root other 434 Oct 5 10:15 chkpt.list Ultimately, changing the permissions on only the /var/opt/SUNWSMS/SMS1.4.1/data/.failover/chkpt directory would allow for SMS to write to the particular chkpt file, but there is no telling if other problems might be resolved now by fixing what was actually root cause, which is the bad group ownership of the top level directories. So, the fix is to issue the commands as root: cd /var/opt/SUNWSMS/SMS1.4.1/data chgrp -R sms .failover chgrp -R sms .cod Please see Additional Information for more suggestions. Additional Information It is important to note that the group ownership issue could be the result of a tar restore or cpio restore that did not preserve original group and owner settings, or it might just be the result of someone having manually set these ownership/permission themselves for some reason. The fact is that it would be hard to prove either way after the fact. If one directory or file is configured incorrectly, assume all are. Confirming the configuration is correct should be the next step. Obtain access to a separate "known good" SC to compare configuration, or log a case with Sun[TM] Support to obtain help in making sure permissions and ownership is correct. It's also a good idea to confirm that the SMS daemons have the correct UID as well. From /etc/passwd, the UID is as follows for the various daemons: sms-codd:x:10:54:SMS Capacity On Demand Daemon:: Product Sun Fire E25K Server Sun Fire E20K Server Sun Fire 15K Server Sun Fire 12K Server Internal Comments Reference Apollo Escalation 1-4203640, Radiance case ID 64289130 frad, esmd, sms, fru, fruaccess, chkpt, checkpoint, write failure Previously Published As 78507 Change History Date: 2004-10-05 User Name: 7058 Action: Approved Comment: Fixed document format with STM. Fixed a few grammar errors. Added technology area metatags. OK to publish now. Version: 3 Date: 2004-10-05 User Name: 7058 Action: Accept Comment: Version: 0 Date: 2004-10-05 User Name: 146765 Action: Approved Comment: Good document with great details. Please publish. Version: 0 Product_uuid d842dd03-059b-11d8-84cb-080020a9ed93|Sun Fire E25K Server 1404a2d3-059a-11d8-84cb-080020a9ed93|Sun Fire E20K Server 29e4659c-0a18-11d6-9fa1-e67bbc033df8|Sun Fire 15K Server 077fd4c5-df8f-4320-ad69-7d01603a674d|Sun Fire 12K Server Attachments This solution has no attachment |
||||||||||||
|