|  | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
| 
 |  | ||
|  | ||||||||||||
| 
 Solution Type Problem Resolution Sure Solution 1007790.1 : Sun Fire[TM] 12K/15K/E20K/E25K: System Controller (SC) platform messages file reports "FRAD chkpt WRITE failed. session id: 128, return code: 8" errors. 
 
 PreviouslyPublishedAs 210776 
 Applies to:Sun Fire 15K ServerSun Fire E20K Server Sun Fire 12K Server Sun Fire E25K Server All Platforms SymptomsIn the /var/opt/SUNWSMS/adm/platform/messages file on a Sun Fire[TM] 12K/15K/E20K/E25K SC, all components in the platform report FruAcess errors and "FRAD chkpt WRITE failed" messages such as the following:Oct 4 07:00:43 2004 sc0 frad[660]: [10009 1379176584473204 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id: 128, return code: 8 The command showenvironment reports all temperature and voltage status checks are fine for all components, and nothing appears to be wrong on the platform, so why are all the messages occurring and how do we stop them? CauseThe error message is indicating that FRAD, Fru Access Daemon, can not write to a checkpoint file.Q: Why can't a daemon, or a user for that matter, write to certain files? A: Because the daemon or user doesn't have permissions to the file. SolutionIn the case of the FRAD chkpt error, the file in question is located in the /var/opt/SUNWSMS/data/.failover/chkpt directory on the SC. This file is a checkpoint file that is used as reference by FOMD (Failover Monitoring Daemon) for file propagation between SCs.If the permissions on this chkpt file are incorrect, the SMS daemon can not write to it and the error messages appear. So, a possible "fix" for this issue would be to simply open up the permissions on this file or directory and the daemons could now write to the chkpt file, as root does: chmod -R 777 /var/opt/SUNWSMS/data/.failover/chkpt BUT, this is not really a good solution because this may not actually be the real root cause. There might be more problems that need to be resolved. If the directory /var/opt/SUNWSMS/SMS1.4.1/data/.failover has the wrong group/ownership permissions, it's subdirectories are not writeable by sms daemons, and the error messages above will happen. Changing just the permissions on the chkpt files or chkpt directory is not the correct course of action, because we need to make sure that the parent directory is not actually the real root cause. The whole directory structure needs it's ownership configuration resolved to head off possible future issues: BAD CONFIGURATION (NOTE: ".cod" and ".failover" directories should be root:sms) sms-svc> cd /var/opt/SUNWSMS/SMS1.4.1/data/ sms-svc> ls -la total 54 drwxrwxr-x+ 23 root sms 512 Oct 4 14:15 . drwxr-xr-x+ 8 root sys 512 Oct 2 00:52 .. drwxrwxr-x 2 root bin 512 Jun 18 2002 .cod drwxrwxr-x 6 root bin 512 Jun 18 2002 .failover -r-------- 1 root sys 17 Sep 16 17:46 .remotesc drwxr-xr-x 2 root sms 512 Oct 2 01:55 .wcapp drwxrwx---+ 2 root sms 512 Oct 2 01:55 A drwxrwx--- 2 root sms 512 Sep 12 02:45 B drwxrwx--- 2 root sms 512 Sep 12 02:45 C drwxrwx--- 2 root sms 512 Sep 12 02:45 D drwxrwx--- 2 root sms 512 Sep 12 02:45 E drwxrwx--- 2 root sms 512 Sep 12 02:45 F drwxrwx--- 2 root sms 512 Sep 12 02:45 G drwxrwx--- 2 root sms 512 Sep 12 02:45 H drwxrwx--- 2 root sms 512 Sep 12 02:45 I drwxrwx--- 2 root sms 512 Sep 12 02:45 J drwxrwx--- 2 root sms 512 Sep 12 02:45 K drwxrwx--- 2 root sms 512 Sep 12 02:46 L drwxrwx--- 2 root sms 512 Sep 12 02:46 M drwxrwx--- 2 root sms 512 Sep 12 02:46 N drwxrwx--- 2 root sms 512 Sep 12 02:46 O drwxrwx--- 2 root sms 512 Sep 12 02:46 P drwxrwx--- 2 root sms 512 Sep 12 02:46 Q drwxrwx--- 2 root sms 512 Sep 12 02:46 R -rw-r----- 1 sms-dsmd sms 288 Oct 2 02:04 dsmd_domain_info srwxrwxrwx 1 sms-efe sms 0 Oct 2 02:03 efeSock -rw-r--r-- 1 sms-osd sms 72 Oct 2 00:11 osdTimeDeltas -rw-r--r-- 1 root root 4 Oct 2 01:52 ssd_loop.pid GOOD CONFIGURATION sms-svc> pwd /var/opt/SUNWSMS/SMS1.4.1/data sms-svc> ls -la total 54 drwxrwxr-x+ 23 root sms 512 Oct 2 16:30 . drwxr-xr-x+ 8 root sys 512 Sep 22 11:47 .. drwxrwxr-x 2 root sms 512 Sep 22 11:51 .cod drwxrwxr-x 6 root sms 512 Sep 22 11:46 .failover -r-------- 1 root sys 17 Sep 23 12:08 .remotesc drwxr-xr-x 2 root sms 512 Oct 1 11:00 .wcapp drwxrwx---+ 2 root sms 512 Oct 1 11:00 A drwxrwx---+ 2 root sms 512 Sep 29 14:17 B drwxrwx---+ 2 root sms 512 Sep 27 10:27 C drwxrwx---+ 2 root sms 512 Sep 27 10:27 D drwxrwx---+ 2 root sms 512 Sep 22 11:51 E drwxrwx---+ 2 root sms 512 Sep 22 11:51 F drwxrwx---+ 2 root sms 512 Sep 22 11:51 G drwxrwx---+ 2 root sms 512 Sep 22 11:51 H drwxrwx---+ 2 root sms 512 Sep 22 11:51 I drwxrwx---+ 2 root sms 512 Sep 22 11:51 J drwxrwx---+ 2 root sms 512 Sep 22 11:51 K drwxrwx---+ 2 root sms 512 Sep 22 11:51 L drwxrwx---+ 2 root sms 512 Sep 22 11:51 M drwxrwx---+ 2 root sms 512 Sep 22 11:51 N drwxrwx---+ 2 root sms 512 Sep 22 11:51 O drwxrwx---+ 2 root sms 512 Sep 22 11:51 P drwxrwx---+ 2 root sms 512 Sep 30 14:21 Q drwxrwx---+ 2 root sms 512 Sep 22 11:51 R -rw-r----- 1 sms-dsmd sms 288 Oct 1 21:01 dsmd_domain_info srwxrwxrwx 1 sms-efe sms 0 Oct 1 11:02 efeSock -rw-r--r-- 1 sms-osd bin 72 Oct 1 17:58 osdTimeDeltas -rw-r--r-- 1 root root 5 Oct 1 10:58 ssd_loop.pid sms-svc> cd .failover sms-svc> ls -la total 12 drwxrwxr-x 6 root sms 512 Sep 22 11:46 . drwxrwxr-x+ 23 root sms 512 Oct 2 16:30 .. drwxrwxr-x 2 root sms 512 Oct 5 10:15 chkpt drwxrwxr-x 2 root sms 512 Sep 22 11:51 fomd drwxrwxr-x 2 root sms 512 Sep 22 11:46 local drwxrwxrwx 2 root sms 512 Oct 5 10:55 tmp sms-svc> cd chkpt sms-svc> ls -la total 10 drwxrwxr-x 2 root sms 512 Oct 5 10:15 . drwxrwxr-x 6 root sms 512 Sep 22 11:46 .. -rw-r--r-- 1 root other 544 Oct 1 17:32 2.128.1.0 -rw-r--r-- 1 root other 544 Oct 1 11:03 2.130.1.0 -rw-rw-rw- 1 root other 434 Oct 5 10:15 chkpt.list Ultimately, changing the permissions on only the /var/opt/SUNWSMS/SMS1.4.1/data/.failover/chkpt directory would allow for SMS to write to the particular chkpt file, but there is no telling if other problems might be resolved now by fixing what was actually root cause, which is the bad group ownership of the top level directories. So, the fix is to issue the commands as root: cd /var/opt/SUNWSMS/SMS1.4.1/data chgrp -R sms .failover chgrp -R sms .cod Please see Additional Information for more suggestions. Additional Information It is important to note that the group ownership issue could be the result of a tar restore or cpio restore that did not preserve original group and owner settings, or it might just be the result of someone having manually set these ownership/permission themselves for some reason. The fact is that it would be hard to prove either way after the fact. If one directory or file is configured incorrectly, assume all are. Confirming the configuration is correct should be the next step. Obtain access to a separate "known good" SC to compare configuration, or log a case with Sun[TM] Support to obtain help in making sure permissions and ownership is correct. It's also a good idea to confirm that the SMS daemons have the correct UID as well. From /etc/passwd, the UID is as follows for the various daemons:  sms-codd:x:10:54:SMS Capacity On Demand Daemon:: sms-dca:x:11:54:SMS Domain Configuration Agent:: sms-dsmd:x:12:54:SMS Domain Status Monitoring Daemon::   Sun Fire E25K Server Sun Fire E20K Server Sun Fire 15K Server Sun Fire 12K Server Internal Section Reference Apollo Escalation 1-4203640, Radiance case ID 64289130 Keywords: frad, esmd, sms, fru, fruaccess, chkpt, checkpoint, write failure Previously Published As 78507 Product_uuid d842dd03-059b-11d8-84cb-080020a9ed93|Sun Fire E25K Server 1404a2d3-059a-11d8-84cb-080020a9ed93|Sun Fire E20K Server 29e4659c-0a18-11d6-9fa1-e67bbc033df8|Sun Fire 15K Server 077fd4c5-df8f-4320-ad69-7d01603a674d|Sun Fire 12K Server Attachments This solution has no attachment | ||||||||||||
| 
 | ||||||||||||