Quantcast
Channel: CentOS Bug Tracker - Issues
Viewing all articles
Browse latest Browse all 19115

0005971: autofs fails during first boot after install

$
0
0
We install CentOS 6.3 using a kickstart file (attached). During first<br /> reboot after successful install, the autofs does not work correctly,<br /> e.g., failing to mount home directories (or any other) in our LDAP<br /> resident maps. We get the following message in the system log:<br /> <br /> Sep 19 16:51:19 automount[6554]: bind_ldap_simple: lookup(ldap): Unable<br /> to bind to the LDAP server: (default), error Can't contact LDAP server<br /> <br /> Rebooting "seems" to make the problem go away -- although I have<br /> personally witnessed the following sequence of events in one log file:<br /> <br /> <First boot after install><br /> ...<br /> ... Can't contact LDAP server<br /> ...<br /> <Second reboot><br /> ...<br /> <Third reboot><br /> ...<br /> ... Can't contact LDAP server<br /> <br /> My concern is that this can potentially be a repeatable problem, not<br /> just during the "first boot" after installation. If I reboot a dozen<br /> times after first boot, it does not seem to fail -- but it always fails<br /> during first boot. The subsequent failure listed in the log sketch<br /> above probably resulted from some some sort of manual action I took,<br /> e.g., bringing the interface down and then up again, restarting<br /> NetworkManager, etc. Unfortunately, I do not know what I did that made<br /> it occur during that subsequent boot. I can say unequivocably, however,<br /> that it *has* happened during at least one later boot.<br /> <br /> Looking at the autofs source code, I observed that this error message is<br /> produced within lookup_ldap.c, by the failure of a single call to<br /> openldap:<br /> <br /> rv = ldap_simple_bind_s (ldap, ...);<br /> <br /> My first approach to tracking this down further was to prepare a special<br /> version of the autofs RPM, that enabled the debugging trace output<br /> features of the openldap library. All this showed was that the failure<br /> was happening at the point where the ldap_connect_to_host() routine (in<br /> openldap-2.4.23/libraries/libldap/os-ip.c) performs a call to<br /> getaddrinfo(), with the following message in the debugging output:<br /> <br /> ldap_connect_to_host: getaddrinfo failed: Name or service not known<br /> <br /> Since the service is specified as a numeric port number, that leaves the<br /> host name that is failing to resolve. The source code for getaddrinfo()<br /> is quite gnarly and there was no way to easily enable debugging of it<br /> (during first boot, that is). My hunch was that the /etc/resolv.conf<br /> file was messed up, preventing the resolver from looking up the IP<br /> address for our LDAP server "ldap-tls.group-w-inc.com". This turned out<br /> to not be the case.<br /> <br /> My second approach was more successful. I modified my kickstart file<br /> to alter the /etc/rc.d/init.d/autofs file so that the<br /> /usr/sbin/automount program is invoked using strace. The fragments of<br /> strace output that proved enlightening are as follows:<br /> <br /> ...<br /> <br /> 6554 16:51:19.063975 open("/etc/resolv.conf", O_RDONLY) = 7<br /> 6554 16:51:19.064149 fstat(7, {st_dev=makedev(253, 1), st_ino=9830415, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=80, st_atime=2012/09/19-16:50:54, st_mtime=2012/09/19-16:50:53, st_ctime=2012/09/19-16:50:53}) = 0<br /> 6554 16:51:19.064239 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f6c483000<br /> 6554 16:51:19.064291 read(7, "; generated by /sbin/dhclient-script\nsearch group-w-inc.com\nnameserver 10.1.1.1\n", 4096) = 80<br /> <br /> ...<br /> <br /> 6554 16:51:19.069263 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 7<br /> 6554 16:51:19.069480 connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.1.1")}, 16) = -1 ENETUNREACH (Network is unreachable)<br /> 6554 16:51:19.069552 close(7) = 0<br /> 6554 16:51:19.069606 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 7<br /> 6554 16:51:19.069656 connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.1.1")}, 16) = -1 ENETUNREACH (Network is unreachable)<br /> <br /> ...<br /> <br /> So the /etc/resolv.conf file is actually reasonable:<br /> <br /> ; generated by /sbin/dhclient-script<br /> search group-w-inc.com<br /> nameserver 10.1.1.1<br /> <br /> The "eth0" interface is supposed to be on the "10.1.1.x" network (i.e.,<br /> our DNS server is on the directly connected LAN), so if the *network* is<br /> unreachable, it must be that the network interface itself is "down", has<br /> had its address wiped, or is otherwise being mucked with in an unsavory<br /> manner.<br /> <br /> Note that this appears to be a transient (though perhaps not short in<br /> duration) state during "first boot." Once the system has finished<br /> booting, I am indeed able, e.g., to "ssh" into this system as root, so<br /> by this time the network interface is definitely back in a reasonable<br /> state. (Users other than root don't fare as well, since they don't have<br /> home directories.)<br /> <br /> This is weird. The "autofs" service is being started about 15 *seconds*<br /> after the "bringing up eth0" message, and perhaps 10 seconds after<br /> NetworkManager starts. I have no idea what could be making the<br /> interface invalid this long after these system init events. I know that<br /> our DHCP server is not that slow.<br /> <br /> I'm not too sure how to track this down any further. For example, I<br /> could start a process early on during first boot that repeatedly does<br /> the equivalent of an "ifconfig eth0" and "netstat -rn", together with<br /> high-resolution time stamps, to observe the approximate moments in time<br /> when the interface is getting altered. I could even bracket the<br /> ldap_simple_bind_s() call with similar timestamps in the debugging<br /> output to assure me that autofs is indeed attempting its LDAP binding<br /> during a time window where the interface is out to lunch. But this<br /> would give me no clue as to *which* process is actually modifying the<br /> interface.<br /> <br /> Perhaps there is some sort of kernel cmdline flag that turns on<br /> debugging info within the network stack that would provide visibility<br /> into which processes are doing what alterations to interface<br /> configurations, routing tables, etc.? If I knew such flags, I could<br /> easily make my kickstart file modify the grub.conf file so that first<br /> boot was invoked with the necessary additional kernel incantations.<br /> <br /> I have not had too much luck finding documentation regarding the<br /> architecture of NetworkManager -- how it works, what you can expect it<br /> to do, when and why. NetworkManager is at the top of my list of<br /> suspects, but the apparent lack of documentation is making it hard for<br /> me to proceed here. Is it really the case that 15 seconds after<br /> "bringing up eth0" and 10 seconds after NetworkManager is started that<br /> the interface is still in a bogus state (assuming essentially<br /> instantaneous response from the DHCP server, which is what we really do<br /> get here)? Does NetworkManager have some massive amount of extra work<br /> that it does during first boot that it doesn't have to repeat during<br /> subsequent boots?<br /> <br /> My fear is that this is not really just a "first boot" problem -- that<br /> literally any boot can have this race condition between various startup<br /> processes causing the automounter to fail. I do not want to roll out<br /> many replications of this system (I'm using kickstart for a reason)<br /> without having a much clearer understanding of what is happening here<br /> and why.<br /> <br /> This is, of course, made much worse by the fact that automount<br /> apparently never retries binding with the LDAP server if the connection<br /> ever fails. Once it fails, the automounter seems to be permanently dead<br /> (until somebody comes along and manually either restarts autofs or<br /> reboots the system, in the hope that the race condition will not recur).<br /> This problem also needs to be fixed quite urgently.

Viewing all articles
Browse latest Browse all 19115

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>