We install CentOS 6.3 using a kickstart file (attached). During first<br />
reboot after successful install, the autofs does not work correctly,<br />
e.g., failing to mount home directories (or any other) in our LDAP<br />
resident maps. We get the following message in the system log:<br />
<br />
Sep 19 16:51:19 automount[6554]: bind_ldap_simple: lookup(ldap): Unable<br />
to bind to the LDAP server: (default), error Can't contact LDAP server<br />
<br />
Rebooting "seems" to make the problem go away -- although I have<br />
personally witnessed the following sequence of events in one log file:<br />
<br />
<First boot after install><br />
...<br />
... Can't contact LDAP server<br />
...<br />
<Second reboot><br />
...<br />
<Third reboot><br />
...<br />
... Can't contact LDAP server<br />
<br />
My concern is that this can potentially be a repeatable problem, not<br />
just during the "first boot" after installation. If I reboot a dozen<br />
times after first boot, it does not seem to fail -- but it always fails<br />
during first boot. The subsequent failure listed in the log sketch<br />
above probably resulted from some some sort of manual action I took,<br />
e.g., bringing the interface down and then up again, restarting<br />
NetworkManager, etc. Unfortunately, I do not know what I did that made<br />
it occur during that subsequent boot. I can say unequivocably, however,<br />
that it *has* happened during at least one later boot.<br />
<br />
Looking at the autofs source code, I observed that this error message is<br />
produced within lookup_ldap.c, by the failure of a single call to<br />
openldap:<br />
<br />
rv = ldap_simple_bind_s (ldap, ...);<br />
<br />
My first approach to tracking this down further was to prepare a special<br />
version of the autofs RPM, that enabled the debugging trace output<br />
features of the openldap library. All this showed was that the failure<br />
was happening at the point where the ldap_connect_to_host() routine (in<br />
openldap-2.4.23/libraries/libldap/os-ip.c) performs a call to<br />
getaddrinfo(), with the following message in the debugging output:<br />
<br />
ldap_connect_to_host: getaddrinfo failed: Name or service not known<br />
<br />
Since the service is specified as a numeric port number, that leaves the<br />
host name that is failing to resolve. The source code for getaddrinfo()<br />
is quite gnarly and there was no way to easily enable debugging of it<br />
(during first boot, that is). My hunch was that the /etc/resolv.conf<br />
file was messed up, preventing the resolver from looking up the IP<br />
address for our LDAP server "ldap-tls.group-w-inc.com". This turned out<br />
to not be the case.<br />
<br />
My second approach was more successful. I modified my kickstart file<br />
to alter the /etc/rc.d/init.d/autofs file so that the<br />
/usr/sbin/automount program is invoked using strace. The fragments of<br />
strace output that proved enlightening are as follows:<br />
<br />
...<br />
<br />
6554 16:51:19.063975 open("/etc/resolv.conf", O_RDONLY) = 7<br />
6554 16:51:19.064149 fstat(7, {st_dev=makedev(253, 1), st_ino=9830415, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=80, st_atime=2012/09/19-16:50:54, st_mtime=2012/09/19-16:50:53, st_ctime=2012/09/19-16:50:53}) = 0<br />
6554 16:51:19.064239 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f6c483000<br />
6554 16:51:19.064291 read(7, "; generated by /sbin/dhclient-script\nsearch group-w-inc.com\nnameserver 10.1.1.1\n", 4096) = 80<br />
<br />
...<br />
<br />
6554 16:51:19.069263 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 7<br />
6554 16:51:19.069480 connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.1.1")}, 16) = -1 ENETUNREACH (Network is unreachable)<br />
6554 16:51:19.069552 close(7) = 0<br />
6554 16:51:19.069606 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 7<br />
6554 16:51:19.069656 connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.1.1")}, 16) = -1 ENETUNREACH (Network is unreachable)<br />
<br />
...<br />
<br />
So the /etc/resolv.conf file is actually reasonable:<br />
<br />
; generated by /sbin/dhclient-script<br />
search group-w-inc.com<br />
nameserver 10.1.1.1<br />
<br />
The "eth0" interface is supposed to be on the "10.1.1.x" network (i.e.,<br />
our DNS server is on the directly connected LAN), so if the *network* is<br />
unreachable, it must be that the network interface itself is "down", has<br />
had its address wiped, or is otherwise being mucked with in an unsavory<br />
manner.<br />
<br />
Note that this appears to be a transient (though perhaps not short in<br />
duration) state during "first boot." Once the system has finished<br />
booting, I am indeed able, e.g., to "ssh" into this system as root, so<br />
by this time the network interface is definitely back in a reasonable<br />
state. (Users other than root don't fare as well, since they don't have<br />
home directories.)<br />
<br />
This is weird. The "autofs" service is being started about 15 *seconds*<br />
after the "bringing up eth0" message, and perhaps 10 seconds after<br />
NetworkManager starts. I have no idea what could be making the<br />
interface invalid this long after these system init events. I know that<br />
our DHCP server is not that slow.<br />
<br />
I'm not too sure how to track this down any further. For example, I<br />
could start a process early on during first boot that repeatedly does<br />
the equivalent of an "ifconfig eth0" and "netstat -rn", together with<br />
high-resolution time stamps, to observe the approximate moments in time<br />
when the interface is getting altered. I could even bracket the<br />
ldap_simple_bind_s() call with similar timestamps in the debugging<br />
output to assure me that autofs is indeed attempting its LDAP binding<br />
during a time window where the interface is out to lunch. But this<br />
would give me no clue as to *which* process is actually modifying the<br />
interface.<br />
<br />
Perhaps there is some sort of kernel cmdline flag that turns on<br />
debugging info within the network stack that would provide visibility<br />
into which processes are doing what alterations to interface<br />
configurations, routing tables, etc.? If I knew such flags, I could<br />
easily make my kickstart file modify the grub.conf file so that first<br />
boot was invoked with the necessary additional kernel incantations.<br />
<br />
I have not had too much luck finding documentation regarding the<br />
architecture of NetworkManager -- how it works, what you can expect it<br />
to do, when and why. NetworkManager is at the top of my list of<br />
suspects, but the apparent lack of documentation is making it hard for<br />
me to proceed here. Is it really the case that 15 seconds after<br />
"bringing up eth0" and 10 seconds after NetworkManager is started that<br />
the interface is still in a bogus state (assuming essentially<br />
instantaneous response from the DHCP server, which is what we really do<br />
get here)? Does NetworkManager have some massive amount of extra work<br />
that it does during first boot that it doesn't have to repeat during<br />
subsequent boots?<br />
<br />
My fear is that this is not really just a "first boot" problem -- that<br />
literally any boot can have this race condition between various startup<br />
processes causing the automounter to fail. I do not want to roll out<br />
many replications of this system (I'm using kickstart for a reason)<br />
without having a much clearer understanding of what is happening here<br />
and why.<br />
<br />
This is, of course, made much worse by the fact that automount<br />
apparently never retries binding with the LDAP server if the connection<br />
ever fails. Once it fails, the automounter seems to be permanently dead<br />
(until somebody comes along and manually either restarts autofs or<br />
reboots the system, in the hope that the race condition will not recur).<br />
This problem also needs to be fixed quite urgently.
↧