0005971: autofs fails during first boot after install

We install CentOS 6.3 using a kickstart file (attached). During first reboot after successful install, the autofs does not work correctly, e.g., failing to mount home directories (or any other) in our LDAP resident maps. We get the following message in the system log: Sep 19 16:51:19 automount[6554]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Rebooting "seems" to make the problem go away -- although I have personally witnessed the following sequence of events in one log file: <First boot after install> ... ... Can't contact LDAP server ... <Second reboot> ... <Third reboot> ... ... Can't contact LDAP server My concern is that this can potentially be a repeatable problem, not just during the "first boot" after installation. If I reboot a dozen times after first boot, it does not seem to fail -- but it always fails during first boot. The subsequent failure listed in the log sketch above probably resulted from some some sort of manual action I took, e.g., bringing the interface down and then up again, restarting NetworkManager, etc. Unfortunately, I do not know what I did that made it occur during that subsequent boot. I can say unequivocably, however, that it *has* happened during at least one later boot. Looking at the autofs source code, I observed that this error message is produced within lookup_ldap.c, by the failure of a single call to openldap: rv = ldap_simple_bind_s (ldap, ...); My first approach to tracking this down further was to prepare a special version of the autofs RPM, that enabled the debugging trace output features of the openldap library. All this showed was that the failure was happening at the point where the ldap_connect_to_host() routine (in openldap-2.4.23/libraries/libldap/os-ip.c) performs a call to getaddrinfo(), with the following message in the debugging output: ldap_connect_to_host: getaddrinfo failed: Name or service not known Since the service is specified as a numeric port number, that leaves the host name that is failing to resolve. The source code for getaddrinfo() is quite gnarly and there was no way to easily enable debugging of it (during first boot, that is). My hunch was that the /etc/resolv.conf file was messed up, preventing the resolver from looking up the IP address for our LDAP server "ldap-tls.group-w-inc.com". This turned out to not be the case. My second approach was more successful. I modified my kickstart file to alter the /etc/rc.d/init.d/autofs file so that the /usr/sbin/automount program is invoked using strace. The fragments of strace output that proved enlightening are as follows: ... 6554 16:51:19.063975 open("/etc/resolv.conf", O_RDONLY) = 7 6554 16:51:19.064149 fstat(7, {st_dev=makedev(253, 1), st_ino=9830415, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=80, st_atime=2012/09/19-16:50:54, st_mtime=2012/09/19-16:50:53, st_ctime=2012/09/19-16:50:53}) = 0 6554 16:51:19.064239 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f6c483000 6554 16:51:19.064291 read(7, "; generated by /sbin/dhclient-script\nsearch group-w-inc.com\nnameserver 10.1.1.1\n", 4096) = 80 ... 6554 16:51:19.069263 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 7 6554 16:51:19.069480 connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.1.1")}, 16) = -1 ENETUNREACH (Network is unreachable) 6554 16:51:19.069552 close(7) = 0 6554 16:51:19.069606 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 7 6554 16:51:19.069656 connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.1.1")}, 16) = -1 ENETUNREACH (Network is unreachable) ... So the /etc/resolv.conf file is actually reasonable: ; generated by /sbin/dhclient-script search group-w-inc.com nameserver 10.1.1.1 The "eth0" interface is supposed to be on the "10.1.1.x" network (i.e., our DNS server is on the directly connected LAN), so if the *network* is unreachable, it must be that the network interface itself is "down", has had its address wiped, or is otherwise being mucked with in an unsavory manner. Note that this appears to be a transient (though perhaps not short in duration) state during "first boot." Once the system has finished booting, I am indeed able, e.g., to "ssh" into this system as root, so by this time the network interface is definitely back in a reasonable state. (Users other than root don't fare as well, since they don't have home directories.) This is weird. The "autofs" service is being started about 15 *seconds* after the "bringing up eth0" message, and perhaps 10 seconds after NetworkManager starts. I have no idea what could be making the interface invalid this long after these system init events. I know that our DHCP server is not that slow. I'm not too sure how to track this down any further. For example, I could start a process early on during first boot that repeatedly does the equivalent of an "ifconfig eth0" and "netstat -rn", together with high-resolution time stamps, to observe the approximate moments in time when the interface is getting altered. I could even bracket the ldap_simple_bind_s() call with similar timestamps in the debugging output to assure me that autofs is indeed attempting its LDAP binding during a time window where the interface is out to lunch. But this would give me no clue as to *which* process is actually modifying the interface. Perhaps there is some sort of kernel cmdline flag that turns on debugging info within the network stack that would provide visibility into which processes are doing what alterations to interface configurations, routing tables, etc.? If I knew such flags, I could easily make my kickstart file modify the grub.conf file so that first boot was invoked with the necessary additional kernel incantations. I have not had too much luck finding documentation regarding the architecture of NetworkManager -- how it works, what you can expect it to do, when and why. NetworkManager is at the top of my list of suspects, but the apparent lack of documentation is making it hard for me to proceed here. Is it really the case that 15 seconds after "bringing up eth0" and 10 seconds after NetworkManager is started that the interface is still in a bogus state (assuming essentially instantaneous response from the DHCP server, which is what we really do get here)? Does NetworkManager have some massive amount of extra work that it does during first boot that it doesn't have to repeat during subsequent boots? My fear is that this is not really just a "first boot" problem -- that literally any boot can have this race condition between various startup processes causing the automounter to fail. I do not want to roll out many replications of this system (I'm using kickstart for a reason) without having a much clearer understanding of what is happening here and why. This is, of course, made much worse by the fact that automount apparently never retries binding with the LDAP server if the connection ever fails. Once it fails, the automounter seems to be permanently dead (until somebody comes along and manually either restarts autofs or reboots the system, in the hope that the race condition will not recur). This problem also needs to be fixed quite urgently.

0005971: autofs fails during first boot after install

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112