Saturday, May 7, 2016

qsub-Bad UID for job execution MSG=User does not exist in server password file

Scenario:
A High Performance Compute cluster running scientific Linux. Head nodes and  compute nodes are members of a Windows 2012 R2 Active directory domain. Users log in to client nodes  using their AD login and then SSH (Kerberos enabled passwordless login) in to head node to submit PBS job scripts to the cluster.  

knit and id commands work on the head node and display relevant information for the domainuser. Which means  that the head node can connect and resolve domain usernames from the Active Directory.

Issue:
When submitting PBS jobs they see the following error :
[domainuser@myorg.org.au@HPC torque]$ qsub FirstJob.pbs 
qsub: submit error (Bad UID for job execution MSG=User domainuser does not exist in server password file


when submitting the job , Torque call the system function : getpwnam_r to grab the user information who is submitting the job. The error in here is misleading as it sounds like getpwnam_r  is only looking for the user in the "server password file". But according to the man page , when configured , it also search in NIS and LDAP  for the given user.

“ The getpwnam() function returns a pointer to a structure containing the broken-out fields of the record in the password database (e.g., the local password file /etc/passwd, NIS, and LDAP) that matches the username name. 
The getpwuid() function returns a pointer to a structure containing the broken-out fields of the record in the password database that matches the user ID uid

Reason:
What causes this error is that getpwnam_r is looking for the user  domainuser@myorg.org.au instead of domainuser in authentication databases. In this case in PASSWD file as well as in the Active Directory.

Fix:
Go to /etc/sssd/sssd.conf file and change the option :  use_fully_qualified_names to False. So that SSS will only look for the username in Active Directory.

Torque Make error: mom_mach.h: No such file or directory

Issue: 
While running MAKE to compile  torque resource manager you may come up with the error:
site_mom_chu.c:25:22: fatal error: mom_mach.h: No such file or directory
#include "mom_mach.h"


Environment: 
OS: Scientific Linux 7.2
TORQUE Resource Manager :  6.0.1

Fix: 
To fix this issue , simply give full file access to all the files in the torque folder.

chmod 777 -R *

Then attempt the MAKE process again.

Only other available web resource regarding this issue is in here: