System Administrator

Full Time 1 week ago
Employment Information

Role: Sr. HPC Administrator

Desired Experience Range: 7 - 12 yrs

Notice Period: Immediate to 60 Days only

Location of Requirement: Bangalore

JOB DESCRIPTION

  • Strong experience in providing support for Linux HPC clusters.
  • Strong working knowledge on Following:
  • IBM Platform LSF 9 and 10 administration.
  • Redhat Enterprise Linux Administration.
  • Lustre Parallel File system.
  • Mellanox Infiniband Connectivity.
  • Cluster Manager Administration (HPCM or xCAT)
  • SSSD & NIS Authentication mechanisms.
  • Bash & Python scripting.
  • Ansible playbooks.
  • Experience of Abaqus, and CFD application (Fluent and StarCCM..etc.,)
  • Strong knowledge of application installations and version management on shared file systems.
  • IT infrastructure Technical Operation Management under ITIL framework
  • Security compliance and remediation management.

Intermediate Level

  • DevOps, ITIL, Agile, Safe (certifications are desirable)

Responsibilities

  • Installation, configuration, troubleshooting and administration of Linux HPC clusters (compute,

storage, and network) and applications in support of CAE environments.

  • Monitor and analyze LSF job queues and resource utilization to optimize workload management.
  • Troubleshoot and resolve any issues with LSF and its components, including master servers, compute

nodes, and resource managers.

  • Collaborate with users to understand their HPC requirements and design LSF job workflows to meet

their needs.

  • Develop and maintain LSF documentation, including standard operating procedures, installation

guides, and troubleshooting procedures.

  • Develop and maintain LSF scripts for automation and task scheduling.
  • Diagnose and troubleshoot complex RHEL OS, application and HPC cluster technical problems.
  • Interact with hardware and software vendors for external support.
  • Develop and maintain technical solution documents (TSD) and standard operating procedures(SOP).
  • Keep all HPC infrastructure systems/servers/devices up to date and working condition to enhance

business continuity.

  • Design and implement HPC network topology, including Mellanox connectivity.
  • Create and maintain HPC capacity planning and periodical cluster utilization reports.
  • Troubleshoot Abaqus, StarCCM+ and Fluent applications, and resolve any issues in a timely manner.
  • Develop and maintain scripts for automation and task scheduling using Python and Bash scripting.