We are seeking a Senior HPC Systems Engineer to maintain G42 state-of-the-art computational and data science infrastructure.
As a member of our HPC Team, you will lead and participate in the deployment, management, and optimization of systems, and processes. You will work with G42 s community to identify and provide solutions and technical support that enable our cloud customers to deploy and develop their AI applications at scale.
Responsibilities and Duties:
• Provide tier-3 in-depth technical O&M support and administration of 24*7*365 always available production environment
• Configure, install, maintain and upgrade HPC clusters (compute, storage, and network) and applications in support of research computing environments
• Lead and collaborate on projects to maintain and enhance system functionality in areas such as systems monitoring, scheduling and resource management, configuration management, backups, HPC system management utilities/tools, HPC cluster performance and resiliency
• Diagnose, isolate and resolve complex application and system technical problems (hardware, software, network)
• Develop scripts and automation to enhance operational services and service quality
• Perform system tuning based upon proactive performance analysis
• Build, install, and support scientific software (Commercial and Open Source)
• Develop and maintain technical documentation for customer use and contribute to the internal knowledge base.
• Solid Experience in configuring, managing, and optimizing large Linux clusters and servers
• Expert level experience with management tools (e.g. PBS, SLURM, Moab, TORQUE, etc.)
• Experience configuring, managing, and optimizing distributed and parallel file systems such as Lustre, GPFS, NFS, Ceph and protocols FC, iSCSI, NFS, CIFS, etc.
• Knowledge of networks, routers, switches, firewalls and familiarity with high-performance networks such as Infiniband
• Strong scripting/programming capabilities ( e.g. Python, Bash, Perl)
• Experience managing virtualization platforms (VMWare, KVM, oVirt)
• Extensive knowledge of RedHat or Debian based distributions and strong experience with maintaining, upgrading, and tuning the Linux kernel
• Experience with system configuration management tools such as Puppet, Ansible, Chef, Cobbler
• Experience with monitoring/alerting tools (e.g. Ganglia, Nagios, Zabbix, Grafana)
• Strong experience with compiling and building packages tools (e.g. Spack, Conda, EasyBuild)
• Strong Experience using containerized workflows based on docker, singularity, Kubernetes
• Solid Experience configuring, installing and troubleshooting MPI
• Demonstrated ability to research, quickly identify and correct problems (debug) using system utilities and diagnostics
• Demonstrated ability to perform complex performance analysis including system processes, I/O subsystems, networks and other related components.
• Experience with performance benchmarking using profilers and debuggers to recommend code improvements for scalability and performance
• Experience with Nvidia DGX servers and Nvidia tools
• Experience with Linux kernel development and the Linux development community
• Experience with on-prem cloud technologies such as OpenStack
• Working knowledge of one or more programming languages such as C, C++.