For Employers
GPU Server & Cluster Engineer


FREYR TECHNOLOGY AI PTE. LTD.
16 days ago
Posted date
16 days ago
N/A
Minimum level
N/A
OtherJob category
Other
Position Overview

We are seeking a GPU Server & Cluster Engineer to design, deploy, and optimize high-performance GPU-based computing environments. This role will focus on large-scale AI model training, GPU cluster management, and infrastructure performance tuning. The ideal candidate should have a strong background in GPU hardware, HPC networking, and system optimization, with a Bachelor's degree in Computer Science, Electrical Engineering, or a related field.

Key Responsibilities

GPU Server Deployment & Management
  • Install, configure, and maintain high-performance GPU servers (H200, H100, A100) for AI and HPC workloads.
  • Ensure optimal thermal management, power distribution, and hardware reliability in GPU clusters.
  • Perform hardware diagnostics, troubleshooting, and firmware updates to maintain cluster stability.

Cluster Architecture & Performance Optimization
  • Design and manage multi-node, multi-GPU clusters, optimizing for distributed AI model training.
  • Tune CUDA, NCCL, and NVLink/NVSwitch configurations to maximize GPU efficiency.
  • Implement low-latency networking solutions (InfiniBand, RDMA, NVMe-oF) to enhance high-performance computing workflows.
  • Optimize job scheduling and workload balancing using Kubernetes, Slurm, or Ray.

Infrastructure Monitoring & Automation
  • Develop and maintain real-time monitoring solutions for GPU utilization, power efficiency, and system health.
  • Automate cluster provisioning, configuration, and scaling using Ansible, Terraform, and Bash/Python scripting.
  • Implement predictive maintenance models to prevent downtime and improve GPU resource management.

Security, Compliance & Documentation
  • Ensure secure access controls and data integrity across GPU clusters.
  • Maintain compliance with data center and AI security regulations (ISO 27001, GDPR, HIPAA).
  • Document hardware configurations, troubleshooting guides, and operational procedures for internal teams.

Qualifications

Education
  • Bachelor's degree or higher in:
    • Computer Science / Electrical Engineering
    • High-Performance Computing (HPC) / Cloud Computing
    • AI Infrastructure / Data Center Engineering

Technical Expertise
  • 3+ years of experience in GPU server administration, HPC infrastructure, or AI computing.
  • Hands-on experience with NVIDIA GPU architectures (CUDA, TensorRT, cuDNN, NCCL).
  • Strong knowledge of Linux system administration (Ubuntu, CentOS, Rocky Linux).
  • Experience with cluster management tools (Kubernetes, Slurm, Ray, MPI, Singularity).
  • Familiarity with networking (Ethernet, InfiniBand, RDMA) and storage solutions (Ceph, Lustre, NVMe-oF).
  • Scripting skills in Python, Bash, or Go for automation and system tuning.

Soft Skills & Travel Requirements
  • Proficient in English (spoken and written) for global collaboration.
  • Strong problem-solving and analytical skills, with a proactive approach to system optimization.
  • Willingness to travel internationally for data center setup, maintenance, and upgrades.

Preferred Qualifications
  • Certifications in NVIDIA, Linux, or HPC (e.g., NVIDIA Certified Professional, RHCE, AWS Certified Advanced Networking - Specialty).
  • Experience with LLM training and AI model optimization on large-scale clusters.
  • Contributions to open-source HPC or AI infrastructure projects.
Related tags
-
JOB SUMMARY
GPU Server & Cluster Engineer
FREYR TECHNOLOGY AI PTE. LTD.
Singapore
16 days ago
N/A
Full-time

GPU Server & Cluster Engineer