GPU Server & Cluster Engineer
FREYR TECHNOLOGY AI PTE. LTD.
Position Overview
We are seeking a GPU Server & Cluster Engineer to design, deploy, and optimize high-performance GPU-based computing environments. This role will focus on large-scale AI model training, GPU cluster management, and infrastructure performance tuning. The ideal candidate should have a strong background in GPU hardware, HPC networking, and system optimization, with a Bachelor's degree in Computer Science, Electrical Engineering, or a related field.
Key Responsibilities
GPU Server Deployment & Management
Cluster Architecture & Performance Optimization
Infrastructure Monitoring & Automation
Security, Compliance & Documentation
Qualifications
Education
Technical Expertise
Soft Skills & Travel Requirements
Preferred Qualifications
We are seeking a GPU Server & Cluster Engineer to design, deploy, and optimize high-performance GPU-based computing environments. This role will focus on large-scale AI model training, GPU cluster management, and infrastructure performance tuning. The ideal candidate should have a strong background in GPU hardware, HPC networking, and system optimization, with a Bachelor's degree in Computer Science, Electrical Engineering, or a related field.
Key Responsibilities
GPU Server Deployment & Management
- Install, configure, and maintain high-performance GPU servers (H200, H100, A100) for AI and HPC workloads.
- Ensure optimal thermal management, power distribution, and hardware reliability in GPU clusters.
- Perform hardware diagnostics, troubleshooting, and firmware updates to maintain cluster stability.
Cluster Architecture & Performance Optimization
- Design and manage multi-node, multi-GPU clusters, optimizing for distributed AI model training.
- Tune CUDA, NCCL, and NVLink/NVSwitch configurations to maximize GPU efficiency.
- Implement low-latency networking solutions (InfiniBand, RDMA, NVMe-oF) to enhance high-performance computing workflows.
- Optimize job scheduling and workload balancing using Kubernetes, Slurm, or Ray.
Infrastructure Monitoring & Automation
- Develop and maintain real-time monitoring solutions for GPU utilization, power efficiency, and system health.
- Automate cluster provisioning, configuration, and scaling using Ansible, Terraform, and Bash/Python scripting.
- Implement predictive maintenance models to prevent downtime and improve GPU resource management.
Security, Compliance & Documentation
- Ensure secure access controls and data integrity across GPU clusters.
- Maintain compliance with data center and AI security regulations (ISO 27001, GDPR, HIPAA).
- Document hardware configurations, troubleshooting guides, and operational procedures for internal teams.
Qualifications
Education
- Bachelor's degree or higher in:
- Computer Science / Electrical Engineering
- High-Performance Computing (HPC) / Cloud Computing
- AI Infrastructure / Data Center Engineering
Technical Expertise
- 3+ years of experience in GPU server administration, HPC infrastructure, or AI computing.
- Hands-on experience with NVIDIA GPU architectures (CUDA, TensorRT, cuDNN, NCCL).
- Strong knowledge of Linux system administration (Ubuntu, CentOS, Rocky Linux).
- Experience with cluster management tools (Kubernetes, Slurm, Ray, MPI, Singularity).
- Familiarity with networking (Ethernet, InfiniBand, RDMA) and storage solutions (Ceph, Lustre, NVMe-oF).
- Scripting skills in Python, Bash, or Go for automation and system tuning.
Soft Skills & Travel Requirements
- Proficient in English (spoken and written) for global collaboration.
- Strong problem-solving and analytical skills, with a proactive approach to system optimization.
- Willingness to travel internationally for data center setup, maintenance, and upgrades.
Preferred Qualifications
- Certifications in NVIDIA, Linux, or HPC (e.g., NVIDIA Certified Professional, RHCE, AWS Certified Advanced Networking - Specialty).
- Experience with LLM training and AI model optimization on large-scale clusters.
- Contributions to open-source HPC or AI infrastructure projects.
JOB SUMMARY
GPU Server & Cluster Engineer
FREYR TECHNOLOGY AI PTE. LTD.
Singapore
16 days ago
N/A
Full-time
GPU Server & Cluster Engineer