GPU Server & Cluster Engineer

FREYR TECHNOLOGY AI PTE. LTD.

SingaporeLocation

Singapore

16 days ago

Posted date

16 days ago

N/A

Minimum level

N/A

Full-timeJob type

Full-time

OtherJob category

Other

Position Overview

We are seeking a GPU Server & Cluster Engineer to design, deploy, and optimize high-performance GPU-based computing environments. This role will focus on large-scale AI model training, GPU cluster management, and infrastructure performance tuning. The ideal candidate should have a strong background in GPU hardware, HPC networking, and system optimization, with a Bachelor's degree in Computer Science, Electrical Engineering, or a related field.

Key Responsibilities

GPU Server Deployment & Management

Install, configure, and maintain high-performance GPU servers (H200, H100, A100) for AI and HPC workloads.
Ensure optimal thermal management, power distribution, and hardware reliability in GPU clusters.
Perform hardware diagnostics, troubleshooting, and firmware updates to maintain cluster stability.

Cluster Architecture & Performance Optimization

Design and manage multi-node, multi-GPU clusters, optimizing for distributed AI model training.
Tune CUDA, NCCL, and NVLink/NVSwitch configurations to maximize GPU efficiency.
Implement low-latency networking solutions (InfiniBand, RDMA, NVMe-oF) to enhance high-performance computing workflows.
Optimize job scheduling and workload balancing using Kubernetes, Slurm, or Ray.

Infrastructure Monitoring & Automation

Develop and maintain real-time monitoring solutions for GPU utilization, power efficiency, and system health.
Automate cluster provisioning, configuration, and scaling using Ansible, Terraform, and Bash/Python scripting.
Implement predictive maintenance models to prevent downtime and improve GPU resource management.

Security, Compliance & Documentation

Ensure secure access controls and data integrity across GPU clusters.
Maintain compliance with data center and AI security regulations (ISO 27001, GDPR, HIPAA).
Document hardware configurations, troubleshooting guides, and operational procedures for internal teams.

Qualifications

Education

Bachelor's degree or higher in:
- Computer Science / Electrical Engineering
- High-Performance Computing (HPC) / Cloud Computing
- AI Infrastructure / Data Center Engineering

Technical Expertise

3+ years of experience in GPU server administration, HPC infrastructure, or AI computing.
Hands-on experience with NVIDIA GPU architectures (CUDA, TensorRT, cuDNN, NCCL).
Strong knowledge of Linux system administration (Ubuntu, CentOS, Rocky Linux).
Experience with cluster management tools (Kubernetes, Slurm, Ray, MPI, Singularity).
Familiarity with networking (Ethernet, InfiniBand, RDMA) and storage solutions (Ceph, Lustre, NVMe-oF).
Scripting skills in Python, Bash, or Go for automation and system tuning.

Soft Skills & Travel Requirements

Proficient in English (spoken and written) for global collaboration.
Strong problem-solving and analytical skills, with a proactive approach to system optimization.
Willingness to travel internationally for data center setup, maintenance, and upgrades.

Preferred Qualifications

Certifications in NVIDIA, Linux, or HPC (e.g., NVIDIA Certified Professional, RHCE, AWS Certified Advanced Networking - Specialty).
Experience with LLM training and AI model optimization on large-scale clusters.
Contributions to open-source HPC or AI infrastructure projects.

Related tags

JOB SUMMARY

GPU Server & Cluster Engineer

FREYR TECHNOLOGY AI PTE. LTD.

Singapore

16 days ago

N/A

Full-time

GPU Server & Cluster Engineer