System Engineer (GPU Infrastructure & Platform Engineering)

RAKUTEN ASIA PTE. LTD.

SingaporeLocation

Singapore

a day ago

Posted date

a day ago

N/A

Minimum level

N/A

Full-timeEmployment type

Full-time

EngineeringJob category

Engineering

About Rakuten

Rakuten group has almost 100 million customers based in Japan and 1 billion globally as well, providing more than 70 services in a variety such as e-commerce, payment services, financial services, telecommunication, media, sports, etc.

Division Introduction

AI & Data Division (AIDD) spearheads data science & AI initiatives by leveraging data from Rakuten Group. We build a platform for large-scale field experimentations using cutting-edge technologies to provide critical insights that enable faster and better and faster contribution for our business. Our division boasts an international culture created by talented employees from around the world. Following the strategic vision "Rakuten as a data-driven membership company", AIDD is expanding its data & AI related activities across multiple Rakuten Group companies.

About the Role

As a System Engineer (GPU Infrastructure & Platform Engineering), you will build, scale, and optimize the GPU cluster infrastructure that supports both training (e.g., ranking models, LLMs) and inference workloads. Your focus will be on the design and build of GPU platform with sophisticated scheduling, elasticity, quota management -ensuring efficient utilization, scalability, and stability for Rakuten's AI workloads.

Key Responsibilities

Optimize Kubernetes (K8s) for GPU workloads, including scheduling policies, autoscaling, and multi-tenant resource isolation.
Deploy and maintain inference serving platforms (e.g., NVIDIA Triton, vLLM, SGlang) for high-throughput and low-latency model deployment.
Automate cluster provisioning, monitoring, and recovery to maximize uptime and GPU utilization.
Collaborate with ML engineers to troubleshoot GPU-related issues in training jobs (e.g., NCCL errors, OOM) and inference bottlenecks.
Implement observability tools (Prometheus, Grafana) to track GPU utilization, job performance, and cluster health.
Develop infrastructure-as-code (IaC) solutions for reproducible GPU environments (e.g., Terraform, Ansible).

Mandatory Qualifications

3+ years of experience in DevOps/MLOps, GPU infrastructure, or distributed computing.
Deep expertise in Kubernetes (K8s) for GPU workload orchestration (e.g., KubeFlow, Volcano, custom schedulers).
Strong programming skills in Go or Python for platform development, automation and tooling.
Proficiency in Linux system administration, performance tuning, and networking (e.g., RDMA, InfiniBand).
Experience with IaC tools (Terraform, Ansible) and CI/CD pipelines (GitHub Actions, Jenkins).
Bachelor's or higher degree in Computer Science, Engineering, or a related field.
Strong teamwork and communication skills, with a passion for solving infrastructure challenges.

Nice-to-Have Skills

Familiarity with distributed training frameworks (e.g., PyTorch DDP, FSDP, DeepSpeed).
Familiarity with Nvidia Triton serving framework or similar framework, and serving parameter tuning to make a good trade off between latency and throughput.
Hands-on experience with GPU clusters, including troubleshooting NVIDIA drivers, CUDA, and NCCL issues.
Knowledge of high-performance storage (Lustre, WekaFS) for large-scale training data.
Experience with LLM training/inference stacks (e.g., Megatron-LM, TensorRT-LLM).

Why Join Us?

Build and scale cutting-edge GPU infrastructure for ranking models, LLMs, and real-time AI.
Work with global AI/ML teams to solve high-impact infrastructure challenges.
Opportunity to shape the future of Rakuten's GPU platform for scalability and efficiency.

Related tags

JOB SUMMARY

System Engineer (GPU Infrastructure & Platform Engineering)

RAKUTEN ASIA PTE. LTD.

Singapore

a day ago

N/A

Full-time

System Engineer (GPU Infrastructure & Platform Engineering)