System Administrator

REVE CLOUD PTE. LTD.
Position Summary: We are seeking a skilled HPC System Administrator to manage and maintain high-performance computing (HPC) systems. The ideal candidate will be responsible for system administration, user support, software integration, and collaboration with research teams to optimize computational workflows.
Key Responsibilities:
1. HPC System Management and Maintenance
• Install, configure, integrate, and maintain high-performance compute clusters and associated hardware
• Monitor system performance, troubleshoot issues, and ensure security compliance Process and document change management procedures
2. User Support and Consultation
• Assist users with computational jobs and optimize workflows for efficient resource utilization
• Provide training sessions and resolve user issues related to HPC environments
3. Software and Application Support
• Install, configure, and maintain scientific and engineering HPC software solutions
• Support software development for parallel computing and performance optimization
4. Collaboration with Research Teams
• Understand research project requirements and recommend appropriate HPC solutions
• Assist in designing and optimizing computational workflows for researchers
5. Resource Allocation and Scheduling
• Manage resource allocation and job scheduling within the HPC environment
• Implement policies for job queuing, resource limits, and workload balancing
• Enforce operational best practices and implementation plans Internal Use - Confidential
6. System and Network Optimization
• Configure and maintain high-speed networks for optimal data transfer within the HPC infrastructure
• Conduct performance benchmarking and optimization efforts.
7. Documentation and Reporting
• Maintain detailed system documentation, configuration guides, and user manuals
• Generate reports on system performance, resource utilization, and operational efficiency
Qualifications and Skills:
• Strong experience with HPC system administration, Linux-based environments, and cluster management tools.
• Proficiency in job scheduling and resource management frameworks (e.g., Slurm, PBS, Grid Engine).
• Hands-on experience with networking protocols, security policies, and data transfer optimizations.
• Familiarity with scientific computing software and parallel programming techniques.
• Ability to troubleshoot complex system and application issues effectively.
• Strong communication skills to collaborate with researchers and support teams.
Key Responsibilities:
1. HPC System Management and Maintenance
• Install, configure, integrate, and maintain high-performance compute clusters and associated hardware
• Monitor system performance, troubleshoot issues, and ensure security compliance Process and document change management procedures
2. User Support and Consultation
• Assist users with computational jobs and optimize workflows for efficient resource utilization
• Provide training sessions and resolve user issues related to HPC environments
3. Software and Application Support
• Install, configure, and maintain scientific and engineering HPC software solutions
• Support software development for parallel computing and performance optimization
4. Collaboration with Research Teams
• Understand research project requirements and recommend appropriate HPC solutions
• Assist in designing and optimizing computational workflows for researchers
5. Resource Allocation and Scheduling
• Manage resource allocation and job scheduling within the HPC environment
• Implement policies for job queuing, resource limits, and workload balancing
• Enforce operational best practices and implementation plans Internal Use - Confidential
6. System and Network Optimization
• Configure and maintain high-speed networks for optimal data transfer within the HPC infrastructure
• Conduct performance benchmarking and optimization efforts.
7. Documentation and Reporting
• Maintain detailed system documentation, configuration guides, and user manuals
• Generate reports on system performance, resource utilization, and operational efficiency
Qualifications and Skills:
• Strong experience with HPC system administration, Linux-based environments, and cluster management tools.
• Proficiency in job scheduling and resource management frameworks (e.g., Slurm, PBS, Grid Engine).
• Hands-on experience with networking protocols, security policies, and data transfer optimizations.
• Familiarity with scientific computing software and parallel programming techniques.
• Ability to troubleshoot complex system and application issues effectively.
• Strong communication skills to collaborate with researchers and support teams.
JOB SUMMARY
System Administrator

REVE CLOUD PTE. LTD.
Singapore
8 days ago
N/A
Full-time
System Administrator