EngineeringBangalore
Senior Site Reliability Engineer - GPU Cloud
Location
Bangalore
Compensation Tier
Senior
Category
Engineering
Company Overview
An organization at the forefront of accelerated computing and AI works with a leading developer of safety-certified, real-time operating systems and software for embedded systems, particularly within the automotive sector. This collaboration focuses on integrating high-performance computing hardware with robust, secure software solutions to power advanced applications, such as autonomous driving and other intelligent systems. Together, the companies are enabling the development of next-generation intelligent devices that demand both intensive processing power and the highest standards of reliability.
Position Overview
We are looking for experienced Senior Site Reliability Engineers (SREs) to manage and automate large-scale GPU cloud infrastructure for AI, HPC, and distributed computing. The role involves designing, deploying, and maintaining scalable and reliable systems using tools like Terraform, Kubernetes, and cloud platforms. Candidates should have 8+ years of experience, strong coding skills (Go/Python/C++/Java), and expertise in troubleshooting and automation. The ideal candidate can simplify complex problems, ensure platform reliability, and build efficient solutions for NVIDIA's next-gen cloud services.
Responsibilities
- Provide scalable and robust service oriented infrastructure automation, monitoring and analytics solutions for NVIDIA's on-prem and cloud based GPU infrastructure.
- Own the whole life cycle of new tools and services - from requirements gathering, to design documentation, validation and deployment.
- Provide customer support on a rotation basis.
Skills & Experience
- Experience: 10 - 16 years.
- Minimum of 8 years of experience in automating and handling large-scale distributed system software deployments in on-prem/cloud environments.
- Proficiency in any language - Go/Python/Perl/C++/Java/C.
- Strong command on terraform, Kubernetes and cloud infra administration.
- Excellent debugging and troubleshooting skills.
- Ability to design simple and reliable systems that can work without much support.
- Outstanding teammate who can collaborate and influence in a multifaceted environment.
- Excellent interpersonal, and written communication skills.
- M.Sc or B.E in Computer Science or a related technical field involving coding (e.g., physics or mathematics).
Apply for this Role
You must be signed in to apply for this position.
Sign In to Apply