Posted time June 10, 2025 Location Bangalore Job type Full-time

 

Experience: 5 Years

We are seeking a highly skilled MLOps Engineer with 5+ years of experience and strong expertise in AWS cloud services to join our engineering team. In this role, you will be responsible for architecting, building, and managing end-to-end ML pipelines, enabling scalable, reproducible, and production-ready ML solutions.
You will work on AWS SageMaker, EKS, EC2, S3, and related services, leveraging containerization, distributed training, and CI/CD best practices. The ideal candidate will be passionate about building robust ML infrastructure, ensuring operational excellence, and collaborating with cross-functional teams to deliver high-impact ML products.

 

Key Responsibilities:

  • Architect, develop, and maintain end-to-end machine learning pipelines with CI/CD practices to ensure reproducibility and scalability.
  • Design, optimize, and manage ML training and inference workflows on AWS SageMaker and Kubernetes (EKS).
  • Manage GPU-based distributed training clusters leveraging Volcano scheduler on EKS for efficient resource utilization.
  • Implement ML model versioning, experiment tracking, and model registry using tools like MLFlow or SageMaker Experiments.
  • Build, deploy, and manage secure containerized ML applications using Docker and Kubernetes.
  • Set up robust monitoring, logging, and alerting frameworks for ML models and infrastructure via AWS CloudWatch and related tools.
  • Enable comprehensive data lineage tracking, data versioning, and manage data lakes with optimal storage formats such as Parquet and ORC.
  • Automate infrastructure deployment and management using Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
  • Collaborate closely with Data Scientists, DevOps, and Product teams to productionize, scale, and maintain ML solutions in production environments.

 

Required Skills & Qualifications:

MLOps & ML Lifecycle Management:

  • Deep understanding of ML lifecycle, real-world deployment challenges, and best practices in model retraining, monitoring, and evaluation.
  • Hands-on experience with ML model tracking, versioning, and automated retraining pipelines.

AWS MLOps Expertise:

  • Proficiency with AWS services including SageMaker, EKS, EC2, S3, IAM, CloudWatch, and ECR.
  • Familiarity with AWS security and scalability best practices for ML workloads.

GPU & Distributed Training:

  • Experience configuring and managing NVIDIA GPU environments on Kubernetes (EKS).
  • Knowledge of distributed training frameworks and scheduling using Volcano scheduler.

Containerization & Orchestration:

  • Advanced skills in Docker and Kubernetes tailored for ML workloads.
  • Proven ability to deploy and scale ML models in multi-node GPU clusters.

Large-Scale Data Management:

  • Experience with data versioning tools (e.g., DVC, LakeFS), data lineage, and distributed storage systems.
  • Familiarity with big data file formats such as Parquet, Avro, and related ecosystem tools.

CI/CD & Infrastructure as Code:

  • Expertise in designing and implementing ML-specific CI/CD pipelines.
  • Skilled in Terraform or CloudFormation for automated infrastructure provisioning and management.

Programming & Scripting:

  • Strong proficiency in Python and Bash scripting.
  • Comfortable authoring and debugging YAML configuration and deployment files.
  • Familiarity with popular ML frameworks such as TensorFlow, PyTorch, and Scikit-learn, including deployment patterns.

 

Application

Please send your CV to hr@nyxses.com