MLOps Engineer

Experience: 5 Years

We are seeking a highly skilled MLOps Engineer with 5+ years of experience and strong expertise in AWS cloud services to join our engineering team. In this role, you will be responsible for architecting, building, and managing end-to-end ML pipelines, enabling scalable, reproducible, and production-ready ML solutions.
You will work on AWS SageMaker, EKS, EC2, S3, and related services, leveraging containerization, distributed training, and CI/CD best practices. The ideal candidate will be passionate about building robust ML infrastructure, ensuring operational excellence, and collaborating with cross-functional teams to deliver high-impact ML products.

Key Responsibilities:

Architect, develop, and maintain end-to-end machine learning pipelines with CI/CD practices to ensure reproducibility and scalability.
Design, optimize, and manage ML training and inference workflows on AWS SageMaker and Kubernetes (EKS).
Manage GPU-based distributed training clusters leveraging Volcano scheduler on EKS for efficient resource utilization.
Implement ML model versioning, experiment tracking, and model registry using tools like MLFlow or SageMaker Experiments.
Build, deploy, and manage secure containerized ML applications using Docker and Kubernetes.
Set up robust monitoring, logging, and alerting frameworks for ML models and infrastructure via AWS CloudWatch and related tools.
Enable comprehensive data lineage tracking, data versioning, and manage data lakes with optimal storage formats such as Parquet and ORC.
Automate infrastructure deployment and management using Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
Collaborate closely with Data Scientists, DevOps, and Product teams to productionize, scale, and maintain ML solutions in production environments.

Required Skills & Qualifications:

MLOps & ML Lifecycle Management:

Deep understanding of ML lifecycle, real-world deployment challenges, and best practices in model retraining, monitoring, and evaluation.
Hands-on experience with ML model tracking, versioning, and automated retraining pipelines.

AWS MLOps Expertise:

Proficiency with AWS services including SageMaker, EKS, EC2, S3, IAM, CloudWatch, and ECR.
Familiarity with AWS security and scalability best practices for ML workloads.

GPU & Distributed Training:

Experience configuring and managing NVIDIA GPU environments on Kubernetes (EKS).
Knowledge of distributed training frameworks and scheduling using Volcano scheduler.

Containerization & Orchestration:

Advanced skills in Docker and Kubernetes tailored for ML workloads.
Proven ability to deploy and scale ML models in multi-node GPU clusters.

Large-Scale Data Management:

Experience with data versioning tools (e.g., DVC, LakeFS), data lineage, and distributed storage systems.
Familiarity with big data file formats such as Parquet, Avro, and related ecosystem tools.

CI/CD & Infrastructure as Code:

Expertise in designing and implementing ML-specific CI/CD pipelines.
Skilled in Terraform or CloudFormation for automated infrastructure provisioning and management.

Programming & Scripting:

Strong proficiency in Python and Bash scripting.
Comfortable authoring and debugging YAML configuration and deployment files.
Familiarity with popular ML frameworks such as TensorFlow, PyTorch, and Scikit-learn, including deployment patterns.

Application

Please send your CV to hr@nyxses.com

NYXSES India Private Limited