Experience: 5 Years
We are seeking a highly skilled MLOps Engineer with 5+ years of experience and strong expertise in AWS cloud services to join our engineering team. In this role, you will be responsible for architecting, building, and managing end-to-end ML pipelines, enabling scalable, reproducible, and production-ready ML solutions.
You will work on AWS SageMaker, EKS, EC2, S3, and related services, leveraging containerization, distributed training, and CI/CD best practices. The ideal candidate will be passionate about building robust ML infrastructure, ensuring operational excellence, and collaborating with cross-functional teams to deliver high-impact ML products.
Key Responsibilities:
- Architect, develop, and maintain end-to-end machine learning pipelines with CI/CD practices to ensure reproducibility and scalability.
- Design, optimize, and manage ML training and inference workflows on AWS SageMaker and Kubernetes (EKS).
- Manage GPU-based distributed training clusters leveraging Volcano scheduler on EKS for efficient resource utilization.
- Implement ML model versioning, experiment tracking, and model registry using tools like MLFlow or SageMaker Experiments.
- Build, deploy, and manage secure containerized ML applications using Docker and Kubernetes.
- Set up robust monitoring, logging, and alerting frameworks for ML models and infrastructure via AWS CloudWatch and related tools.
- Enable comprehensive data lineage tracking, data versioning, and manage data lakes with optimal storage formats such as Parquet and ORC.
- Automate infrastructure deployment and management using Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
- Collaborate closely with Data Scientists, DevOps, and Product teams to productionize, scale, and maintain ML solutions in production environments.
Required Skills & Qualifications:
MLOps & ML Lifecycle Management:
- Deep understanding of ML lifecycle, real-world deployment challenges, and best practices in model retraining, monitoring, and evaluation.
- Hands-on experience with ML model tracking, versioning, and automated retraining pipelines.
AWS MLOps Expertise:
- Proficiency with AWS services including SageMaker, EKS, EC2, S3, IAM, CloudWatch, and ECR.
- Familiarity with AWS security and scalability best practices for ML workloads.
GPU & Distributed Training:
- Experience configuring and managing NVIDIA GPU environments on Kubernetes (EKS).
- Knowledge of distributed training frameworks and scheduling using Volcano scheduler.
Containerization & Orchestration:
- Advanced skills in Docker and Kubernetes tailored for ML workloads.
- Proven ability to deploy and scale ML models in multi-node GPU clusters.
Large-Scale Data Management:
- Experience with data versioning tools (e.g., DVC, LakeFS), data lineage, and distributed storage systems.
- Familiarity with big data file formats such as Parquet, Avro, and related ecosystem tools.
CI/CD & Infrastructure as Code:
- Expertise in designing and implementing ML-specific CI/CD pipelines.
- Skilled in Terraform or CloudFormation for automated infrastructure provisioning and management.
Programming & Scripting:
- Strong proficiency in Python and Bash scripting.
- Comfortable authoring and debugging YAML configuration and deployment files.
- Familiarity with popular ML frameworks such as TensorFlow, PyTorch, and Scikit-learn, including deployment patterns.
Application
Please send your CV to hr@nyxses.com