The Key Responsibilites:
- Deploying machine learning models to production environments,
- Monitoring the performance of models in real-time,
- Creating and maintaining CI/CD pipelines,
- Configuring and managing Docker containers and Kubernetes clusters,
- Setting up and expanding system monitoring using tools like Prometheus, Thanos, Grafana,
- Implementing A/B testing and integrating with service mesh (e.g. Istio),
- Working with model management platforms such as Kubeflow and OpenDataHub,
- Automating tasks using scripts (e.g. Bash, Python),
- Resolving technical issues and optimizing system performance,
- Collaborating with Data Science, DevOps, and engineering teams to ensure deployment stability.
Desired skills & experience:
- Documented experience in implementing and monitoring machine learning models in production environments,
- Minimum of 5 years experience working with Docker, Kubernetes, Helm, and CI/CD pipelines,
- At least 5 years of experience with monitoring tools such as Prometheus, Thanos, or Grafana,
- Familiarity with model monitoring tools such as Arize, Evidently AI, or Alibi Detect,
- Experience in conducting A/B tests and working with service mesh software (e.g., Istio),
- Proficiency in using platforms like Kubeflow or OpenDataHub for deploying and managing models,
- Very good understanding of infrastructure monitoring concepts and best practices in this area,
- Excellent problem-solving skills and diagnosing complex technical issues,
- Experience working with cloud platforms (AWS OR Google Cloud Platform OR Azure),
- Knowledge of scripting languages and automation (e.g., Bash, Python).
- Proficiency in English enabling fluent communication both verbally and in writing.
We offer:
- paid leave
- Work in an international environment - permanent - all work in an English-speaking environment