- Improve service monitoring system which is able to detect system problems to prevent AI project failure.
- Discover and evaluate new tools, technologies for the team for better development and operation.
- Communicate with the team through design docs, tech talks, and code reviews.
- Participate in solution design and advise other developers to build scalable, maintainable, and efficient systems.
- Build working monitoring and logging infrastructure catered to distributed systems
- Design, implement and maintain infrastructures for application CICD, machine learning, and deep learning algorithm deployment pipelines.
- Have fun as part of an awesome team.
- 1+ years with UNIX/Linux systems administration.
- 1+ years of production experience with Docker and Kubernetes.
- Experienced with public cloud (GCP, AWS, Azure), GCP is a big plus.
- Experienced with bash script or python.
- Experienced with at least one monitoring tool (Thanos/Prometheus/Grafana is a big plus).
- Experienced with at least one log gathering tool (ELK/EFK/Loki+Promtail+Grafana)
- Experienced with CICD pipelines.
- Experienced with Git.
- Experience with machine learning or data science background.
- Experience in ai relational production or project developing.
- Experience with CNCF, including Helm, Istio, Argo, Thanos, or others.
- Experience in relational database administration.
To apply for this job email your details to email@example.com