Join Mistral, a leading AI platform company, as a Systems Engineer (HPC). In this hands-on, hybrid role, you will design, operate, and scale the infrastructure behind our AI platforms. You will work closely with infrastructure, HPC, and research teams to ensure our clusters and platforms run reliably at scale. Responsibilities include operating and maintaining large-scale Linux environments, monitoring system health, troubleshooting incidents, and improving performance and resource utilization. You will also automate operational tasks and contribute to system design and architecture decisions.
About the role
- Solid troubleshooting skills across systems, hardware, and networks - Strong Linux systems administration experience (core requirement) - HPC clusters or cloud infrastructure - Experience working in large-scale environments: - Experience with Job schedulers (e.g. Slurm) - Storage systems (e.g. Ceph, Lustre, NFS) - Infrastructure as Code / automation tooling - GPU or AI/ML experience - Networking fundamentals (Ethernet; InfiniBand is a plus) - Containers / orchestration (e.g. Kubernetes) - We are not expecting everything — strong depth in one area is valuable - Low-ego, collaborative, and hands-on - Pragmatic problem solver who can operate in fast-scaling environments - Comfortable working across multiple domains (“Swiss army knife” mindset) - Able to go deep in one area while learning othersKey missions
- Operar y mantener entornos Linux a gran escala, monitorear la salud del sistema y garantizar alta disponibilidad.
- Automatizar tareas operativas utilizando herramientas como Python, Bash, Ansible o Terraform, y mejorar la gestión del ciclo de vida del sistema.
- Colaborar estrechamente con equipos de infraestructura, HPC y de investigación para garantizar que nuestros clústeres y plataformas funcionen de manera confiable a gran escala.