DevOps Engineer (GCP & Kubernetes)

Required Qualifications

3+ years of experience in DevOps, Cloud Engineering, Site Reliability Engineering, or similar roles.
Hands-on experience with Google Cloud Platform (GCP).
Strong understanding of core GCP services, including:
- Compute Engine
- Cloud Run
- App Engine
- Google Kubernetes Engine (GKE)
Production experience managing Kubernetes environments.
Experience configuring Kubernetes resources such as Deployments, Services, Ingress, ConfigMaps, Secrets, and Autoscaling.
Solid understanding of Kubernetes health checks, including readiness and liveness probes.
Experience with Infrastructure as Code using Terraform.
Understanding of Terraform state management and multi-environment infrastructure design.
Strong Linux administration and troubleshooting skills.
Good understanding of networking concepts, including:
- VPCs
- Subnets
- Firewall rules
- Load balancing
- Private networking
Experience with monitoring, logging, and observability platforms.
Experience investigating and resolving production incidents.
Understanding of reliability concepts such as SLA, SLO, and SLI.
Strong verbal and written English communication skills.

Preferred Qualifications

Experience designing highly available and globally distributed applications in GCP.
Knowledge of zero-downtime deployment strategies.
Experience supporting large-scale production environments.
Experience with multi-tenant architectures.
Scripting experience using Python, Bash, or similar languages.
Experience working in hybrid cloud/on-premise environments.
Experience participating in SEV incident management.
Familiarity with capacity planning and performance tuning.

Technology Stack

Cloud: Google Cloud Platform (GCP)
Containers: Kubernetes, GKE
Infrastructure as Code: Terraform
Monitoring & Observability: Grafana, Prometheus, Logging Platforms
Operating Systems: Linux
Incident Management: PagerDuty, ServiceNow, Slack (or equivalent tools)

Working Requirements

Availability to work within CT business hours.
Participation in an on-call rotation that includes coverage for one weekend day when scheduled.

What Success Looks Like

Reliable operation of production systems during periods of high traffic and critical business activity.
Fast and effective incident response and troubleshooting.
Well-automated, maintainable infrastructure managed through Infrastructure as Code.
Strong collaboration with development teams to improve reliability, scalability, and operational efficiency.

Deploy, maintain, and improve cloud infrastructure in Google Cloud Platform (GCP).
Operate and support Kubernetes environments, including GKE.
Build and maintain Infrastructure as Code using Terraform.
Monitor production systems and proactively identify reliability risks.
Troubleshoot infrastructure, networking, application, and performance issues.
Participate in incident response, root cause analysis, and postmortem activities.
Implement and maintain observability solutions, dashboards, and alerting systems.
Collaborate with software engineering teams to improve deployment processes and operational excellence.
Support highly available and scalable production environments.
Contribute to automation initiatives that reduce operational overhead and improve reliability.

With over 10 years of experience driving innovation for companies ranging from startups to large corporations in the United States and Latin America, Devsu has developed high-impact solutions in industries such as entertainment, banking, healthcare, retail, education, and insurance.

At Devsu, you'll work alongside top-tier professionals, with the opportunity for continuous learning and participation in challenging, high-impact projects for global clients. Our team is present in more than 18 countries, collaborating on a variety of software products and solutions.

We are looking for a hands-on Semi Senior DevOps Engineer to join a high-impact project supporting a global-scale sports event. This role is ideal for someone who enjoys working close to production systems, troubleshooting complex issues, automating infrastructure, and ensuring platform reliability in mission-critical environments.