SRE Architect for GenAI Integration Platform
We are seeking a highly skilled Site Reliability Engineer/Architect (SRE) to join our innovative and fast-paced team.
In this role, you will be responsible for designing and implementing modern SRE practices to enhance the reliability and scalability of our enterprise-grade Generative AI (GenAI) integration platform. You will play a vital role in driving operational excellence by adopting advanced methodologies and tools while collaborating with key stakeholders across technical and business units.
Responsibilities
- Define Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to establish reliability standards and monitor system health
- Architect resilient production systems using methodologies like canary deployments, shadow traffic, and testing-in-production
- Develop incident management strategies and automate on-call operations to minimize downtime and improve system stability
- Enhance observability frameworks with logging, tracing, and monitoring for real-time visibility and proactive troubleshooting
- Automate tasks related to scalability, performance optimization, and operational processes for improved efficiency
- Collaborate with engineering teams to integrate SRE principles into system design and development
- Provide strategic leadership for implementing site reliability solutions in multi-cloud, multi-tenant environments for enterprise applications
- Advise executive stakeholders with insights and recommendations to align SRE strategies with organizational goals
- Promote a culture of innovation and operational reliability through mentoring and industry-leading best practices
- Ensure the platform’s infrastructure supports high availability and scalability in partnership with architecture and DevOps teams
- Drive continuous improvement by identifying opportunities for process innovation and optimization
Requirements
- 10+ years of professional experience in SRE, DevOps, or related areas, including managing production systems
- Expertise in SRE practices such as SLOs, SLIs, canary testing, and incident management
- Proficiency with cloud technologies like AWS, Google Cloud Platform, or Azure, with hands-on experience in multi-cloud setups
- Background in observability tools such as Prometheus, Grafana, or ELK Stack, as well as monitoring distributed systems
- Skills in automation platforms such as Terraform, Ansible, or Kubernetes, enabling infrastructure-as-code adoption
- Familiarity with programming languages like Python, Go, or Bash for building automation solutions
- Strong understanding of CI/CD pipelines, containerization technologies, and orchestration frameworks
- Competency in system architecture for fault tolerance, redundancy, and performance optimization
- History of collaborating effectively with diverse stakeholders, from technical teams to executive management
- Background in managing enterprise-scale systems and multi-tenant platform deployments
Nice to have
- Knowledge of Generative AI platforms and integration techniques
- Understanding of managed database services, including Amazon RDS, Google Spanner, or Azure SQL
- Familiarity with security practices for enterprise platforms and multi-cloud infrastructures
- Background in contributing to technical roadmaps for distributed systems at scale
- Capability to lead initiatives involving Chaos Engineering or disaster recovery strategies
Empresa: BairesDev
Trabalhe de Casa Arquiteto Python / Ref. 0071P
Contratação: Integral
title
Empresa: Grupo Primo
Front-end Engineer Pleno
Contratação: Integral
title