Senior Site Reliability Engineer Senior Manager
Source: remoteok
About the Role
At Accenture Federal Services, we are dedicated to helping the US federal government strengthen the nation, enhance safety, and improve the lives of its citizens. Our team of 13,000+ professionals is united by a shared purpose: to harness the potential of technology and ingenuity for our clients across defense, national security, public safety, civilian, and military health organizations.
Join Accenture Federal Services, a technology company and part of global Accenture, to contribute to meaningful work within a collaborative and supportive community. You'll have opportunities to grow, learn, and thrive through hands-on experience, certifications, industry training, and more. Join us to drive positive, lasting change that moves missions and the government forward!
We are seeking a Senior Site Reliability Engineer (SRE) with deep expertise in building and maintaining reliable, scalable systems and a passion for optimizing the performance, reliability, and efficiency of technical infrastructure. The ideal candidate will have a strong background in site reliability engineering principles, extensive experience with automation, and a proven ability to collaborate across teams to ensure seamless service delivery.
Responsibilities
- Design, build, and maintain reliable, scalable, and high-performance infrastructure and services to support business needs.
- Implement and advocate for SRE best practices, including automation, CI/CD pipelines, monitoring, and incident management.
- Collaborate with cross-functional teams to develop systems that meet high availability, performance, and reliability standards.
- Drive incident management processes, including root cause analysis, mitigation strategies, and long-term preventive measures.
- Establish, monitor, and refine service level objectives (SLOs), service level agreements (SLAs), and key performance indicators (KPIs) to ensure systems adhere to reliability and performance targets.
- Automate repetitive tasks to improve operational efficiency and reduce manual intervention.
- Build and maintain robust monitoring, logging, and alerting systems to ensure visibility into system performance and reliability.
- Provide technical mentorship and guidance to team members, fostering a culture of knowledge sharing and continuous improvement.
- Act as a technical leader by driving solutions to complex challenges, ensuring alignment with organizational goals.
- Prepare and deliver performance and reliability reports to stakeholders, offering insights and recommendations for improvements.
Requirements
- Proven experience in site reliability engineering or a similar role, with a focus on application and infrastructure scalability, reliability, and performance.
- Strong knowledge of ITSM principles and incident management processes.
- Expertise in automation tools, scripting, and infrastructure-as-code (IaC) technologies.
- Proficiency with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk).
- Experience with cloud platforms (e.g., AWS, Azure, GCP) and container technologies (e.g., Docker, Kubernetes).
- Strong analytical and problem-solving skills, with the ability to troubleshoot complex systems.
- Excellent communication and collaboration abilities, with a focus on cross-team partnerships.
- A passion for continuous learning, innovation, and driving improvements.