AI-enhanced for better readability
Lead Platform Engineer (IC3)
Source: reddit-r-forhire
About the Role:
Above and Beyond Talent is seeking a Lead Platform Engineer (IC3).
Requirements
- Education: Bachelor’s in computer science, IT, or related field.
- 5+ years of experience in software, systems, or reliability engineering roles, with multiple years of hands-on experience owning production observability, monitoring, and SLOs in distributed systems.
- Deep experience building scalable, reliable monitoring and observability solutions, including instrumentation, alerting, dashboarding, and configuration across large, complex environments.
- Hands-on expertise and proficiency with modern monitoring and observability tools, (e.g., OpsRamp, Grafana, Elastic, CloudWatch, Azure Monitor BigPanda (AIOps), and strong knowledge of metrics, logs, traces, and OpenTelemetry.
- Strong scripting and programming capability (Bash, PowerShell, and one or more languages such as Python, C-family, or JavaScript) to automate telemetry, alerting, and platform workflows.
- Strong expertise with cloud platforms (AWS and/or Azure) and container orchestration systems (Kubernetes, Docker).
- Deep hands-on experience with Elastic Observability (APM, Logs, Metrics, Traces)
- Understanding of distributed systems fundamentals, including networking, security, databases, DevSecOps principles, and performance/capacity engineering.
- Strong communication skills, with the ability to clearly explain complex technical topics to both technical and non-technical audiences.
- Exceptional problem-solving and troubleshooting abilities, especially in high-pressure or time-sensitive environments.
- Effective prioritization and multitasking, able to manage competing deadlines while maintaining quality and focus.
- Proven cross-functional collaboration, working seamlessly with diverse teams in large, complex IT environments and driving continuous improvement across systems.
Performance Objectives / What You'll Be Doing
- Architect, deploy, and manage OpenTelemetry-based observability solutions, including instrumentation, telemetry pipelines, distributed tracing, and integrations with Elastic, Grafana, CloudWatch, Azure Monitor, OpsRamp, and BigPanda.
- Define and implement SRE best practices, including SLIs/SLOs, error budgets, reliability dashboards, alerting standards, and proactive monitoring for critical systems.
- Collaborate with engineering and product teams to embed observability, reliability, and failure-awareness into the SDLC, release processes, and business-critical workflows.
- Build and optimize automation frameworks for monitoring, incident response, self-healing workflows, capacity planning, and operational efficiency to reduce toil and MTTR.
- Develop and maintain enterprise observability documentation, dashboards, runbooks, and governance standards while mentoring engineers and promoting operational excellence.
Location
Considering candidates who live within 50 miles (commutable distance) to our client's Corporate Offices in Evansville IN, Baltimore MD, Wilmington DE, Charlotte NC and Irving TX
Compensation
Pay Rate: $65.00 to $70.00
Other
- Must be authorized to work in the US
To Apply
Email your resume and availability to apply@aandbtalent.com