Job Description

Our client is seeking a Senior Site Reliability Engineer to help define, lead, and continuously improve operational best practices using modern Site Reliability Engineering principles, with a strong emphasis on AWS-based cloud infrastructure. This role partners closely with engineering, production support, and technology leadership to design, implement, and operate highly reliable, secure, scalable, and cost-effective systems supporting a complex application ecosystem and software delivery lifecycle. The Senior SRE will influence cloud architecture decisions, lead complex infrastructure initiatives, and drive long-term improvements in reliability, observability, automation, and cost efficiency. This is a senior individual contributor role with broad technical ownership and organizational influence. Key Responsibilities Contribute to the design, evolution, and operational health of a large-scale AWS environment, including architecture standards and best practices Design, implement, and optimize AWS-based infrastructure using services such as EC2, ECS/EKS, Lambda, RDS, S3, IAM, VPC, and CloudWatch Build and manage cloud infrastructure using Infrastructure as Code tools such as Terraform, CloudFormation, or equivalent Lead new platform implementations and major reliability initiatives as a subject-matter expert in AWS and SRE practices Monitor, analyze, and optimize cloud spend, balancing performance, reliability, and cost efficiency Apply and mature SRE principles to improve system availability, scalability, performance, security, and observability Design and implement automation to reduce operational toil and improve system efficiency Provide advanced operational support for cloud-hosted and hybrid platforms Define and improve monitoring, alerting, logging, and incident response practices Lead complex production incidents, perform root cause analysis, and drive corrective and preventive actions Mentor junior and mid-level engineers through technical guidance and best-practice leadership, without direct people management Collaborate with engineering, QA, security, and business teams to embed reliability throughout the software delivery lifecycle Ensure systems and data handling meet applicable legal, regulatory, and security requirements Improve production engineering processes including change and configuration management, observability, incident response, disaster recovery, capacity planning, performance tuning, and deployment automation Participate in a sustainable on-call rotation and help reduce alert fatigue over time Act as a change agent for long-term technical strategy, identifying risks, dependencies, and improvement opportunities Required Qualifications Bachelor's degree in Computer Science, Information Systems, Engineering, or equivalent practical experience Seven or more years of experience delivering technical solutions in production environments Three or more years of hands-on Site Reliability Engineering experience Extensive experience designing, operating, and scaling production AWS environments Strong expertise with Infrastructure as Code and modern cloud deployment patterns Proven ability to diagnose and resolve complex issues in distributed systems Experience leading incidents and driving post-incident improvements Ability to work independently, prioritize effectively, and manage multiple initiatives Strong written and verbal communication skills with both technical and non-technical stakeholders Preferred Qualifications AWS certifications such as Solutions Architect, DevOps Engineer, or SysOps Administrator Experience working in regulated industries such as healthcare, financial services, or similar environments Familiarity with modern application stacks and supporting tools including CI/CD pipelines, version control systems, observability platforms, containerization, orchestration technologies, and identity and access management solutions Experience working in Agile or Scrum-based delivery environments What Success Looks Like Production systems are stable, observable, and resilient Incidents are handled effectively and result in measurable reliability improvements Infrastructure scales predictably while remaining cost-conscious Engineering teams are supported by clear standards, automation, and reliability tooling Reliability is embedded into the delivery process rather than treated as an afterthought

Job Title

Company : Oxenham Group LLC

Location : Melbourne, FL

Created : 2026-04-13

Job Type : Full Time