Site Reliability Engineer

Company

iCIMS

Function

Engineering

Level

Location

Holmdel, New Jersey

Apply Back to Jobs

Job Summary

We are seeking a skilled Engineer, Site Reliability (SRE) to contribute to the reliability, scalability, and performance of our multi-cloud SaaS platform serving thousands of customers worldwide. This role involves hands-on technical work in incident response, system monitoring, automation, and continuous improvement of our platform reliability. The successful candidate will work within a global SRE team to ensure optimal system performance and customer satisfaction.

Responsibilities

System Monitoring & Reliability:
- Monitor multi-cloud infrastructure (AWS, Azure, GCP) using New Relic, Grafana, and Sumo Logic
- Maintain reliability of AWS resources, Auth0/Okta authentication, databases, and legacy applications
- Implement monitoring, alerting, and dashboards for assigned systems

Incident Management & Response:
- Respond to alerts and incidents within SLA timeframes
- Perform root cause analysis and document findings
- Create and maintain runbooks and troubleshooting procedures
- Participate in 24/7 on-call rotation

Automation & Improvement:
- Develop scripts to reduce manual operational overhead
- Build monitoring and alerting solutions
- Support infrastructure-as-code initiatives
- Implement automated remediation where possible

Success Metrics:
- Customer Impact: Reduced MTTR and improved customer satisfaction scores
- Reliability: Achievement of 99.9%+ uptime SLAs across all products and regions
- Proactive Prevention: Reduction in incident frequency through automated detection and prevention
- Cross-functional Collaboration: Improved partnership metrics with Product, Engineering, and Customer Success teams
- Automation Delivery: Complete assigned automation projects to reduce manual tasks
- Knowledge Sharing: Contribute to team knowledge base and mentor junior engineers

Qualifications

4+ years experience in SRE, DevOps, or Infrastructure Engineering
Hands-on experience with AWS (required) and Azure (preferred)
Strong Linux system administration skills
Experience with monitoring tools (New Relic, Grafana, Prometheus)
Scripting skills in Python, Bash, or similar
Knowledge of databases (SQL Server, PostgreSQL, MongoDB)

Apply