Site Reliability Engineer
Company
iCIMS
Function
Engineering
Level
Location
Holmdel, New Jersey
Job Summary
We are seeking a skilled Engineer, Site Reliability (SRE) to contribute to the reliability, scalability, and performance of our multi-cloud SaaS platform serving thousands of customers worldwide. This role involves hands-on technical work in incident response, system monitoring, automation, and continuous improvement of our platform reliability. The successful candidate will work within a global SRE team to ensure optimal system performance and customer satisfaction.
Responsibilities
- System Monitoring & Reliability:
- Monitor multi-cloud infrastructure (AWS, Azure, GCP) using New Relic, Grafana, and Sumo Logic
- Maintain reliability of AWS resources, Auth0/Okta authentication, databases, and legacy applications
- Implement monitoring, alerting, and dashboards for assigned systems
- Incident Management & Response:
- Respond to alerts and incidents within SLA timeframes
- Perform root cause analysis and document findings
- Create and maintain runbooks and troubleshooting procedures
- Participate in 24/7 on-call rotation
- Automation & Improvement:
- Develop scripts to reduce manual operational overhead
- Build monitoring and alerting solutions
- Support infrastructure-as-code initiatives
- Implement automated remediation where possible
- Success Metrics:
- Customer Impact: Reduced MTTR and improved customer satisfaction scores
- Reliability: Achievement of 99.9%+ uptime SLAs across all products and regions
- Proactive Prevention: Reduction in incident frequency through automated detection and prevention
- Cross-functional Collaboration: Improved partnership metrics with Product, Engineering, and Customer Success teams
- Automation Delivery: Complete assigned automation projects to reduce manual tasks
- Knowledge Sharing: Contribute to team knowledge base and mentor junior engineers
Qualifications
- 4+ years experience in SRE, DevOps, or Infrastructure Engineering
- Hands-on experience with AWS (required) and Azure (preferred)
- Strong Linux system administration skills
- Experience with monitoring tools (New Relic, Grafana, Prometheus)
- Scripting skills in Python, Bash, or similar
- Knowledge of databases (SQL Server, PostgreSQL, MongoDB)
-