Agentic AI Is Here: What Investors Need to Know Read Now

Site Reliability Engineer

Company

iCIMS

Function

Engineering

Level

Location

Holmdel, New Jersey

Apply Back to Jobs

Job Summary

We are seeking a skilled Engineer, Site Reliability (SRE) to contribute to the reliability, scalability, and performance of our multi-cloud SaaS platform serving thousands of customers worldwide. This role involves hands-on technical work in incident response, system monitoring, automation, and continuous improvement of our platform reliability. The successful candidate will work within a global SRE team to ensure optimal system performance and customer satisfaction.

Responsibilities

  • System Monitoring & Reliability:
    • Monitor multi-cloud infrastructure (AWS, Azure, GCP) using New Relic, Grafana, and Sumo Logic
    • Maintain reliability of AWS resources, Auth0/Okta authentication, databases, and legacy applications
    • Implement monitoring, alerting, and dashboards for assigned systems
  • Incident Management & Response:
    • Respond to alerts and incidents within SLA timeframes
    • Perform root cause analysis and document findings
    • Create and maintain runbooks and troubleshooting procedures
    • Participate in 24/7 on-call rotation
  • Automation & Improvement:
    • Develop scripts to reduce manual operational overhead
    • Build monitoring and alerting solutions
    • Support infrastructure-as-code initiatives
    • Implement automated remediation where possible
  • Success Metrics:
    • Customer Impact: Reduced MTTR and improved customer satisfaction scores
    • Reliability: Achievement of 99.9%+ uptime SLAs across all products and regions
    • Proactive Prevention: Reduction in incident frequency through automated detection and prevention
    • Cross-functional Collaboration: Improved partnership metrics with Product, Engineering, and Customer Success teams
    • Automation Delivery: Complete assigned automation projects to reduce manual tasks
    • Knowledge Sharing: Contribute to team knowledge base and mentor junior engineers
  •  

Qualifications

  • 4+ years experience in SRE, DevOps, or Infrastructure Engineering
  • Hands-on experience with AWS (required) and Azure (preferred)
  • Strong Linux system administration skills
  • Experience with monitoring tools (New Relic, Grafana, Prometheus)
  • Scripting skills in Python, Bash, or similar
  • Knowledge of databases (SQL Server, PostgreSQL, MongoDB)
    •  
Apply