The Site Reliability Engineering team is part of the Digital Enterprise Technology Platform Engineering organization, responsible for maintaining and developing the IT monitoring and log analytics platform that ensures Enterprise IT services' reliability.

We're looking for a self-starter with the ability to take ownership of tasks, work under pressure, and balance multiple assignments simultaneously while maintaining a positive outlook. You'll contribute ideas and provide feedback on IT monitoring systems' vision while providing expertise for IT projects and enhancements across various IT organizations.

Responsibilities:

Manage, assess, plan, and support core observability platform operations
Lead process changes and implementations related to the monitoring platform
Provide escalation support for configuration and platform issues, participating in on-call schedules to resolve major incidents
Collaborate with key stakeholders (Service Managers, Product Managers, Application Architects, Business Support, and Operations) to gather and develop requirements
Develop AI, automation, and integrations to deliver custom monitoring requirements
Work with third-party vendors and partners to address platform-related enhancements
Support and manage the introduction of new monitoring tools and orchestrate migrations as aging software is retired
Present reports on monitoring event metrics and correlation metrics to the Enterprise Operations team periodically
Work under Agile scrum methodology and provide guidance to junior team members
Create standard operating procedures and share them with the team for effective execution

Minimum Qualifications:

Bachelor's degree in Computer Science or related technical field, or equivalent experience in technical leadership
5-8 years of experience designing and implementing distributed systems to handle large-scale telemetry and log data
Demonstrable ability in Bash/Powershell, Python, and JavaScript (NodeJS), especially program comprehension
Understanding of REST-based API design principles and best practices
Experience with server administration (Linux and Windows)
Knowledge of monitoring tools like Zabbix, Splunk, Grafana, NewRelic, or ThousandEyes
Experience with AWS public cloud and VMware vSphere
Knowledge of configuration management and orchestration tools like Puppet, Ansible, or Terraform
Experience with Docker and containerized applications
Strong troubleshooting and debug skills (reading log files, analyzing memory leaks)
Strong analytical skills and ability to gather and synthesize data for review
Ability to problem-solve in a fast-paced environment and shift gears effectively
Subject matter expertise in at least one monitoring and telemetry product

Preferred Qualifications:

Experience with AI and machine learning applications in operations
Experience with predictive monitoring and auto-healing solutions
Master's degree in Computer Science or related field
Experience translating technical concepts into visual representations

For roles in San Francisco and Los Angeles: Pursuant to the San Francisco Fair Chance Ordinance and the Los Angeles Fair Chance Initiative for Hiring, Salesforce will consider for employment qualified applicants with arrest and conviction records.

See more open positions at Own Company

Privacy policy Cookie policy