As a Slack Proactive Monitoring (ProM) Engineer, you will operate at the intersection of platform engineering, site reliability, and strategic customer success. You are a highly technical, customer-centric specialist dedicated to safeguarding the health and performance of Slack’s largest and most complex global enterprise deployments (Enterprise Grid).

Rather than waiting for customers to report problems, you will continuously monitor Slack workspace metrics, performance thresholds, API integrations, enterprise security events, and custom app behavior. You will proactively detect anomalies, triage system exceptions, and orchestrate rapid mitigation steps—frequently initiating preemptive customer outreach before their internal users even experience a slowdown.

Key Responsibilities:

Continuously monitor dashboards, alerting systems, and telemetry data (error rates, latency spikes, API failures, deployment anomalies) for early signals of degradation.
Triage and correlate alerts from multiple sources (Splunk, internal tools, etc) to identify patterns before customers report issues.
Actively monitor Slack platform health dashboards, network latency signals, message delivery queues, and database capacities for high-frequency workspaces.
Monitor critical custom automations, Slack Workflow Builder runs, Enterprise Key Management (EKM) operations, and Identity Provider (IDP) authentication syncs.
Identify customers potentially affected by degraded service conditions and coordinate proactive outreach with Customer Success and Support teams.
Partner with the Incident Management team to escalate signals that meet incident-threshold criteria.
Technical Advisory: Partner with Customer Success Managers and Success Architects to deliver annual technical health check reviews, assessing platform metrics, configuration limits, and custom integration health.
Perform root cause analysis (RCA) on proactively detected issues, documenting findings in internal case and incident management systems.
Work closely with Engineering and SRE teams to drive rapid remediation of identified issues
Intervene in low-risk system exceptions (e.g., advising clients on misconfigured Slack Webhooks, API rate limit exhaustion, or broken Salesforce-Slack app connections) before they trigger widespread downtime.
Build and maintain Slack-based automations and workflows to streamline proactive monitoring operations.

Requirements:

2+ years of experience in technical support, site reliability engineering, or a related operations role.
Hands-on experience with observability and monitoring tools (e.g., Grafana, Splunk, Datadog, PagerDuty, or equivalent).
Strong understanding of cloud-based SaaS architecture, APIs, and common failure modes.
Proficiency in reading and analyzing logs, metrics, and traces.
Excellent written and verbal communication skills; ability to clearly convey technical findings to both technical and non-technical audiences.
Demonstrated ability to leverage modern AI tools to optimize workflows, conduct research, and enhance daily productivity.

Preferred Requirements:

Experience working with Slack platform (Slack API, Slack workflows, Bolt framework).
Familiarity with Salesforce Service Cloud / OrgCS case management.
Scripting or automation experience (Python, JavaScript, Bash).
Experience in a customer-facing support engineering or reliability role at a SaaS company.
ITIL, SRE, or similar certification.

See more open positions at Own Company

Powered by Getro.com

Privacy policy Cookie policy