Site Reliability Engineering Manager
Stord
Stord is The Consumer Experience Company, powering seamless checkout through delivery for today's leading brands. Stord is rapidly growing and is on track to double our revenue in the next 18 months. To meet and exceed this target, Stord is strategically scaling teams across the entire company, and seeking energetic experts to help us achieve our mission.
By combining comprehensive commerce-enablement technology with high-volume fulfillment services, Stord provides brands a platform to compete with retail giants. Stord manages over $10 billion of commerce annually through its fulfillment, warehousing, transportation, and operator-built software suite including OMS, Pre- and Post-Purchase, and WMS platforms. Stord is leveling the playing field for all brands to deliver the best consumer experience at scale.
With Stord, brands can increase cart conversion, improve unit economics, and drive sustained customer loyalty. Stord’s end-to-end commerce solutions combine best-in-class omnichannel fulfillment and shipping with leading technology to ensure fast shipping, reliable delivery promises, easy access to more channels, and improved margins on every order.
Hundreds of leading DTC and B2B companies like AG1, True Classic, Native, Seed Health, quip, goodr, Sundays for Dogs, and more trust Stord to deliver industry-leading consumer experiences on every order. Stord is headquartered in Atlanta with facilities across the United States, Canada, and Europe. Stord is backed by top-tier investors including Kleiner Perkins, Franklin Templeton, Founders Fund, Strike Capital, Baillie Gifford, and Salesforce Ventures.
We are seeking an experienced and strategic SRE Manager to lead our growing site reliability engineering team. This role combines deep technical expertise with strong leadership skills to drive the reliability, scalability, and performance of our production systems at scale. You'll be responsible for building and mentoring a high-performing team of SREs while setting the technical vision and strategy for our infrastructure platform.This position requires someone who can balance hands-on technical contributions with people management, strategic planning, and cross-functional collaboration in our fast-paced, high-autonomy environment.
What You'll Do:
Team Leadership & People Management
Build, lead, and scale a team of SREs
Provide career development, mentoring, and technical guidance to team members
Establish hiring practices and interview processes to attract top SRE talent
Foster a culture of reliability, automation, and continuous improvement
Manage team performance, conduct reviews, and facilitate professional growth
Define on-call practices and ensure sustainable operational load across the team
Strategic Planning & Technical Vision
Develop and execute the long-term infrastructure and reliability strategy
Establish reliability standards, SLOs, and engineering practices across the organization
Drive architectural decisions for scalable, multi-region infrastructure on GCP
Partner with engineering leadership to align infrastructure roadmap with business objectives
Evaluate and introduce new technologies, tools, and practices to improve team effectiveness
Lead capacity planning and infrastructure cost optimization initiatives
Cross-Functional Collaboration
Work closely with development teams to embed reliability practices into the software development lifecycle
Collaborate with Product, Security, and Compliance teams on infrastructure requirements
Represent the SRE team in engineering leadership meetings and strategic planning sessions
Drive incident response processes and lead major incident coordination
Establish SLAs and communication protocols with internal stakeholders
Technical Excellence & Oversight
Maintain hands-on technical involvement in critical infrastructure decisions
Review and approve major architectural changes and infrastructure proposals
Ensure implementation of best practices for Infrastructure as Code, monitoring, and automation
Drive the adoption of chaos engineering, disaster recovery, and business continuity practices
Oversee security hardening and compliance efforts across infrastructure systems
What You'll Need:
Leadership & Management Experience
3+ years of experience managing and leading technical teams (5+ people)
Proven track record of building and scaling SRE, platform, or infrastructure teams
Experience with hiring, performance management, and career development of technical staff
Strong ability to balance technical hands-on work with people management responsibilities
Experience leading incident response and managing high-stakes technical escalations
Technical Expertise
8+ years of experience in site reliability, platform engineering, or infrastructure roles
Deep expertise with cloud platforms, particularly Google Cloud Platform (GCP)
Strong proficiency in multiple programming languages (Python, Go, Java, etc.)
Extensive experience with containerization (Docker), orchestration (Kubernetes), and microservices
Expert-level knowledge of Infrastructure as Code (Terraform, CloudFormation, Pulumi)
Advanced understanding of monitoring, observability, and distributed systems architecture
Experience with CI/CD pipelines, automation frameworks, and DevOps practices
Strategic & Communication Skills
Ability to translate technical concepts into business value and communicate with executive leadership
Experience developing technical roadmaps and long-term strategic planning
Strong project management skills and experience with agile/scrum methodologies
Excellent written and verbal communication skills for technical and non-technical audiences
Experience with budget management and vendor relationships
Preferred Qualifications:
Experience managing teams in high-growth startup or scale-up environments
Background in managing distributed teams and remote-first engineering cultures
Advanced GCP certifications (Professional Cloud Architect, Cloud DevOps Engineer)
Experience with multi-cloud architectures and cloud migration strategies
Knowledge of modern data infrastructure (BigQuery, streaming platforms, data pipelines)
Previous experience as a technical lead or principal engineer before transitioning to management
Familiarity with functional programming languages and event-driven architectures