Site Reliability Engineer (Intermediate)

Holistic Approach to Systems | Operations is a Software Problem  | Team-Oriented Communication

DigitalEd has a simple and resonant purpose - to shape the world through digital learning. As a SaaS company in the online learning market, our Möbius platform is a comprehensive solution designed for the unique needs of teaching science, technology, engineering and mathematics (STEM).

A Site Reliability Engineer at DigitalEd is responsible for ensuring our production systems meet our customers' uptime and service needs, with software engineering tools and capabilities. They are pragmatic, objective, and articulate, with strong communication and teamwork capabilities. They create effective tooling and automation that enables our teams to give our customers a compelling and seamless experience with the Möbius platform. 

The SRE team designs, deploys, and manages DigitalEd’s internal Private Cloud Infrastructure as well as our customer-facing Google Cloud Platform SaaS application infrastructure. We anticipate this role will ideally spend no more than 30 to 50% of their time on “ops'' related work, and the rest of their time on software development to improve the scalability, reliability, and availability of the Möbius application.

Outcomes and Key Responsibilities: The Impact You'll Have

  • System Design: Engage in and improve the whole lifecycle of our service — from inception and design, through deployment, operation and refinement. Identify areas of opportunity to programmatically automate cloud deployment, administration, and monitoring tasks.

  • System Support: Support our service through system design consulting, developing software platforms and frameworks, and capacity planning. Investigate and troubleshoot cloud component performance. Leverage your experience to find the root causes of defects and work to proactively address them. Practice sustainable incident response and blameless incident reviews.

  • System Maintenance: Maintain our service by measuring and monitoring availability, latency, and overall system health; support on-call rotations with operational duties that have not been addressed with automation.

Measures of Performance: How You Know You're Doing Well

  • Process Execution: Every project, automation task, and incident is executed well and completely. We ensure that all work in our system is done to the best of our ability given our knowledge, tooling, and experience.

  • Customer Satisfaction: A desire to ensure a high quality of service to provide the best customer experience, by continually finding the next problem to solve, and solving it well.

  • Effective Cooperation: Working with Engineering and Customer Support continually to ensure our customers' needs are met and exceeded.

Experience & Competencies: The 'Stuff' that Makes You Great at This

  • An understanding that system failure is normal, and the ability to embrace risk as part of the job.

  • Demonstrated success in working through blameless incident review processes, using techniques such as “the infinite hows.”

  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.

  • Ability to debug and optimize code and automate routine tasks.

  • Ability to see the system as a whole and treat its interconnections with as much attention and respect as the components themselves.

  • Strong desire to automate.

  • Never stop learning, remaining current in the latest DevOps/SRE technologies and functionalities, as well as with the ever-evolving needs of our customers.

The Technical Piece: The Knowledge and Exposure that this Role Can't Operate Without

  • Several years in an Operational role, be it DevOps, SRE, or traditional network/server management.

  • Advanced expertise with at least one programming language, with a preference for Java and Python; polyglot preferred.

  • Extensive experience in Linux.

  • Experience with cloud platforms, preference for GCP.

  • Experience with containers and orchestration (Docker, Kubernetes).

  • Experience with database management (PostgreSQL).

  • Experience with infrastructure as code (IaC) (Terraform, Puppet, Git).

  • Experience with general networking concepts and protocols, and storage fundamentals.

Tech Stack - What You'll Be Using

  • Apache, Confluence, Docker, Docker Swarm, GCE, GCP, GCR, Git Bash, GKE, HA Prozy, Java, Jira, Kubernetes (K8s), Linux, Opsgenie, Puppet, Python, PostgreSQL, Terraform, and yes we felt it was necessary to alphabetize this list.

The Culture Piece: What it's Like to Work Here

The spirit of our culture is rooted in ‘No Deposit, No Return’. If you don’t put anything into your professional experience, you won’t get anything out of it. We are a team working towards one goal: a better learning experience for students everywhere. To bring this to life, we lean on our core values of Customer Orientation, Curiosity, Teamwork, Adaptability, Ownership and Coaching. If any of these words strike a chord, then we’ve got something in common.

The majority of our team is located in or around Waterloo, Ontario, Canada, but we also have team members throughout the UK. We are currently operating as a remote workforce, and intend to re-open the Waterloo office for a hybrid working model when it is safe to do so. In terms of work location, we are open to a remote or in-person team member, within the Eastern Standard Time zone, in Ontario.

Lastly, we welcome individuals of all backgrounds, experiences, and perspectives to apply. If you require any form of accommodation during the application process, don’t hesitate to let us know and we’ll work to ensure it’s a positive experience for you.

Read through this posting and not sure if you’re qualified? Apply anyways. You never know where it could go, and we promise to read and review every application that comes through—with a magnifying glass we like to call the ‘Potential’ Detector. Everyone has a great story, and we’d love to hear yours.

Send your resume to and include a few words on why this role caught your eye. Within 7 days, you'll find out if you're moving forward in the process or not. All interviews will be held via Zoom video conference, and candidates can expect to meet various members of our team as we embark on a remote recruitment process to find the next great Site Reliability Engineer to join DigitalEd.

Sound like a good fit? E-mail us at to apply

Are you a student?

Retrieve your student access below or browse our
support documents if you have any questions.

Student Access

Not a student?

Continue to