The SRE’s primary goals is to ensure the platform reliability, scalability, and availability. Key success metrics include system health and availability, ability to scale infrastructure quickly and appropriately, cost to run services, quality and platform health.
- Develop internal automation - monitoring, setup, statistics
- Setup automatic systems to control infrastructure
- Monitor live production systems health
- First-aid reaction to infrastructure / platform failures
- Deal with problems and interact with operation and development team
- Help developers with deployment and integration
- Participate in on-call rotation