Jobs
Site Reliability Engineer
Job description
Live previewWe are seeking a Site Reliability Engineer responsible for designing, building, running, and monitoring public cloud infrastructure to support a variety of mission critical services. This is a highly technical, hands-on role that requires expertise supporting systems at enterprise scale. You will deliver innovative solutions in Engineering, Reliability, Monitoring, Automation and orchestration.
Experience: 2+ years with Any graduation.
Location: Onsite/Remote
Responsibilities:
- As SRE, we look forward to your contribution in engineering and supporting cloud platform IaaS and PaaS services
- Partner with application teams to provision scalable workloads reliably across distributed compute resources
- Provide engineering and operational support for distributed systems and network based information security tools, including for configuration management and provisioning
- Implement and maintain security controls
- Work closely with development teams to understand application performance and behaviour patterns to proactively monitor, tune and correct issues before they occur
- See opportunities to improve security tooling reliability, performance and security
- Develop tools and automation to eliminate manual and repetitive efforts
Skills Required:
- 2+ years of experience in Software Engineering and Systems Engineering to manage operations
- Experience supporting infrastructure and services in public and private cloud environments (Azure, AWS, GCP, OpenStack etc.)
- Proficient with various programming languages such as Python/Java/Ruby/Perl/Go/Makefile for building automation or integration with APIs
- Experience with common formats such as JSON, YAML and compression utilities
- Expertise with monitoring or log aggregation tools (Prometheus, Grafana, Splunk, ELK, etc.)
- Expertise in key SRE Skills (Scalability, Reliability and Observability) and 24*7 on-call process
- Familiarity with CI/CD tools and deployment processes
- Solid understanding and experience with centralized configuration management, coordination and provisioning technologies, such as Ansible, Chef, Puppet, etc.
- Experience implementing and working with open source frameworks
- Excellent communication skills, must be capable of working with cross functional technical and business teams and varying levels of management
- Understanding of Agile methodologies like Scrum and be able to work in fast-paced environment
- Strong project management skills, including excellent presentation skills
- Must be capable of writing detailed solution specifications, diagrams, best practices/standards documentation, operating procedures, test plans/test reports, etc.
- Solid understanding Linux/Unix system internals, including kernel tuning
- Failure Testing and Chaos Engineering
- Working knowledge of network protocols and network based services, including routing and network load balancing
- Experience building and supporting containerized applications on various platforms like GKE, EKS, ECS.