Kỹ năng
Mô tả công việc
Job Description: Datacenter Observability and Site Reliability Engineer
Roles and Responsibilities:
Observability and Monitoring:
• Design, implement, and maintain observability solutions for datacenter infrastructure.
• Develop, deploy, and maintain the operational and reliability components of a large-scale Observability and Telemetry collection platform, emphasizing performance at scale, real-time monitoring, logging, and alerting. • Participate in and enhance the entire lifecycle of services, from inception and design to deployment, operation, and refinement.
• Develop and optimize monitoring systems to ensure high availability and performance.
• Create and manage dashboards, alerts, and reports to provide visibility into system health and performance.
Site Reliability Engineering (SRE):
• Implement SRE best practices to improve the reliability, scalability, and performance of datacenter services.
• Develop and maintain automation scripts for infrastructure provisioning, monitoring, and management.
• Conduct root cause analysis and post-mortem reviews to prevent recurrence of incidents.
Performance Optimization:
• Analyze and optimize the performance of datacenter systems and applications.
• Implement best practices for resource utilization and efficiency.
Collaboration:
• Work closely with other engineering teams to understand and meet their observability and reliability requirements.
• Collaborate with hardware and software vendors to evaluate and integrate new technologies.
Security and Compliance:
• Ensure that observability and reliability solutions comply with security policies and industry standards.
• Implement and maintain security measures to protect data and infrastructure. Troubleshooting and Support:
• Provide support for observability and reliability-related issues, including debugging and resolving hardware and software problems.
• Develop and maintain documentation for troubleshooting procedures and best practices.
Continuous Improvement:
• Stay updated with the latest advancements in observability and SRE technologies and integrate them into the infrastructure.
• Continuously improve the reliability, scalability, and performance of datacenter services.
Roles and Responsibilities:
Observability and Monitoring:
• Design, implement, and maintain observability solutions for datacenter infrastructure.
• Develop, deploy, and maintain the operational and reliability components of a large-scale Observability and Telemetry collection platform, emphasizing performance at scale, real-time monitoring, logging, and alerting. • Participate in and enhance the entire lifecycle of services, from inception and design to deployment, operation, and refinement.
• Develop and optimize monitoring systems to ensure high availability and performance.
• Create and manage dashboards, alerts, and reports to provide visibility into system health and performance.
Site Reliability Engineering (SRE):
• Implement SRE best practices to improve the reliability, scalability, and performance of datacenter services.
• Develop and maintain automation scripts for infrastructure provisioning, monitoring, and management.
• Conduct root cause analysis and post-mortem reviews to prevent recurrence of incidents.
Performance Optimization:
• Analyze and optimize the performance of datacenter systems and applications.
• Implement best practices for resource utilization and efficiency.
Collaboration:
• Work closely with other engineering teams to understand and meet their observability and reliability requirements.
• Collaborate with hardware and software vendors to evaluate and integrate new technologies.
Security and Compliance:
• Ensure that observability and reliability solutions comply with security policies and industry standards.
• Implement and maintain security measures to protect data and infrastructure. Troubleshooting and Support:
• Provide support for observability and reliability-related issues, including debugging and resolving hardware and software problems.
• Develop and maintain documentation for troubleshooting procedures and best practices.
Continuous Improvement:
• Stay updated with the latest advancements in observability and SRE technologies and integrate them into the infrastructure.
• Continuously improve the reliability, scalability, and performance of datacenter services.
Yêu cầu công việc
Technical Skills:
• Proficiency in observability tools and technologies (e.g., Prometheus, Grafana, ELK Stack).
• Experience with SRE practices and tools (e.g., Kubernetes, Docker, Terraform).
• Strong programming and scripting skills (e.g., Python, Go, Bash).
• Familiarity with cloud platforms (AWS, Azure, GCP) and their observability and
reliability services.
Soft Skills:
• Strong problem-solving skills and attention to detail.
• Excellent communication and collaboration skills.
• Ability to work in a fast-paced, dynamic environment.
• Proficiency in observability tools and technologies (e.g., Prometheus, Grafana, ELK Stack).
• Experience with SRE practices and tools (e.g., Kubernetes, Docker, Terraform).
• Strong programming and scripting skills (e.g., Python, Go, Bash).
• Familiarity with cloud platforms (AWS, Azure, GCP) and their observability and
reliability services.
Soft Skills:
• Strong problem-solving skills and attention to detail.
• Excellent communication and collaboration skills.
• Ability to work in a fast-paced, dynamic environment.
Thời gian làm việc
Trong tuần: Từ thứ 2 - thứ 6
Trong ngày: Từ 08:30 giờ - 18:00 giờ
Quyền lợi ứng viên
- No probationary period, full-time job with 100% salary
- Opportunity to work in teams with many leading experts in the IT field domestically and internationally.
- Opportunity to carry out ambitious projects in many countries, access the latest technologies and learn from talented colleagues.
- Work in a young, dynamic, modern and multicultural environment; Communication activities and events on holidays take place regularly.
- Opportunity to advance according to ability with corresponding rank and salary increases.
- Right to participate in soft skills training courses (logical thinking, creative thinking, communication skills, project management skills, negotiation skills ...) and Japanese language classes.
- And many other attractive benefits...
- Opportunity to work in teams with many leading experts in the IT field domestically and internationally.
- Opportunity to carry out ambitious projects in many countries, access the latest technologies and learn from talented colleagues.
- Work in a young, dynamic, modern and multicultural environment; Communication activities and events on holidays take place regularly.
- Opportunity to advance according to ability with corresponding rank and salary increases.
- Right to participate in soft skills training courses (logical thinking, creative thinking, communication skills, project management skills, negotiation skills ...) and Japanese language classes.
- And many other attractive benefits...
Địa chỉ làm việc
remote