About the Company
We’re Arcules: an innovative, bold member of the Canon family. We move fast, operate on trust, and value our employees. Our engineering team is passionate about what they do at work and play. So come as you are, and join us on this path to transform video into intelligence with cloud-native development and bleeding-edge technologies. Let’s grow together.
Arcules offers excellent benefits, including a top-tier PPO medical plan, four weeks of vacation, three weeks of sick leave, 401(k) plan after three months of employment (4% company match), an on-site gym and game pavilion, an awesome work environment and more.
Overview of the Job
The Site Reliability Engineering team at Arcules provides leadership, direction and accountability for platform architecture, system design and end-to-end implementation to meet and exceed the product non-functional requirements including quality, security, reliability, availability and performance. SREs allow Product Development teams to focus on shipping a product with reliable velocity.
The Site Reliability Engineer (SRE) level 3 on our dynamic SRE team focuses on driving the SRE charter by using software engineering to enable automation and efficiency in all aspects of platform change management and operations. The main responsibilities include, but are not limited to, optimizing day-to-day activities, including processes, to reliably support product rollout and operation through automation and mentoring other staff SRE to adopt and implement the DevOps culture.
You will identify opportunities to design, build and implement innovative solutions to solve unique platform and infrastructure problems to enhance developer workflow and production stability for the products. You will collaborate with other senior team members to evangelize the SRE mindset and system design in order to implement technology solutions that will maximize the performance and availability of our environment.
- Design and implement orchestration, and tooling solutions to ensure that repetitive administration tasks are performed at a high level of efficiency and free of defect
- Design and implement monitoring and recovery tools to provide for site high availability (HA) and disaster recovery (DR)
- Design and develop highly available infrastructure and platform components to meet the needs of our growing and evolving product lines
- Design and implement security engineering best practices in all our deployed platform and environments
- Triage alerts & diagnose/resolve critical issues, manage the implementation of changes
- Manage the coordination, documentation, and tracking of critical incidents ensuring rapid and complete issue resolution and appropriate closed loop to customers and other key stakeholders.
- Develop continuous integration/continuous deployment orchestration system to reduce friction for software delivery to production
- Evangelize the DevOps culture and SRE mindset, and mentor others about reliability and best practices.
- Identify and work with engineering to implement opportunities for automation, signal noise reduction, recurring issues, and other actions to reduce time to mitigate service-impacting events and increase the productivity of cloud operations and development resources
- Maintain a strong understanding of IaaS, Paas, and SaaS offerings with building and maintaining a state-of-the-art, cloud-based environment for massive-scale data processing
- Ensure that implementation and solution are fully documented, and solution deployed with fully operationalized processes to support the solution lifecycle
- Other tasks as assigned
- 7+ years of experience in infrastructure, system engineering, QA/testing automation
- Demonstrable subject matter expert with experience in testing methodology, testing automation frameworks
- Full stack software engineering experience with a solid foundation of at least 2-3 of the following frontend and backend technologies: ReactJS (or similar frameworks), Go, Python, SQL, RDBMS or No-SQL Databases.
- A systematic problem-solving approach, coupled with strong communications skills and a sense of ownership and drive.
- Experience in designing, analyzing, scaling, and troubleshooting medium-scale distributed systems.
- Well-versed with SRE methodologies and passionate about solving operation problems through automation and software engineering.
- Ability to communicate effectively vertically and horizontally within the organization via demonstrated written and verbal communication skills.
- Intermediate to advanced level of knowledge of Kubernetes and Docker, including experience in Docker image optimization and managing the Docker image lifecycle
- Strong experience in at least 2 of the following sets of logging and monitoring tools: ELK stack, Prometheus, Grafana, Stackdriver, New Relic, Datadog, Dynatrace
- Advanced level of Linux/Unix experience.
- Experience working with Google Cloud is preferred, but will consider any other public cloud providers experience
- Microservices lifecycle management (integration, testing, deployment)
- API and front end testing automation.
- Semantic versioning and semantic-release.
- Experience with load, performance and stress testing tools
- Intermediate to advanced level of knowledge for software release tooling to include but not limited to Gitlab, Jenkins, Spinnaker.