The Lead Systems Engineer - Computing Technology engages in the design, leads implementation, and provides Level 3 expert support for large-scale private Cloud computing and/or HPC infrastructure, with a specific emphasis on computing technologies including hardware layer, operating system, hypervisor, and orchestration services.
• Co-design, lead implementation, and manage hybrid virtualization and containerized platforms based on OpenStack, VMware VCF, and/or Red Hat OpenShift, ensuring platform stability, performance, and compliance with industry standards and best practices.
• Define and oversee the implementation of the roadmap for all Virtualization and HPC platforms across the company.
• Collaborate with architecture and engineering teams on technology stack component evaluation and selection, ensuring solutions are designed following best practices and optimized from both functional and non-functional perspectives.
• Lead regular capacity planning exercises to anticipate and accommodate the growing demands on the virtualized environment and HPC infrastructure, ensuring it meets current and future requirements.
• Develop and oversee plans to enhance the reliability of the computing infrastructure, addressing potential points of failure and ensuring high availability of services.
• Lead regular performance assessments and implement improvements based on findings in collaboration with relevant teams.
• Define and oversee execution of disaster recovery strategies ensuring system integrity, availability, and protection across all platforms and environments.
• Design and enhance observability stack in collaboration with the infrastructure operations team ensuring monitoring coverage and accuracy.
• Provide L3 expert support, including on-call shifts, and act as the final tier of resolution for L2 support teams through problem analysis and communication with vendor’s technical support.
• Lead the collaboration with architecture and engineering teams on technology stack component evaluation and selection, ensuring solutions adhere to best practices and are optimized for both functional and non-functional requirements.
• Lead the analysis and implementation of performance optimization strategies for the cloud computing and/or HPC environment to maximize efficiency and resource utilization.
• Lead and mentor a team of engineers and collaborate with other infrastructure engineering and systems architect teams on solution design and delivery.
• Collaborate with security management teams to ensure that systems are safe and secure against cybersecurity threats.
• Write and maintain relevant documentation, ensuring completeness and quality.
• Work closely with process management and operational teams, contributing to process development, standardizing the collaboration framework, and improving collaboration efficiency.
• Participate in the Hiring process by conducting technical interviews and contributing to the team’s growth and expertise.
• Bachelor’s or master’s degree in computer science, Engineering, Software Engineering, or a related field in technology.
• 2+ years of experience leading a team of 3+ engineers, holding accountability for quality and timely delivery of infrastructure projects.
• 7+ years of experience and deep expertise in designing, implementing, and managing private cloud stacks with a focus on compute and virtualization technologies.
• Extensive hands-on experience with at least one of the following platforms/stacks: OpenStack, Apache CloudStack, VMware VCF and Red Hat OpenShift, and related computing technologies such as x86 hardware, OS, KVM/ESXi, and orchestration services.
• 7+ years of hands-on experience in Linux Environments and 3+ years of experience in Senior Systems or Infrastructure engineering role.
• Profound understanding of hardware architecture and components [x86 and ARM, NUMA, types of memory and channels, types of NICs, etc).
• Good understanding of network and storage types and architecture.
• Good understanding of Cloud Native concepts and technologies.
• Experience in managing large-scale public or private cloud environments and/or working in a cloud service provider environment is highly desirable.
• Advanced programming and scripting skills using Python and/or Golang, bash.
• Good knowledge in Data center network designs and related technologies [OSI model, TCP/IP stack, routing, VLAN/VxLAN, etc]
• Understanding of storage types, architecture, and protocols such as object/block/file storages, NFS/SMB, iSCSI, FC, etc.
• Experience with integration of identity management, access management, and authorization solutions (PKI, LDAP, OAUTH, OpenID).
• Hands-on experience with monitoring and observability tools like Zabbix or Nagios, Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana).
• Understanding of CI/CD principles, Infrastructure as Code (IaaC) approach and software defined infrastructure solutions.
• Experience with database management and optimization for both SQL and NoSQL databases such as MySQL, PostgreSQL, MongoDB, or Cassandra is highly desirable.
• Experience with ITSM tools such as Jira, Redmine, ServiceNow, etc.
• Relevant certifications in Linux, virtualization, and cloud computing are a plus.
• Knowledge and experience working with GPU-hardware and AI hardware accelerators is a plus.
• Strong organizational skills with the ability to multitask and prioritize.
• A proactive approach to problem-solving and decision-making.