Lead Engineer - Software & HPC Engineering
First Light Fusion
Software & HPC Engineering
We run and maintain our own High Performance Computing resource to ensure constant availability for the computations that underpin our approach. The workload varies from precision single runs to thousands of simulations in parallel, enabling exhaustive design space exploration and robust statistical assurance. Our multi-scale, multi-physics toolsets model hydrodynamics, thermal conduction, material strength, radiation transport, and resistive magnetohydrodynamics, backed by proprietary atomic-scale models for extreme accuracy. AI integration accelerates design cycles, automates optimisation, and continuously calibrates models with experimental data for maximum predictive confidence.
Job Description
Role Purpose
We are looking for a Lead HPC Engineer, or a Senior HPC Engineer with ambition to reach a higher level of responsibility, with a strong understanding of designing, implementing, and maintaining HPC platforms. You will be working to support our existing HPC cluster of over 10,000 cores and associated storage. You will be a member of the Software & HPC Engineering team. The team works closely with the Computational Physics and the Data-Driven Engineering Departments to provide the hardware and software platform for running our codes. This includes maintaining a consistent environment for our various codes to run in, quality assurance, builds, continuous integration, quality control, deployment, monitoring, and general software development. Focusing on the HPC function, you will also work closely with the business IT team.
Current Systems:
- Two server rooms mostly for HPC but shared with business IT.
- The whole HPC complex is currently air-gapped which requires on-site working.
- Hardware: AMD EPYC Dell servers, Intel XEON SuperMicro and Intel servers, 4TB and 8TB SSDs, 100Gb LAN.
- Base software: AlmaLinux 8.10, Ubuntu 22.04, Lustre 2.15 and 2.16, GlusterFS 11, NFS Ganesha 5, Ansible, Ganglia, Slurm.
- Developer software: G++ 14, CLANG C++ 20, Fortran 2008+, MPICH 4.3, OpenMPI 5.0, Python 3, VTK, ParaView.
- Tools: Bitbucket, JIRA, Jenkins (maintained by business IT).
Accountabilities and deliverables
- Maintain hardware, cooperating with vendor support.
- Maintain base software.
- Monitor status and performance and fix issues.
- Backup critical data and configuration.
- Perform maintenance according to user activity.
- Profile performance and seek improvements.
- Provide status updates to team leaders and user representatives.
- Communicate major issues or developments to the user community.
- Record major user requests and user issues and provide resolutions.
- Provide documentation on the HPC cluster and its operation to other systems engineers and to the users.
- Solicit requirements and provide budget plans for upgrades and replacements.
- Negotiate with vendors for purchases.
- Assist with other work in the Software & HPC Engineering team as needed.
- Perform other tasks compatible with skills as needed.
Core skills, knowledge and attributes
Essential
- Degree in Computer Science, other STEM subject or equivalent experience.
- Good knowledge of Linux, high performance computing, high performance storage, and high-speed interconnect networking.
- Familiarity with MPI, C++, Fortran 2008+ programming.
- Understanding of scheduling systems.
- High level scripting experience in configuration management, Shell, Python, using Git for version control.
- Understanding of specifying a system to meet simulation requirements through to implementation and ongoing support.
- Fast and effective problem-solving skills and a methodical approach to work.
- Strong communication and interpersonal skills.
Desirable
- In depth knowledge of MPI, C++, Fortran 2008+.
- Experience with Ansible configuration management.
- Experience of specifying a system to meet simulation requirements through to implementation and ongoing support.
- Profiling existing codes and optimising hardware/software to deliver the best performance.
- Containerisation (Singularity, Apptainer).
- Familiarity with system deployment tools.
- Prior work on air-gapped networks.
- SQL DBMS queries and experience with HPC accounting databases.
- Education and training of end users.