Machine Learning Software Engineer

Fathom Radiant


 90k - 180k
 Full-Time
 United States  (Colorado)
 Remote   
Fathom Radiant logo

About us

We are searching for talented individuals who are driven to tackle the most ambitious goal of our time - building the computer hardware that enables the development of safe artificial general intelligence. See more at fathomradiant.co/aboutus
 
In our people, we above all value kindness, a scout mindset, a focus on improvement, and prioritising to get the right things done. We aim to help build one of the most transformative technologies in the world, with massive social and ethical implications. We think this makes representation even more important, and we are actively striving to have a range of diverse perspectives on our team.
 

This role

As a Machine Learning Software Engineer, you'll work closely with our performance architect and silicon team to optimize the performance of the ML trainings on our system. You'd be testing our systems to share ongoing insights and benchmarks with the broader teams to influence our product decisions.
 
This role is open to fully remote.

Areas of contribution:

    • Contribute to the implementation of the graph compiler, and the runtime components of the Fathom’s software stack, enabling distributed large AI model training with novel networking topologies and using Fathom’s interconnect fabric.
    • Work closely with other engineer and architecture teams and Fathom’s vendors to optimize performance metrics such as resource utilization, latency, and power consumption.
    • Collaborate closely with our partners and customers to train large-scale neural networks on Fathom’s computing fabric, benchmark the performance metrics, and optimize for performance.
    • Extend distributed systems collectives libraries for novel network architectures.
    • Develop tools and frameworks to analyze and visualize performance, hardware utilization, placement, traffic patterns, and power consumption.

Requirements (necessary skills for this role):

    • Mastery of programming languages such as C/C++ and Python.
    • Demonstrated experience in developing and optimizing highly parallel AI models
    • Solid foundation in algorithms and data structures.
    • Strong analytical, problem-solving, and communication skills.

Nice-to-haves (we will prioritize candidates that also have these skills):

    • Masters, or PhD degree in computer engineering, computer science or related fields and 5+ years of experience in system-level performance modeling, analysis, and optimization.
    • Strong understanding of AI/ML algorithms and toolsets.
    • Experience with distributed systems collectives such as NCCL or OpenMPI.
    • Knowledge of networking stacks and protocols, including Ethernet and InfiniBand fabrics.
Indicative salary range for this role: $90,000 - $180,000. The salary for this role will depend on the location and the experience level of the candidate.
 
For all roles, we target market salaries, with an additional benefits package. Our comprehensive benefits include startup equity, medical expenses coverage (including extra coverage for employees with a family).
Apply now