)
Location: Remote (San Francisco Bay Area / North or South America)
Location: Remote (San Francisco Bay Area / North or South America)
Experience Level: 3+ years in ML engineering or research
About the role:
The Machine Learning Systems Engineer contributes to ML infrastructure and the open-source ScalarLM codebase. The role intersects high-performance computing, distributed systems, and advanced machine learning research.
Key responsibilities:
• Develop and optimise distributed training algorithms for large language models (LLMs).
• Implement high-performance inference engines and optimisation techniques.
• Work on integrations between the vLLM, Megatron-LM and HuggingFace ecosystems.
• Build tools for seamless model training, fine-tuning and deployment.
• Optimise performance across advanced GPU architectures.
• Research and implement new techniques for self-improving AI agents.
Required technical expertise:
• Proficiency in both C/C++ and Python.
• Deep understanding of HPC concepts, including MPI programming and distributed computing across multiple GPUs/nodes.
• Experience with transformer architectures and distributed training techniques (data parallelism, model parallelism).
• Experience with large-scale distributed training frameworks (Megatron-LM, DeepSpeed) and inference-optimisation frameworks (vLLM, TensorRT).
BlueStone Solutions B.V. is certified in accordance with NEN 4400-1 and recognised sponsor with the IND.

⚡️ Website developed by Skeps.nl