Hyperparam ML Engineer – Dataset Curation Product Seattle, WA · Full time Company website

Work on research and development of tools and techniques for ML dataset curation

Description

What is the key to building the most advanced AI models? Data quality.


Hyperparam’s goal is to build a tool so efficient that one engineer can curate large ML datasets single-handedly. We believe that the way to accomplish this is: 1) build a highly scalable and interactive frontend experience that enables exploration and curation of massive ML datasets in the browser, and 2) dataset-scale inference that uses models to reflect back on their own training set to assist with curation. We're building the next generation of tools for ML dataset curation, helping make LLM dataset curation orders of magnitude more efficient than current approaches. By creating the best quality datasets, we will enable the creation of the world’s most capable models.


This opportunity is hybrid in-person in Seattle at a seed-stage startup. You would be one of the very first employees, working side-by-side with an experienced team building a new kind of dataset curation tool. This will require intense work ethic, dedication, creativity, and independence that is necessary at an early stage startup. For the right candidate, this is a unique opportunity to build a company from the earliest idea stages to building a product used by real customers.


Responsibilities:

  • Dataset Curation: Analyze, process, and clean large-scale datasets to ensure they meet quality and usability standards for machine learning applications.
  • Heuristic Development: Identify and design robust heuristics to filter, rank, and enhance datasets based on specific requirements.
  • Agent Development: Create and deploy intelligent agents that autonomously perform data cleaning, labeling, and curation tasks.
  • Quality Metrics: Define, track, and continuously improve dataset quality metrics, working towards tangible improvements in ML model performance.
  • Product Feedback: Collaborate closely with product and engineering teams, using the curation product to provide actionable feedback and prioritize enhancements.


You might be a great fit if you have:

  • Deep experience building products with LLMs. Should have experience using various APIs from Anthropic, OpenAI, etc. Familiar with tool calls and other advanced API features.
  • Experience with Data: Strong proficiency in working with structured and unstructured data; hands-on experience with data cleaning, processing, and transformation, and evaluation of ML datasets.
  • Algorithm Development: Proven ability to design and implement effective heuristics for data-related challenges.
  • Quality-Driven Mindset: Strong attention to detail and an obsession with "making the quality number go up" through iteration and experimentation.
  • Deep experience working with LLMs (e.g., GPT, Claude). If you aren’t using OpenAI and/or Anthropic models almost daily, it’s probably not a good fit.
  • Agent Creation: Iterative development of agentic systems to perform tasks. Experience with agentic frameworks like LangGraph, Autogen, etc is a plus.
  • Familiarity with active learning, synthetic data generation, or semi-supervised learning techniques.
  • Excellent problem-solving abilities and attention to detail.
  • Ability to operate independently in a small startup environment.
  • Passion for staying current with emerging technologies and best practices.


What We Offer:

  • Get in on the ground level of a funded seed start startup.
  • Work side-by-side with experienced entrepreneurs who care deeply about advancing AI.
  • A collaborative and close-knit work environment with a small team of highly motivated engineers, located in-person in Seattle.
  • Competitive salary, equity, and comprehensive benefits package.
  • Opportunity to work on groundbreaking projects in the ML and data visualization space.


The ideal candidate is deeply passionate about accelerating AI progress. You're excited by the potential of using LLMs as tools to improve the quality and efficiency of dataset curation, seeing this as a key lever for advancing ML capabilities. You think critically about dataset quality and how it impacts model performance, and you're motivated by the challenge of building automated systems that can help create better training data at scale. You likely have hands-on experience with ML models and understand firsthand how dataset quality influences model behavior. You don’t need to know frontend development, but you should be excited about the prospect of a more interactive, frontend-centric ML data platform. Most importantly, you're eager to work on systems that could help unlock the next generation of more capable AI models through better training data.

Salary

$180,000 - $240,000 per year