Machine Learning Engineer / Senior Machine Learning Engineer (AI Platform)

Workday about 2 months ago

Canada, ON, Toronto, Canada {{REMOTE}}

Senior Level

Full-Time

About the role

Do you want to build impactful, AI features and solutions that will be used by millions of end-users? We are in the AI Platform organization at Workday and we solve meaningful problems that lie at the intersection of machine learning and enterprise-scale software!
We build advanced AI solutions that power the core Workday software by modeling user behavior and providing intelligent automation. Come join us and make it easier and balanced for millions of Workday users!
This role is focused on building the systems and tooling required to host and scale agent-based applications powered by LLMs. You will work across the platform stack to create reusable capabilities for agent execution, workflow orchestration, observability, evaluation, reliability, and developer experience
You’ll partner closely with applied AI, product, and infrastructure teams to define how agents are built and operated across the organization
We are looking for a Machine Learning Engineer to help design and build our Agent Platform—the core infrastructure that enables teams to develop, deploy, orchestrate, and operate AI agents in production
Design and build the core platform capabilities required to develop, host, and operate AI agents at scale
Develop infrastructure and services for agent execution, orchestration, state management, and runtime reliability
Build reusable abstractions, frameworks, and workflows in Python to support agent development patterns across teams
Design and implement systems for tool use, memory, retrieval, workflow coordination, and human-in-the-loop interactions
Build and maintain services deployed on Kubernetes, with a focus on scalability, resiliency, and operational excellence
Develop capabilities for evaluation, tracing, observability, debugging, and performance monitoring of agent behavior in production
Improve platform performance across latency, throughput, fault tolerance, and cost efficiency
Create internal APIs, SDKs, and developer tooling that make it easier for engineering teams to build on the platform
Partner with cross-functional teams to productionize new agent use cases and establish common platform patterns and best practices
Contribute to technical architecture and help define the roadmap for agent infrastructure and platform evolution- This is an ideal role for someone who enjoys solving hard engineering problems in a fast-evolving technical space and wants to shape the foundation for the next generation of AI applications
3+ years experience designing systems with a focus on scalability, reliability, observability, and maintainability
3+ yrs experience as part of a data science, machine learning software development team or relevant work in a PhD or equivalent program
6+ years of software engineering experience, including experience building and operating production-grade backend, ML, or platform systems
8+ years experience in Python and experience building reliable, maintainable production services
3+ years experience with distributed systems, APIs, asynchronous workflows, and service-oriented architecture
5+ years experience with distributed systems, APIs, asynchronous workflows, and service-oriented architecture
5+ years experience in Python and experience building reliable, maintainable production services
5+ years experience designing systems with a focus on scalability, reliability, observability, and maintainability
Experience building or supporting agent platforms, AI infrastructure, or internal developer platforms
Experience building and deploying machine learning or LLM-powered applications in production
Tool calling
Familiarity with LLM application patterns, including:
Retrieval-augmented generation (RAG)
Memory and context management
Multi-step workflows and orchestration
Human-in-the-loop systems
Experience designing and implementing evaluation frameworks for LLM or agent quality
Familiarity with vector databases, model serving, prompt/version management, and experimentation tooling
Solid knowledge of Data Science principles and their application in NLP
Ability to work across ambiguity, make strong technical tradeoffs, and drive projects from concept to production
Experience running services in Kubernetes-based environments
Strong communication and collaboration skills, with the ability to partner effectively across engineering, product, and AI teams

About Workday

Software Development

Website