Machine Learning Engineer / Senior Machine Learning Engineer (AI Platform)
Canada, ON, Toronto, Canada {{REMOTE}}
Senior Level
Full-Time
About the role
- Do you want to build impactful, AI features and solutions that will be used by millions of end-users? We are in the AI Platform organization at Workday and we solve meaningful problems that lie at the intersection of machine learning and enterprise-scale software!
- We build advanced AI solutions that power the core Workday software by modeling user behavior and providing intelligent automation. Come join us and make it easier and balanced for millions of Workday users!
- This role is focused on building the systems and tooling required to host and scale agent-based applications powered by LLMs. You will work across the platform stack to create reusable capabilities for agent execution, workflow orchestration, observability, evaluation, reliability, and developer experience
- You’ll partner closely with applied AI, product, and infrastructure teams to define how agents are built and operated across the organization
- We are looking for a Machine Learning Engineer to help design and build our Agent Platform—the core infrastructure that enables teams to develop, deploy, orchestrate, and operate AI agents in production
- Design and build the core platform capabilities required to develop, host, and operate AI agents at scale
- Develop infrastructure and services for agent execution, orchestration, state management, and runtime reliability
- Build reusable abstractions, frameworks, and workflows in Python to support agent development patterns across teams
- Design and implement systems for tool use, memory, retrieval, workflow coordination, and human-in-the-loop interactions
- Build and maintain services deployed on Kubernetes, with a focus on scalability, resiliency, and operational excellence
- Develop capabilities for evaluation, tracing, observability, debugging, and performance monitoring of agent behavior in production
- Improve platform performance across latency, throughput, fault tolerance, and cost efficiency
- Create internal APIs, SDKs, and developer tooling that make it easier for engineering teams to build on the platform
- Partner with cross-functional teams to productionize new agent use cases and establish common platform patterns and best practices
- Contribute to technical architecture and help define the roadmap for agent infrastructure and platform evolution- This is an ideal role for someone who enjoys solving hard engineering problems in a fast-evolving technical space and wants to shape the foundation for the next generation of AI applications
- 3+ years experience designing systems with a focus on scalability, reliability, observability, and maintainability
- 3+ yrs experience as part of a data science, machine learning software development team or relevant work in a PhD or equivalent program
- 6+ years of software engineering experience, including experience building and operating production-grade backend, ML, or platform systems
- 8+ years experience in Python and experience building reliable, maintainable production services
- 3+ years experience with distributed systems, APIs, asynchronous workflows, and service-oriented architecture
- 5+ years experience with distributed systems, APIs, asynchronous workflows, and service-oriented architecture
- 5+ years experience in Python and experience building reliable, maintainable production services
- 5+ years experience designing systems with a focus on scalability, reliability, observability, and maintainability
- Experience building or supporting agent platforms, AI infrastructure, or internal developer platforms
- Experience building and deploying machine learning or LLM-powered applications in production
- Tool calling
- Familiarity with LLM application patterns, including:
- Retrieval-augmented generation (RAG)
- Memory and context management
- Multi-step workflows and orchestration
- Human-in-the-loop systems
- Experience designing and implementing evaluation frameworks for LLM or agent quality
- Familiarity with vector databases, model serving, prompt/version management, and experimentation tooling
- Solid knowledge of Data Science principles and their application in NLP
- Ability to work across ambiguity, make strong technical tradeoffs, and drive projects from concept to production
- Experience running services in Kubernetes-based environments
- Strong communication and collaboration skills, with the ability to partner effectively across engineering, product, and AI teams