Objective
MLOps Team Lead with over a decade of hands-on experience in architecting, developing and turning ML research into production systems that run at scale. I lead teams the way I write code - with clarity, ownership, and a bias for shipping. I’ve built and shipped across industries where scale is the starting point. Seeking a challenging role to leverage technical proficiency and leadership abilities to create a meaningful difference.
Experiences
- Leading a team of 8 engineers (including 1 Data Scientist), owning the end-to-end lifecycle of production ML systems for large-scale IoT tracking across global supply chains.
- Architected and scaled real-time streaming ML pipelines (Apache Flink, Kafka, Kubernetes) from hundreds to 100K+ events/sec, enabling deployment across 4,000+ physical stores within one year.
- Designed and operated end-to-end ML pipelines (training + inference), processing ~3TB/day and supporting ~20 production models.
- Built complex real-time inference pipelines including data ingestion, decryption, enrichment, and fan-out model serving architectures.
- Reduced system latency by up to 43% and infrastructure costs by ~15% through autoscaling and pipeline optimization.
- Established production-grade monitoring for data drift, model performance, and system health, maintaining 99.99% uptime SLA.
- Improved team productivity and development velocity by promoting modern tooling and AI-assisted development workflows.
- Partnered closely with Data Science, DevOps, PM and external stakeholders to deliver scalable ML solutions in production.
- Responsible for the architecture, design, development, and maintenance of several core key components of the BLADE system.
- Led a team of 4 engineers (including 1 Data Analyst).
- Mentored junior developers, providing guidance and fostering their professional growth.
- Collaborated closely with the data science team to bring machine learning models into production, aimed at protecting Intuit’s APIs against malicious intentions from both security and fraud perspectives.
- Successfully deployed several ML models, which at their peak handled ~1 million TPS while demonstrating impressive performance results (F1, p99 latency & throughput).
- Utilized expertise in Apache Flink to develop a high-performance, scalable, and secure software solution.
- Full project planning and management: cost, timeline and risks.
- Played a key role in the redesign, architecture, development, and maintenance of the Risk Control System (RCS) 2.0.
- Spearheaded the migration from a legacy batch processing infrastructure, which was prone to crashes and lacked code quality, to a modern, highly available, and maintainable system based on a Java Spring boot microservices architecture. Post migration, the RCS system served its clients at 99.999% uptime with a NPS score of 86.
- Collaborated with cross-functional teams to gather requirements, design system architecture, and implement robust solutions.
Design, Development and Integration of a cloud-based Java Spring system that aims to extend the capabilities of the ISE (Identity Services Engine) system into the cloud domain. Key responsibilities included:
- Migrated existing product runtime to the AWS ECS Fargate platform.
- Integrated peripheral services with their AWS counterparts (Cognito, RDS, etc.).
- Developed and maintained three layers of testing (UT, FT, ST) and ensured quality.
- Integration with offsite international dev/PO/PM teams for co-development of features, tests, pipeline and bugs handling.
- Development of the back-office User management and Analytics system.
- Responsible for ramping up of new engineers into the scrum.
Design, Development and Integration, from scratch, of a cloud-based Java Spring system that aims to give global farmers command and control of their fields and crops. Key responsibilities included:
- Developed peripheral tools for the main app such as a watchdog and a math crop model.
- Tested, initiated, and taught new technologies to the team including FOTA, DB Sync & migrations, and AOP.
- Created a CI/CD infrastructure using Microsoft Fabric.
- Improved performance and stability of the system deployed on Apache Tomcat servers.
- Reviewed code, feature quality, and status as a liaison from Netafim.
Development, maintenance and debugging for the backend of a TV, picture and voice recognition API.
- Developing REST web services with RestEasy and Undertow.
- Creating unit and integration tests (BDD) using Cucumber.
- Continuous development and integration using Jenkins.
- Developing atop Amazon Web Service’s core functionalities such as EC2, S3 and DynamoDB.