What’s Really Like to Be an AI/ML Engineer
3 real-world examples of AI/ML Engineers -> their stack, challenges, and day-to-day focus!
This week’s newsletter is sponsored by Atono.
How we built an MCP server from scratch
MCP Servers has quietly become the thing every engineering team is adding to their roadmap. Not because it is trendy, but because it finally gives AI a safe and predictable way to work with the systems you have already built.
In Atono’s latest webinar, senior engineer Lex walks through how their team built an MCP server from scratch. He shares what the process looked like, where things got strange, and how LLMs behave once they are connected to real workflows instead of clean examples.
In this walkthrough, you’ll learn:
How MCP gives AI a predictable way to interact with your systems — even when your product has years of features, fixes, and tech debt baked in.
Why we built and shipped our MCP server as a Dockerized service — and how that choice simplified security, setup, and compatibility across tools like Zed, VS Code, and Copilot.
What happens when different LLMs start invoking your tools — including the unexpected behaviors, model quirks, and debugging patterns that only show up once everything is wired into real workflows.
Whether you manage a team or write the code, this walkthrough shows how MCP works in practice.
Thanks to Atono for sponsoring this newsletter, let’s get back to this week’s thought!
Intro
With the increase in popularity of AI, roles such as AI Engineer and ML Engineer are becoming more and more popular.
In order to understand the roles much better, I’ve asked 3 AI/ML Engineers to share their real-world experience:
How does their day-to-day look like
What technologies do they work with
Specific challenges they are facing
How does the work differ from Software Engineering
These are the 3 engineers I had the pleasure of talking to:
If you like these kinds of articles, where I talk to various engineers & engineering leaders and share their insights, you’ll love these 2 articles as well:
Let’s start!
1. Working on large scale ML systems at Meta with the focus on defending against bad actors
Shared by Shivam Anand, Staff ML Engineer at Meta.
His work focuses on large-scale machine learning systems, with a particular emphasis on adversarial ML → building resilient systems to defend against bad actors.
Before Meta, he spent over seven years at Google, leading ML efforts in ads spam and fraud, video ranking, and search quality.
What attracted him to AI/ML was how fast the field evolves. He always loved learning, and ML is one of the few areas in engineering where the landscape can shift dramatically year to year. That constant state of reinvention is what kept him engaged.
At Meta, his day-to-day varies depending on the time of year. During planning phases, the focus is on alignment and defining technical strategy.
In execution phases, it shifts to mentoring other engineers and hands-on work → building infra, training models, iterating quickly.
He uses PyTorch extensively, tightly integrated with Meta’s internal stack. A major focus area has been deploying LLAMA models to handle problems where labels are scarce and adversarial behavior is dynamic.
One of the hardest problems he tackled recently involved scaling LLMs to handle billions of unknown examples with only a handful of positive labels.
It’s an extreme version of the class imbalance problem, where success depends not just on modeling, but on optimizing data pipelines, evaluation strategies, and inference performance across infrastructure.
Compared to earlier roles in software engineering, ML introduces more uncertainty. In traditional engineering, outcomes for a given effort are often easier to estimate, even if timelines slip.
In ML, the outcome itself is often unclear until you try, especially in adversarial settings. This makes iteration velocity, measurement discipline, and expectation management all critical parts of the job.
A common misconception about AI is that it’s full of magical breakthroughs. In practice, it’s empirical, iterative, and full of hard trade-offs. Most progress comes from structured experimentation and relentless tuning, not cleverness alone.
For those looking to break into ML roles at Big Tech companies:
Understand that interviews are highly structured. Prepare deliberately, but also develop real depth in the problem domains you’re interested in. That combination is what makes people stand out.
2. Senior AI/ML Engineer part of the Central ML team
Shared by Alex Razvant, Senior AI Engineer at Everseen.
He works as a Senior AI/ML Engineer at Everseen AI, which is the market leader for Vision AI solutions for Retail. The products stack of Everseen includes solutions for smart checkout, loss prevention, theft prevention, supply chain monitoring, and more.
When he started, he joined the company as an Applied AI Researcher, working closely on model architectures, training, and evaluation. That phase involved studying and integrating concepts from research papers, testing for quality and performance metrics, and building the software components (Expert Systems) that interpret the predictions of these models.
Currently, he is part of the Central ML team, which serves as the core enabler of ML initiatives across the company. The team collaborates closely with Applied AI Researchers, Data Engineers, DevOps, and Software Development teams and works on each component of the ML Lifecycle.
Occasionally, they research and propose new tools or frameworks to streamline their ML processes. For example, if they consider switching their Experiment Tracking system, they’d do the research, create architectural decision records (ADRs) and diagrams, and coordinate discussions with other teams to plan a smooth migration.
On a typical day, his work is highly cross-functional
He’ll work with Data Engineers on either porting new / improving existing Data Pipelines, and ensure lineage between raw collected videos and images, which is in the order of TBs in size.
With AI Research teams, he’d work on training and evaluating models and providing the infrastructure for these workloads. Here, he might work on fine-tuning Vision Language Models (VLMs) for image/video captioning of key video events or sample training datasets for downstream task adaptation to the models we have in production, depending on the edge cases they have.
Additionally, alongside Dev and Ops teams, he’d work mainly on benchmarking and optimizing inference latency, ensuring their distributed training, centralised or federated, and evaluation jobs work as expected.
To summarize his role, it’s a mix between MLOps and AI Engineering, working mainly on the infrastructure for internal AI research, and also collaborating on components of the lifecycle.
Technologies
The technologies he works with are quite broad. He relies heavily on Python and PyTorch for ML development, including performance-critical components with Cython, CuPY, Numba, DeepStream, or even C++ at times.
He uses multiple tools from NVIDIA’s AI Stack, including Triton Server, TensorRT, NSight, NVFlare, NCCL, CUDA, and he also works with distributed computing frameworks like Ray and Flower for distributed model training.
For internal RAG or embedding search-related systems, FAISS or QDrant vector databases. This step also involves staying up to date with open-source genai models, testing VLMs and cross-modality embedding models, and, less often, LLMs.
AI model architectures span across classical deep learning ones, such as CNNs, ResNets, to multi-modal transformer-based ones such as CLIP, ViT, Florence, DinoV2, SAM (Segment Anything), LLaVA, and other.
On the infrastructure side, his toolset includes FastAPI, Docker/Podman, Mongo, Ray, Kubernetes, MLFlow/TensorBoard, Airflow, Kubeflow, Prometheus, Grafana, ELK stack, and cloud platforms like Azure and Google Cloud with AzureML Studio and VertexAI for MLOps workflows.
Challenges
In terms of challenges, he’d say AI/ML engineering differs significantly from traditional software development. Prior to AI/ML, he worked as a full-stack web developer, where he found the processes to be more robust and well-defined.
ML workflows involve many iterative loops and a degree of uncertainty, especially when both quality and efficiency are critical, as in Edge Computing scenarios.
One major challenge is figuring out how to keep the ML process standardised enough while adding a new AI initiative in the loop, while handling the reliability of models in production, from optimizing for edge cases to retraining, evaluating, and benchmarking.
Going from 0 to 80% in model performance is straightforward, and he thinks everyone agrees with that, but even with enough training data, it’s still harder to reach quality levels of 90%+ since edge cases grow significantly, and each deep learning training loop costs time and money, so ensuring lineage and specific plans for improvements matter a lot.
That is particularly true with AI applied to vision systems.
As an ending note, he mentions AI/ML engineering demands a hybrid mindset, where balancing the stability of engineering with the dynamic behaviour of ML experimentation is mandatory.
Apart from that, AI is a fast-moving, deeply technical field that’s constantly changing; thus, to stay relevant, one must keep up to date with the advancements while avoiding the hype, which is an exciting yet tricky balance to strike.
3. Designing and scaling AI systems primarily with RAG
Shared by Bhuvaneshwaran Ponnusamy Ilanthirayan, AI Engineer & Head of AI at contexxt.ai.
He works for contexxt.ai as an AI Engineer, where he’s involved in designing and scaling intelligent systems, primarily around Retrieval augmented generation (RAG) systems, agents, and LLM pipelines.
On a typical day, he begins by reading through a few AI newsletters he’s subscribed to for about 15–20 minutes. He highlights articles that seem worth a deeper read and revisits them over the weekend.
Then he quickly checks the system’s health by reviewing various metrics (such as latency, retrieval quality, load patterns) → ranging from the AI models to the backend services and external dependencies, to catch any early anomalies.
Following this, they have their daily scrum where they briefly sync up with the team to discuss progress and blockers. Most of his day revolves around building or improving features, often related to enhancing the quality of answers in our Retrieval-Augmented Generation (RAG) systems.
Occasionally, he dives into deeper research on ideas they aim to implement in the future, especially when evaluating new model architectures or retrieval strategies. He’s also responsible for architectural decisions and optimizations, particularly when it comes to performance or stability.
Throughout the day, he might have check-ins with his direct reports to help them out, usually in areas related to backend development, DevOps, or security.
Before wrapping up, he logs his hours and categorizes his time using Rize for later productivity analysis. He then ends the day by outlining some small tasks or focus areas for the following day.
Technologies
He works with a wide range of technologies, including (but not limited to): Python, FastAPI, LangChain, LangGraph, Hugging Face Transformers, spaCy, PyTorch, TensorRT, Milvus, Neo4j, Redis, Elasticsearch, Docker, Kubernetes, Terraform, OVHcloud, Langfuse, Grafana, and Slurm (primarily for experimental training and using HPC clusters).
Challenges
Each day brings something new. A major challenge in AI engineering is that many tools and frameworks, while promising, are not fully production-ready.
The field is evolving rapidly, but the research-to-production gap still exists. This means he has to consider many unpredictable variables while designing and implementing solutions.
Compared to his previous experience in software engineering, where the work tends to be more deterministic and predictable, AI engineering is far more stochastic. The results can vary depending on the model, data, or even deployment infrastructure.
But rather than being overwhelming, he finds this pace of innovation exciting.
Watching models become more accessible and powerful over time reinforces his belief that AI will empower a broader range of people and use cases in the near future.
My Key Takeaways
Let me share my 3 main takeaways:
1. AI/ML Engineering is fast-moving, Iterative, and uncertain
Unlike traditional software engineering (deterministic, predictable), AI/ML work is highly experimental and full of unknowns.
Success often depends on iteration speed, data quality, evaluation methods, and infrastructure, not just clever algorithms. The field evolves rapidly. Staying relevant requires continuous learning and avoiding hype.
2. The role is highly cross-functional
AI/ML engineers collaborate with:
Data engineers (pipelines, lineage)
Research scientists (model development, evaluation)
DevOps/MLOps teams (infra, deployment)
Product/Backend teams (integration, performance)
The work spans research, engineering, infrastructure, and operations. There’s a big need for understanding many different concepts, technologies, and processes.
3. Success depends on data quality, edge cases, and evaluation
The hard part of AI/ML engineering is not the model → it’s curating, labeling, cleaning, and understanding data.
Real progress comes from identifying and fixing edge cases, defining better evaluation metrics, and building feedback loops for continuous improvement.
Getting to “good enough” accuracy can be done fast, but pushing from 80% → 90%+ requires a lot more effort and focus on error analysis, and targeted iteration.
Last words
Special thanks to Shivam Anand, Alex Razvant, and Bhuvaneshwaran Ponnusamy Ilanthirayan for sharing their insights on this important topic!
If you came to the end of the article, let me end it with the following:
No matter if you are an AI/ML, Software Engineer, or any kind of other engineer, one thing is common to all the fields, and that is curiosity and the need to learn new things constantly. And that has become more important than ever.
You got this!
Liked this article? Make sure to 💙 click the like button.
Feedback or addition? Make sure to 💬 comment.
Know someone that would find this helpful? Make sure to 🔁 share this post.
Whenever you are ready, here is how I can help you further
Join the Cohort course Senior Engineer to Lead: Grow and thrive in the role here.
Interested in sponsoring this newsletter? Check the sponsorship options here.
Take a look at the cool swag in the Engineering Leadership Store here.
Want to work with me? You can see all the options here.
Get in touch
You can find me on LinkedIn, X, YouTube, Bluesky, Instagram or Threads.
If you wish to make a request on particular topic you would like to read, you can send me an email to info@gregorojstersek.com.
This newsletter is funded by paid subscriptions from readers like yourself.
If you aren’t already, consider becoming a paid subscriber to receive the full experience!
You are more than welcome to find whatever interests you here and try it out in your particular case. Let me know how it went! Topics are normally about all things engineering related, leadership, management, developing scalable products, building teams etc.








