Software is an art. Period.
Even when it's structured, its most fundamental contract is simple: you give it an input, it returns an output.
Picture this: You type your credit card into an e-commerce checkout, and the system processes a payment. Clean. Predictable. Done.
But the real craft lives in the middle. That's where the real sexy things happen:
- How do you make it secure?
- Cost-efficient? Observable? Resilient?
- How do you handle edge cases, failures, scale?
- What are the best practices to follow? What language to use? What framework to use? What libraries to use? What tools to use?
- The list goes on and on...
That's where software stops being mechanical and starts becoming intentional. And that's a whole spectrum of alternatives, we the software developers have been discussing and polishing this for decades.
--
Machine learning as a discipline: software development for data
Just like any other discipline, software has specializations. One of them. probably the most transformative in recent years. is machine learning.
It started with a very practical question:
Can we stop telling computers exactly what to do… and instead let them learn from data?
That's the shift. So, devs started to think, ok how make computers learn from data? So they can produce results without use setting all the results.
So exciting stuff, let's dive into it!
The origin: learning from data, not rules
Early software systems were deterministic. They are still are, but back there determinism was the king.
But reality isn't. I fact the reality is so complex, and I don't want to start talking about it now, probably I will ended up talking about aliens, time travel and other sexy stuff.
But ok, back to the point, things like fraud doesn't follow fixed rules. User behavior changes. Language is ambiguous. Patterns don't fit neatly into if/else branches.
Read that again, patterns don't fit neatly into
if/elsebranches. Patterns are messy.
So the question becomes: how do we model messy reality in a way a machine can compute?
That's where machine learning comes in.
At its core, machine learning is about training a model to map inputs to outputs using data.
Instead of explicitly defining the rules, the system learns them. 😱
Models: the engine behind the scenes
Start simple. Think Linear Regression or Decision Trees.
These models were designed to solve narrow problems:
- Predicting prices
- Classifying emails
- Detecting fraud
Nothing flashy, well it is flashy but AI just set the bar so hight now that, well yes those are now "simple".
But here's the catch:
The real power of machine learning is in the data.
Feature engineering: where the real work happens
You're not just passing data. You're shaping reality for the model, and this IS crucial, that decision alone can make or break everything.
There's a reason people say:
Garbage in, garbage out
If your data is weak, noisy, biased, or flat-out wrong, the model has no way of "figuring it out." It will learn exactly what you give it.
Train a model on fake news, and it won't question it. It will learn it as truth and operate accordingly.
The model is not responsible for truth. It's responsible for consistency with data, and the data defines the "truth".
Training vs Inference: when learning happens vs when it works
OK, so at this point, once we have the data, and the model, now what?
it's time to train the model.
Training is where the model learns. You feed it data, it adjusts internal parameters, it minimizes error. It's iterative and usually happens offline.
Once the parameters are set in a way that the loss function is reflecting the best possible configuration, then the training is completed, and now the model is ready for the inference phase.
So...
Training optimizes for learning and accuracy. Inference optimizes for latency, cost, and reliability.
Fundamental concepts about machine learning
Now, that we have a super general overview and intro of the ML as a discipline, let's mentioned some fundamental things I believe are necessary to understand or at least be aware of when the hands on time will come.
Not all learning is the same
When we say "machine learning," there are three main groups or paradigms:
Different problems require different approaches. There is no universal method and we will discuss later how Generative AI fits into this paradigms.
Overfitting vs Generalization: the real game
Here's a classic trap. A model that performs perfectly on training data… might be useless.
Why?
Because it memorized instead of learned.
Concept of generalization
Generalization is when the model learns the underlying patterns of the data and can make predictions on new data.
That's the real objective of machine learning. Not perfection but adaptability.
Train, Validation, Test: don't fool yourself
If you train and evaluate on the same data, you're lying to yourself.
That's why we split datasets:
Training set
where the model learns
Validation set
where you tune decisions
Test set
where you measure true performance
Usually, we split the dataset using the 80/20 rule. 80% for training, 20% for validation and testing.
This separation is what gives credibility to your results. Without it, metrics are just illusions, because how do you know if the model is actually learning or just memorizing or overfitting? You can't.
Bias vs Variance: finding the balance
Think of models as a spectrum.
-
Too simple, and they can't capture the complexity of the problem. High bias.
-
Too complex, and they memorize noise. High variance.
Machine learning is about navigating that tension. Not too simple. Not too complex.
Just enough to generalize, and here is where that real challenge comes in. Machine learnign engineers spend a lot of time tuning models to find the balance. Trying hypothesis, testing, back and forth, trial an error.
Data leakage: the silent killer
This one is subtle. and dangerous.
Concept of data leakage
Data leakage happens when your model has access to information during training that it wouldn't have in the real world.
The result? Amazing performance during testing but a total failure in production. Why? It doesn't break loudly. It just quietly invalidates everything. The model is reflecting, but not encapsulating the real work data.
Machine learning is not a model, it's a system: the pipeline
One of the biggest misconceptions is thinking ML = model.
It's not.
A real machine learning system is a pipeline:
Data ingestion
Collect and bring data from various sources (databases, files, APIs) into your pipeline.
Cleaning
Remove errors, inconsistencies, and irrelevant information; handle missing values and outliers.
Feature engineering
Select, transform, or create the variables (features) that help the model learn meaningful patterns.
Training
Feed prepared data into a model so it learns relationships and patterns by adjusting parameters.
Evaluation
Test the trained model on unseen data to measure its performance and diagnose strengths or weaknesses.
Deployment
Integrate the validated model into a real-world application or production system for use.
Monitoring
Continuously observe performance, detect problems or drift, and decide when retraining is needed.
The model is just one component in that flow. Sometimes, not even the most complex one I always struggle with the data ingestion, getting good quality data is the key to success but often is difficult to get it.
We will cover this in another post. About data engineering in depth.
Monitoring and drift: nothing stays static
You trained a model. It works. You deploy it.
Done? Not even close.
The world changes. User behavior shifts. Data evolves.
Concept of drift
This is called drift. It happens when the data changes in a way that the model is not able to generalize.
And when it happens, your model slowly becomes less accurate. So, your model needs to evolved with the data.
That's why monitoring is not optional.
Concept of living system
You need to observe performance, detect degradation, and retrain when necessary. Machine learning systems are living systems. They need to be monitored and retrained when necessary.
A lot of the effort in good machine learning pipelines is about monitoring, measuring the performance, detecting degradation, and retraining when necessary. Implementing this is whole disciple, the art of measuring and improving is still an open and wild topic for me.
The art of measuring and improving
The important mindset I want you to have is: you cannot improve what you cannot measure.
Start simple: baseline before complexity
There's a tendency to jump into complex models too early. Deep learning, ensembles, advanced architectures.
But good engineering starts with a baseline.
A simple model gives you a reference point. It tells you if complexity is actually buying you something. or just adding cost.
The art of measuring and improving
KISS principle: Keep it simple, stupid.
Please don't take this literally. It is a joke. But it is a good reminder.
Simple first. Complexity only if justified, iteratively. Interation is key.
Outputs are probabilities, not truths
Concept of probabilities
Most models don't give you answers. They give you probabilities.
There's always uncertainty. Understanding that uncertainty. and designing systems that can handle it. is part of building reliable ML applications.
A note on data responsibility: the real impact
Concept of biases
Models inherit the biases of the data they are trained on.
Not philosophically. Mechanically, everything is a reflection of the data. And the quality of the data is the quality of the model.
If your data is skewed, incomplete, or biased, your outputs will be too. This is not a side concern. It's a system design constraint. So, always be aware of the data you are feeding into the model.