What is Machine Learning?
Machine learning (ML) is a subset of artificial intelligence (AI) where algorithms learn patterns from data and use them to make predictions or decisions without being explicitly programmed.
In this article, we’ll explore supervised learning, which uses labeled data to train models, and Scikit-learn, one of the most popular libraries for implementing machine learning in Python.
Installing Scikit-learn
To begin using Scikit-learn, you need to install it:
bashCopyEditpip install scikit-learn
After installing, you can import it into your script:
pythonCopyEditimport sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
Understanding Supervised Learning
Supervised learning involves training a model on labeled data, where the input features are mapped to output labels. The model’s goal is to learn the mapping so that it can predict the output for unseen data.
- Regression: Predicting a continuous value (e.g., predicting house prices).
- Classification: Predicting a category or class (e.g., predicting if an email is spam or not).
Step-by-Step ML Workflow
Here’s a typical machine learning process:
- Load the Data – You usually start with a dataset.
- Preprocess the Data – Clean the data, handle missing values, and normalize if needed.
- Split the Data – Divide the data into training and test sets.
- Train the Model – Choose an algorithm and train it on the training set.
- Evaluate the Model – Test the model on the test set to see how well it performs.
- Tune and Improve – Adjust hyperparameters, select features, or try different models to improve performance.
Loading Data
For simplicity, we’ll use the famous Iris dataset to predict the species of a flower based on its measurements (sepal length, sepal width, petal length, petal width).
pythonCopyEditfrom sklearn.datasets import load_iris
data = load_iris()
X = data.data # Features (input)
y = data.target # Labels (output)
Splitting Data
To evaluate your model, you should split your data into a training set (for training the model) and a test set (for testing how well the model generalizes).
pythonCopyEditX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Here, 80% of the data is used for training, and 20% is reserved for testing.
Building a Model
Let’s use a Logistic Regression model (often used for classification tasks). This algorithm will learn to classify the Iris species based on the features.
pythonCopyEditfrom sklearn.linear_model import LogisticRegression
# Create a model
model = LogisticRegression(max_iter=200)
# Train the model
model.fit(X_train, y_train)
Making Predictions
Once the model is trained, you can make predictions on the test data:
pythonCopyEdity_pred = model.predict(X_test)
print(y_pred) # Predicted labels
Evaluating the Model
Now that we have predictions, it’s time to evaluate how well our model performed.
For classification problems, we can use accuracy (the percentage of correct predictions):
pythonCopyEditfrom sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
We can also use confusion matrices and classification reports to better understand how the model performs:
pythonCopyEditfrom sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
Final Thoughts on Machine Learning
Machine learning isn’t a one-size-fits-all process. Different problems require different algorithms, and tuning these models to perfection can take time. However, Scikit-learn makes it easy to experiment with different algorithms and evaluate their performance, giving you a solid foundation for learning and building ML systems.
Next Steps
In future articles, we’ll dive deeper into more advanced machine learning topics, like model evaluation techniques, hyperparameter tuning, and advanced algorithms (e.g., decision trees, random forests, and neural networks).