Introduction to Machine Learning with Scikit-Learn

What is Machine Learning?

Machine learning (ML) is a subset of artificial intelligence (AI) where algorithms learn patterns from data and use them to make predictions or decisions without being explicitly programmed.

In this article, we’ll explore supervised learning, which uses labeled data to train models, and Scikit-learn, one of the most popular libraries for implementing machine learning in Python.


Installing Scikit-learn

To begin using Scikit-learn, you need to install it:

bashCopyEditpip install scikit-learn

After installing, you can import it into your script:

pythonCopyEditimport sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Understanding Supervised Learning

Supervised learning involves training a model on labeled data, where the input features are mapped to output labels. The model’s goal is to learn the mapping so that it can predict the output for unseen data.

  • Regression: Predicting a continuous value (e.g., predicting house prices).
  • Classification: Predicting a category or class (e.g., predicting if an email is spam or not).

Step-by-Step ML Workflow

Here’s a typical machine learning process:

  1. Load the Data – You usually start with a dataset.
  2. Preprocess the Data – Clean the data, handle missing values, and normalize if needed.
  3. Split the Data – Divide the data into training and test sets.
  4. Train the Model – Choose an algorithm and train it on the training set.
  5. Evaluate the Model – Test the model on the test set to see how well it performs.
  6. Tune and Improve – Adjust hyperparameters, select features, or try different models to improve performance.

Loading Data

For simplicity, we’ll use the famous Iris dataset to predict the species of a flower based on its measurements (sepal length, sepal width, petal length, petal width).

pythonCopyEditfrom sklearn.datasets import load_iris
data = load_iris()
X = data.data  # Features (input)
y = data.target  # Labels (output)

Splitting Data

To evaluate your model, you should split your data into a training set (for training the model) and a test set (for testing how well the model generalizes).

pythonCopyEditX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here, 80% of the data is used for training, and 20% is reserved for testing.


Building a Model

Let’s use a Logistic Regression model (often used for classification tasks). This algorithm will learn to classify the Iris species based on the features.

pythonCopyEditfrom sklearn.linear_model import LogisticRegression

# Create a model
model = LogisticRegression(max_iter=200)

# Train the model
model.fit(X_train, y_train)

Making Predictions

Once the model is trained, you can make predictions on the test data:

pythonCopyEdity_pred = model.predict(X_test)
print(y_pred)  # Predicted labels

Evaluating the Model

Now that we have predictions, it’s time to evaluate how well our model performed.

For classification problems, we can use accuracy (the percentage of correct predictions):

pythonCopyEditfrom sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

We can also use confusion matrices and classification reports to better understand how the model performs:

pythonCopyEditfrom sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Final Thoughts on Machine Learning

Machine learning isn’t a one-size-fits-all process. Different problems require different algorithms, and tuning these models to perfection can take time. However, Scikit-learn makes it easy to experiment with different algorithms and evaluate their performance, giving you a solid foundation for learning and building ML systems.


Next Steps

In future articles, we’ll dive deeper into more advanced machine learning topics, like model evaluation techniques, hyperparameter tuning, and advanced algorithms (e.g., decision trees, random forests, and neural networks).


Next Up: Hyperparameter Tuning and Model Optimization