Classification with XGBoost

Image for post
Image for post
Photo by Martin Adams on Unsplash

Welcome to the second article of “A Journey through XGBoost” series. Today, we will build our first XGBoost model on the “heart disease” dataset and make a small (but useful) web app to communicate our results with end-users. Here are the topics we discuss today.

  • Form a classification problem
  • Identify the feature matrix and the target vector
  • Build the XGBoost model (Scikit-learn compatible API)
  • Describe ‘accuracy’ and ‘area under the ROC curve’ metrics
  • Explain XGBoost Classifier hyperparameters
  • Build the XGBoost model (Non-Scikit-learn compatible API)
  • XGBoost’s DMatrix
  • Create a small web app for our XGBoost model with Shapash Python library
  • Make…


Setting up the background

Image for post
Image for post
Photo by Martin Adams on Unsplash

Welcome to another article series! This time, we are discussing XGBoost (Extreme Gradient Boosting) — The leading and the most preferred machine learning algorithm among data scientists in the 21st century. Most people say XGBoost is a money-making algorithm because it easily outperforms any other algorithms, gives the best possible scores and helps its users to claim luxury cash prizes from data science competitions.

The topic we are discussing is broad and important so that we discuss it through a series of articles. It is like a journey, maybe a long journey for newcomers. We discuss the entire topic step…


Hands-on Tutorials

Not just dimensionality reduction, but rather find latent variables

Image for post
Image for post
Photo by Nicolas Hoizey on Unsplash

Factor Analysis (FA) and Principal Component Analysis (PCA) are both dimensionality reduction techniques. The main objective of Factor Analysis is not to reduce the dimensionality of the data. Factor Analysis is a useful approach to find latent variables which are not directly measured in a single variable but rather inferred from other variables in the dataset. These latent variables are called factors. So, factor analysis is a model of the measurement of latent variables. For example, if we find two latent variables in our model, it is called a two-factor model. …


Unsupervised Machine Learning Algorithm for Dimensionality Reduction

Image for post
Image for post
Photo by Renee Fisher on Unsplash

Hi again! Today, we discuss one of the most popular machine learning algorithms used by every data scientist — Principal Component Analysis (PCA). Previously, I have written some contents for this topic. If you haven’t read yet, you may also read them at:

In this article, more emphasis will be given to the two programming languages (R and Python) which we use to perform PCA. At the end of the article, you will see the difference between R and Python in terms of performing PCA.

The dataset that…


For evaluating a model’s performance and hyperparameter tuning

Image for post
Image for post
Photo by Scott Webb on Unsplash

k-fold cross-validation is one of the most popular strategies widely used by data scientists. It is a data partitioning strategy so that you can effectively use your dataset to build a more generalized model. The main intention of doing any kind of machine learning is to develop a more generalized model which can perform well on unseen data. One can build a perfect model on the training data with 100% accuracy or 0 error, but it may fail to generalize for unseen data. So, it is not a good model. It overfits the training data. Machine Learning is all about…


This is how decision trees are combined to make a random forest

Image for post
Image for post
Photo by Filip Zrnzević on Unsplash

The Random Forest is one of the most powerful machine learning algorithms available today. It is a supervised machine learning algorithm that can be used for both classification (predicts a discrete-valued output, i.e. a class) and regression (predicts a continuous-valued output) tasks. In this article, I describe how this can be used for a classification task with the popular Iris dataset.

The motivation for random forests

First, we discuss some of the drawbacks of the Decision Tree algorithm. This will motivate you to use Random Forests.

  • Small changes to training data can result in a significantly different tree structure.
  • It may have the problem of…


For complex nonlinear data

Image for post
Image for post
Image by author

Decision Trees are a non-parametric supervised learning method, capable of finding complex nonlinear relationships in the data. They can perform both classification and regression tasks. But in this article, we only focus on decision trees with a regression task. For this, the equivalent Scikit-learn class is DecisionTreeRegressor.

We will start by discussing how to train, visualize and make predictions with Decision Trees for a regression task. We will also discuss how to regularize hyperparameters in decision trees. This will avoid the problem of overfitting. Finally, we will discuss some of the advantages and disadvantages of Decision Trees.

We use the…


Sequentially apply multiple transformers and a final regressor to build your model

Image for post
Image for post
Photo by Joshua Sortino on Unsplash

Welcome back! It’s very exciting to apply the knowledge that we already have to build machine learning models with some real data. Polynomial Regression, the topic that we discuss today, is such a model which may require some complicated workflow depending on the problem statement and the dataset.

Today, we discuss how to build a Polynomial Regression Model, and how to preprocess the data before making the model. Actually, we apply a series of steps in a particular order to build the complete model. All the necessary tools are available in Python Scikit-learn Machine Learning library.

If you’re not familiar…


Understand how Principal Component Analysis (PCA) really works behind the scenes

Image for post
Image for post

As I promised in the previous article, Principal Component Analysis (PCA) with Scikit-learn, today, I’ll discuss the mathematics behind the principal component analysis by manually executing the algorithm using the powerful numpy and pandas libraries. This will help you to understand how PCA really works behind the scenes.

Before proceeding to read this one, I highly recommend you to read the following article:

This is because this article is continued from the above article.

In this article, I first review some statistical and mathematical concepts which are required to execute the PCA calculations.

Statistical concepts behind PCA

The…


Unsupervised Machine Learning Algorithm for Dimensionality Reduction

Image for post
Image for post
Image by author

Hi everyone! This is the second unsupervised machine learning algorithm that I’m discussing here. This time, the topic is Principal Component Analysis (PCA). At the very beginning of the tutorial, I’ll explain the dimensionality of a dataset, what dimensionality reduction means, main approaches to dimensionality reduction, reasons for dimensionality reduction and what PCA means. Then, I will go deeper into the topic PCA by implementing the PCA algorithm with Scikit-learn machine learning library. This will help you to easily apply PCA to a real-world dataset and get results very fast.

In a separate article (not in this one), I will…

Rukshan Pramoditha

Data Analyst with Python || Author of Data Science 365 Blog || Sri Lanka

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store