This is how decision trees are combined to make a random forest

Image for post
Image for post
Photo by Filip Zrnzević on Unsplash

The Random Forest is one of the most powerful machine learning algorithms available today. It is a supervised machine learning algorithm that can be used for both classification (predicts a discrete-valued output, i.e. a class) and regression (predicts a continuous-valued output) tasks. In this article, I describe how this can be used for a classification task with the popular Iris dataset.

The motivation for random forests

First, we discuss some of the drawbacks of the Decision Tree algorithm. This will motivate you to use Random Forests.

  • Small changes to training data can result in a significantly different tree structure.
  • It may have the problem of overfitting (the model fits the training data very well but it fails to generalize for new input data) unless you tune the model hyperparameter of max_depth. …


For complex nonlinear data

Image for post
Image for post
Image by author

Decision Trees are a non-parametric supervised learning method, capable of finding complex nonlinear relationships in the data. They can perform both classification and regression tasks. But in this article, we only focus on decision trees with a regression task. For this, the equivalent Scikit-learn class is DecisionTreeRegressor.

We will start by discussing how to train, visualize and make predictions with Decision Trees for a regression task. We will also discuss how to regularize hyperparameters in decision trees. This will avoid the problem of overfitting. Finally, we will discuss some of the advantages and disadvantages of Decision Trees.

Code convention

We use the following code convention to import the necessary libraries and set the plot style. …


Sequentially apply multiple transformers and a final regressor to build your model

Image for post
Image for post
Photo by Joshua Sortino on Unsplash

Welcome back! It’s very exciting to apply the knowledge that we already have to build machine learning models with some real data. Polynomial Regression, the topic that we discuss today, is such a model which may require some complicated workflow depending on the problem statement and the dataset.

Today, we discuss how to build a Polynomial Regression Model, and how to preprocess the data before making the model. Actually, we apply a series of steps in a particular order to build the complete model. All the necessary tools are available in Python Scikit-learn Machine Learning library.

Prerequisites

If you’re not familiar with Python, numpy, pandas, machine learning and Scikit-learn, please read my previous articles that are prerequisites for this article. …


Understand how Principal Component Analysis (PCA) really works behind the scenes

Image for post
Image for post

As I promised in the previous article, Principal Component Analysis (PCA) with Scikit-learn, today, I’ll discuss the mathematics behind the principal component analysis by manually executing the algorithm using the powerful numpy and pandas libraries. This will help you to understand how PCA really works behind the scenes.

Before proceeding to read this one, I highly recommend you to read the following article:

This is because this article is continued from the above article.

In this article, I first review some statistical and mathematical concepts which are required to execute the PCA calculations.

Statistical concepts behind PCA

Mean

The mean (also called the average) is calculated by simply adding all the values and dividing by the number of values. …


Unsupervised Machine Learning Algorithm for Dimensionality Reduction

Image for post
Image for post
Image by author

Hi everyone! This is the second unsupervised machine learning algorithm that I’m discussing here. This time, the topic is Principal Component Analysis (PCA). At the very beginning of the tutorial, I’ll explain the dimensionality of a dataset, what dimensionality reduction means, main approaches to dimensionality reduction, reasons for dimensionality reduction and what PCA means. Then, I will go deeper into the topic PCA by implementing the PCA algorithm with Scikit-learn machine learning library. This will help you to easily apply PCA to a real-world dataset and get results very fast.

In a separate article (not in this one), I will discuss the mathematics behind the principal component analysis by manually executing the algorithm using the powerful numpy and pandas libraries. This will help you to understand how PCA really works behind the scenes. …


Unsupervised Machine Learning Algorithm for Clustering

Image for post
Image for post
Image by Author

You’re all welcome to another exciting ML topic — K-Means Clustering. To implement the algorithm to a real-world data set, I’ll use the Scikit-learn machine learning library in Python.

What is K-Means Clustering?

Clustering is the task of partitioning a dataset into groups, called Clusters. The objective of clustering is to identify distinct groups in the dataset such that the observations within a group are similar to each other but different from observations in other groups. Clustering is often used to find patterns in unlabeled data which has no label.

K-Means Algorithm is one of the simplest and most commonly used clustering algorithms. In k-means clustering, the algorithm attempts to group observations into k groups, with each group having roughly equal variance. …


Supervised Machine Learning Algorithm for Classification

Image for post
Image for post

Hello friends! This is the 14th article on Data Science 365 blog. So far, we’ve come a long journey in Data Science and Machine Learning by discussing theory and applying them to real problems. If you haven’t read my previous articles published on Data Science 365, please read them to learn something new about Data Science and Machine Learning.

Dependencies


Understand how linear regression really works behind the scenes

Image for post
Image for post

You are ALL welcome to another exciting tutorial at Data Science 365! So far I’ve discussed the fundamentals of Data Science, Machine Learning and various Python libraries (modules/packages) such as numpy, pandas, matplotlib, seaborn which can be used for your data analysis task.

It’s time to practically apply all these things that I’ve discussed so far at Data Science 365. I highly recommend you to read my previous articles published there before reading this one. Today, in this tutorial, I will discuss the most fundamental Machine Learning algorithm called Linear Regression by following the steps of the Predictive Analytics process. …


To reveal the hidden patterns behind data

Image for post
Image for post

“A picture is worth a thousand words.” This is often true in the world of Data Science. Data visualization plays a key role as it allows you to reveal the patterns behind the data. There are three primary uses for data visualization:

  • To explore data
  • To validate a model
  • To communicate data

Exploratory Data Analysis (EDA), which uses visualization techniques, allows you to understand different characteristics of data, its variables and the potential relationships between them. We also use statistical visualization techniques to validate a model and its assumptions. …


Exploring a Dataset

Image for post
Image for post

Hello! Welcome to the 2nd tutorial of pandas: Exploring a Dataset. In this tutorial, I discuss the following topics with examples.

Topics discussing

  • Reasons for data exploring
  • Reading the data: pandas read_csv() function
  • Viewing the first few rows of the dataset: head() method of pandas DataFrame
  • Viewing the last few rows of the dataset: tail() method of pandas DataFrame
  • Viewing the dimensionality of the dataset: shape attribute of the DataFrame class
  • Getting a concise summary of the dataset: info() method of pandas DataFrame
  • Getting descriptive statistics of the data: describe() method of pandas DataFrame
  • Viewing the levels of a categorical variable
  • Viewing the counts of categorical variable levels: frequency table — pandas crosstab() function, bar chart — plot() method of pandas DataFrame, bar chart — catplot() function in…

About

Rukshan Pramoditha

Data Analyst with Python || Author of Data Science 365 Blog

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store