These two are technically different even if they seem to be the same in appearance

Photo by 𝓴𝓘𝓡𝓚 𝕝𝔸𝕀 on Unsplash

Numpy is the foundational Python library that is widely used for numerical calculations and linear algebra. ndarray and matrix objects are commonly used numpy objects. ndarray objects are created from the numpy ndarray class. matrix objects are created from the numpy matrix class. If you’re new to numpy, you may get confused with numpy ndarray and numpy matrix objects. They are two different things if they seem to be the same in appearance. Today, we’ll discuss 6 such differences between them.

I recommend you to read the following content written by me.


Using sklearn, graphviz and dtreeviz Python packages for fancy visualization of decision trees

Photo by Liam Pozz on Unsplash

Data visualization plays a key role in data analysis and machine learning fields as it allows you to reveal the hidden patterns behind the data. Model visualization allows you to interpret the model. The visualization process is now easy with plenty of available Python packages today.

Tree-based models such as Decision Trees, Random Forests and XGBoost are more popular for supervised learning (classification and repression) tasks. This is because those models are well fitted on non-linear data which are frequently used in real-world applications.

The baseline model for any tree-based model is the Decision Tree. Random Forests consist of multiple…


Reduce the size of your dataset while keeping as much of the variation as possible

Photo by Nika Benedictova on Unsplash

In both Statistics and Machine Learning, the number of attributes, features or input variables of a dataset is referred to as its dimensionality. For example, let’s take a very simple dataset containing 2 attributes called Height and Weight. This is a 2-dimensional dataset and any observation of this dataset can be plotted in a 2D plot.


To see how much your model benefits from adding more training data

Photo by Colin Carter on Unsplash

The Learning Curve is another great tool to have in any data scientist’s toolbox. It is a visualization technique that can be to see how much our model benefits from adding more training data. It shows the relationship between the training score and the test score for a machine learning model with a varying number of training samples. Generally, the cross-validation procedure is taken into effect when plotting the learning curve.

A good ML model fits the training data very well and is generalizable to new input data as well. Sometimes, an ML model may require more training instances in…


Dimensionality Reduction in Action

Photo by JJ Ying on Unsplash

Principal Component Analysis (PCA) is a linear dimensionality reduction technique (algorithm) that transform a set of correlated variables (p) into a smaller k (k<p) number of uncorrelated variables called principal components while keeping as much of the variability in the original data as possible.

One of the use cases of PCA is that it can be used for image compression — a technique that minimizes the size in bytes of an image while keeping as much of the quality of the image as possible. In this post, we will discuss that technique by using the MNIST dataset of handwritten digits…


Learn the way that worked for me

Photo by Braden Collum on Unsplash

Undoubtedly, Scikit-learn is one of the best machine learning libraries available today. There are several reasons for that. The consistency among Scikit-learn estimators is one reason. You cannot find such consistency in any other machine learning library. The .fit()/.predict() paradigm best describes the consistency. Another reason is that Scikit-learn has a variety of uses. It can be used for classification, regression, clustering, dimensionality reduction, anomaly detection.

Therefore, Scikit-learn is a must-have Python library in your data science toolkit. But, learning to use Scikit-learn is not straightforward. It’s not simple as you imagine. You have to set up some background before…


Form groups of similar observations based on distance

Photo by Kelly Sikkema on Unsplash

The main objective of the cluster analysis is to form groups (called clusters) of similar observations usually based on the euclidean distance. In machine learning terminology, clustering is an unsupervised task. Today, we discuss 4 useful clustering methods which belong to two main categories — Hierarchical clustering and Non-hierarchical clustering.

Under hierarchical clustering, we will discuss 3 agglomerative hierarchical methods — Single Linkage, Complete Linkage and Average Linkage. Under non-hierarchical clustering methods, we will discuss the K-Means Clustering.


Machine learning-based outlier detection

Photo by Paul Carroll on Unsplash

Based on the feedback given by readers after publishing “Two outlier detection techniques you should know in 2021”, I have decided to make this post which includes four different machine learning techniques (algorithms) for outlier detection in Python. Here, I will use the I-I (Intuition-Implementation) approach for each technique. That will help you to understand how each algorithm works behind the scenes without going deeper into the algorithm mathematics (the Intuition part) and implement each algorithm with the Scikit-learn machine learning library (the Implementation part). I will also use some graphical techniques to describe each algorithm and its output. At…


Perform Linear Algebra with Python

Photo by Isaiah Bekkers on Unsplash

About 30–40% of the mathematical knowledge required for Data Science and Machine Learning comes from linear algebra. Matrix operations play a significant role in linear algebra. Today, we discuss 10 of such matrix operations with the help of the powerful numpy library. Numpy is generally used to perform numerical calculations in Python. It also has special classes and sub-packages for matrix operations. The use of vectorization allows numpy to perform matrix operations more efficiently by avoiding many for loops.

I will include the meaning, background description and code examples for each matrix operation discussing in this article. The “Key Takeaways”…


Elliptic Envelope and IQR-based detection

Photo by Alexander Andrews on Unsplash

An outlier is an unusual data point that differs significantly from other data points. Outlier detection is something tricky that should be done carefully. Elliptic Envelope and IQR are commonly used outlier detection techniques. Elliptic Envelop is a machine learning-based approach while IQR-based detection is a statistical approach. They have their own advantages and disadvantages. Therefore, we cannot say which one is the best. The best strategy is to combine those two techniques and take a look at the whole results.

In this article, we will discuss the intuition behind Elliptic Envelope and IQR techniques, and combine them together to…

Rukshan Pramoditha

Data Analyst with Python || Bring data into actionable insights

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store