**Numpy** is the foundational Python library that is widely used for numerical calculations and linear algebra. ** ndarray** and

I recommend you to read the following content written by me.

Data visualization plays a key role in data analysis and machine learning fields as it allows you to reveal the hidden patterns behind the data. Model visualization allows you to interpret the model. The visualization process is now easy with plenty of available Python packages today.

Tree-based models such as Decision Trees, Random Forests and XGBoost are more popular for supervised learning (classification and repression) tasks. This is because those models are well fitted on non-linear data which are frequently used in real-world applications.

The baseline model for any tree-based model is the ** Decision Tree**. Random Forests consist of multiple…

In both Statistics and Machine Learning, the number of attributes, features or input variables of a dataset is referred to as its **dimensionality**. For example, let’s take a very simple dataset containing 2 attributes called *Height* and *Weight*. This is a 2-dimensional dataset and any observation of this dataset can be plotted in a 2D plot.

The ** Learning Curve** is another great tool to have in any data scientist’s toolbox. It is a visualization technique that can be to see how much our model benefits from adding more training data. It shows the relationship between the training score and the test score for a machine learning model with a varying number of training samples. Generally, the cross-validation procedure is taken into effect when plotting the learning curve.

A good ML model fits the training data very well and is generalizable to new input data as well. Sometimes, an ML model may require more training instances in…

**Principal Component Analysis (PCA)** is a linear dimensionality reduction technique (algorithm) that transform a set of correlated variables (p) into a smaller k (k<p) number of uncorrelated variables called ** principal components** while keeping as much of the variability in the original data as possible.

One of the use cases of PCA is that it can be used for ** image compression** — a technique that minimizes the size in bytes of an image while keeping as much of the quality of the image as possible. In this post, we will discuss that technique by using the MNIST dataset of handwritten digits…

Undoubtedly, **Scikit-learn** is one of the best machine learning libraries available today. There are several reasons for that. The consistency among Scikit-learn estimators is one reason. You cannot find such consistency in any other machine learning library. The **.fit()/.predict()** paradigm best describes the consistency. Another reason is that Scikit-learn has a variety of uses. It can be used for classification, regression, clustering, dimensionality reduction, anomaly detection.

Therefore, Scikit-learn is a must-have Python library in your data science toolkit. But, learning to use Scikit-learn is not straightforward. It’s not simple as you imagine. You have to set up some background before…

The main objective of the cluster analysis is to form groups (called ** clusters**) of similar observations usually based on the

Under hierarchical clustering, we will discuss 3 *agglomerative* hierarchical methods — **Single Linkage**, **Complete Linkage** and **Average Linkage**. Under non-hierarchical clustering methods, we will discuss the **K-Means Clustering**.

Based on the feedback given by readers after publishing “Two outlier detection techniques you should know in 2021”, I have decided to make this post which includes four different machine learning techniques (algorithms) for outlier detection in Python. Here, I will use the I-I (Intuition-Implementation) approach for each technique. That will help you to understand how each algorithm works behind the scenes without going deeper into the algorithm mathematics (the Intuition part) and implement each algorithm with the Scikit-learn machine learning library (the Implementation part). I will also use some graphical techniques to describe each algorithm and its output. At…

About 30–40% of the mathematical knowledge required for Data Science and Machine Learning comes from linear algebra. Matrix operations play a significant role in linear algebra. Today, we discuss 10 of such matrix operations with the help of the powerful numpy library. Numpy is generally used to perform numerical calculations in Python. It also has special classes and sub-packages for matrix operations. The use of vectorization allows numpy to perform matrix operations more efficiently by avoiding many for loops.

I will include the meaning, background description and code examples for each matrix operation discussing in this article. The “Key Takeaways”…

An outlier is an unusual data point that differs significantly from other data points. Outlier detection is something tricky that should be done carefully. **Elliptic Envelope** and **IQR** are commonly used outlier detection techniques. Elliptic Envelop is a machine learning-based approach while IQR-based detection is a statistical approach. They have their own advantages and disadvantages. Therefore, we cannot say which one is the best. The best strategy is to combine those two techniques and take a look at the whole results.

In this article, we will discuss the intuition behind Elliptic Envelope and IQR techniques, and combine them together to…

Data Analyst with Python || Bring data into actionable insights