Naive Bayes in it’s easiest form

Naïve Bayes is a probabilistic supervised machine learning algorithm based on bayes theorem. It is used in used in various classification tasks but mainly used in text classification that includes a higher dimensional training data set . since it is a probabilistic classifier , it predicts on the basis of the probability of an object.

Bayes’ theorem is also known as Bayes’ Rule or Bayes’ law, which is used to determine the probability of a hypothesis with prior knowledge. It depends on the conditional probability.

Conditional probability is a measure of the probability of an event occurring given that another event has (by assumption, presumption, assertion, or evidence) occurred.

The formula for Bayes theorem is :

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

In simpler terms, Bayes’ Theorem is a way of finding a probability when we know certain other probabilities.

To start with let us consider a fictional data set.

Consider the car theft problem with attributes Color, Type, Origin, and the target, Stolen can be either Yes or No

The fundamental Naive Bayes assumption is that each feature makes an:

1. independent

2. equal

contribution to the outcome. Which means

1. No pair of pair features are dependent. i.e-the color being ‘Red’ has nothing to do with the Type or the Origin of the car. Hence, the features are assumed to be Independent.

2. Each feature is given the same influence. i.e-knowing the only type and origin alone can’t predict the outcome perfectly. So none of the attributes are irrelevant and assumed to be contributing Equally to the outcome.

Note: these assumptions are generally not correct in real world situation. The first assumption of independence is never correct but often works well in practice. That is why the name is Naïve.

Now , given the features of the car our task is to classify

Whether the car is stolen or not.

The columns represent these features and the rows represent individual entries. If we take the first row of the dataset, we can observe that the car is stolen if the Color is Red, the Type is Sports and Origin is Domestic. So we want to classify a Red Domestic SUV is getting stolen or not. Note that there is no example of a Red Domestic SUV in our data set.

According to this example, Bayes theorem can be rewritten as:

The variable y is the class variable and X is a dependent feature vector (of size n ) . where

Here x1, x2,…,xn represents the features.

By substituting for X and expanding using the chain rule we get,

Since the denominator remains constant. It can be written as

For this case our class variable has only two outcomes. In case of more than two possible outcomes we have to find the the class variable with maximum probability

The posterior probability P(y|X) can be calculated by first creating frequency table for each feature against the target and then we have to create the likelihood table by calculating the probabilities and finally calculating the naive bayesian to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of our prediction.

Frequency table for color:

Now in our example we have three predictor X :

From the equation discussed above we can calculate the posterior probability of yes

And P(no|X) :

Since 0.072 > 0.024 , our example is classified as ‘NO’ the car is not stolen .

  • . It is one of the fast and easy ML algorithm to to predict the class of a data set.
  • . It performs well in multi class prediction
  • When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
  • . It perform well in case of categorical input variables compared to numerical variable(s).
  • . If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
  • Another disadvantage of naive bayes is assumption of independent predictors. In real life it is very hard to find independent predictors.

There are three types of Naive Bayes Model, which are given below:

Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if predictors take continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution.

Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed. It is primarily used for document classification problems, it means a particular document belongs to which category such as Sports, Politics, education, etc.

The classifier uses the frequency of words for the predictors.

Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor variables are the independent Booleans variables. Such as if a particular word is present or not in a document. This model is also famous for document classification tasks.

If there is something wrong or you have suggestion for me please reach out to me at these

Github . Twitter . LinkedIn .

Thank you.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store