Welcome to Whatsapp Chat Data Analysis 2.0

For the first version

Overview

Introduction

Data Retrieval & Preprocessing

Steps to get data:

Opening this .txt file up, you get messages in a format that looks like this:

Exploratory Data Analysis

Importing Necessary Libraries

We will be using :

  1. Regex (re) to extract and manipulate strings based on specific patterns.
  2. pandas for analysis.
  3. matplotlib and seaborn for visualization.
  4. emoji to deal with emojis.
  5. wordcloud for the most used words.
  6. datetime for datetime manipulation.

To read the complete process of cleaning data and preprocessing, read it here.

Here is how the final dataset looks like:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26492 entries, 0 to 26491
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   date_time  26492 non-null  datetime64[ns]
 1   user       26492 non-null  object
 2   message    26492 non-null  object
 3   day        26492 non-null  object
 4   month      26492 non-null  object
 5   year       26492 non-null  int64
 6   date       26492 non-null  object
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 1.4+ MB

The dataset now contains 7 rows, and 26492 respective entries.

Now that we have a clean DataFrame to work with, it’s time to perform analysis on it. Let’s start Visualizing!

Exploratory Data Analysis

At this point, I think I’m ready to start my analysis so I will plot a simple line graph to see the frequency of messages over the months.

The overall frequency of total messages on the group

This is interesting, thanks to WhatsApp Updates of Privacy Policy, there was dead decline for a few months straight, since we moved to Telegram.

Top 10 Most Active Days

Grouping the data set by date and sorting values according to the number of messages per day.

Top 10 active users on the group

Before analyzing, the top users, let’s find out how many ghosts are there in the group!

Total number of people who have sent at least one message on the group are 188
Number of people who haven't sent even a single message on the group are 69

Total number of people who have sent at least one message on the group are 188.

But that isn’t always true to make some solid facts about the analysis, doesn’t necessarily mean they are not reading the chats.

As some wise people said:

A lot of people want to know what’s being talked about but don’t have something to say about what’s being talked about

Kuch logo ko itna knowledge hi nahi hai and are learning from others

And there is something called passive participation, and that was the very reason that I suggested to move back to WhatsApp from Telegram, so that a lot more people can be a part of the family.

Now, looking top 10 active users.

Replacing names with initials for better visualization

My first plot will be the total number of messages sent per person. For this, a simple seaborn countplot will suffice.

TK still killing everyone by a mile, with 5000+ messages, followed by DR with around 4000 messages.

But here comes the twist!

Now, I will plot the Average Message Length of the messages sent by the Top 10 most active users. Let’s see the results now!

Comparing the top 10 users!

Now, first things first, since almost all the plots will be comparing one person with another, I’ll assign a specific color to each person so that it becomes easy to identify each person among multiple plots.

I’m defining a function to maintain consistent colors for each person across all plots. Since the order will vary depending on the plot, this is passed to the function which will reorder colors in a particular order so that the color of a certain person remains the same no matter the plot. This will help maintain consistency and readability amongst the many graphs I will be plotting.

Let’s see the plots, simultaneously for some interesting results!

It’s really interesting to see plots like this side by side, because here comes the twist:

Things aren’t always the way they seem like.

Not bragging, just presenting the facts 👀️

The Top 10 users who send the most media

TK and DL are still beating everyone by a huge margin.

Top 10 most used Emojis

Will be using the emoji module, that was imported earlier.

Since the emojis will not be rendered into the plots, here is how the top10emojis dataset looks like!

Which Emoji is the most used in the chat?

Most active days, most active hours, most active months.

Which hour of the day are most messages exchanged?

Visualization

Now, we will be plotting grouped by day and respective group by month simultaneously, to see some interesting results.

Inferences:

To get a clearer understanding, we will plot a combined graph — Heatmap.

Now, we will plot a heatmap, combining the above to bar plots, for a better understanding!

Heatmap of Month sent and Day sent

Most Used Words in the whole chat.

I will be using the wordcloud module, to create a WordCloud of the most used words!

Conclusion

That’s it from my end! Did this in the middle of my sem exams, but it was worth it!

If you find something interesting throughout the analysis that I missed, feel free to make a PR! 🎉️

It’s really interesting to see the texting habits of people and incidents of daily life reflected in the text. I suggest you take a look at my code and apply it to your own group chats. However, some modifications will have to be done at the DataFrame creation part.

If you’re interested, shoot me a message and I’ll help you out.

Author

Tushar Nankani