Data analysis and Visualization with Python

Last Updated : 15 Apr, 2025

Python is widely used as a data analysis language due to its robust libraries and tools for managing data. Among these libraries is Pandas, which makes data exploration, manipulation, and analysis easier. we will use Pandas to analyse a dataset called Country-data.csv from Kaggle. While working with this data, we also introduce some important concepts in Pandas.

1. Installation

Easiest way to install pandas is to use pip:

Python

pip install pandas

or, Download it from here.

2. Creating A DataFrame in Pandas

A DataFrame is a table-like data structure in Pandas which has data stored in rows and columns. A DataFrame can be created by passing multiple python Series objects into the DataFrame class (pd.DataFrame()) using the pd.Series method. In this example, two Series objects are used: s1 as the first row and s2 as the second row.

Example 1: Creating DataFrame from Series:

Python

import pandas as pd

# Creating two Series: s1 (numbers) and s2 (names)
s1 = pd.Series([1, 2])
s2 = pd.Series(["Ashish", "Sid"])

# Creating DataFrame by combining Series as rows
dataframe = pd.DataFrame([s1, s2])

# Displaying the DataFrame
print(dataframe)

Output:

Example 2: DataFrame from a List with Custom Index and Column Names:

Python

dataframe1 = pd.DataFrame([[1, 2], ["Ashish", "Sid"]], index=["r1", "r2"], columns=["c1", "c2"])
print(dataframe1)

Output:

Example 3: DataFrame from a Dictionary:

Python

dataframe2 = pd.DataFrame({
    "c1": [1, "Ashish"],
    "c2": [2, "Sid"]
})
print(dataframe2)

Output:

3. Importing Data with Pandas

The first step is to read the data. In our case, the data is stored as a CSV (Comma-Separated Values) file, where each row is separated by a new line, and each column by a comma. In order to be able to work with the data in Python, it is needed to read the csv file into a Pandas DataFrame.

Python

import pandas as pd

# Read Country-data.csv into a DataFrame
df = pd.read_csv("Country-data.csv")

# Prints the first 5 rows of a DataFrame as default
df.head()

# Prints no. of rows and columns of a DataFrame
df.shape

Output:

(167, 10)

4. Indexing DataFrames with Pandas

Pandas provides powerful indexing capabilities. You can index DataFrames using both position-based and label-based methods.

Position-Based Indexing (Using iloc):

Python

# prints first 5 rows and every column which replicates df.head()
df.iloc[0:5,:]

# prints entire rows and columns
df.iloc[:,:]

# prints from 5th rows and first 5 columns
df.iloc[5:,:5]

Output:

Label-Based Indexing (Using loc):

Indexing can be worked with labels using the pandas.DataFrame.loc method, which allows to index using labels instead of positions.

Examples:

Python

# prints first five rows including 5th index and every columns of df
df.loc[0:5,:]

# prints from 5th rows onwards and entire columns
df.loc[5:,:]

Output:

The above doesn’t actually look much different from df.iloc[0:5,:]. This is because while row labels can take on any values, our row labels match the positions exactly. But column labels can make things much easier when working with data.

Example:

Python

# Prints the first 5 rows of Time period
# value 
df.loc[:5,"child_mort"]

Output:

5. DataFrame Math with Pandas

Pandas makes it easier to perform mathematical operations on the data stored in dataframes. The operations which can be performed on pandas are vectorized, meaning they are fast and apply automatically to all elements without using loops.

Example - Column-wise Math:

Python

# Adding 5 to every element in column A
df["child_mort"] = df["child_mort"] + 5

# Multiplying values in column B by 10
df["exports"] = df["exports"] * 10
df

Output:

Statistical Functions in Pandas:

Computation of data frames can be done by using Statistical Functions of pandas tools. We can use functions like:

df.sum() → sum of values
df.mean() → average
df.max() / df.min() → max and min values
df.describe() → quick statistics summary

Python

# computes various summary statistics, excluding NaN values
df.describe()

# Provides sum of all the values for each column
df.sum()

Output:

6. Data Visualization with Pandas and Matplotlib

Pandas is very easy to use with Matplotlib, a powerful library used for creating basic plots and charts. With only a few lines of code, we can visualize our data and understand it better. Below are some simple examples to help you get started with plotting using Pandas and Matplotlib:

Python

# Import the library first
import matplotlib.pyplot as plt

Histogram

A histogram shows the distribution of values in a column.

Python

df['income'].hist(bins=10)
plt.title('Histogram of Income')
plt.xlabel('Income Value')
plt.ylabel('Frequency')
plt.show()

Output:

Box Plot

A box plot is useful to detect outliers and understand data spread.

Python

df = df.head(10) 

plt.figure(figsize=(20, 6))  # Increase width to make x-axis labels clearer
df.boxplot(column='imports', by='country')
plt.title('Boxplot by Country')
plt.suptitle('')  # Removes default title
plt.xlabel('Country')
plt.ylabel('Imports')
plt.xticks(rotation=45)  # Optional: Rotate x-axis labels for better visibility
plt.tight_layout()       # Adjust layout to avoid clipping
plt.show()

Output:

Scatter Plot

A scatter plot shows the relationship between two variables.

Python

x = df["health"]
y = df["life_expec"]

plt.scatter(x, y, label="Data Points", color="m", marker="*", s=30)
plt.xlabel('Health')
plt.ylabel('Life Expectancy')
plt.title('Scatter Plot of Health vs Life Expectancy')
plt.legend()
plt.show()