Data analysis and Visualization with Python
Python is widely used as a data analysis language due to its robust libraries and tools for managing data. Among these libraries is Pandas, which makes data exploration, manipulation, and analysis easier. we will use Pandas to analyse a dataset called Country-data.csv from Kaggle. While working with this data, we also introduce some important concepts in Pandas.
1. Installation
Easiest way to install pandas is to use pip:
pip install pandas
or, Download it from here.
2. Creating A DataFrame in Pandas
A DataFrame is a table-like data structure in Pandas which has data stored in rows and columns. A DataFrame can be created by passing multiple python Series objects into the DataFrame
class (pd.DataFrame()) using the pd.Series
method. In this example, two Series objects are used: s1
as the first row and s2
as the second row.
Example 1: Creating DataFrame from Series:
import pandas as pd
# Creating two Series: s1 (numbers) and s2 (names)
s1 = pd.Series([1, 2])
s2 = pd.Series(["Ashish", "Sid"])
# Creating DataFrame by combining Series as rows
dataframe = pd.DataFrame([s1, s2])
# Displaying the DataFrame
print(dataframe)
Output:

Example 2: DataFrame from a List with Custom Index and Column Names:
dataframe1 = pd.DataFrame([[1, 2], ["Ashish", "Sid"]], index=["r1", "r2"], columns=["c1", "c2"])
print(dataframe1)
Output:

Example 3: DataFrame from a Dictionary:
dataframe2 = pd.DataFrame({
"c1": [1, "Ashish"],
"c2": [2, "Sid"]
})
print(dataframe2)
Output:

3. Importing Data with Pandas
The first step is to read the data. In our case, the data is stored as a CSV (Comma-Separated Values) file, where each row is separated by a new line, and each column by a comma. In order to be able to work with the data in Python, it is needed to read the csv file into a Pandas DataFrame.
import pandas as pd
# Read Country-data.csv into a DataFrame
df = pd.read_csv("Country-data.csv")
# Prints the first 5 rows of a DataFrame as default
df.head()
# Prints no. of rows and columns of a DataFrame
df.shape
Output:

(167, 10)
4. Indexing DataFrames with Pandas
Pandas provides powerful indexing capabilities. You can index DataFrames using both position-based and label-based methods.
Position-Based Indexing (Using iloc
):
# prints first 5 rows and every column which replicates df.head()
df.iloc[0:5,:]
# prints entire rows and columns
df.iloc[:,:]
# prints from 5th rows and first 5 columns
df.iloc[5:,:5]
Output:



Label-Based Indexing (Using loc
):
Indexing can be worked with labels using the pandas.DataFrame.loc method, which allows to index using labels instead of positions.
Examples:
# prints first five rows including 5th index and every columns of df
df.loc[0:5,:]
# prints from 5th rows onwards and entire columns
df.loc[5:,:]
Output:


The above doesnât actually look much different from df.iloc[0:5,:]. This is because while row labels can take on any values, our row labels match the positions exactly. But column labels can make things much easier when working with data.
Example:
# Prints the first 5 rows of Time period
# value
df.loc[:5,"child_mort"]
Output:

5. DataFrame Math with Pandas
Pandas makes it easier to perform mathematical operations on the data stored in dataframes. The operations which can be performed on pandas are vectorized, meaning they are fast and apply automatically to all elements without using loops.
Example - Column-wise Math:
# Adding 5 to every element in column A
df["child_mort"] = df["child_mort"] + 5
# Multiplying values in column B by 10
df["exports"] = df["exports"] * 10
df
Output:

Statistical Functions in Pandas:
Computation of data frames can be done by using Statistical Functions of pandas tools. We can use functions like:
df.sum()
â sum of valuesdf.mean()
â averagedf.max()
/df.min()
â max and min valuesdf.describe()
â quick statistics summary
# computes various summary statistics, excluding NaN values
df.describe()
# Provides sum of all the values for each column
df.sum()
Output:


6. Data Visualization with Pandas and Matplotlib
Pandas is very easy to use with Matplotlib, a powerful library used for creating basic plots and charts. With only a few lines of code, we can visualize our data and understand it better. Below are some simple examples to help you get started with plotting using Pandas and Matplotlib:
# Import the library first
import matplotlib.pyplot as plt
Histogram
A histogram shows the distribution of values in a column.
df['income'].hist(bins=10)
plt.title('Histogram of Income')
plt.xlabel('Income Value')
plt.ylabel('Frequency')
plt.show()
Output:

Box Plot
A box plot is useful to detect outliers and understand data spread.
df = df.head(10)
plt.figure(figsize=(20, 6)) # Increase width to make x-axis labels clearer
df.boxplot(column='imports', by='country')
plt.title('Boxplot by Country')
plt.suptitle('') # Removes default title
plt.xlabel('Country')
plt.ylabel('Imports')
plt.xticks(rotation=45) # Optional: Rotate x-axis labels for better visibility
plt.tight_layout() # Adjust layout to avoid clipping
plt.show()
Output:

Scatter Plot
A scatter plot shows the relationship between two variables.
x = df["health"]
y = df["life_expec"]
plt.scatter(x, y, label="Data Points", color="m", marker="*", s=30)
plt.xlabel('Health')
plt.ylabel('Life Expectancy')
plt.title('Scatter Plot of Health vs Life Expectancy')
plt.legend()
plt.show()
Output:

Related Article:
- Pandas Introduction
- Graph Plotting in Python
- Working with csv files in Python
- Pandas DataFrame
- Introduction to Matplotlib
- Histogram - Definition, Types, Graph, and Examples
- Box Plot
- Scatter Plot