When performing data analysis in Python, it’s crucial to understand the overall structure of your data. Use the info()
and describe()
methods from the pandas library to inspect a DataFrame’s schema and summary statistics. In this article, we’ll walk through beginner-friendly examples showing how to use these methods.
Dataset Used
We’ll convert the following dictionary into a pandas DataFrame:
import pandas as pd
data = {
"Name": ["Taro", "Hanako", "Jiro", "Mika", "Kenichi", "Keiko", "Sho", "Akane", "Takashi", "Aoi"],
"Age": [23, 29, 35, 42, 18, 33, 27, 24, 31, 30],
"Occupation": ["Engineer", "Designer", "Teacher", "Doctor", "Student", "Nurse", "Programmer", "Sales", "Lawyer", "Researcher"],
"Annual Income (¥)": [4500000, 5500000, 4900000, 7300000, 0, 4000000, 6000000, 3200000, 8000000, 5800000],
"Location": ["Tokyo", "Osaka", "Nagoya", "Sapporo", "Fukuoka", "Tokyo", "Kobe", "Sendai", "Yokohama", "Chiba"],
"Years Employed": [2, 4, 10, 15, 1, 5, 3, 1, 12, 8]
}
df = pd.DataFrame(data)
DataFrame output:
df
Name | Age | Occupation | Annual Income (¥) | Location | Years Employed | |
---|---|---|---|---|---|---|
0 | Taro | 23 | Engineer | 4500000 | Tokyo | 2 |
1 | Hanako | 29 | Designer | 5500000 | Osaka | 4 |
2 | Jiro | 35 | Teacher | 4900000 | Nagoya | 10 |
3 | Mika | 42 | Doctor | 7300000 | Sapporo | 15 |
4 | Kenichi | 18 | Student | 0 | Fukuoka | 1 |
5 | Keiko | 33 | Nurse | 4000000 | Tokyo | 5 |
6 | Sho | 27 | Programmer | 6000000 | Kobe | 3 |
7 | Akane | 24 | Sales | 3200000 | Sendai | 1 |
8 | Takashi | 31 | Lawyer | 8000000 | Yokohama | 12 |
9 | Aoi | 30 | Researcher | 5800000 | Chiba | 8 |
Inspecting Structure with info()
df.info()

From this output, you can see:
- Entries: 10 rows indexed 0–9, and 6 columns
- Data types:
- Name (object): strings
- Age (int64): integers
- Occupation (object): strings
- Annual Income (¥) (int64): integers
- Location (object): strings
- Years Employed (int64): integers
- Non-null counts: All columns have 10 non-null entries (no missing values)
- Memory usage: 612 bytes (small dataset)
Note:
int64
is 64-bit integer; object
covers strings or mixed types; non-null means no missing data.
▶️ For reference, see the official info documentation:
pandas DataFrame info Documentation
Examining Summary Statistics with describe()
df.describe()
Age | Annual Income (¥) | Years Employed | |
---|---|---|---|
count | 10 | 10 | 10 |
mean | 29.2 | 4920000 | 6.1 |
std | 6.49 | 2170619.12 | 4.76 |
min | 18 | 0 | 1 |
25% | 24 | 4000000 | 2.25 |
50% | 29.5 | 5200000 | 4.5 |
75% | 33.25 | 6000000 | 9.25 |
max | 42 | 8000000 | 15 |
The key statistics are:
- count: number of non-missing values
- mean: average
- std: standard deviation
- min / max: minimum and maximum
- 25% / 50% / 75%: quartiles
Including All Columns
By default, describe()
shows only numeric columns. Use include='all'
to include categorical data:
df.describe(include='all')
Name | Age | Occupation | Annual Income (¥) | Location | Years Employed | |
---|---|---|---|---|---|---|
count | 10 | 10.0 | 10 | 10.0 | 10 | 10.0 |
unique | 10 | NaN | 10 | NaN | 9 | NaN |
top | Taro | NaN | Engineer | NaN | Tokyo | NaN |
freq | 1 | NaN | 1 | NaN | 2 | NaN |
mean | NaN | 29.2 | NaN | 4920000 | NaN | 6.1 |
std | NaN | 6.76 | NaN | 2251321 | NaN | 4.91 |
min | NaN | 18 | NaN | 0 | NaN | 1 |
25% | NaN | 24.75 | NaN | 4125000 | NaN | 2.25 |
50% | NaN | 29.5 | NaN | 5200000 | NaN | 4.5 |
75% | NaN | 32.5 | NaN | 5950000 | NaN | 9.5 |
max | NaN | 42 | NaN | 8000000 | NaN | 15 |
Categorical Statistics Explained
- unique: number of distinct values
- top: most frequent value
- freq: frequency of the top value
- NaN: indicates that a statistic (e.g. mean) is not applicable
Note: For example, “Tokyo” appears twice so freq=2
. In ties, one value is shown.
Author’s Takeaway
The first time I ran df.describe()
, I assumed no issues—but later found missing values in categorical columns that broke my model training. Now I always start with df.info()
and follow with df.describe(include='all')
.
Lesson Learned: Don’t rely on numeric summaries alone—check structure and stats together!
▶️ For reference, see the official describe documentation:
pandas DataFrame describe Documentation
Summary
- Use
info()
to inspect types, non-null counts, and dimensions - Use
describe()
to review numeric statistics - Use
include='all'
to include categorical columns - NaN indicates that a statistic is not applicable
Key Statistic Terms
Statistic | Numeric | Categorical |
---|---|---|
count | non-missing entries | non-missing entries |
mean | average | – |
std | standard deviation | – |
min/max | minimum/maximum | – |
25%/50%/75% | quartiles | – |
unique | – | distinct values |
top | – | most frequent value |
freq | – | frequency of top value |
Next time, we’ll cover loc
for label-based row and column selection!
コメント