【Vol.5】 Pandas info and describe for data structure & stats

When performing data analysis in Python, it’s crucial to understand the overall structure of your data. Use the info() and describe() methods from the pandas library to inspect a DataFrame’s schema and summary statistics. In this article, we’ll walk through beginner-friendly examples showing how to use these methods.

Dataset Used

We’ll convert the following dictionary into a pandas DataFrame:

import pandas as pd

data = {
    "Name": ["Taro", "Hanako", "Jiro", "Mika", "Kenichi", "Keiko", "Sho", "Akane", "Takashi", "Aoi"],
    "Age": [23, 29, 35, 42, 18, 33, 27, 24, 31, 30],
    "Occupation": ["Engineer", "Designer", "Teacher", "Doctor", "Student", "Nurse", "Programmer", "Sales", "Lawyer", "Researcher"],
    "Annual Income (¥)": [4500000, 5500000, 4900000, 7300000, 0, 4000000, 6000000, 3200000, 8000000, 5800000],
    "Location": ["Tokyo", "Osaka", "Nagoya", "Sapporo", "Fukuoka", "Tokyo", "Kobe", "Sendai", "Yokohama", "Chiba"],
    "Years Employed": [2, 4, 10, 15, 1, 5, 3, 1, 12, 8]
}

df = pd.DataFrame(data)

DataFrame output:

df
Name Age Occupation Annual Income (¥) Location Years Employed
0Taro23Engineer4500000Tokyo2
1Hanako29Designer5500000Osaka4
2Jiro35Teacher4900000Nagoya10
3Mika42Doctor7300000Sapporo15
4Kenichi18Student0Fukuoka1
5Keiko33Nurse4000000Tokyo5
6Sho27Programmer6000000Kobe3
7Akane24Sales3200000Sendai1
8Takashi31Lawyer8000000Yokohama12
9Aoi30Researcher5800000Chiba8

Inspecting Structure with info()

df.info()
DataFrame info() output

From this output, you can see:

  1. Entries: 10 rows indexed 0–9, and 6 columns
  2. Data types:
    • Name (object): strings
    • Age (int64): integers
    • Occupation (object): strings
    • Annual Income (¥) (int64): integers
    • Location (object): strings
    • Years Employed (int64): integers
  3. Non-null counts: All columns have 10 non-null entries (no missing values)
  4. Memory usage: 612 bytes (small dataset)

Note: int64 is 64-bit integer; object covers strings or mixed types; non-null means no missing data.

▶️ For reference, see the official info documentation:
pandas DataFrame info Documentation

Examining Summary Statistics with describe()

df.describe()
Age Annual Income (¥) Years Employed
count101010
mean29.249200006.1
std6.492170619.124.76
min1801
25%2440000002.25
50%29.552000004.5
75%33.2560000009.25
max42800000015

The key statistics are:

  • count: number of non-missing values
  • mean: average
  • std: standard deviation
  • min / max: minimum and maximum
  • 25% / 50% / 75%: quartiles

Including All Columns

By default, describe() shows only numeric columns. Use include='all' to include categorical data:

df.describe(include='all')
Name Age Occupation Annual Income (¥) Location Years Employed
count1010.01010.01010.0
unique10NaN10NaN9NaN
topTaroNaNEngineerNaNTokyoNaN
freq1NaN1NaN2NaN
meanNaN29.2NaN4920000NaN6.1
stdNaN6.76NaN2251321NaN4.91
minNaN18NaN0NaN1
25%NaN24.75NaN4125000NaN2.25
50%NaN29.5NaN5200000NaN4.5
75%NaN32.5NaN5950000NaN9.5
maxNaN42NaN8000000NaN15

Categorical Statistics Explained

  • unique: number of distinct values
  • top: most frequent value
  • freq: frequency of the top value
  • NaN: indicates that a statistic (e.g. mean) is not applicable

Note: For example, “Tokyo” appears twice so freq=2. In ties, one value is shown.

Author’s Takeaway

The first time I ran df.describe(), I assumed no issues—but later found missing values in categorical columns that broke my model training. Now I always start with df.info() and follow with df.describe(include='all').

Lesson Learned: Don’t rely on numeric summaries alone—check structure and stats together!

▶️ For reference, see the official describe documentation:
pandas DataFrame describe Documentation

Summary

  • Use info() to inspect types, non-null counts, and dimensions
  • Use describe() to review numeric statistics
  • Use include='all' to include categorical columns
  • NaN indicates that a statistic is not applicable

Key Statistic Terms

StatisticNumericCategorical
countnon-missing entriesnon-missing entries
meanaverage
stdstandard deviation
min/maxminimum/maximum
25%/50%/75%quartiles
uniquedistinct values
topmost frequent value
freqfrequency of top value

Next time, we’ll cover loc for label-based row and column selection!

▲ Back to Top

コメント

Copied title and URL