【vol.3】pandas dataframe: Basics, Structure & Creation Guide

Converted Colab Notebook

When you begin data analysis, the Pandas DataFrame is an indispensable tool. It serves as the foundation of data analysis for efficiently handling tabular data. Like Excel sheets or database tables, it allows you to organize data and perform various operations with remarkable efficiency. Mastering the Pandas DataFrame greatly affects the speed and quality of your subsequent analyses. Therefore, understanding its basic structure and how to create it is essential.

You should also understand the Pandas Series, which is closely related to the Pandas DataFrame. A Series is a data structure that represents a single “column” of a DataFrame. By learning Series, you can gain a deeper grasp of how a DataFrame works.

This article provides a thorough, beginner-friendly explanation of the basic structure of the Pandas DataFrame and how to create one. By clearly outlining the differences between it and the Pandas Series, you will deepen your understanding of the DataFrame. We will also introduce the fundamental creation patterns using the pd.DataFrame method with easy-to-follow code samples.

▶️ Be sure to consult the official documentation for DataFrame and Series as well:

【What You Will Learn in This Article】

  • The basic structure and role of the Pandas DataFrame
  • How to create a DataFrame from various data formats (how to use pd.DataFrame)
  • The fundamental differences between Pandas Series and DataFrame

【Related Articles】 We have separate detailed articles on how to inspect data after creating a Pandas DataFrame and how to retrieve or select specific data. Once you understand the basics of the DataFrame in this article, be sure to read the following posts to learn practical operations.

【Personal Experience】When I first started, I didn’t understand the difference between a Series and a DataFrame and passed a 2-D list to Series, which caused an error. I realized that being aware that a Series is 1-D and a DataFrame is 2-D is the key to avoiding this initial stumbling block.

Installing Pandas

If Pandas is not installed, install it with the following command.


pip install pandas

Importing Pandas

To use Pandas in your Python code, first import it. It is commonly imported with the alias pd.


import pandas as pd

Pandas DataFrame Basics: Structure and Creation

What Is a Pandas DataFrame?—Your Powerful Partner in Data Analysis

The Pandas DataFrame is the most fundamental data structure for working with tabular data in Python and is the central pillar of data analysis. Like a spreadsheet or database table, it organizes data and lets you perform a wide variety of operations efficiently.

In other words, the DataFrame plays a role in every step of data analysis. By mastering the DataFrame, you can automate data processing that would take a long time by hand and move on to more advanced analysis. In most data-analysis projects, the DataFrame is the star of the show.

The Basic Structure of a Pandas DataFrame

A DataFrame consists of the following elements.

  • Data: The values in the table. They can hold various data types (numbers, strings, etc.).
  • Column names (Columns): The labels for each column. Each column can be thought of as a one-dimensional data structure called a “Series.”
  • Index: The labels for each row. By default it is a sequence starting at 0, but you can set any values you like.

The following illustration shows the structure of a Pandas DataFrame.

Structure of a Pandas DataFrame, illustrating the relationships among the Index, column names (Columns), and data.
Figure: Structure of a Pandas DataFrame, illustrating the relationships among the Index, column names (Columns), and data.

Creating a Pandas DataFrame

You can create a Pandas DataFrame from various data formats using the pd.DataFrame() method. Below are the main creation patterns commonly used in data analysis.

Create from a List of Lists

A common approach is to use a “list of lists” in Python (a list that contains other lists). In this structure, each element of the outer list becomes a row, and the elements of the inner lists become the values of each column. It is typical to specify the column names with the columns argument.


data_list_of_list = [[1, 'Alice', 24],
                     [2, 'Bob', 27],
                     [3, 'Charlie', 22],
                     [4, 'David', 32],
                     [5, 'Eve', 29]]

# Create DataFrame by specifying column names with the columns argument
df_from_list_of_list = pd.DataFrame(data_list_of_list, columns=['ID', 'Name', 'Age'])

print("DataFrame created from list of lists:\n", df_from_list_of_list)

DataFrame created from list of lists:
    ID     Name  Age
0   1    Alice   24
1   2      Bob   27
2   3  Charlie   22
3   4    David   32
4   5      Eve   29

Create from a Dictionary

You can also pass a Python dictionary to pd.DataFrame(). In this case, the dictionary keys become the column names of the DataFrame and the values (lists or NumPy arrays, etc.) become the column data. For many cases, this method is an intuitive way to create a DataFrame.


data_dict = {'ID': [1, 2, 3, 4, 5],
             'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
             'Age': [24, 27, 22, 32, 29]}

# Create DataFrame from dictionary
df = pd.DataFrame(data_dict)

print("\nDataFrame created from dictionary:\n", df)
# Assign to variable df for use in later examples (adjust existing code if necessary)

DataFrame created from dictionary:
    ID     Name  Age
0   1    Alice   24
1   2      Bob   27
2   3  Charlie   22
3   4    David   32
4   5      Eve   29

Create from a NumPy Array

If your data already exists as a NumPy array, you can pass it directly to pd.DataFrame() to generate a DataFrame. As with the list-of-lists approach, you can specify the columns and index arguments.


# Import NumPy
# To use NumPy in your Python code, import it first. It is commonly imported with the alias np.
import numpy as np

# NumPy is a library for fast numerical computation,
# but beginners don’t need to worry about its detailed usage at this point.
# For now, it is enough to know that it is sometimes used when creating a Pandas DataFrame.
# It helps you efficiently handle multidimensional arrays of the same data type.

data_np = np.array([[1, 'Alice', 24],
                    [2, 'Bob', 27],
                    [3, 'Charlie', 22]])
df_from_np = pd.DataFrame(data_np, columns=['ID', 'Name', 'Age'])
print("\nDataFrame created from NumPy array:\n", df_from_np)

DataFrame created from NumPy array:
   ID     Name Age
0  1    Alice  24
1  2      Bob  27
2  3  Charlie  22

Creating and Modifying with a Specified Index

The Pandas DataFrame allows you to freely set or change the row labels, called the index.

The index is a label that uniquely identifies each row in a DataFrame. By default, an integer sequence starting at 0 is assigned automatically, but you can set meaningful values—such as dates or IDs—depending on your data. The index plays an important role when efficiently selecting specific rows or when joining multiple DataFrames.

You can set a particular column as the index by using the set_index() method.


# Use the df created in the previous example (redefine or check it here if necessary)
# For example, use the df created from data_dict:
data_dict_for_index = {'ID': [1, 2, 3, 4, 5],
                       'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
                       'Age': [24, 27, 22, 32, 29]}
df_index_example = pd.DataFrame(data_dict_for_index)  # Rename variable to avoid overwriting the original df

print("DataFrame before setting the index:\n", df_index_example)

# 'ID'Set the 'ID' column as the index
df_with_index = df_index_example.set_index('ID')
print("\n'ID'DataFrame after setting the 'ID' column as the index:\n", df_with_index)

# Update the variable df for later examples if needed
# df = df_with_index # Uncomment if this df will be used later

DataFrame before setting the index:
    ID     Name  Age
0   1    Alice   24
1   2      Bob   27
2   3  Charlie   22
3   4    David   32
4   5      Eve   29

'ID'DataFrame after setting the 'ID' column as the index:
        Name  Age
ID              
1     Alice   24
2       Bob   27
3   Charlie   22
4     David   32
5       Eve   29

Other Creation Methods (Reading from Files, etc.)

Besides the approaches above, you can create a Pandas DataFrame in the following ways.

  • Read external files such as CSV or Excel (pd.read_csv(), pd.read_excel(), etc.)
  • Create from a dictionary of Pandas Series

For details on reading CSV files, see the article 【Part 2】Google Colab and Drive: Easy CSV Loading and Saving | Beginner Drive Integration Guide.

【Cautions When Creating a DataFrame】

When creating a DataFrame from a list of lists, pay attention to the structure of your data—for example, you will get an error if the inner lists do not all have the same number of elements.

Fully Understanding the Differences Between Pandas Series and DataFrame

What Is a Pandas Series?—Its Role and Relation to DataFrame

A Pandas Series is a one-dimensional, labeled data structure. It resembles a list or a NumPy array, but differs in that it carries an index.

It is easier to grasp its role if you think of a Pandas Series as a single column that makes up a Pandas DataFrame.

【When Should You Use Series?】 A Series is useful, for example, when you want to pull out just one column from a DataFrame for analysis. Of course, you can also use it on its own when you need to handle one-dimensional data.

Extracting a Series from a DataFrame

Selecting a specific column from the DataFrame created in the previous section yields a Pandas Series. Let’s confirm this relationship with code.


# Use the df created in the previous section (for example, the df created from the dictionary)
# df = pd.DataFrame({'ID': [1, 2, 3, 4, 5], 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'Age': [24, 27, 22, 32, 29]}) # Redefine df here if necessary

print("Original DataFrame:\n", df)

# Extract the 'Name' column
names_series = df['Name']
print("\n'Name'column (retrieved as Series):\n", names_series)
print("Type of retrieved data:", type(names_series))

Original DataFrame:
    ID     Name  Age
0   1    Alice   24
1   2      Bob   27
2   3  Charlie   22
3   4    David   32
4   5      Eve   29

'Name'column (retrieved as Series):
 0      Alice
1        Bob
2    Charlie
3      David
4        Eve
Name: Name, dtype: object
Type of retrieved data:  <class 'pandas.core.series.Series'>

Basic Ways to Create a Series

You can also create a Pandas Series from various data formats using the pd.Series() method. To understand that a DataFrame column is in fact a Series, let’s briefly look at the basic creation patterns. We will cover the details in another article.


# Example: create from a list
data_s_list = [10, 20, 30, 40]
series_from_list = pd.Series(data_s_list)
print("\nSeries created from a list:\n", series_from_list)

# Example: create from a dictionary (convenient when you want to specify the index)
data_s_dict = {'a': 10, 'b': 20, 'c': 30, 'd': 40}
series_from_dict = pd.Series(data_s_dict)
print("\nSeries created from a dictionary:\n", series_from_dict)

Series created from a list:
0    10
1    20
2    30
3    40
dtype: int64

Series created from a dictionary:
a    10
b    20
c    30
d    40
dtype: int64

Key Differences Between Series and DataFrame (Structure, Dimensionality, Usability)

When learning Pandas, it is crucial to clearly understand the differences between Series and DataFrame. The main distinctions are as follows.

  1. Dimensionality:

    • Series: One-dimensional data structure
    • DataFrame: Two-dimensional data structure (rows and columns)
  2. Structure:

    • Series: Data are arranged in a single column, and each item has an index (label).
    • DataFrame: A table format with rows and columns; each column has a name and each row has an index.
  3. The following figure visually demonstrates this structural difference.

    Pandas Series vs DataFrame: Structural Differences

  4. Usability:

    Understanding these differences makes it clear when you should work with a Series versus a DataFrame and lets you write code more efficiently.

✅ Summary

This article thoroughly explained the basic structure of the Pandas DataFrame, the cornerstone of Pandas-based data analysis, and the fundamental ways to create one from lists and dictionaries using the pd.DataFrame() method.

We also deepened our understanding of the Pandas Series—an element that makes up a DataFrame—and the fundamental differences between Series and DataFrame in terms of dimensionality, structure, and usability.

By solidifying these Pandas fundamentals, you are now ready to progress through the various stages of data analysis.

To truly master the Pandas DataFrame, knowing how to create it is only the first step—the subsequent data manipulation is equally important. Be sure to check the upcoming articles to further level up your data-analysis skills with the Pandas DataFrame.

pd.read_csv で CSV を読み込む|Google Colab × Drive 完全ガイド【第2回】
Google ColabとGoogle Driveを連携し、CSVファイルの読み込み・保存方法を初心者向けにやさしく解説します。Googleアカウントがあれば無料で始められ、面倒な環境構築も不要。操作手順やpandasによる読み書きも具体的...

▲ Back to top

コメント

Copied title and URL