Pandas DataFrame
Init signature:
pd.DataFrame(
data=None,
index: 'Optional[Axes]' = None,
columns: 'Optional[Axes]' = None,
dtype: 'Optional[Dtype]' = None,
copy: 'bool' = False,
)
Docstring:
Two-dimensional, size-mutable, potentially heterogeneous tabular data.
Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.
Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
Dict can contain Series, arrays, constants, dataclass or list-like objects. If
data is a dict, column order follows insertion-order.
.. versionchanged:: 0.25.0
If data is a list of dicts, column order follows insertion-order.
index : Index or array-like
Index to use for resulting frame. Will default to RangeIndex if
no indexing information part of input data and no index provided.
columns : Index or array-like
Column labels to use for resulting frame. Will default to
RangeIndex (0, 1, 2, ..., n) if no column labels are provided.
dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input.
See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.
Examples
--------
Constructing DataFrame from a dictionary.
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2
0 1 3
1 2 4
Notice that the inferred dtype is int64.
>>> df.dtypes
col1 int64
col2 int64
dtype: object
To enforce a single dtype:
>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1 int8
col2 int8
dtype: object
Constructing DataFrame from numpy ndarray:
>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
... columns=['a', 'b', 'c'])
>>> df2
a b c
0 1 2 3
1 4 5 6
2 7 8 9
Constructing DataFrame from dataclass:
>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
x y
0 0 0
1 0 3
2 2 3
File: d:\python38\lib\site-packages\pandas\core\frame.py
Type: type
Subclasses: SubclassedDataFrame
代码 | 功能 |
DataFrame() | 创建一个DataFrame对象 |
df.values | 返回ndarray类型的对象 |
df.iloc[ 行序,列序 ] | 按序值返回元素 |
df.loc[ 行索引,列索引 ] | 按索引返回元素 |
df.index | 获取行索引 |
df.columns | 获取列索引 |
df.axes | 获取行及列索引 |
df.T | 行与列对调 |
df. info() | 打印DataFrame对象的信息 |
df.head(i) | 显示前 i 行数据 |
df.tail(i) | 显示后 i 行数据 |
df.describe() | 查看数据按列的统计信息 |
1、创建
大多数情况下都是从数据文件如 CSV、Excel 中取得数据。
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
d = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')]) # 创建一个空的 2x3 数组
# 给这个数据填入具体数据值
d[:] = [(1, 2., 'Hello'), (2, 3., "World")]
pd.DataFrame(d, index=['first', 'second'], columns=['C', 'A', 'B'])
d = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(d)
# 从字典里生成
pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]))
# 从列表、元组、ndarray 中创建
pd.DataFrame.from_records([(1, 2., b'Hello'), (2, 3., b'World')])
# 列内容为一个字典
pd.json_normalize(df.col)
df.col.apply(pd.Series)
2、访问索引和列名
d = {'one': [1., 2., 3., 4.], 'two': [4., 3., 2., 1.]}
df = pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
df.index
# Index(['a', 'b', 'c', 'd'], dtype='object')
df.columns
# Index(['one', 'two'], dtype='object')
3、选择增加修改列
df['one'] # 选择列,结果是一个 Series
df['foo'] = 'bar' # 定义一个固定值的列
df['one_trunc'] = df['one'][:2] # 定义的新列取某列的部分值
df['three'] = df['one'] * df['two'] # 定义一个新列,由已有的两列相乘
df['flag'] = df['one'] > 2 # 新增加的列返回的是一个逻辑运算值
df.insert(1, 'bar', df['one']) # 在列索引位 1 处理插入名为 bar 的列,值取 df.one
Signature: df.insert(loc, column, value, allow_duplicates=False) -> 'None'
Docstring:
Insert column into DataFrame at specified location.
Raises a ValueError if `column` is already contained in the DataFrame,
unless `allow_duplicates` is set to True.
Parameters
----------
loc : int
Insertion index. Must verify 0 <= loc <= len(columns).
column : str, number, or hashable object
Label of the inserted column.
value : int, Series, or array-like
allow_duplicates : bool, optional
File: d:\python38\lib\site-packages\pandas\core\frame.py
Type: method
4、删除一个列
del df['two']
three = df.pop('three')
5、用方法链创建新列
方法链(method chains)是一种预计算,并没有改变原变量的数据,同时使用起来非常方便,不需要频繁给变量赋值。
# 定义一个名为 rate 的新列,并给定计算公式
df.assign(rate=df.one/df.bar)
# 可以用 lambda 进行计算,变量 x 是指整个 df
df.assign(rate=lambda x:x['one']/x['two'])
# 可指定多个
df.assign(rate=lambda x:x['one']/x['bar'], rate2=lambda x:x['one']+x['bar'])
6、选择数据
Operation | Syntax | Result |
选择列 | df[col] | Series |
按索引选择行 | df.loc[label] | Series |
按数字索引选择行 | df.iloc[loc] | Series |
使用切片选择行 | df[5:10] | DataFrame |
用表达式筛选行 | df[bool_vec] | DataFrame |
df['one']
df.loc['a']
df[1:3]
df.iloc[3]
df[df.one > 1]
7、数据的转置
可以会数据沿对角线进行转置,即行转列,列转行 df.T