Pandas 03 DataFrame

标签：03 index df dtype DataFrame pd data Pandas

Pandas DataFrame

Init signature:
pd.DataFrame(
    data=None,
    index: 'Optional[Axes]' = None,
    columns: 'Optional[Axes]' = None,
    dtype: 'Optional[Dtype]' = None,
    copy: 'bool' = False,
)
Docstring:     
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    Dict can contain Series, arrays, constants, dataclass or list-like objects. If
    data is a dict, column order follows insertion-order.

    .. versionchanged:: 0.25.0
       If data is a list of dicts, column order follows insertion-order.

index : Index or array-like
    Index to use for resulting frame. Will default to RangeIndex if
    no indexing information part of input data and no index provided.
columns : Index or array-like
    Column labels to use for resulting frame. Will default to
    RangeIndex (0, 1, 2, ..., n) if no column labels are provided.
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool, default False
    Copy data from inputs. Only affects DataFrame / 2d ndarray input.

See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.

Examples
--------
Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
...                    columns=['a', 'b', 'c'])
>>> df2
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Constructing DataFrame from dataclass:

>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
    x  y
0  0  0
1  0  3
2  2  3
File:           d:\python38\lib\site-packages\pandas\core\frame.py
Type:           type
Subclasses:     SubclassedDataFrame

代码	功能
DataFrame()	创建一个DataFrame对象
df.values	返回ndarray类型的对象
df.iloc[ 行序,列序 ]	按序值返回元素
df.loc[ 行索引,列索引 ]	按索引返回元素
df.index	获取行索引
df.columns	获取列索引
df.axes	获取行及列索引
df.T	行与列对调
df. info()	打印DataFrame对象的信息
df.head(i)	显示前 i 行数据
df.tail(i)	显示后 i 行数据
df.describe()	查看数据按列的统计信息

1、创建

大多数情况下都是从数据文件如 CSV、Excel 中取得数据。

d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

d = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')]) # 创建一个空的 2x3 数组
# 给这个数据填入具体数据值
d[:] = [(1, 2., 'Hello'), (2, 3., "World")]
pd.DataFrame(d, index=['first', 'second'], columns=['C', 'A', 'B'])
d = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(d)
# 从字典里生成
pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]))
# 从列表、元组、ndarray 中创建
pd.DataFrame.from_records([(1, 2., b'Hello'), (2, 3., b'World')])
# 列内容为一个字典
pd.json_normalize(df.col)
df.col.apply(pd.Series)

2、访问索引和列名

d = {'one': [1., 2., 3., 4.], 'two': [4., 3., 2., 1.]}
df = pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
df.index
# Index(['a', 'b', 'c', 'd'], dtype='object')

df.columns
# Index(['one', 'two'], dtype='object')

3、选择增加修改列

df['one'] # 选择列，结果是一个 Series
df['foo'] = 'bar' # 定义一个固定值的列
df['one_trunc'] = df['one'][:2] # 定义的新列取某列的部分值
df['three'] = df['one'] * df['two'] # 定义一个新列，由已有的两列相乘
df['flag'] = df['one'] > 2 # 新增加的列返回的是一个逻辑运算值
df.insert(1, 'bar', df['one']) # 在列索引位 1 处理插入名为 bar 的列，值取 df.one

Signature: df.insert(loc, column, value, allow_duplicates=False) -> 'None'
Docstring:
Insert column into DataFrame at specified location.

Raises a ValueError if `column` is already contained in the DataFrame,
unless `allow_duplicates` is set to True.

Parameters
----------
loc : int
    Insertion index. Must verify 0 <= loc <= len(columns).
column : str, number, or hashable object
    Label of the inserted column.
value : int, Series, or array-like
allow_duplicates : bool, optional
File:      d:\python38\lib\site-packages\pandas\core\frame.py
Type:      method

4、删除一个列

del df['two']
three = df.pop('three')

5、用方法链创建新列

方法链（method chains）是一种预计算，并没有改变原变量的数据，同时使用起来非常方便，不需要频繁给变量赋值。

# 定义一个名为 rate 的新列，并给定计算公式
df.assign(rate=df.one/df.bar)
# 可以用 lambda 进行计算，变量 x 是指整个 df 
df.assign(rate=lambda x:x['one']/x['two'])
# 可指定多个
df.assign(rate=lambda x:x['one']/x['bar'], rate2=lambda x:x['one']+x['bar'])

6、选择数据

Operation	Syntax	Result
选择列	df[col]	Series
按索引选择行	df.loc[label]	Series
按数字索引选择行	df.iloc[loc]	Series
使用切片选择行	df[5:10]	DataFrame
用表达式筛选行	df[bool_vec]	DataFrame

df['one']
df.loc['a']
df[1:3]
df.iloc[3]
df[df.one > 1]

7、数据的转置

可以会数据沿对角线进行转置，即行转列，列转行 df.T

标签：03,index,df,dtype,DataFrame,pd,data,Pandas
From： https://blog.51cto.com/u_1439909/6321599