map, apply, applymap, 及 agg¶
在這個網頁中,我們介紹如何對Series/DataFrame的內容做函式運算。
In [6]:
Copied!
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
.map()¶
In [2]:
Copied!
names = ['John', 'Tom', 'Matt']
names = ['John', 'Tom', 'Matt']
In [3]:
Copied!
s = pd.Series(names)
s = pd.Series(names)
In [4]:
Copied!
# replace 只會替換掉找的到的值,其他的不改變。
# map 會把找不到的值都換成 NaN。
s.map({'John':'Josh', 'Tom':'Tim'})
# replace 只會替換掉找的到的值,其他的不改變。
# map 會把找不到的值都換成 NaN。
s.map({'John':'Josh', 'Tom':'Tim'})
Out[4]:
0 Josh 1 Tim 2 NaN dtype: object
In [5]:
Copied!
s.map(lambda x: f"He is {x}")
s.map(lambda x: f"He is {x}")
Out[5]:
0 He is John 1 He is Tom 2 He is Matt dtype: object
.apply()¶
針對單一個欄或列使用指定的函式做運算。
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
我們可以這樣做類比,就像在numpy裡面有ufunc可以針對ndarray做運算一樣。在DataFrame中,row跟column是Series,我們可以使用apply去針對整個Series做特定的函式運算。
In [40]:
Copied!
np.random.seed(987)
data = np.random.randint(1, 6, (4, 4))
np.random.seed(987)
data = np.random.randint(1, 6, (4, 4))
In [41]:
Copied!
df = pd.DataFrame(data)
df = pd.DataFrame(data)
In [42]:
Copied!
df.columns = ['col1', 'col2', 'col3', 'col4']
df.index = ['row1', 'row2', 'row3', 'row4']
df.columns = ['col1', 'col2', 'col3', 'col4']
df.index = ['row1', 'row2', 'row3', 'row4']
In [16]:
Copied!
df
df
Out[16]:
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
row1 | 4 | 2 | 4 | 3 |
row2 | 3 | 4 | 3 | 5 |
row3 | 3 | 5 | 4 | 2 |
row4 | 4 | 3 | 1 | 3 |
In [17]:
Copied!
df.apply(lambda x:x**2)
df.apply(lambda x:x**2)
Out[17]:
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
row1 | 16 | 4 | 16 | 9 |
row2 | 9 | 16 | 9 | 25 |
row3 | 9 | 25 | 16 | 4 |
row4 | 16 | 9 | 1 | 9 |
In [18]:
Copied!
df.apply(sum)
df.apply(sum)
Out[18]:
col1 14 col2 14 col3 12 col4 13 dtype: int64
In [19]:
Copied!
df.apply(sum, axis=1)
df.apply(sum, axis=1)
Out[19]:
row1 13 row2 15 row3 14 row4 11 dtype: int64
.applymap()¶
針對單一個element使用指定的函式做運算。
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.applymap.html
In [20]:
Copied!
import math
import math
In [21]:
Copied!
df.applymap(lambda x: math.sin(x))
df.applymap(lambda x: math.sin(x))
Out[21]:
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
row1 | -0.756802 | 0.909297 | -0.756802 | 0.141120 |
row2 | 0.141120 | -0.756802 | 0.141120 | -0.958924 |
row3 | 0.141120 | -0.958924 | -0.756802 | 0.909297 |
row4 | -0.756802 | 0.141120 | 0.841471 | 0.141120 |
In [22]:
Copied!
df.apply(lambda x: math.sin(x))
df.apply(lambda x: math.sin(x))
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-22-df9131ccdfc8> in <cell line: 1>() ----> 1 df.apply(lambda x: math.sin(x)) /usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwargs) 9566 kwargs=kwargs, 9567 ) -> 9568 return op.apply().__finalize__(self, method="apply") 9569 9570 def applymap( /usr/local/lib/python3.10/dist-packages/pandas/core/apply.py in apply(self) 762 return self.apply_raw() 763 --> 764 return self.apply_standard() 765 766 def agg(self): /usr/local/lib/python3.10/dist-packages/pandas/core/apply.py in apply_standard(self) 889 890 def apply_standard(self): --> 891 results, res_index = self.apply_series_generator() 892 893 # wrap results /usr/local/lib/python3.10/dist-packages/pandas/core/apply.py in apply_series_generator(self) 905 for i, v in enumerate(series_gen): 906 # ignore SettingWithCopy here in case the user mutates --> 907 results[i] = self.f(v) 908 if isinstance(results[i], ABCSeries): 909 # If we have a view on v, we need to make a copy because <ipython-input-22-df9131ccdfc8> in <lambda>(x) ----> 1 df.apply(lambda x: math.sin(x)) /usr/local/lib/python3.10/dist-packages/pandas/core/series.py in wrapper(self) 204 if len(self) == 1: 205 return converter(self.iloc[0]) --> 206 raise TypeError(f"cannot convert the series to {converter}") 207 208 wrapper.__name__ = f"__{converter.__name__}__" TypeError: cannot convert the series to <class 'float'>
In [24]:
Copied!
df.apply(lambda x: np.sin(x))
df.apply(lambda x: np.sin(x))
Out[24]:
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
row1 | -0.756802 | 0.909297 | -0.756802 | 0.141120 |
row2 | 0.141120 | -0.756802 | 0.141120 | -0.958924 |
row3 | 0.141120 | -0.958924 | -0.756802 | 0.909297 |
row4 | -0.756802 | 0.141120 | 0.841471 | 0.141120 |
In [43]:
Copied!
df.apply({'col1':np.sin, 'col2':np.cos, 'col3':lambda x:x**2, 'col4':lambda x:np.sqrt(x)})
df.apply({'col1':np.sin, 'col2':np.cos, 'col3':lambda x:x**2, 'col4':lambda x:np.sqrt(x)})
Out[43]:
col1 | col2 | col3 | col4 | ||
---|---|---|---|---|---|
sin | cos | cos | <lambda> | <lambda> | |
row1 | -0.756802 | -0.653644 | -0.416147 | 16 | 1.732051 |
row2 | 0.141120 | -0.989992 | -0.653644 | 9 | 2.236068 |
row3 | 0.141120 | -0.989992 | 0.283662 | 16 | 1.414214 |
row4 | -0.756802 | -0.653644 | -0.989992 | 1 | 1.732051 |
In [27]:
Copied!
df.apply([np.sum, np.mean])
df.apply([np.sum, np.mean])
Out[27]:
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
sum | 14.0 | 14.0 | 12.0 | 13.00 |
mean | 3.5 | 3.5 | 3.0 | 3.25 |
.agg()¶
可以一次多種彙總方式,也可以針對不同的欄位用不同的方法彙總。
.agg()是.aggregate()的別名。
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html
In [28]:
Copied!
df.agg([np.sum, np.mean])
df.agg([np.sum, np.mean])
Out[28]:
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
sum | 14.0 | 14.0 | 12.0 | 13.00 |
mean | 3.5 | 3.5 | 3.0 | 3.25 |
In [45]:
Copied!
np.random.seed(987)
names = np.random.choice(['A','B','C'], 20)
score1 = np.random.randint(1, 11, 20)
score2 = np.random.randint(1, 11, 20)
score3 = np.random.randint(1, 11, 20)
np.random.seed(987)
names = np.random.choice(['A','B','C'], 20)
score1 = np.random.randint(1, 11, 20)
score2 = np.random.randint(1, 11, 20)
score3 = np.random.randint(1, 11, 20)
In [46]:
Copied!
df = pd.DataFrame({'names':names, 'score1':score1, 'score2':score2, 'score3':score3})
df = pd.DataFrame({'names':names, 'score1':score1, 'score2':score2, 'score3':score3})
In [47]:
Copied!
df.groupby('names').apply({'score1':np.sum, 'score2':np.mean, 'score3':np.std})
df.groupby('names').apply({'score1':np.sum, 'score2':np.mean, 'score3':np.std})
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-47-1e433f903117> in <cell line: 1>() ----> 1 df.groupby('names').apply({'score1':np.sum, 'score2':np.mean, 'score3':np.std}) /usr/local/lib/python3.10/dist-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs) 1516 def apply(self, func, *args, **kwargs) -> NDFrameT: 1517 # GH#50538 -> 1518 is_np_func = func in com._cython_table and func not in com._builtin_table 1519 orig_func = func 1520 func = com.is_builtin_func(func) TypeError: unhashable type: 'dict'
In [48]:
Copied!
df.groupby('names').agg({'score1':np.sum, 'score2':np.mean, 'score3':np.std})
df.groupby('names').agg({'score1':np.sum, 'score2':np.mean, 'score3':np.std})
Out[48]:
score1 | score2 | score3 | |
---|---|---|---|
names | |||
A | 15 | 3.666667 | 2.645751 |
B | 33 | 7.333333 | 3.311596 |
C | 66 | 5.454545 | 2.467977 |