30  merge

30.1 merge方法

横向连接,包括内连接和外连接,结果都是列增加了,行根据要求可能减少,不变或者增加)

import pandas as pd

data=pd.DataFrame([{"id":0,"name":'lxh',"age":20,"cp":'lm'},
                   {"id":1,"name":'xiao',"age":40,"cp":'ly'},
                   {"id":2,"name":'hua',"age":4,"cp":'yry'},
                   {"id":3,"name":'be',"age":70,"cp":'old'}])
data1=pd.DataFrame([{"id":100,"name":'lxh','cs':10},
                    {"id":101,"name":'xiao','cs':40},
                    {"id":102,"name":'hua2','cs':50}])
data2=pd.DataFrame([{"id":0,"name":'lxh','cs':10},
                    {"id":101,"name":'xiao','cs':40},
                    {"id":102,"name":'hua2','cs':50}])
data
id name age cp
0 0 lxh 20 lm
1 1 xiao 40 ly
2 2 hua 4 yry
3 3 be 70 old
data1
id name cs
0 100 lxh 10
1 101 xiao 40
2 102 hua2 50

30.1.1 内连接

30.1.1.1 单列作为连接键。

不设置how的值,则how默认值为”inner”。
结果:求交集。

pd.merge(data,data1,
         on="name",
         suffixes=('_left','_right')
        )    
id_left name age cp id_right cs
0 0 lxh 20 lm 100 10
1 1 xiao 40 ly 101 40

30.1.1.2 多列作为连接键。

不设置how的值,则how默认值为”inner”。
结果:要求两列值同时符合,求交集。

data
id name age cp
0 0 lxh 20 lm
1 1 xiao 40 ly
2 2 hua 4 yry
3 3 be 70 old
data2
id name cs
0 0 lxh 10
1 101 xiao 40
2 102 hua2 50
pd.merge(data,data2,
         on=("name","id")
        )
id name age cp cs
0 0 lxh 20 lm 10

30.1.1.3 指定不同名连接键

当左右两个DataFrame的列名不同,当又想做为连接键时可以使用left_on与right_on来指定连接键

data3=pd.DataFrame([
    {"mid":0,"mname":'lxh','cs':10},
    {"mid":101,"mname":'xiao','cs':40},
    {"mid":102,"mname":'hua2','cs':50}]
)
data
id name age cp
0 0 lxh 20 lm
1 1 xiao 40 ly
2 2 hua 4 yry
3 3 be 70 old
data3
mid mname cs
0 0 lxh 10
1 101 xiao 40
2 102 hua2 50
pd.merge(data,data3,
         left_on=["name"],
         right_on=["mname"])
id name age cp mid mname cs
0 0 lxh 20 lm 0 lxh 10
1 1 xiao 40 ly 101 xiao 40
pd.merge(data,data3,
         left_on=["name","id"],
         right_on=["mname","mid"]
)
id name age cp mid mname cs
0 0 lxh 20 lm 0 lxh 10
pd.merge(data,data3,
         left_on=["name"],
         right_on=["mname"], 
         how = 'left')
id name age cp mid mname cs
0 0 lxh 20 lm 0.0 lxh 10.0
1 1 xiao 40 ly 101.0 xiao 40.0
2 2 hua 4 yry NaN NaN NaN
3 3 be 70 old NaN NaN NaN

30.1.1.4 不指定键

不指定on则以两个DataFrame的列名交集做为连接键,如下面使用了id与name

pd.merge(data,data2)
id name age cp cs
0 0 lxh 20 lm 10

30.1.1.5 设置行索引

这里将name设为索引

indexed_data1=data1.set_index("name")
data
id name age cp
0 0 lxh 20 lm
1 1 xiao 40 ly
2 2 hua 4 yry
3 3 be 70 old
indexed_data1
id cs
name
lxh 100 10
xiao 101 40
hua2 102 50

使用右边的DataFrame的行索引做为连接键,通过right_index=True设置

pd.merge(data, indexed_data1, left_on='name', right_index=True)
id_x name age cp id_y cs
0 0 lxh 20 lm 100 10
1 1 xiao 40 ly 101 40

30.1.2 外连接

30.1.2.1 左外连接

以左侧表的name列为连接键,向左外连接

data
id name age cp
0 0 lxh 20 lm
1 1 xiao 40 ly
2 2 hua 4 yry
3 3 be 70 old
data1
id name cs
0 100 lxh 10
1 101 xiao 40
2 102 hua2 50
pd.merge(data, data1,
         on="name",
         how="left",
         suffixes=('_left','_right'))
id_left name age cp id_right cs
0 0 lxh 20 lm 100.0 10.0
1 1 xiao 40 ly 101.0 40.0
2 2 hua 4 yry NaN NaN
3 3 be 70 old NaN NaN
# 交换表的位置

pd.merge(data1,data,on="name",how="left")
id_x name cs id_y age cp
0 100 lxh 10 0.0 20.0 lm
1 101 xiao 40 1.0 40.0 ly
2 102 hua2 50 NaN NaN NaN

30.1.2.2 右外连接

以右侧表为基准进行外连接,得到的整体数据内容与上面左外连接是一样的。

pd.merge(data,data1,
         on="name",
         how="right")
id_x name age cp id_y cs
0 0.0 lxh 20.0 lm 100 10
1 1.0 xiao 40.0 ly 101 40
2 NaN hua2 NaN NaN 102 50

30.1.2.3 全外连接

保留两个表的所有行,无匹配则用NAN填充

pd.merge(data,data1,
         on="name",
         how="outer")
id_x name age cp id_y cs
0 0.0 lxh 20.0 lm 100.0 10.0
1 1.0 xiao 40.0 ly 101.0 40.0
2 2.0 hua 4.0 yry NaN NaN
3 3.0 be 70.0 old NaN NaN
4 NaN hua2 NaN NaN 102.0 50.0