我有一个如下所示的数据框-
|short_1|short_2|full_nm|
|"Ginder"|"Ginder v. Ginder"|"Carrie Diane GINDER v. Carl L. GINDER"|
|"KENNEY"|"KENNEY v. BARNHART"|""|
|""|"United States v. $933,075.00 IN UNITED STATES CURRENCY"|""|
|""|""|""|
|"Rieser"|""|""|
我想先选择full_nm列(如果存在,获取这个值),然后是short_2(如果full_nm不存在,获取这个值),然后是short_1(如果full_nm和short_2不存在,获取这个值),如果所有 3 个值都不存在,请将其保留为空并根据此优先级创建一个新列,如下面 pyspark-
中的伪代码if full_nm:
df['new'] = full_nm
elseif short_2:
df['new'] = short_2
elseif short_1:
df['new'] = short_1
else:
df['new'] = ''
预期产出 -
|short_1|short_2|full_nm|new
|"Ginder"|"Ginder v. Ginder"|"Carrie Diane GINDER v. Carl L. GINDER"|"Carrie Diane GINDER v. Carl L. GINDER"|
|"KENNEY"|"KENNEY v. BARNHART"|""|"KENNEY v. BARNHART"|
|""|"United States v. $933,075.00 IN UNITED STATES CURRENCY"|""|"United States v. $933,075.00 IN UNITED STATES CURRENCY"|
|""|""|""|""|
|"Rieser"|""|""|"Rieser"|
如有任何帮助,我们将不胜感激。
coalesce
按所需顺序屏蔽列
cols = ['full_nm', 'short_2', 'short_1']
masked_cols = [F.when(F.col(c) != '', F.col(c)) for c in cols]
df = df.withColumn('new', F.coalesce(*masked_cols)).fillna({'new': ''})
+-------+--------------------+--------------------+--------------------+
|short_1| short_2| full_nm| new|
+-------+--------------------+--------------------+--------------------+
| Ginder| Ginder v. Ginder|Carrie Diane GIND...|Carrie Diane GIND...|
| KENNEY| KENNEY v. BARNHART| | KENNEY v. BARNHART|
| |United States v. ...| |United States v. ...|
| | | | null|
| Rieser| | | Rieser|
+-------+--------------------+--------------------+--------------------+