根据 pyspark 数据框中的三列创建一个新列

问题描述 投票:0回答:1

我有一个如下所示的数据框-

|short_1|short_2|full_nm|
|"Ginder"|"Ginder v. Ginder"|"Carrie Diane GINDER v. Carl L. GINDER"|
|"KENNEY"|"KENNEY v. BARNHART"|""|
|""|"United States v. $933,075.00 IN UNITED STATES CURRENCY"|""|
|""|""|""|
|"Rieser"|""|""|

我想先选择full_nm列(如果存在,获取这个值),然后是short_2(如果full_nm不存在,获取这个值),然后是short_1(如果full_nm和short_2不存在,获取这个值),如果所有 3 个值都不存在,请将其保留为空并根据此优先级创建一个新列,如下面 pyspark-

中的伪代码
if full_nm:
   df['new'] = full_nm
elseif short_2:
   df['new'] = short_2
elseif short_1:
  df['new'] = short_1
else:
  df['new'] = ''

预期产出 -

|short_1|short_2|full_nm|new
|"Ginder"|"Ginder v. Ginder"|"Carrie Diane GINDER v. Carl L. GINDER"|"Carrie Diane GINDER v. Carl L. GINDER"|
|"KENNEY"|"KENNEY v. BARNHART"|""|"KENNEY v. BARNHART"|
|""|"United States v. $933,075.00 IN UNITED STATES CURRENCY"|""|"United States v. $933,075.00 IN UNITED STATES CURRENCY"|
|""|""|""|""|
|"Rieser"|""|""|"Rieser"|

如有任何帮助,我们将不胜感激。

python pyspark conditional-statements case
1个回答
0
投票

用 null 屏蔽空值,然后

coalesce
按所需顺序屏蔽列

cols = ['full_nm', 'short_2', 'short_1']
masked_cols = [F.when(F.col(c) != '', F.col(c)) for c in cols]
                    
df = df.withColumn('new', F.coalesce(*masked_cols)).fillna({'new': ''})

+-------+--------------------+--------------------+--------------------+
|short_1|             short_2|             full_nm|                 new|
+-------+--------------------+--------------------+--------------------+
| Ginder|    Ginder v. Ginder|Carrie Diane GIND...|Carrie Diane GIND...|
| KENNEY|  KENNEY v. BARNHART|                    |  KENNEY v. BARNHART|
|       |United States v. ...|                    |United States v. ...|
|       |                    |                    |                null|
| Rieser|                    |                    |              Rieser|
+-------+--------------------+--------------------+--------------------+
© www.soinside.com 2019 - 2024. All rights reserved.