如何在PySpark中创建多列加法,就像在SAS中那样,然后执行语句(否)?

问题描述 投票:1回答:1

我想将下面的SAS代码转换为PySpark,有人可以帮我吗?

data ABC_New;
set ABC(where=(A=1234));
format C z14.;
 if A ge 0 then do;
    C=A*50000;
    X1 = input(substr(put(C,z14.),1,2),2.);
    X2 = input(substr(put(C,z14.),3,2),2.);
    X3  =input(substr(put(C,z14.),1,1)||substr(put(C,z14.),3,1),2.);
    X4  =input(substr(put(C,z14.),2,1)||substr(put(C,z14.),4,1),2.);
  end;
run;

感谢任何帮助!

pyspark sas
1个回答
0
投票

SAS与python和pyspark之间有很多区别。这里有一些亮点:

我们通过定义执行管道中的步骤来转换数据。

(ABC
  # Filter the data in the pipeline to just rows where A is 1234.
  .filter(ABC.A == 1234)
  # We can only create one column at a time, but these columns
  # can be nested. We have to define the transformation that produces this column separately.
  .withColumn('X', my_udf(ABC.A)))

这里我们定义了在执行管道中执行转换的函数:

def my_func(A):
  # Exit early if A is not greater than or equal to zero.
  if A < 0:
    return (None, None, None, None) # NOTE: we must return 4 columns.

  C = A * 50000 # This is the same in almost every language ;)
  padded = C.zfill(14) # Save our padded string ("z-filled") to be reused
  X1 = int(padded[0:2]) # Here we slice the string to get a substring from ,and including, the 0 index to, but not including, the 2 index
  X2 = int(padded[2:4])
  X3 = int(padded[0] + padded[2])
  X4 = int(padded[1] + padded[3])
  return (X1, X2, X3, X4) # We return all four values packed in a "tuple". They'll be nested below our parent column in the new dataset.

这里我们定义了转换函数将返回的类型和列。

from pyspark.sql.functions import udf
from pyspark.sql.types import StructType, StructField, IntegerType

schema = StructType([ # A struct is a data structure that holds other datastructures
    StructField("X1", IntegerType()),
    StructField("X2", IntegerType()),
    StructField("X3", IntegerType()),
    StructField("X4", IntegerType())
])

my_udf = udf(my_func, schema) # We define a function for use in pyspark, by combining a python function with a pyspark schema.

这是架构的外观。

root
 |-- X: struct    (nullable = true)
 |   |-- X1: int  (nullable = true)
 |   |-- X2: int  (nullable = true)
 |   |-- X3: int  (nullable = true)
 |   |-- X4: int  (nullable = true)
© www.soinside.com 2019 - 2024. All rights reserved.