我有一个字符串格式的模式:
schema_str = """StructType([StructField("firstname",StringType(),True),StructField("middlename",StringType(),True),StructField("lastname",StringType(),True), StructField("id", StringType(), True),StructField("gender", StringType(), True),StructField("salary", IntegerType(), True)])"""
数据是:
data2 = [
("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
我想用它创建一个数据框。
df = sqlContext.createDataFrame(data = data2, schema = schema_str)
但这会报错。 如何创建数据框? 附言:
Pyspark 有很好的内置方法 - _parse_datatype_string。 但它采用更简单的字符串方案作为输入,例如 - “名字字符串、中间名字符串、姓氏字符串、ID 字符串、性别字符串、薪水整数”。
将源字符串带入此表单对您来说应该不难。