我正在尝试读取一个小的文件作为数据集,但它给出了错误信息
“无法将
ordId
从字符串强制转换为int,因为它可能会被截断”。
这里是代码:
object Main {
case class Orders(ordId: Int, custId: Int, amount: Float, date: String)
def main(args : Array[String]): Unit ={
val schema = Encoders.product[Orders].schema
val spark = SparkSession.builder
.master ("local[*]")
.appName ("")
.getOrCreate ()
val df = spark.read.option("header",true).csv("/mnt/data/orders.txt")
import spark.implicits._
val ds = df.as[Orders]
}
}
orders.txt
ordId,custId,amount,date
1234,123,400,20190112
2345,456,600,20190122
1345,123,500,20190123
3456,345,800,20190202
5678,123,600,20190203
6578,455,900,20190301
如何解决此错误?我也想知道,我首先需要将文件读取为数据框,然后转换为数据集吗?
尝试读取为passing schema(using .schema)
时,按dataframe
到]。>
import org.apache.spark.sql.Encoders val schema = Encoders.product[Orders].schema val ds=spark.read.option("header",true).schema(schema).csv("/mnt/data/orders.txt").as[Orders] ds.show()
Result:
+-----+------+------+--------+ |ordId|custId|amount| date| +-----+------+------+--------+ | 1234| 123| 400.0|20190112| | 2345| 456| 600.0|20190122| | 1345| 123| 500.0|20190123| | 3456| 345| 800.0|20190202| | 5678| 123| 600.0|20190203| | 6578| 455| 900.0|20190301| +-----+------+------+--------+
Schema:
ds.printSchema()
root
|-- ordId: integer (nullable = true)
|-- custId: integer (nullable = true)
|-- amount: float (nullable = true)
|-- date: string (nullable = true)