嗨,我正在尝试使用输入字符串来计算给定问题中的最大值。问题描述:给定两个月的x和y,其中y> x,请找到从x月份到y月份增加最多推文数量的主题标签名称。我们已经在您的代码模板中编写了可从键盘读取x和y值的代码。忽略x和y之间月份中的推文,因此只需比较x月和y月份的推文数。报告主题标签名称,x和y个月中的推文数量。忽略在月份x或y中没有任何鸣叫的任何主题标签名称。您可以假设主题标签和月份的组合是唯一的。使用println将结果打印到终端输出。对于上面的小示例数据集,输出应为以下内容:
Input x = 200910, y = 200912
Output hashtagName: mycoolwife, countX: 1, countY: 500
Data Fomrat:
Token type Month count Hash Tag Name
hashtag 200910 2 Babylove
hashtag 200911 2 babylove
hashtag 200912 90 babylove
我的尝试:
// Load the input data and split each line into an array of strings
val twitterLines = sc.textFile("hdfs:///user/ashhall1616/bdc_data/twitter-small.tsv")
val twitterdata = twitterLines.map(_.split("\t"))
// Each month is a string formatted as YYYYMM
val x = scala.io.StdIn.readLine("x month: ")
val y = scala.io.StdIn.readLine("y month: ")
val matchmonth= twitterdata.map(r => (r(0)== x ,r(0)==y, r(2), r(3))).sortBy(_._3, false)
if(matchmonth.(r => (r(0))) < matchmonth.(r => (r(1)))
{
val ht1 = matchmonth.map(r => (r(2), r(3))).take(1)
val ht2 = matchmonth.map(r => (r(2), r(3))).take(1,2)
println("[" + ht1 + "," + ht2 + "]")
}
错误得到:
val matchmonth= twitterdata.map(r => (r(0)== x ,r(0)==y, r(2), r(3))).sortBy(_._3, false)
matchmonth: org.apache.spark.rdd.RDD[(Boolean, Boolean, String, String)] = MapPartitionsRDD[20] at sortBy at <console>:32
scala> if(matchmonth.(r => (r(0))) < matchmonth.(r => (r(1)))
<console>:1: error: identifier expected but '(' found.
if(matchmonth.(r => (r(0))) < matchmonth.(r => (r(1)))
^
<console>:1: error: identifier expected but '(' found.
if(matchmonth.(r => (r(0))) < matchmonth.(r => (r(1)))
^
scala> {
| val ht1 = matchmonth.map(r => (r(2), r(3))).take(1)
| val ht2 = matchmonth.map(r => (r(2), r(3))).take(1,2)
| println("[" + ht1 + "," + ht2 + "]")
| }
<console>:36: error: (Boolean, Boolean, String, String) does not take parameters
val ht1 = matchmonth.map(r => (r(2), r(3))).take(1)
^
<console>:36: error: (Boolean, Boolean, String, String) does not take parameters
val ht1 = matchmonth.map(r => (r(2), r(3))).take(1)
^
<console>:37: error: (Boolean, Boolean, String, String) does not take parameters
val ht2 = matchmonth.map(r => (r(2), r(3))).take(1,2)
^
<console>:37: error: (Boolean, Boolean, String, String) does not take parameters
val ht2 = matchmonth.map(r => (r(2), r(3))).take(1,2)
有人可以看看这里有什么问题吗?
我建议您阅读有关Spark SQL和DataFrame API的几篇文章。
matchmonth是一个DataFrame(如果可以帮助您将其视为SQL表)。在if语句中使用它毫无意义。命名列也非常有用。例如
val twitterdata = twitterLines.map(_.split("\t")).toDF("tokenType", "month", "count", "hash")
通过将地图应用于DataFrame twitterdata,您已将其从形状(String, String, String, String, String)
转换为形状(Boolean, Boolean, String, String)
-这就是您想要的吗?
if(matchmonth.(r => (r(0))) < matchmonth.(r => (r(1)))
是无效的语法。
首先,您要选择与您的月份范围相匹配的行。这将返回另一个DataFrame。从那里,我将留给您阅读Spark sql。 groupBy和sum可能会帮助您
import spark.implicits._
import org.apache.spark.sql.functions.lit
val filteredDf = twitterdata.where($"month" >= lit(x) && $"month" <= lit(y))