如何使用输入字符串进行计数?

问题描述 投票:-1回答:1

嗨,我正在尝试使用输入字符串来计算给定问题中的最大值。问题描述:给定两个月的x和y,其中y> x,请找到从x月份到y月份增加最多推文数量的主题标签名称。我们已经在您的代码模板中编写了可从键盘读取x和y值的代码。忽略x和y之间月份中的推文,因此只需比较x月和y月份的推文数。报告主题标签名称,x和y个月中的推文数量。忽略在月份x或y中没有任何鸣叫的任何主题标签名称。您可以假设主题标签和月份的组合是唯一的。使用println将结果打印到终端输出。对于上面的小示例数据集,输出应为以下内容:

Input       x = 200910, y = 200912

Output  hashtagName: mycoolwife, countX: 1, countY: 500
Data Fomrat:

Token type  Month   count   Hash Tag Name

hashtag 200910  2   Babylove

hashtag 200911  2   babylove

hashtag 200912  90  babylove

我的尝试:

// Load the input data and split each line into an array of strings
val twitterLines = sc.textFile("hdfs:///user/ashhall1616/bdc_data/twitter-small.tsv")
val twitterdata = twitterLines.map(_.split("\t"))

// Each month is a string formatted as YYYYMM
val x = scala.io.StdIn.readLine("x month: ")
val y = scala.io.StdIn.readLine("y month: ")

val matchmonth= twitterdata.map(r => (r(0)== x ,r(0)==y, r(2), r(3))).sortBy(_._3, false)
if(matchmonth.(r => (r(0))) < matchmonth.(r => (r(1)))
   {
       val ht1 = matchmonth.map(r => (r(2), r(3))).take(1)
       val ht2 = matchmonth.map(r => (r(2), r(3))).take(1,2)
           println("[" + ht1 + "," + ht2 + "]")
   }

错误得到:

 val matchmonth= twitterdata.map(r => (r(0)== x ,r(0)==y, r(2), r(3))).sortBy(_._3, false)
matchmonth: org.apache.spark.rdd.RDD[(Boolean, Boolean, String, String)] = MapPartitionsRDD[20] at sortBy at <console>:32
scala> if(matchmonth.(r => (r(0))) < matchmonth.(r => (r(1)))
<console>:1: error: identifier expected but '(' found.
if(matchmonth.(r => (r(0))) < matchmonth.(r => (r(1)))
              ^
<console>:1: error: identifier expected but '(' found.
if(matchmonth.(r => (r(0))) < matchmonth.(r => (r(1)))
                                         ^
scala>    {
     |        val ht1 = matchmonth.map(r => (r(2), r(3))).take(1)
     |        val ht2 = matchmonth.map(r => (r(2), r(3))).take(1,2)
     |            println("[" + ht1 + "," + ht2 + "]")
     |    }
<console>:36: error: (Boolean, Boolean, String, String) does not take parameters
              val ht1 = matchmonth.map(r => (r(2), r(3))).take(1)
                                              ^
<console>:36: error: (Boolean, Boolean, String, String) does not take parameters
              val ht1 = matchmonth.map(r => (r(2), r(3))).take(1)
                                                    ^
<console>:37: error: (Boolean, Boolean, String, String) does not take parameters
              val ht2 = matchmonth.map(r => (r(2), r(3))).take(1,2)
                                              ^
<console>:37: error: (Boolean, Boolean, String, String) does not take parameters
              val ht2 = matchmonth.map(r => (r(2), r(3))).take(1,2)

有人可以看看这里有什么问题吗?

scala apache-spark rdd
1个回答
0
投票

我建议您阅读有关Spark SQL和DataFrame API的几篇文章。

matchmonth是一个DataFrame(如果可以帮助您将其视为SQL表)。在if语句中使用它毫无意义。命名列也非常有用。例如

val twitterdata = twitterLines.map(_.split("\t")).toDF("tokenType", "month", "count", "hash")

通过将地图应用于DataFrame twitterdata,您已将其从形状(String, String, String, String, String)转换为形状(Boolean, Boolean, String, String)-这就是您想要的吗?

if(matchmonth.(r => (r(0))) < matchmonth.(r => (r(1)))是无效的语法。

首先,您要选择与您的月份范围相匹配的行。这将返回另一个DataFrame。从那里,我将留给您阅读Spark sql。 groupBy和sum可能会帮助您

import spark.implicits._
import org.apache.spark.sql.functions.lit

val filteredDf = twitterdata.where($"month" >= lit(x) && $"month" <= lit(y))
© www.soinside.com 2019 - 2024. All rights reserved.