Spark SQL：如何在不使用rdd.cache（）的情况下缓存sql查询结果

Question

有没有办法在不使用rdd.cache（）的情况下缓存缓存sql查询结果？举些例子：

output = sqlContext.sql("SELECT * From people")

我们可以使用output.cache（）来缓存结果，但是我们不能使用sql查询来处理它。

所以我想问一下有什么像sqlcontext.cacheTable（）来缓存结果吗？

Answer 1

你应该使用sqlContext.cacheTable("table_name")来缓存它，或者使用CACHE TABLE table_name SQL查询。

这是一个例子。我在HDFS上有这个文件：

1|Alex|[email protected]
2|Paul|[email protected]
3|John|[email protected]

那么PySpark中的代码：

people = sc.textFile('hdfs://sparkdemo:8020/people.txt')
people_t = people.map(lambda x: x.split('|')).map(lambda x: Row(id=x[0], name=x[1], email=x[2]))
tbl = sqlContext.inferSchema(people_t)
tbl.registerTempTable('people')

现在我们有一个表，可以查询它：

sqlContext.sql('select * from people').collect()

为了坚持下去，我们有3个选择：

# 1st - using SQL
sqlContext.sql('CACHE TABLE people').collect()
# 2nd - using SQLContext
sqlContext.cacheTable('people')
sqlContext.sql('select count(*) from people').collect()     
# 3rd - using Spark cache underlying RDD
tbl.cache()
sqlContext.sql('select count(*) from people').collect()

第一个和第二个选项是首选，因为它们会以优化的内存中列式格式缓存数据，而第三个选项会像任何其他RDD一样以行方式缓存它

回到你的问题，这是一个可能的解决方案：

output = sqlContext.sql("SELECT * From people")
output.registerTempTable('people2')
sqlContext.cacheTable('people2')
sqlContext.sql("SELECT count(*) From people2").collect()

Answer 2

以下内容最像是将.cache用于RDD，在Zeppelin或类似的SQL-heavy环境中很有用

CACHE TABLE CACHED_TABLE AS
SELECT $interesting_query

然后你得到缓存读取interesting_query的后续用法，以及CACHED_TABLE上的所有查询。

这个答案是基于已接受的答案，但使用AS的强大功能使得该调用在更受限制的仅限SQL环境中变得有用，在这种环境中，您无法以任何方式执行.collect()或执行RDD / Dataframe操作。

Spark SQL：如何在不使用rdd.cache（）的情况下缓存sql查询结果

问题描述投票：10回答：2

2个回答

最新问题

Spark SQL：如何在不使用rdd.cache（）的情况下缓存sql查询结果

问题描述 投票：10回答：2

2个回答

最新问题

问题描述投票：10回答：2