clojure - strng-concat with group by in maps of maps

问题描述 投票:0回答:1

给定来自jdbc源的输入数据,例如:

  (def input-data
    [{:doc_id 1 :doc_seq 1  :doc_content "this is a very long "}
    {:doc_id 1 :doc_seq 2  :doc_content "sentence from a mainframe "}
    {:doc_id 1 :doc_seq 3  :doc_content "system that was built before i was "}
    {:doc_id 1 :doc_seq 4  :doc_content "born."}
    {:doc_id 2 :doc_seq 1  :doc_content "this is a another very long "}
    {:doc_id 2 :doc_seq 2  :doc_content "sentence from the same mainframe "}
    {:doc_id 3 :doc_seq 1  :doc_content "Ok here we are again. "}
    {:doc_id 3 :doc_seq 2  :doc_content "The mainframe only had 40 char per field so"}
    {:doc_id 3 :doc_seq 3  :doc_content "they broke it into multiple rows "}
    {:doc_id 3 :doc_seq 4  :doc_content "which seems to be common"}
    {:doc_id 3 :doc_seq 5  :doc_content " for the time. "}
    {:doc_id 3 :doc_seq 6  :doc_content "thanks for your help."}])

我想通过doc id分组,并将字符串连接到doc_content,所以我的输出看起来像这样:

  [{:doc_id 1 :doc_content "this is a very long sentence from a mainfram system that was built before i was born."}
   {:doc_id 2 :doc_content "this is a another very long sentence ... clip..."}
   {:doc_id 3 :doc_content "... clip..."}]

我正在考虑使用group-by然而输出一个地图,我需要输出一些懒惰的东西,因为输入数据集可能非常大。也许我可以运行group-byreduce-kv的一些组合来获得我正在寻找的东西......或者如果我能强迫它变得懒惰的话,可能还有frequencies的东西。

我可以保证它会被分类;我将把顺序(通过sql)放在doc_iddoc_seq上,所以这个程序唯一负责的是aggregate / string-concat部分。我可能会有整个序列的大输入数据,但该序列中的特定doc_id应该只有几十个doc_seq

任何提示赞赏,

clojure
1个回答
4
投票

partition-by是懒惰的,只要每个doc序列适合内存,这应该工作:

(defn collapse-docs [docs]
  (apply merge-with
         (fn [l r]
           (if (string? r)
             (str l r)
             r))
         docs))

(sequence ;; you may want to use eduction here, depending on use case
  (comp
    (partition-by :doc_id)
    (map collapse-docs))
  input-data)
=>
({:doc_id 1,
  :doc_seq 4,
  :doc_content "this is a very long sentence from a mainframe system that was built before i was born."}
  {:doc_id 2, :doc_seq 2, :doc_content "this is a another very long sentence from the same mainframe "}
  {:doc_id 3,
   :doc_seq 6,
   :doc_content "Ok here we are again. The mainframe only had 40 char per field sothey broke it into multiple rows which seems to be common for the time. thanks for your help."})
© www.soinside.com 2019 - 2024. All rights reserved.