我知道在 clojure.string
有 split
函数,该函数返回字符串中不包括给定模式的部分的序列。
(require '[clojure.string :as str-utils])
(str-utils/split "Yes, hello, this is dog yes hello it is me" #"hello")
;; -> ["Yes, " ", this is dog yes " " it is me"]
然而,我试图找到一个函数,将标记作为元素留在返回的向量中。因此,它将像
(split-around "Yes, hello, this is dog yes hello it is me" #"hello")
;; -> ["Yes, " "hello" ", this is dog yes " "hello" " it is me"]
在任何一个包含的库中有这样的功能吗?外部库中有吗?我一直想自己写,但一直没弄明白。
你也可以使用regex lookaheadlookbehind功能来实现。
user> (clojure.string/split "Yes, hello, this is dog yes hello it is me" #"(?<=hello)|(?=hello)")
;;=> ["Yes, " "hello" ", this is dog yes " "hello" " it is me"]
你可以把它理解为 "在前面或后面的单词是'hello'的地方用零长度的字符串分割"
注意,它还忽略了相邻模式的悬空字符串和前导尾部的悬空字符串。
user> (clojure.string/split "helloYes, hello, this is dog yes hellohello it is mehello" #"(?<=hello)|(?=hello)")
;;=> ["hello"
;; "Yes, "
;; "hello"
;; ", this is dog yes "
;; "hello"
;; "hello"
;; " it is me"
;; "hello"]
你可以把它封装成这样的函数,例如:
(defn split-around [source word]
(let [word (java.util.regex.Pattern/quote word)]
(->> (format "(?<=%s)|(?=%s)" word word)
re-pattern
(clojure.string/split source))))
(-> "Yes, hello, this is dog yes hello it is me"
(str/replace #"hello" "~hello~")
(str/split #"~"))
使用@Shlomi的解决方案的例子。
(ns tst.demo.core
(:use tupelo.core tupelo.test)
(:require [clojure.string :as str]))
(dotest
(let [input-str "Yes, hello, this is dog yes hello it is me"
segments (mapv str/trim
(str/split input-str #"hello"))
result (interpose "hello" segments)]
(is= segments ["Yes," ", this is dog yes" "it is me"])
(is= result ["Yes," "hello" ", this is dog yes" "hello" "it is me"])))
也许最好为这个用例写一个自定义循环。 比如说
(ns tst.demo.core
(:use tupelo.core tupelo.test)
(:require
[clojure.string :as str] ))
(defn strseg
"Will segment a string like '<a><tgt><b><tgt><c>' at each occurrence of `tgt`, producing
an output vector like [ <a> <tgt> <b> <tgt> <c> ]."
[tgt source]
(let [tgt-len (count tgt)
segments (loop [result []
src source]
(if (empty? src)
result
(let [i (str/index-of src tgt)]
(if (nil? i)
(let [result-next (into result [src])
src-next nil]
(recur result-next src-next))
(let [pre-tgt (subs src 0 i)
result-next (into result [pre-tgt tgt])
src-next (subs src (+ tgt-len i))]
(recur result-next src-next))))))
result (vec
(remove (fn [s] (or (nil? s)
(empty? s)))
segments))]
result))
用单元测试
(dotest
(is= (strseg "hello" "Yes, hello, this is dog yes hello it is me")
["Yes, " "hello" ", this is dog yes " "hello" " it is me"] )
(is= (strseg "hello" "hello")
["hello"])
(is= (strseg "hello" "") [])
(is= (strseg "hello" nil) [])
(is= (strseg "hello" "hellohello") ["hello" "hello" ])
(is= (strseg "hello" "abchellodefhelloxyz") ["abc" "hello" "def" "hello" "xyz" ])
)
这里是另一种解决方案,它避免了leetwinski的答案中存在的重复模式和双重识别的问题(见我的评论),而且还能尽可能地懒惰地计算部分。
(defn partition-str [s sep]
(->> s
(re-seq
(->> sep
java.util.regex.Pattern/quote ; remove this to treat sep as a regex
(format "((?s).*?)(?:(%s)|\\z)")
re-pattern))
(mapcat rest)
(take-while some?)
(remove empty?))) ; remove this to keep empty parts
然而 当分隔符与空字符串匹配时,这并不符合直觉。
另一种方法是同时使用 re-seq
和 split
用相同的模式和交错产生的序列,如图所示。此相关问题. 遗憾的是,这样一来,分离器的每次出现都会被识别两次。
也许更好的方法是在一个更原始的基础上使用 re-matcher
和 re-find
.
最后,为了更直接地回答最初的问题,Clojure的标准库或任何外部库中都没有这样的功能。此外,我也不知道有什么简单的、完全没有问题的方法来解决这个问题(尤其是使用regex-separator)。
这是我现在能想到的最好的解决方案,在一个较低的层次上,懒洋洋地用一个regex-separator工作。
(defn re-partition [re s]
(let [mr (re-matcher re s)]
((fn rec [i]
(lazy-seq
(if-let [m (re-find mr)]
(list* (subs s i (.start mr)) m (rec (.end mr)))
(list (subs s i)))))
0)))
(def re-partition+ (comp (partial remove empty?) re-partition))
注意,我们可以(重新)定义:
(def re-split (comp (partial take-nth 2) re-partition))
(def re-seq (comp (partial take-nth 2) rest re-partition))