使用 str.findall 获取 pandas 系列中的指数。

Question

我的工作是寻找包含一个特定字符串的行，数据集有近100万行。这里是一个简单的例子。

text=['abc [email protected] 123 any@www foo @ bar 78@ppp @5555 aa@111www','anontalk.com']
text=pd.Series(text)
srhc=text.str.findall('www')
srhc

而输出的结果是。

0    [www, www]
1    []        
dtype: object

是否有可能有效地（即以编程方式）只获得索引列表，其中包含文本。www. 感谢帮助。

Answer 1

要搜索一个特定的子字符串，请使用 .str.contains() ;

text = ['abc [email protected]', 'helowww', '123 any@www', 'foo www', '@5555 aa@111www', 'anontalk.com']

text = pd.Series(text)

text[text.str.contains('www')]

输出。

1            helowww
2        123 any@www
3            foo www
4    @5555 aa@111www
dtype: object

要得到这些的索引。

text[text.str.contains('www')].index.to_list()

# or this 
text.index[text.str.contains('www')]

Ouput。

[1, 2, 3, 4]

Answer 2

我们可以这样做 str contains 与 nonzero

srhc=text.str.contains('www').to_numpy().nonzero()[0]
srhc
Out[66]: array([0], dtype=int64)

Answer 3

你可以过滤 text.index 与 str.contains():

srhc = text.index[text.str.contains('www')]
print(srhc)

印刷品。

Int64Index([0], dtype='int64')

Answer 4

我认为更有效的办法是通过列表理解来获取索引，尤其是该系列的索引并没有什么独特或特别之处

text=['abc [email protected] 123 any@www foo @ bar 78@ppp @5555 aa@111www','anontalk.com']

#I use this to stay true to your question
text=pd.Series(text)

#this gets you the index/indices
#which is what you want, based on your question
[index for index, entry in enumerate(text) if 'www' in entry]

[0]

使用 str.findall 获取 pandas 系列中的指数。

问题描述投票：0回答：2

2个回答

最新问题

使用 str.findall 获取 pandas 系列中的指数。

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2