通过从目录中读取所有.txt文件来创建一个JSON对象

问题描述 投票:0回答:1

我正在使用scikitlearn的20个二十个新闻组数据集。并且有20个.txt文件,其中一些具有如下结构-带有新组名称,docID,发件人,主题。我想从目录中读取所有文件(20),以将其转换为json对象或CSV,以将其输入以进行弹性搜索以建立索引。

每篇新文章都以“ Newsgroup”,document_id等开头。下面是一个示例。

Newsgroup: sci.space
document_id: 59497
From: [email protected] (Eric H. Taylor)
Subject: Re: Gravity waves, was: Predicting gravity wave quantization & Cosmic Noise

In article <[email protected]> [email protected] (Tom Van Flandern) writes:
>[email protected] (Cameron Randale Bass) writes:
>> [email protected] (Bruce Scott) writes:
>>> "Existence" is undefined unless it is synonymous with "observable" in
>>> physics.
>> [crb] Dong ....  Dong ....  Dong ....  Do I hear the death-knell of
>> string theory?
>
>     I agree.  You can add "dark matter" and quarks and a lot of other
>unobservable, purely theoretical constructs in physics to that list,
>including the omni-present "black holes."
>
>     Will Bruce argue that their existence can be inferred from theory
>alone?  Then what about my original criticism, when I said "Curvature
>can only exist relative to something non-curved"?  Bruce replied:
>"'Existence' is undefined unless it is synonymous with 'observable' in
>physics.  We cannot observe more than the four dimensions we know about."
>At the moment I don't see a way to defend that statement and the
>existence of these unobservable phenomena simultaneously.  -|Tom|-

"I hold that space cannot be curved, for the simple reason that it can have
no properties."
"Of properties we can only speak when dealing with matter filling the
space. To say that in the presence of large bodies space becomes curved,
is equivalent to stating that something can act upon nothing. I,
for one, refuse to subscribe to such a view." - Nikola Tesla

----
 ET  "Tesla was 100 years ahead of his time. Perhaps now his time comes."
----

Newsgroup: comp.os.ms-windows.misc
document_id: 10002
Subject: Re: Win31 & doublespace
From: [email protected]

In article <[email protected]>, [email protected] ( Chris Almy) writes:
> 
>   Doublespace, although I do not trust it for my hard disks, sounds
>   great for floppies. The thouoght of having to mount the disk
>   is anoying but something I can deal with. The problem arises 
>   when under windows. Is there a way to mount and unmount while
>   under windows or is this part of the upgrades soon to be 
>   available from other vendors?

每个.txt文件包含将近1000个文档,其中包含新闻组,document_id,发件人,主题。所以第二篇文章再次以“ Newgroup ...”开头

我正在执行以下操作以从目录读取文件,但不确定如何将上述4个字段转换并捕获为json / csv。

files = glob.glob(path + '\\*.txt')
# iterate over the list getting each file 
for fle in files:
   # open the file and then call .read() to get the text 
   with open(fle) as f:
     text = f.read()

我正在使用scikitlearn的20个二十个新闻组数据集。并且有20个.txt文件,其中一些具有如下结构-带有新组名称,docID,发件人,主题。我想阅读所有...

python-3.x pandas elasticsearch text-files data-analysis
1个回答
0
投票

即使每个文件包含更多文章,也可以使用以下方法:

© www.soinside.com 2019 - 2024. All rights reserved.