我是一个完整的初学者！如何在R或Python中将.txt文件（电影脚本）转换为表（字符和行）？

Question

我是一个完整的初学者，对于一个大学项目，我需要分析电影剧本。我想创建一个表，在其中可以将字符与它们的行匹配。我的文件都是.txt格式，我想将它们转换为csv文件。我要处理许多脚本，因此我想找到一个可以轻松适应不同文件的代码。

这是我所拥有的：

                            THREEPIO
      Did you hear that?  They've shut 
      down the main reactor.  We'll be 
      destroyed for sure.  This is 
      madness!


                THREEPIO
      We're doomed!


                THREEPIO
      There'll be no escape for the 
      Princess this time.

                THREEPIO
      What's that?

这就是我需要的：

“字符”“对话”

“” 1“” THREEPIO“”你听到了吗？他们已经关闭了主反应堆。我们肯定会被摧毁。这太疯狂了！“

“” 2“” THREEPIO“”我们注定要失败！“

“” 3“” THREEPIO“”这次公主将无法逃脱。“

“” 4“” THREEPIO“”那是什么？“

这是我尝试过的：

# the first 70 lines don't contain dialogues
# so we can start reading at line 70 (for instance)
i = 70

# while loop to extract character and dialogues
# (probably there's a better way to parse the file instead of
# using my crazy nested if-then-elses, but this works for me)
while (i <= nlines)
{
  # if empty line
  if (sw[i] == "") i = i + 1  # next line
  # if text line
  if (sw[i] != "")
  {
    # if uninteresting stuff
    if (substr(sw[i], 1, 1) != " ") {
      i = i + 1   # next line
    } else {
      if (nchar(sw[i]) < 10) {
        i = i + 1  # next line
      } else {
        if (substr(sw[i], 1, 5) != " " && substr(sw[i], 6, 6) != " ") {
          i = i + 1  # next line
        } else {
          # if character name
          if (substr(sw[i], 1, 30) == b30) 
          {
            if (substr(sw[i], 31, 31) != " ")
            {
              tmp_name = substr(sw[i], 31, nchar(sw[i], "bytes"))
              cat("\n", file="EpisodeVI_dialogues.txt", append=TRUE)
              cat(tmp_name, "", file="EpisodeVI_dialogues.txt", sep="\t", append=TRUE)
              i = i + 1        
            } else {
              i = i + 1
            }
          } else {
            # if dialogue
            if (substr(sw[i], 1, 15) == b15)
            {
              if (substr(sw[i], 16, 16) != " ")
              {
                tmp_diag = substr(sw[i], 16, nchar(sw[i], "bytes"))
                cat("", tmp_diag, file="EpisodeVI_dialogues.txt", append=TRUE)
                i = i + 1
              } else {
                i = i + 1
              }
            }
          }
        }
      }
    }    
  }
}

Any help would me much appreciated! Thank you!!

Answer 1

您可以执行以下操作：

text = """
 THREEPIO
      Did you hear that?  They've shut 
      down the main reactor.  We'll be 
      destroyed for sure.  This is 
      madness!


                THREEPIO
      We're doomed!


                THREEPIO
      There'll be no escape for the 
      Princess this time.

                THREEPIO
      What's that?
"""

clean = text.split()

n = 1
tmp = []
results = []
for element in clean:
    if element.isupper():
        if tmp:
            results.append(tmp)
        tmp = [n, element]
        n += 1
        continue
    try:
        tmp[2] = " ".join((tmp[2], element))
    except IndexError:
        tmp.append(element)

print(results)

结果：

[[1, 'THREEPIO', "Did you hear that? They've shut down the main reactor. We'll be destroyed for sure. This is madness!"], [2, 'THREEPIO', "We're doomed!"], [3, 'THREEPIO', "There'll be no escape for the Princess this time."]]

Answer 2

如果您知道字符名称列表（并且不担心拼写错误），这样的方法将起作用：

script = """
 THREEPIO
      Did you hear that?  They've shut 
      down the main reactor.  We'll be 
      destroyed for sure.  This is 
      madness!


                THREEPIO
      We're doomed!


                THREEPIO
      There'll be no escape for the 
      Princess this time.

                THREEPIO
      What's that?
"""

characters = ['THREEPIO', 'ANAKIN']
lines = [x for x in list(map(str.strip, script.split('\n'))) if x]
results = []
for (i, item) in enumerate(lines):
    if item in characters:
        dialogue = []
        for index in range(i + 1, len(lines)):
            if lines[index] in characters:
                break
            dialogue.append(lines[index])
        results.append([item, ' '.join(dialogue)])

print([x for x in enumerate(results, start=1)])

此打印：

[(1, ['THREEPIO', "Did you hear that?  They've shut down the main reactor.  We'll be destroyed for sure.  This is madness!"]), (2, ['THREEPIO', "We're doomed!"]), (3, ['THREEPIO', "There'll be no escape for the Princess this time."]), (4, ['THREEPIO', "What's that?"])]

我是一个完整的初学者！如何在R或Python中将.txt文件（电影脚本）转换为表（字符和行）？

问题描述投票：-3回答：2

2个回答

最新问题

我是一个完整的初学者！如何在R或Python中将.txt文件（电影脚本）转换为表（字符和行）？

问题描述 投票：-3回答：2

2个回答

最新问题

问题描述投票：-3回答：2