我是一个完整的初学者,对于一个大学项目,我需要分析电影剧本。我想创建一个表,在其中可以将字符与它们的行匹配。我的文件都是.txt格式,我想将它们转换为csv文件。我要处理许多脚本,因此我想找到一个可以轻松适应不同文件的代码。
这是我所拥有的:
THREEPIO
Did you hear that? They've shut
down the main reactor. We'll be
destroyed for sure. This is
madness!
THREEPIO
We're doomed!
THREEPIO
There'll be no escape for the
Princess this time.
THREEPIO
What's that?
这就是我需要的:
“字符”“对话”
“” 1“” THREEPIO“”你听到了吗?他们已经关闭了主反应堆。我们肯定会被摧毁。这太疯狂了!“
“” 2“” THREEPIO“”我们注定要失败!“
“” 3“” THREEPIO“”这次公主将无法逃脱。“
“” 4“” THREEPIO“”那是什么?“
这是我尝试过的:
# the first 70 lines don't contain dialogues
# so we can start reading at line 70 (for instance)
i = 70
# while loop to extract character and dialogues
# (probably there's a better way to parse the file instead of
# using my crazy nested if-then-elses, but this works for me)
while (i <= nlines)
{
# if empty line
if (sw[i] == "") i = i + 1 # next line
# if text line
if (sw[i] != "")
{
# if uninteresting stuff
if (substr(sw[i], 1, 1) != " ") {
i = i + 1 # next line
} else {
if (nchar(sw[i]) < 10) {
i = i + 1 # next line
} else {
if (substr(sw[i], 1, 5) != " " && substr(sw[i], 6, 6) != " ") {
i = i + 1 # next line
} else {
# if character name
if (substr(sw[i], 1, 30) == b30)
{
if (substr(sw[i], 31, 31) != " ")
{
tmp_name = substr(sw[i], 31, nchar(sw[i], "bytes"))
cat("\n", file="EpisodeVI_dialogues.txt", append=TRUE)
cat(tmp_name, "", file="EpisodeVI_dialogues.txt", sep="\t", append=TRUE)
i = i + 1
} else {
i = i + 1
}
} else {
# if dialogue
if (substr(sw[i], 1, 15) == b15)
{
if (substr(sw[i], 16, 16) != " ")
{
tmp_diag = substr(sw[i], 16, nchar(sw[i], "bytes"))
cat("", tmp_diag, file="EpisodeVI_dialogues.txt", append=TRUE)
i = i + 1
} else {
i = i + 1
}
}
}
}
}
}
}
}
Any help would me much appreciated! Thank you!!
您可以执行以下操作:
text = """
THREEPIO
Did you hear that? They've shut
down the main reactor. We'll be
destroyed for sure. This is
madness!
THREEPIO
We're doomed!
THREEPIO
There'll be no escape for the
Princess this time.
THREEPIO
What's that?
"""
clean = text.split()
n = 1
tmp = []
results = []
for element in clean:
if element.isupper():
if tmp:
results.append(tmp)
tmp = [n, element]
n += 1
continue
try:
tmp[2] = " ".join((tmp[2], element))
except IndexError:
tmp.append(element)
print(results)
结果:
[[1, 'THREEPIO', "Did you hear that? They've shut down the main reactor. We'll be destroyed for sure. This is madness!"], [2, 'THREEPIO', "We're doomed!"], [3, 'THREEPIO', "There'll be no escape for the Princess this time."]]
如果您知道字符名称列表(并且不担心拼写错误),这样的方法将起作用:
script = """
THREEPIO
Did you hear that? They've shut
down the main reactor. We'll be
destroyed for sure. This is
madness!
THREEPIO
We're doomed!
THREEPIO
There'll be no escape for the
Princess this time.
THREEPIO
What's that?
"""
characters = ['THREEPIO', 'ANAKIN']
lines = [x for x in list(map(str.strip, script.split('\n'))) if x]
results = []
for (i, item) in enumerate(lines):
if item in characters:
dialogue = []
for index in range(i + 1, len(lines)):
if lines[index] in characters:
break
dialogue.append(lines[index])
results.append([item, ' '.join(dialogue)])
print([x for x in enumerate(results, start=1)])
此打印:
[(1, ['THREEPIO', "Did you hear that? They've shut down the main reactor. We'll be destroyed for sure. This is madness!"]), (2, ['THREEPIO', "We're doomed!"]), (3, ['THREEPIO', "There'll be no escape for the Princess this time."]), (4, ['THREEPIO', "What's that?"])]