我想从这个json文件page_title中获取演员名称,然后将其与我尝试使用nltk和spacy的数据库匹配,但是我必须在那里训练数据。我的每句句子都有训练吗,我的句子超过10万。如果我坐在那里训练数据,将需要一个月或更长时间。有什么方法可以转储K_actor数据库以训练spacy,nltk或任何其他方法。
{"page_title": "Sonakshi Sinha To Auction Sketch Of Buddha To Help Migrant Labourers", "description": "Sonakshi Sinha took to Instagram to share a timelapse video of a sketch of Buddha that she made to auction to raise funds for migrant workers affected by Covid-19 crisis. ", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589815261_1589815196489_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/sonakshi-sinha-to-auction-sketch-of-buddha-to-help-migrant-labourers-2626123.html"}
{"page_title": "Anushka Sharma Calls Virat Kohli 'A Liar' on IG Live, Nushrat Bharucha Gets Propositioned on Twitter", "description": "In an Instagram live interaction with Sunil Chhetri, Virat Kohli was left embarrassed after Anushka Sharma called him a 'jhootha' from behind the camera. This and more in today's wrap.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589813980_1589813933996_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/anushka-sharma-calls-virat-kohli-a-liar-on-ig-live-nushrat-bharucha-gets-propositioned-on-twitter-2626093.html"}
{"page_title": "Ranveer Singh Shares a Throwback to the Days When WWF was His Life", "description": "Ranveer Singh shared a throwback picture from his childhood where he could be seen posing in front of a poster of WWE legend Hulk Hogan.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589812401_screenshot_20200518-195906_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/ranveer-singh-shares-a-throwback-to-the-days-when-wwf-was-his-life-2626067.html"}
{"page_title": "Salman Khan's Love Song 'Tere Bina' Gets 26 Million Views", "description": "Salman Khan's song Tere Bina, which was launched a few days ago, had garnered 12 million views within 24 hours. As it continues to trend, it has garnered 26 million views in less than a week.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589099778_screenshot_20200510-135934_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/salman-khans-love-song-tere-bina-gets-26-million-views-2626077.html"}
{"page_title": "Yash And Radhika Pandit Pose With Their Kids For a Perfect Family Picture", "description": "Kannada actor Yash tied the knot with actress Radhika Pandit in 2016. The couple shares two kids together.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589812187_yash.jpg", "post_url": "https://www.news18.com/news/movies/yash-and-radhika-pandit-pose-with-their-kids-for-a-perfect-family-picture-2626055.html"}
{"page_title": "Malaika Arora Shares Beach Vacay Boomerang With Hopeful Note", "description": "Malaika Arora shared a throwback boomerang from a beach vacation where she could be seen playfully spinning. She also shared a hopeful message along with it.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589810291_screenshot_20200518-192603_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/malaika-arora-shares-beach-vacay-boomerang-with-hopeful-note-2626019.html"}
{"page_title": "Actor Nawazuddin Siddiqui's Wife Aaliya Sends Legal Notice To Him Demanding Divorce, Maintenance", "description": "The notice was sent to the ", "image_url": "https://images.news18.com/ibnlive/uploads/2019/10/Nawazuddin-Siddiqui.jpg", "post_url": "https://www.news18.com/news/movies/actor-nawazuddin-siddiquis-wife-aaliya-sends-legal-notice-to-him-demanding-divorce-maintenance-2626035.html"}
{"page_title": "Lisa Haydon Celebrates Son Zack\u2019s 3rd Birthday With Homemade Cake And 'Spiderman' Surprise", "description": "Lisa Haydon took to Instagram to share some glimpses from the special day. In the pictures, we can spot a man wearing a Spiderman costume.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589807960_lisa-rey.jpg", "post_url": "https://www.news18.com/news/movies/lisa-haydon-celebrates-son-zacks-3rd-birthday-with-homemade-cake-and-spiderman-surprise-2625953.html"}
{"page_title": "Chiranjeevi Recreates Old Picture with Wife, Says 'Time Has Changed'", "description": "Chiranjeevi was last seen in historical-drama Sye Raa Narasimha Reddy. He was shooting for his next film, Acharya, before the coronavirus lockdown.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589808242_pjimage.jpg", "post_url": "https://www.news18.com/news/movies/chiranjeevi-recreates-old-picture-with-wife-says-time-has-changed-2625973.html"}
{"page_title": "Amitabh Bachchan, Rishi Kapoor\u2019s Pout Selfie Recreated By Abhishek, Ranbir is Priceless", "description": "A throwback picture that has gone viral on the internet shows Ranbir Kapoor and Abhishek Bachchan recreating a selfie of their fathers Rishi Kapoor and Amitabh Bachchan.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589807772_screenshot_20200518-184521_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/amitabh-bachchan-rishi-kapoors-pout-selfie-recreated-by-abhishek-ranbir-is-priceless-2625867.html"}
您可以做的是创建一个注释脚本,在其中可以将演员名称替换为'@@@'或其他字符串(稍后将用演员名称(实体)替换以进行培训)。
我使用i3笔记本电脑在9个小时内训练了68K数据/句子。您可以像这样转储数据,并且输出文件可用于训练模型。
这将节省时间,并为您提供SpaCy的现成的培训数据格式。
from nltk import word_tokenize
from pandas import read_csv
import re
import os.path
def annot(Label, entity, textlist) :
finaldict = []
for text_token in textlist:
textbk=text_token
for value in entity:
#if entity has multi tokens
text=textbk
text=text_token
text=str(text).replace('@@@',value)
text=text.lower()
text = re.sub('[^a-zA-Z0-9\n\.]',' ', text)
if len(word_tokenize(value))<2:
#print('I am here')
newtext=word_tokenize(text)
traindata=[]
prev_length=0
prev_pos=0
k=0
while k != len(newtext):
if k == 0:
prev_pos=0
prev_length=len(newtext[k])
if value.lower()== str(newtext[k]):
ent=Label
tup=(prev_pos,prev_length,ent)
traindata.append(tup)
else:
pass
else :
prev_pos=prev_length+1
prev_length=prev_length+len(newtext[k])+1
if value.lower()==str(newtext[k]):
ent=Label
tup=(prev_pos,prev_length,ent)
traindata.append(tup)
else:
pass
k=k+1
mydict={'entities':traindata}
finaldict.append((text,mydict))
else:
traindata=[]
try:
begin=text.index(value.lower())
ent=Label
tup=(begin,len(value.lower()),ent)
traindata.append(tup)
except ValueError:
pass
mydict={'entities':traindata}
finaldict.append((text,mydict))
return finaldict
def getEntities(csv_file, column) :
df = read_csv(csv_file)
return df[column].to_list()
def getSentences(file_name) :
with open(file_name) as file1 :
sentences = [line1.rstrip('\n') for line1 in file1]
return sentences
def saveData (data, filename, path) :
filename = os.path.join(path, filename)
with open(filename, 'a') as file :
for sent in data :
file.write("{}\n".format(sent))
ents = getEntities(csv_file, column_name) #Actor names in your case
entities = [ent for ent in ents if str(ent) != 'nan']
sentences = getSentences(filepathandname) #Considering you have the sentences in a text file
label = 'ACTOR_NAMES'
data = annot(label, entities, sentences)
saveData(data, 'train_data.txt', path)
希望这是与您的问题有关的答案。