需要从Json文件中获取演员名称

问题描述 投票:0回答:1

我想从这个json文件page_title中获取演员名称,然后将其与我尝试使用nltk和spacy的数据库匹配,但是我必须在那里训练数据。我的每句句子都有训练吗,我的句子超过10万。如果我坐在那里训练数据,将需要一个月或更长时间。有什么方法可以转储K_actor数据库以训练spacy,nltk或任何其他方法。

{"page_title": "Sonakshi Sinha To Auction Sketch Of Buddha To Help Migrant Labourers", "description": "Sonakshi Sinha took to Instagram to share a timelapse video of a sketch of Buddha that she made to auction to raise funds for migrant workers affected by Covid-19 crisis. ", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589815261_1589815196489_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/sonakshi-sinha-to-auction-sketch-of-buddha-to-help-migrant-labourers-2626123.html"}
{"page_title": "Anushka Sharma Calls Virat Kohli 'A Liar' on IG Live, Nushrat Bharucha Gets Propositioned on Twitter", "description": "In an Instagram live interaction with Sunil Chhetri, Virat Kohli was left embarrassed after Anushka Sharma called him a 'jhootha' from behind the camera. This and more in today's wrap.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589813980_1589813933996_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/anushka-sharma-calls-virat-kohli-a-liar-on-ig-live-nushrat-bharucha-gets-propositioned-on-twitter-2626093.html"}
{"page_title": "Ranveer Singh Shares a Throwback to the Days When WWF was His Life", "description": "Ranveer Singh shared a throwback picture from his childhood where he could be seen posing in front of a poster of WWE legend Hulk Hogan.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589812401_screenshot_20200518-195906_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/ranveer-singh-shares-a-throwback-to-the-days-when-wwf-was-his-life-2626067.html"}
{"page_title": "Salman Khan's Love Song 'Tere Bina' Gets 26 Million Views", "description": "Salman Khan's song Tere Bina, which was launched a few days ago, had garnered 12 million views within 24 hours. As it continues to trend, it has garnered 26 million views in less than a week.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589099778_screenshot_20200510-135934_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/salman-khans-love-song-tere-bina-gets-26-million-views-2626077.html"}
{"page_title": "Yash And Radhika Pandit Pose With Their Kids For a Perfect Family Picture", "description": "Kannada actor Yash tied the knot with actress Radhika Pandit in 2016. The couple shares two kids together.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589812187_yash.jpg", "post_url": "https://www.news18.com/news/movies/yash-and-radhika-pandit-pose-with-their-kids-for-a-perfect-family-picture-2626055.html"}
{"page_title": "Malaika Arora Shares Beach Vacay Boomerang With Hopeful Note", "description": "Malaika Arora shared a throwback boomerang from a beach vacation where she could be seen playfully spinning. She also shared a hopeful message along with it.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589810291_screenshot_20200518-192603_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/malaika-arora-shares-beach-vacay-boomerang-with-hopeful-note-2626019.html"}
{"page_title": "Actor Nawazuddin Siddiqui's Wife Aaliya Sends Legal Notice To Him Demanding Divorce, Maintenance", "description": "The notice was sent to the ", "image_url": "https://images.news18.com/ibnlive/uploads/2019/10/Nawazuddin-Siddiqui.jpg", "post_url": "https://www.news18.com/news/movies/actor-nawazuddin-siddiquis-wife-aaliya-sends-legal-notice-to-him-demanding-divorce-maintenance-2626035.html"}
{"page_title": "Lisa Haydon Celebrates Son Zack\u2019s 3rd Birthday With Homemade Cake And 'Spiderman' Surprise", "description": "Lisa Haydon took to Instagram to share some glimpses from the special day. In the pictures, we can spot a man wearing a Spiderman costume.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589807960_lisa-rey.jpg", "post_url": "https://www.news18.com/news/movies/lisa-haydon-celebrates-son-zacks-3rd-birthday-with-homemade-cake-and-spiderman-surprise-2625953.html"}
{"page_title": "Chiranjeevi Recreates Old Picture with Wife, Says 'Time Has Changed'", "description": "Chiranjeevi was last seen in historical-drama Sye Raa Narasimha Reddy. He was shooting for his next film, Acharya, before the coronavirus lockdown.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589808242_pjimage.jpg", "post_url": "https://www.news18.com/news/movies/chiranjeevi-recreates-old-picture-with-wife-says-time-has-changed-2625973.html"}
{"page_title": "Amitabh Bachchan, Rishi Kapoor\u2019s Pout Selfie Recreated By Abhishek, Ranbir is Priceless", "description": "A throwback picture that has gone viral on the internet shows Ranbir Kapoor and Abhishek Bachchan recreating a selfie of their fathers Rishi Kapoor and Amitabh Bachchan.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589807772_screenshot_20200518-184521_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/amitabh-bachchan-rishi-kapoors-pout-selfie-recreated-by-abhishek-ranbir-is-priceless-2625867.html"}
python-3.x scrapy nlp nltk spacy
1个回答
0
投票

您可以做的是创建一个注释脚本,在其中可以将演员名称替换为'@@@'或其他字符串(稍后将用演员名称(实体)替换以进行培训)。

我使用i3笔记本电脑在9个小时内训练了68K数据/句子。您可以像这样转储数据,并且输出文件可用于训练模型。

这将节省时间,并为您提供SpaCy的现成的培训数据格式。

from nltk import word_tokenize
from pandas import read_csv
import re
import os.path


def annot(Label, entity, textlist) :
    finaldict = []
    for text_token in textlist:
        textbk=text_token
        for value in entity:
            #if entity has multi tokens        
            text=textbk
            text=text_token
            text=str(text).replace('@@@',value)
            text=text.lower()
            text = re.sub('[^a-zA-Z0-9\n\.]',' ', text)
            if len(word_tokenize(value))<2:
                #print('I am here')
                newtext=word_tokenize(text)
                traindata=[]
                prev_length=0
                prev_pos=0
                k=0
                while k != len(newtext):
                    if k == 0:
                        prev_pos=0
                        prev_length=len(newtext[k])
                        if value.lower()== str(newtext[k]):
                            ent=Label
                            tup=(prev_pos,prev_length,ent)
                            traindata.append(tup)
                        else:
                            pass
                    else :
                        prev_pos=prev_length+1
                        prev_length=prev_length+len(newtext[k])+1
                        if value.lower()==str(newtext[k]):
                            ent=Label
                            tup=(prev_pos,prev_length,ent)
                            traindata.append(tup)
                        else:
                            pass
                    k=k+1
                mydict={'entities':traindata}
                finaldict.append((text,mydict))
            else:
                traindata=[]
                try:
                    begin=text.index(value.lower())
                    ent=Label
                    tup=(begin,len(value.lower()),ent)
                    traindata.append(tup)
                except ValueError:
                    pass
                mydict={'entities':traindata}
                finaldict.append((text,mydict))
    return finaldict

def getEntities(csv_file, column) :
    df = read_csv(csv_file)
    return df[column].to_list()

def getSentences(file_name) :   
    with open(file_name) as file1 :
        sentences = [line1.rstrip('\n') for line1 in file1]
    return sentences

def saveData (data, filename, path) :
    filename = os.path.join(path, filename)
    with open(filename, 'a') as file :
        for sent in data :
            file.write("{}\n".format(sent))

ents = getEntities(csv_file, column_name) #Actor names in your case
entities = [ent for ent in ents if str(ent) != 'nan']


sentences = getSentences(filepathandname) #Considering you have the sentences in a text file
label = 'ACTOR_NAMES'   
data = annot(label, entities, sentences)
saveData(data, 'train_data.txt', path)

希望这是与您的问题有关的答案。

© www.soinside.com 2019 - 2024. All rights reserved.