将 JSON/dict 转换为带有指示符标记的扁平字符串

问题描述 投票:0回答:0

给出如下输入:

{'example_id': 0,
 'query': ' revent 80 cfm',
 'query_id': 0,
 'product_id': 'B000MOO21W',
 'product_locale': 'us',
 'esci_label': 'I',
 'small_version': 0,
 'large_version': 1,
 'split': 'train',
 'product_title': 'Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceiling Mounted Fan',
 'product_description': None,
 'product_bullet_point': 'WhisperCeiling fans feature a totally enclosed condenser motor and a double-tapered, dolphin-shaped bladed blower wheel to quietly move air\nDesigned to give you continuous, trouble-free operation for many years thanks in part to its high-quality components and permanently lubricated motors which wear at a slower pace\nDetachable adaptors, firmly secured duct ends, adjustable mounting brackets (up to 26-in), fan/motor units that detach easily from the housing and uncomplicated wiring all lend themselves to user-friendly installation\nThis Panasonic fan has a built-in damper to prevent backdraft, which helps to prevent outside air from coming through the fan\n0.35 amp',
 'product_brand': 'Panasonic',
 'product_color': 'White'}

目标是输出如下内容:

Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceiling Mounted Fan [TITLE] Panasonic [BRAND] White [COLOR] WhisperCeiling fans feature a totally enclosed condenser motor and a double-tapered, dolphin-shaped bladed blower wheel to quietly move air [SEP] Designed to give you continuous, trouble-free operation for many years thanks in part to its high-quality components and permanently lubricated motors which wear at a slower pace [SEP] Detachable adaptors, firmly secured duct ends, adjustable mounting brackets (up to 26-in), fan/motor units that detach easily from the housing and uncomplicated wiring all lend themselves to user-friendly installation [SEP] This Panasonic fan has a built-in damper to prevent backdraft, which helps to prevent outside air from coming through the fan [SEP] 0.35 amp [BULLETPOINT]

按照规则进行一些操作以生成所需的输出:

  • 如果字典中的值为None,则不要将内容添加到输出字符串中
  • 如果值包含换行符
    \n
    [SEP]
    标记替换它们
  • 按照用户指定的顺序连接字符串,例如以上按顺序
    ["product_title", "product_brand", "product_color", "product_bullet_point", "product_description"]

我试过这个有点管用,但我写的函数看起来有点硬编码,无法查看所需的键并连接和操作字符串。


item1 = {'example_id': 0,
 'query': ' revent 80 cfm',
 'query_id': 0,
 'product_id': 'B000MOO21W',
 'product_locale': 'us',
 'esci_label': 'I',
 'small_version': 0,
 'large_version': 1,
 'split': 'train',
 'product_title': 'Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceiling Mounted Fan',
 'product_description': None,
 'product_bullet_point': 'WhisperCeiling fans feature a totally enclosed condenser motor and a double-tapered, dolphin-shaped bladed blower wheel to quietly move air\nDesigned to give you continuous, trouble-free operation for many years thanks in part to its high-quality components and permanently lubricated motors which wear at a slower pace\nDetachable adaptors, firmly secured duct ends, adjustable mounting brackets (up to 26-in), fan/motor units that detach easily from the housing and uncomplicated wiring all lend themselves to user-friendly installation\nThis Panasonic fan has a built-in damper to prevent backdraft, which helps to prevent outside air from coming through the fan\n0.35 amp',
 'product_brand': 'Panasonic',
 'product_color': 'White'}

item2 = {'example_id': 198,
 'query': '# 2 pencils not sharpened',
 'query_id': 6,
 'product_id': 'B08KXRY4DG',
 'product_locale': 'us',
 'esci_label': 'S',
 'small_version': 1,
 'large_version': 1,
 'split': 'train',
 'product_title': 'AHXML#2 HB Wood Cased Graphite Pencils, Pre-Sharpened with Free Erasers, Smooth write for Exams, School, Office, Drawing and Sketching, Pack of 48',
 'product_description': "<b>AHXML#2 HB Wood Cased Graphite Pencils, Pack of 48</b><br><br>Perfect for Beginners experienced graphic designers and professionals, kids Ideal for art supplies, drawing supplies, sketchbook, sketch pad, shading pencil, artist pencil, school supplies. <br><br><b>Package Includes</b><br>- 48 x Sketching Pencil<br> - 1 x Paper Boxed packaging<br><br>Our high quality, hexagonal shape is super lightweight and textured, producing smooth marks that erase well, and do not break off when you're drawing.<br><br><b>If you have any question or suggestion during using, please feel free to contact us.</b>",
 'product_bullet_point': '#2 HB yellow, wood-cased pencils:Box of 48 count. Made from high quality real poplar wood and 100% genuine graphite pencil core. These No 2 pencils come with 100% Non-Toxic latex free pink top erasers.\nPRE-SHARPENED & EASY SHARPENING: All the 48 count pencils are pre-sharpened, ready to use when get it, saving your time of preparing.\nThese writing instruments are hexagonal in shape to ensure a comfortable grip when writing, scribbling, or doodling.\nThey are widely used in daily writhing, sketching, examination, marking, and more, especially for kids and teen writing in classroom and home.#2 HB wood-cased yellow pencils in bulk are ideal choice for school, office and home to maintain daily pencil consumption.\nCustomer service:If you are not satisfied with our product or have any questions, please feel free to contact us.',
 'product_brand': 'AHXML',
 'product_color': None}


def product2str(row, keys):
    key2token = {'product_title': '[TITLE]', 
     'product_brand': '[BRAND]', 
     'product_color': '[COLOR]',
     'product_bullet_point': '[BULLETPOINT]', 
     'product_description': '[DESCRIPTION]'}
    
    output = ""
    for k in keys:
        content = row[k]
        if content:
            output += content.replace('\n', ' [SEP] ') + f" {key2token[k]} "

    return output.strip()

product2str(item2, keys=['product_title', 'product_brand', 'product_color',
                        'product_bullet_point', 'product_description'])

问:是否有某种原生的 CPython JSON 来 str flatten 函数/配方,可以实现与

product2str
函数类似的结果?

问:或者

tokenizers
https://pypi.org/project/tokenizers/中是否已经有一些函数/管道可以将JSON/dict扁平化为令牌?

python json tokenize huggingface-tokenizers json-flattener
© www.soinside.com 2019 - 2024. All rights reserved.