如何在 Python 中将 Markdown 字符串转换为 DocX？

Question

我从我的 API 中获取 Markdown 文本，如下所示：

{
    name:'Onur',
    surname:'Gule',
    biography:'## Computers
    I like **computers** so much.
    I wanna *be* a computer.',
    membership:1
}

传记列包括像上面这样的降价字符串。

## Computers
I like **computers** so much.
I wanna *be* a computer.

我想将此 Markdown 文本转换为 docx 字符串以用于我的报告。

在我的 docx 模板中：

{{markdownText|mark2html}}

{{simpleText}}

我正在使用 python3 docxtpl 包来创建 docx，它适用于简单的文本。

我尝试使用 BeautifulSoup 将 markdown 转换为 docx 文本，但它不适用于样式（粗体、斜体等）。
我尝试了 pandoc 并且它有效，但它只是创建了一个 docx 文件，我想将渲染的 markdown 文本添加到现有的 docx（在创建时）。

我当前的代码：

import docx
from docxtpl import DocxTemplate, RichText
import markdown
import jinja2
import markupsafe
from bs4 import BeautifulSoup
import pypandoc

def safe_markdown(text):
    return markupsafe.Markup(markdown.markdown(text))

def mark2html(value):
    html = markdown.markdown(value)
    soup = BeautifulSoup(html, features='html.parser')
    output = pypandoc.convert_text(value,'rtf',format='md')
    return RichText(value) #tried soup and pandoc..

def from_template(template):
    template = DocxTemplate(template)
    context = {
        'simpleText':'Simple text test.',
        'markdownText':'Markdown **text** test.'
    } 
    jenv = jinja2.Environment()
    jenv.filters['markdown'] = safe_markdown
    jenv.filters["mark2html"] = mark2html
    template.render(context,jenv)
    template.save('new_report.docx')

那么，如何将渲染的 markdown 添加到现有的 docx 或创建时（也许使用 jinja2 过滤器）？

Answer 1

我没有任何捷径就解决了。我用 beautifulSoup 将 markdown 转换为 html，然后通过检查它们的标签名称来处理每个段落。

在我的Word模板中：

{% if markdownText != None %}
    {% for mt in markdownText|mark2html %} 
        {{mt}}
    {% endfor %}
{% endif %}

我的模板标签：

def mark2html(value):
    if value == None:
        return '-'
    html = markdown.markdown(value)
    soup = BeautifulSoup(html, features='html.parser')
    paragraphs = []
    global doc
    for tag in soup.findAll(True):
        if tag.name in ('p','h1','h2','h3','h4','h5','h6'):
            paragraphs.extend(parseHtmlToDoc(tag))  
    return paragraphs

我插入docx的代码：

def parseHtmlToDoc(org_tag):
    contents = org_tag.contents
    pars= []
    for con in contents:
        if str(type(con)) == "<class 'bs4.element.Tag'>":
            tag = con
            if tag.name in ('strong',"h1","h2","h3","h4","h5","h6"):
                source = RichText("")
                if len(pars) > 0 and str(type(pars[len(pars)-1])) == "<class 'docxtpl.richtext.RichText'>":
                    source = pars[len(pars)-1]
                    source.add(con.contents[0], bold=True)
                else:
                    source.add(con.contents[0], bold=True)
                    pars.append(source) 
            elif tag.name == 'img':
                source = tag['src']
                imagen = InlineImage(doc, settings.MEDIA_ROOT+source)
                pars.append(imagen)
            elif tag.name == 'em':
                source = RichText("")
                source.add(con.contents[0], italic=True)
                pars.append(source)
        else:
            source = RichText("")
            if len(pars) > 0 and str(type(pars[len(pars)-1])) == "<class 'docxtpl.richtext.RichText'>":
                    source = pars[len(pars)-1]
                    pars.add(con)
            else:
                if org_tag.name == 'h2':
                    source.add(con,bold=True,size=40)
                else:
                    source.add(con)
                pars.append(source) # her zaman append?
    return pars

它处理 html 标签，如 b、i、img、标题。您可以添加更多标签来处理。我这样解决了，它不需要任何额外的文件转换，如 html2docx 等。

我在代码中使用了这个过程，如下所示：

report_context = {'reportVariables': report_variables}
template = DocxTemplate('report_format.docx')
jenv = jinja2.Environment()
jenv.filters["mark2html"] = mark2html
template.render(report_context,jenv)
template.save('exported_1.docx')

Answer 2

我遵循了一种懒惰的、效率不是最高但有用的策略。由于处理

docx

不如

html

灵活，所以我先将 markdown

md

转换为

html

，然后从

html

移至

docx

，如下所示：

from jinja2 import FileSystemLoader, Environment
from pypandoc import convert_file, convert_text

def md2html(md):
  return convert_text(md, 'html', format='md')

def html2docx(file):
  return convert_file(f'{file}.html', 'docx', format='html', outputfile=f'{file}.docx')

def from_template(template_file, f_out):
  context = {
      'simpleText': 'Simple text test.',
      'markdownText': 'Markdown **text** test.'
  }
  ldr = FileSystemLoader(searchpath='./')
  jenv = Environment(loader=ldr)
  jenv.filters["md2html"] = md2html
  template = jenv.get_template(template_file)
  html = template.render(context)
  print(html)
  with open(f'{f_out}.html', 'w') as fout:
    fout.write(html)
    fout.close()
  html2docx(f_out)

if __name__ == '__main__':
  from_template('template.html.jinja', 'new_report')

至于模板的内容，应该取自基于

html

的模板，如下所示：

<!DOCTYPE html>
<html xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
  <head></head>
  <body>
    {{markdownText|md2html}}
    {{simpleText}}
  </body>
</html>

我将其另存为

template.html.jinja

。

我很想研究@Mahrkeenerh 的贡献，那里提到的 API 似乎有相当多的项目需要学习和理解。

如何在 Python 中将 Markdown 字符串转换为 DocX？

问题描述投票：0回答：2

2个回答

最新问题

如何在 Python 中将 Markdown 字符串转换为 DocX？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2