pdfbox获取begintext部分(BT ET)坐标

问题描述 投票:0回答:1

有人可以帮我获取pdf begintext部分的真实像素坐标吗?我正在使用pdfbox从pdf文件中检索文本,但现在我需要获取文本部分/段落。

$contents = $page->getContents();
$contentsStream = $page->getContents()->getStream();
$resources=$page->getResources();
$fonts = $resources->getFonts();
$xobjects = $resources->getImages();
$tokens=$contentsStream->getStreamTokens();
  • [PDF操作员{q},COSFloat {690.48},COSInt {0},COSInt {0},COSFloat {633.6},COSInt {0},COSInt {0},PDF操作员{cm},COSName {im1},PDF操作员{Do} ,PDFOperator {Q},
  • PDFOperator {BT}缝制{1},缝纫{0},缝纫{0},缝纫{1} {COSFloat 25.92} COSFloat {588.48} PDFOperator {TM},{缝纫99} {PDFOperator} Tz的COSName {F30} {缝纫56} {PDFOperator TF}缝纫{3} {PDFOperator Tr的} COSFloat {0334} PDFOperator {锝} COSString {Pospremanj} PDFOperator {TJ},缝纫{0} PDFOperator {锝} COSString {E} PDFOperator {TJ} COSFloat {9533} PDFOperator {Tw的} COSString {I} PDFOperator {TJ} COSFloat {6062} PDFOperator {Tw的} COSFloat {0.95} PDFOperator {锝} COSString {ciscenj} PDFOperator {TJ},缝纫{0} {PDFOperator锝} COSString {E} PDFOperator {TJ},缝纫{1},缝纫{0},缝纫{0}缝制{1} {COSFloat} 55.68 {COSFloat} 539.76 {PDFOperator的Tm},缝纫{0} {PDFOperator Tw的} COSFloat {0262} PDFOperator {锝} COSString {UOE} PDFOperator {TJ}缝纫{0},{}锝PDFOperator,COSString {I},{PDFOperator TJ} {5443} COSFloat,PDFOperator Tw的{} {COSFloat -2145} {锝} PDFOperator,COSString zimslco {} {TJ} PDFOperator ,缝纫{0},{}锝PDFOperator,COSString {G} {TJ} PDFOperator,COSFlo在{7202} {PDFOperator Tw的} {COSFloat -0148} {锝} PDFOperator,COSString odmor {} {TJ} PDFOperator,缝纫{0},{}锝PDFOperator,COSString {A},{TJ} PDFOperator ,PDFOperator {ET},
  • PDF操作员{BT},COSInt {1},COSInt {0},COSInt {0},COSInt {1},COSFloat {6.72},COSFloat {513.12},PDF操作员{Tm},COSInt {0},PDF操作员{Tw} COSName {F30},COSInt {14},PDFOperator {Tf},COSString {},PDF操作员{Tj},COSFloat {2,751},PDF操作员{Tw},...

我希望得到类似PrintTextLocations函数的输出对每个单词/字符。我可以获得底部和左侧坐标,但是如何获得宽度和顶部坐标?

PrintTextLocations:

  • string [25.92,45.119995 fs = 56.0 xscale = 55.440002 height = 40.208004 space = 15.412322 width = 36.978485] p string [63.22914,45.119995 fs = 56.0 xscale = 55.440002 height = 40.208004 space = 15.412322 width = 33.87384] o string [97.43364,45.119995 fs = 56.0 xscale = 55.440002 height = 40.208004 space = 15.412322 width = 30.824646] s string [128.58894,45.119995 fs = 56.0 xscale = 55.440002 height = 42.168 space = 15.412322 width = 33.87384] p string [162.79344,45.119995 fs = 56.0 xscale = 55.440002 height = 42.168 space = 15.412322 width = 21.566162] r string [184.69026,45.119995 fs = 56.0 xscale = 55.440002 height = 42.168 space = 15.412322 width = 30.824646] e string [215.84557,45.119995 fs = 56.0 xscale = 55.440002 height = 42.168 space = 15.412322 width = 49.286148] m ...
pdf text coordinates pdfbox sections
1个回答
1
投票

...因为BT部分给你左下角坐标,你需要解析当前BT块中包含的所有单词/字母以获得所有其他坐标。第一个字高度+ BT底部=顶部,最大(左边坐标+宽度)=右边,最后一个字底部=底部坐标。

我希望这可以帮助别人...

单个字母的示例字符串:

string[32.94,35.099976 fs=8.0 xscale=1.0 height=4.4240003 space=2.2240002 width=3.959999]p

提取,解析和准备的行:

32.94,35.099976 fs=8.0 xscale=1.0 height=4.4240003 space=2.2240002 width=3.959999

功能:

/**
 * Parse single word / letter element
 *
 * @param string $str_raw  Extracted word string line.
 * @param string $str_elem Element of interest, word, char.
 * @param int    $pdf_w    Pdf page width.
 * @param int    $pdf_h    Pdf page height.
 * @param int    $pdf_d    Pdf page dpi.
 * @param int    $pdf_r    Pdf page relative dpi.
 *
 * @return array
 */
function createRealCoordinates($str_raw, $str_elem, $pdf_w, $pdf_h, $pdf_d = 400, $pdf_r = 72)
{
    $stringstrip = array('fs=', 'xscale=', 'height=', 'space=', 'width=');
    $string_info = str_replace($stringstrip, '', $str_raw);

    $coord_info = explode(' ', $string_info);
    $coord_xy   = explode(',', $coord_info[0]);

    $coord = array(
        'pdfWidth'  => $pdf_w,
        'pdfHeight' => $pdf_h,
        'pdfDpi'    => $pdf_d,
        'pdfRel'    => $pdf_r,
        'word'      => $str_elem,

        'x1' => null,
        'y1' => null,
        'x2' => null,
        'y2' => null,

        'fontSize'     => null,
        'xScale'       => null,
        'HeightDir'    => null,
        'WidthDir'     => null,
        'WidthOfSpace' => null,
    );

    // Left, Bottom coordinate.
    $coord['x1'] = ($coord_xy[0] / $pdf_r) * $pdf_d;
    $coord['y2'] = ($coord_xy[1] / $pdf_r) * $pdf_d;

    $coord['fontSize']     = $coord_info[1]; // font size.
    $coord['xScale']       = $coord_info[2]; // x size scale.
    $coord['HeightDir']    = $coord_info[3]; // height.
    $coord['WidthDir']     = $coord_info[5]; // word width.
    $coord['WidthOfSpace'] = ($coord_info[4] / $pdf_r) * $pdf_d; // width of space.

    // Right, Top coordinate.
    $coord['x2'] = $coord['x1'] + (($coord['WidthDir'] / $pdf_r) * $pdf_d);
    $coord['y1'] = $coord['y2'] - (($coord['HeightDir'] / $pdf_r) * $pdf_d);

    return $coord;
}

- 加拿大人

© www.soinside.com 2019 - 2024. All rights reserved.