Opetion February 2016

Extracting and Printing text positions

I've been doing some experiments on pdfbox and I'm currently stuck on a issue which I suspect has something to do with coordinate system.
I'm extending PDFTextStripper to get the X and Y of each character in a pdf page.
Originally I was creating an Image with ImageIO printing the text at the position I received, and putting a little mark (rectangles with different colors) on the bottom of each reference I wanted, and everything seemed well. But now to avoid losing the style from the pdf I just wanted to overlay the pdf and adding the previously spoken marks, but the coordinates I got don't match in PDPageContentStream.
Any help on matching pdf coordinates I get from PDFTextStripper -> processTextPosition to the visual coordinates

Using version 1.8.11

Answers


Tilman Hausherr February 2016

As discussed in the comments, this is the 1.8 version of the DrawPrintTextLocations tool that is part of the examples collections of the 2.0 version and which is based on the better known PrintTextLocations example. Unlike the 2.0 version, this one does not show the font bounding boxes, only the text extraction sizes, which is about the height of a small glyph (a, e, etc). It is used as an heuristic tool for text extraction. That is the cause for the "the textpositions i'm getting are halfed" effect here. If you need bounding boxes, better use 2.0 (which may be too big). To get exact sizes, you would have to calculate the path of each glyph and get the bounds of that one, again, you'd need the 2.0 version for that one.

public class DrawPrintTextLocations extends PDFTextStripper
{
    private BufferedImage image;
    private final String filename;
    static final int SCALE = 4;
    private Graphics2D g2d;
    private final PDDocument document;

    /**
     * Instantiate a new PDFTextStripper object.
     *
     * @param document
     * @param filename
     * @throws IOException If there is an error loading the properties.
     */
    public DrawPrintTextLocations(PDDocument document, String filename) throws IOException
    {
        this.document = document;
        this.filename = filename;
    }

    /**
     * This will print the documents data.
     *
     * @param args The command line arguments.
     *
     * @throws IOException If there is an error parsing the document.
     */
    public static void main(String[] args) throws IOException
    {
        if (args.length != 1)
        {
            usage();
        }
        else
        {
            PDDocument document = null;
            try
            {
                document = PDDocument.load(new File(args[0]));

                DrawPrintTextLocations stripper = new DrawPrintTextLocations(document, args[0]);
                stripper.setSortByPosition(true);

                for (int p 

Post Status

Asked in February 2016
Viewed 3,033 times
Voted 10
Answered 1 times

Search




Leave an answer