Two Pilots™    Home  |  Navigation  |  Site Map  |  Products  |  Download  |  Prices  |  For Partners  |  Support

All forums · Statistics · Search ·

Two Pilots forum / For software developers / GetDocumentText
Author Message
# Posted: 20 Oct 2010 03:38

Hi Pilot Team,
I try to generate a Text output from a pdf Document by using the IPDFDocument4 interfaces and GetDocumentText().
I send the DetDocumentText() with the ofstream in c++ 2010 to a text file
f.e. out << PDF->GetDocumentText()<<endl;
Whenever I do this the text format in notepad is loosing the format and appears in a strange way. carriage returns and blanks are on a different place and instead of for example:
1 2 3 4
something like

3 4
appears in the output text opening with notepad

How can I manipulate this in C++ by using the IPDFDocument4 interface that we keep the format in notepad .txt ?

Thank you in advanced.

Best regards


# Posted: 20 Oct 2010 04:20

Additional Informations:

I used the
Source code of PDF2Text Pilot as a "text from PDF" extractor example
PDF2Text Pilot 2.1
compiler visual studio 2010

# Posted: 20 Oct 2010 10:03

Hello Max,

the problem is that text format (txt) does not contain any positioning information and the text inside a PDF file does have.
The methods GetDocumentText and GetPageText work like this: they scan a page for a text and look at its position. If 2 words have the same Y coordinate but different X coordinate then we put them together in one line ("Hello World") but if word's Y coordinates differ too then we insert a new line character
The main goal of text extraction is indexing and search purposes. The simple text will definetly loose it's original formatting...

Max Filimonov,



Powered by miniBB 2.0 RC7 © 2001-2004 Page creation time (sec.): 0.013


Page top