Two Pilots™    Home  |  Navigation  |  Site Map  |  Products  |  Download  |  Prices  |  For Partners  |  Support


All forums · Statistics · Search ·

Two Pilots forum / For software developers / GetDocumentText
Author Message
Max
# Posted: 20 Oct 2010 03:38


Hi Pilot Team,
I try to generate a Text output from a pdf Document by using the IPDFDocument4 interfaces and GetDocumentText().
I send the DetDocumentText() with the ofstream in c++ 2010 to a text file
..
f.e. out << PDF->GetDocumentText()<<endl;
out.close;
..
Whenever I do this the text format in notepad is loosing the format and appears in a strange way. carriage returns and blanks are on a different place and instead of for example:
"
1 2 3 4
"
something like
"
1
2

3 4
"
appears in the output text opening with notepad

How can I manipulate this in C++ by using the IPDFDocument4 interface that we keep the format in notepad .txt ?

Thank you in advanced.

Best regards

Max

Max
# Posted: 20 Oct 2010 04:20


Additional Informations:

I used the
Source code of PDF2Text Pilot as a "text from PDF" extractor example
PDF2Text Pilot 2.1
and
compiler visual studio 2010
Thx
Max

max.f
# Posted: 20 Oct 2010 10:03


Hello Max,

the problem is that text format (txt) does not contain any positioning information and the text inside a PDF file does have.
The methods GetDocumentText and GetPageText work like this: they scan a page for a text and look at its position. If 2 words have the same Y coordinate but different X coordinate then we put them together in one line ("Hello World") but if word's Y coordinates differ too then we insert a new line character
("Hello
World")
The main goal of text extraction is indexing and search purposes. The simple text will definetly loose it's original formatting...

--
Max Filimonov,
max.f@colorpilot.org

 

 

Powered by miniBB 2.0 RC7 © 2001-2004 Page creation time (sec.): 0.013

 

Page top