Re: OT: Copying text from a PDF



On Wed, 01 Jun 2005 07:00:21 +0100, the renowned Terry Pinnell
<terrypinDELETE@xxxxxxxxxxxxxxxxxxx> wrote:

>Quite often I have trouble extracting text from a PDF. I use the Text
>tool, copy, but on then pasting into my text editor I get garbage.
>Each individual character gets a return inserted. Typical example is
>at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
>to extract the details under 'Absolute Maximum Ratings'.
>
>What's the deal here please? If the document is proprietorially
>protected, wouldn't the Text tool be inaccessible?

One thing I notice that's amiss is that there is a carriage return
before and after subscripted text. So:

V 50 V
DS

Comes out as V<CR>DS<CR> 50 V

The symbol characters (degrees and ohms) also tend to get
translated/screwed up, depending on where you're pasting to. There are
also some lines screwed up, st the ends of some lines end up together
on later lines.

Problems in extracting text are mostly a function of the application
that created the PDF (Framemaker 5.5 for the Power PC set to
LaserWriter 8 8.7 and Acrobat Distiller 4.0 for Macintosh in this
case). In this case, if you open the document in Illustrator you can
see many individual blocks of text, some of which the copy operation
strings together, and others which it misses.

This stuff is fairly easily fixed by a bit of editing-- those dot
leaders are irritating to fix. I tried pasting into a text-only
application (Ultraedit), Excel, the Open Office text editor and into
MS Word, and all came out pretty much the same except for the symbols.
It might even be faster than re-typing everything.

Extracting text using GSView in "normal" mode is only slightly better.


Best regards,
Spehro Pefhany
--
"it's the network..." "The Journey is the reward"
speff@xxxxxxxxxxxx Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog Info for designers: http://www.speff.com
.



Relevant Pages

  • Re: OT: Copying text from a PDF
    ... >>Quite often I have trouble extracting text from a PDF. ... >>tool, copy, but on then pasting into my text editor I get garbage. ...
    (sci.electronics.design)
  • Re: OT: Copying text from a PDF
    ... >>Quite often I have trouble extracting text from a PDF. ... >>tool, copy, but on then pasting into my text editor I get garbage. ...
    (sci.electronics.design)
  • Re: OT: Copying text from a PDF
    ... Spehro Pefhany wrote: ... >>Quite often I have trouble extracting text from a PDF. ... >>tool, copy, but on then pasting into my text editor I get garbage. ...
    (sci.electronics.design)
  • Re: Converting PDF to text
    ... I realize the result will be ugly compared to PDF. ... If the .pdf was made from a text editor then the imbedded text can be extracted. ... *IF* the .pdf was made by scanning an original paper document, then the .pdf is really just a series of .tiff page pictures. ... Extracting text from the .tiffs is not easy or maybe not even possible. ...
    (comp.sys.mac.system)
  • Tools to fix broken rar archives or pdfs?
    ... way through extracting the archive and then reports "CRC failed", ... option I can prevent it deleting the partial .pdf that it's extracted, ... but ghostscript then reports "This file has a corrupted %%EOF marker, ...
    (Debian-User)