Re: OT: Copying text from a PDF
- From: Spehro Pefhany <speffSNIP@xxxxxxxxxxxxxxxxxxxxxxx>
- Date: Wed, 01 Jun 2005 03:17:53 -0400
On Wed, 01 Jun 2005 07:00:21 +0100, the renowned Terry Pinnell
<terrypinDELETE@xxxxxxxxxxxxxxxxxxx> wrote:
>Quite often I have trouble extracting text from a PDF. I use the Text
>tool, copy, but on then pasting into my text editor I get garbage.
>Each individual character gets a return inserted. Typical example is
>at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
>to extract the details under 'Absolute Maximum Ratings'.
>
>What's the deal here please? If the document is proprietorially
>protected, wouldn't the Text tool be inaccessible?
One thing I notice that's amiss is that there is a carriage return
before and after subscripted text. So:
V 50 V
DS
Comes out as V<CR>DS<CR> 50 V
The symbol characters (degrees and ohms) also tend to get
translated/screwed up, depending on where you're pasting to. There are
also some lines screwed up, st the ends of some lines end up together
on later lines.
Problems in extracting text are mostly a function of the application
that created the PDF (Framemaker 5.5 for the Power PC set to
LaserWriter 8 8.7 and Acrobat Distiller 4.0 for Macintosh in this
case). In this case, if you open the document in Illustrator you can
see many individual blocks of text, some of which the copy operation
strings together, and others which it misses.
This stuff is fairly easily fixed by a bit of editing-- those dot
leaders are irritating to fix. I tried pasting into a text-only
application (Ultraedit), Excel, the Open Office text editor and into
MS Word, and all came out pretty much the same except for the symbols.
It might even be faster than re-typing everything.
Extracting text using GSView in "normal" mode is only slightly better.
Best regards,
Spehro Pefhany
--
"it's the network..." "The Journey is the reward"
speff@xxxxxxxxxxxx Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog Info for designers: http://www.speff.com
.
- Follow-Ups:
- Re: OT: Copying text from a PDF
- From: Boris Mohar
- Re: OT: Copying text from a PDF
- From: Jim Thompson
- Re: OT: Copying text from a PDF
- From: Terry Pinnell
- Re: OT: Copying text from a PDF
- References:
- OT: Copying text from a PDF
- From: Terry Pinnell
- OT: Copying text from a PDF
- Prev by Date: Re: Copying text from a PDF
- Next by Date: Re: Mosfet selection for DC motor Control
- Previous by thread: Re: OT: Copying text from a PDF
- Next by thread: Re: OT: Copying text from a PDF
- Index(es):
Relevant Pages
|