How to compare two word documents that consists of same content from html and pdf?
Apr.03, 2008 in
Other
nethagi007 asked:
I’ve tried after copying the content from html to one word document and copied the same content from pdf to another word document. But when trying to compare both the copies of word document it shows me some “Insertion” & “Deletion”, but when i check manually there is no insertion & deletion present in the two documents. Can anyone help me how to solve this?????
Custom Search
April 5th, 2008 at 12:53 am
You are comparing these documents using the Compare & Merge feature under the Tools menu in Word?
You will have difficulty comparing these two documents (seeing as they are not Word documents to start with).
The HTML document would have carried over to Word nicely (at least, better than the PDF), because Word documents are essentially HTML.
The PDF document, however, even though it is able to recognise text and you can highlight it, will not carry across all the formatting that was in the original document. This results in breaks and paragraph marks in strange places.
The extra breaks and paragraph marks are probably the cause of all of these ‘insertions’ and ‘deletions’ when you are comparing the documents.
Hm… how do demonstrate this with words…
Open both versions in Word and click on the show/hide formatting button, which is on your standard toolbar (looks like a backwards P with an extra line on the right side). This shows you where all of the “enters” (paragraph marks – the backwards P with an extra line) and “shift + enters” (looks like an arrow coming down and pointing left) are within the document. They are invisible when you have the show/hide formatting button off, so you may not have noticed that there was anything different between the two documents.
Why does the PDF do this? Because when a document is turned into a PDF it ‘forgets’ that it was originally a Word document and becomes like a flat image. It ‘forgets’ all of the formatting attached to the document – so you end up with something that looks like your original document, but it cannot be altered. Like a printout. This makes PDFs much smaller in size than most other documents – it doesn’t have to remember all of the extra stuff that comes with it.
So in your PDF, for example, you may have a sentence that goes like this: “How to compare two word documents that consists of same content from html and pdf?”. On the PDF the sentence runs over two lines. When you copy it over to the Word document it tries to mimic this by forcibly cutting the sentence in half. You will notice a paragraph mark after the word that was at the end of the line on the PDF, even though in the Word document there is ample room for it to stay on the same line!
etc etc etc etc etc [previous sentence]. How to compare two
word documents that consists of same content from html and pdf?
I know this doesn’t help you at all… and I am afraid (coming from a background of working with documents like this every day) that there is little that you can do to compare these documents automatically… you may just have to do it the old fashioned way.