Converting between DOCX and HTML considered harmful

So, you are building a web-based application, and the requirements include that users can view or edit documents.

Let’s assume these are Word documents.  Treating the Word document as the canonical or “single source” or “one version of the truth” makes sense, because this is what users expect, and will complain about if what they see looks different to what they’d see in Word.

So how do you ensure that what they see looks the same as what they’d see in Word?  And if you can meet this requirement, how does editing work?

Its fair to say that the position has typically been: fidelity or editing, pick one.

If you start with tackling the editing problem, you might think, convert docx to HTML, edit with CKEditor or TinyMCE, then convert the output back to docx.  Job done!

The problem with that though is a never ending stream of annoying fidelity problems, especially around numbering.  Either it doesn’t look right in the web editor, or it looks OK there, but wrong back in Word.

A golden rule is to avoid conversion between document formats wherever possible!

Don’t get me wrong, CKEditor and TinyMCE are both great products, terrific for editing HTML across Chrome, Firefox, Safari and Microsoft’s.  This blog uses WordPress, so I’m actually writing this post in TinyMCE. Its just that they were never designed for editing Word documents, and so its no surprise that that use case doesn’t work so well.  Word and the web do not play nice together.

Or you might start with a PDF, knowing you can convert docx to PDF and retain the visual fidelity.  Then there’s Mozilla’s pdf.js  (or maybe Google’s PDFium)  for viewing it in the browser.

But you can’t edit it!  Well, some PDF tools enable you to do a little bit, but good luck inserting a new sentence or paragraph.

So HTML and PDF are dead ends.

What other options are there?  The following spring to mind:

  • rely on Word being installed
  • Word Online
  • Google Docs

But your best option is Native Documents, which allows you as a developer to work directly with the docx in a web browser, so you never need to worry about docx to HTML, or converting XHTML to docx. Our editor never converts to HTML: it uses Office Open XML natively, and renders using React.

To show why this is your best option for a “native” docx architecture, our next posts will explore the Word desktop and Word Online options.

Google Docs we’ll put to one side.  Its not designed to be integrated into a webapp, it involves sending potentially sensitive information to Google, and it still has a bunch of issues with Word documents which will never be solved (10 years+ so far…).