Saturday, September 19, 2015

A long long time ago ... (XML from Word)

A very long time ago (more than 15 years), I worked on a product that allowed you to take inputs from various formats and restructure them as XML (or SGML).  It was a very useful tool, and made it very easy to convert Word documents to XML, especially when those documents didn't have a great deal of nested structure.

This is fairly common: Word, HTML and many other file formats don't really handle heading level nesting the way you would output information in XML.  When you wind up with a document that has a lot of "structural" information it its styles, getting that structural information represented in your XML can be very handy.  But it can be a royal PITA to get that structure back from the Word document.

I used to do this with a Word macro, but these days I find it easier to extract the styled information into an HTML file.  Use the "Save As..." and then use Filtered HTML as your output format, and what you will get is pretty decent HTML which won't contain a lot of Word specific gunge.  Your next step will be to remove all the stupid content in between <o:p> and </o:p> tags that Word inserts to support empty paragraph and whitespace handling in various versions of the IE browser (from about 5.X on they changed various things that needed special HTML handling for each version).

After you've done that, you need to tidy up the HTML so that it is proper XHTML to begin the final phase of restructuring.  To do this, I use jtidy, the Java implementation of Dave Ragget's Tidy program.  The command line is fairly simple:

java -jar jtidy.jar -m -asxml filename

This command will read filename, cleanup the HTML and turn it into XHTML (-asxml), and then modify (-m) to original file to contain the cleaned up output.

So what was
<p class=foo><span class=bar>Stuff<br></span></p> 
becomes:
<p class='foo'><span class='bar'>Stuff<br/></span></p> 
This will make your life a lot easier. In the next two steps.

The next step simply uses the class attribute as the element name in the output.  So all tags are now rewritten using the class names (which were originally your style names in Word).  Here's the stylesheet to start XML-ifying the XHTML.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
  xmlns=""
  xmlns:html="http://www.w3.org/1999/xhtml"
  version="1.0">
  <xsl:output method="xml" indent="yes"/>
  <xsl:strip-space elements="*"/>
  <xsl:template match="html:head"/>
  <xsl:template match="html:body">
    <content>
      <xsl:apply-templates/>
    </content>
  </xsl:template>
  <xsl:template match="html:*">
    <xsl:choose>
      <xsl:when test="contains('1234567890',substring(@class,1,1))">
        <xsl:element name='_{@class}'>
          <xsl:apply-templates/>
        </xsl:element>
      </xsl:when>
      <xsl:otherwise>
        <xsl:element name='{@class}'>
          <xsl:apply-templates/>
        </xsl:element>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>
  <xsl:template match="html:a">
    <xsl:attribute name="id">
      <xsl:value-of select="@id"/>
    </xsl:attribute>
  </xsl:template>
</xsl;stylesheet>

Now, you still have this flattened XML.  What you need to do is "unflatten" it, and I'll explain how to do that in my next post.

0 comments:

Post a Comment