Eric Hartwell's InfoDabble

 
Welcome to Eric Hartwell's InfoDabble
About | Site Map
Home Tech Notes Apollo 17: Blue Marble Apollo 17 Flight Journal   Calendars About me  

HowTo: Create a MediaWiki Extension for XML/XSL Data

By Eric Hartwell - last updated March 26, 2006

As part of the research for my Apollo 17 project, I needed to prepare a definitive timeline/transcript of the mission. For my first pass, I built a simple HTML table with time, speaker, and text from a single transcript. Then, as I started combining multiple sources, I switched to HTML defined term / definition syntax, with inline comments and later separate footnotes. As I added more and more sources, the HTML got larger, more complex, and harder to manage.

A proper reference needs sources identified, interpretations explained, and a revision history with audit trail. I stated to write a .NET web application with its own database, when I realized I already have a web application that does all that - MediaWiki.

Related Articles

I've always believed it's as important to keep a record of what doesn't work as what does. See MediaWiki Programming Side Trips: Potholes, Detours, and Dead Ends which documents some of the obstacles opportunities I encountered along the way.

MediaWiki extensions

MediaWiki, the software that runs the various Wikimedia projects, allows developers to write their own extensions to the wiki markup. An extension defines an XML-style tag which can be used in the wiki editor like this:

<tagname attribute="some attribute"> some text </tagname>

The attributes and the text between the tags get passed on to a PHP function you implement. This function can then return a HTML string that gets inserted into the output in place of the tags and text. Note that the return string should be HTML, not wiki markup.

At first I started writing a custom MediaWiki extension for Apollo transcripts. The custom tag approach requires an event callback function for every tag we want to show. Adding new tags means adding extra code, and each different transcript would require its own custom code and extension.

XML Transform MediaWiki Extension

A better approach is to write a generic XML/XSL extension. Since XML and XSL are totally generic, different content can be handled simply by changing the XSL. This would only require a single tag, reducing the possibility of namespace collisions. It could also be added to the MediaWiki distribution as a standard extension, requiring no coding on the user's part. MediaWiki is CSS friendly, and adding in-line CSS to the contents part of the page works fine.

For this extension to be of any use, the XML data should come from the MediaWiki page and not from an external file. Storing the XML data as the page text means all of the MediaWiki editing and revision tools can be used with it.

Caution: Breaking the style sheet will break all pages that use it - and not necessarily with a useful error message.

Got XSL?

The XSL file needs to be external to the page, but shouldn't be a hard-coded absolute file path. It seems reasonable to store the XSL as an uploaded file, but MediaWiki restricts the types of files that can be uploaded:

If you want to upload other file types than .jpg or .ogg (like for example .pdf), you have to modify the file, LocalSettings.php, by copying the variable, and its contents, from /includes/DefaultSettings.php. You may also need to remove this extension from the filetype blacklist (in /includes/DefaultSettings.php).

To enable uploads of XML, XSD, and XSL files, I added the following to LocalSettings.php:

# Allow upload of additional file types
$wgFileExtensions = array( 'png', 'gif', 'jpg', 'jpeg', 'ogg', 'xsd', 'xsl', 'xml', 'pdf' );

Once you upload the style sheet, it has its own page, Image:AS17FlightTranscript.xsl. Note that MediaWiki uses the term "image" for any uploaded file, regardless of type. Anyway, the XSL file now has the usual permissions and revision history managed by the framework.

After another unreasonable amount of effort (see: Loading a MediaWiki "image" from a function), I managed to locate and load the style sheet "image" and get its absolute path on the server's file system:

$image = Image::newFromName( $argv["AS17FlightTranscript.xsl" ] );
$XslDoc->load($image->getImagePath());

I also tried loading a MediaWiki page from a function, buy I got the "image" upload approach working first.

Got XML DOM?

MediaWiki doesn't have a built-in XML/XSL processor, and neither does PHP 4 which is installed on my hosted server. [Note: according to PHP.net, "PHP 5 includes the XSL extension by default].

How about running an external process? PHP theoretically supports COM, .NET, and shell-based Program Execution functions. Of course, all these require the function to be uploaded and installed on the server with appropriate rights.

Got XML DOM COM ?

My server already has the basic Microsoft COM components installed since the site is hosted on a Windows server. So, this snippet should work:

    $XmlDoc = new COM("MSXML2.DOMDocument");

Eventually I did get it to work: see XML DOM COM in MediaWiki.

Some of the things I wish I knew before I started:

  • The DOMDocument component needs a server-relative absolute path for the file names, not url or relative path.
  • XmlDoc->text and XmlDoc->xml sometimes appear to be empty, even when they're not.
  • XSLT drops spaces between tags, but does not support &nbsp; Many sources recommend using &#160; as a replacement, but it turns out the HTML standard does not support ASCII characters higher than #FF. MediaWiki outputs the  &#160; tags as ?, which can be pretty misleading until you puzzle it out.
  • The MSXML parser is stricter than the one in Internet Explorer; IE will display an XSL transform even when the XML and/or XSL aren't perfectly well-formed.

XmlTransform Extension and Syntax

MediaWiki extensions define an XML-style tag which can be used in the wiki editor. The attributes and the text between the tags get passed on to the PHP function.

<XmlTransform xmlfile="filepath.xml" xslname="imagename.xsl" xslfile="filepath.xsl">
    <xml between the tags />
</XmlTransform>

The XML data is obtained from the text between the <XmlTransform> and </XmlTransform> tags, unless the xmlfile attribute is specified, in which case the XML is loaded from an absolute path on the server's file system.

The XSL data is obtained from an uploaded "image" file if the xslname attribute is specified, or from an absolute path on the server's file system if the xslfile attribute is specified.

The output transform starts with a standard XML tag:

<!-- start content -->
<?xml version="1.0" encoding="UTF-16"?>

Even though the input is UTF-8, the output is UTF-16 because COM components use BSTRs, which are 16 bits.

The MediaWiki parser seems to have problems with the opening <?xml ?> tag. It usually inserts a paragraph start (<p>), causing it some confusion with the first tag in the content. This can be a real problem, since the next tag is usually the section title which is supposed to be treated as a header. I finally added some code to delete the <?xml ?> tag:

# Delete the ?xml tag
$text = $XmlDoc->transformNode($XslDoc);
$pos = strpos($text, '>');
$output .= substr($text, $pos + 1);

 XmlTransform = MediaWiki + PHP + XSL "Image" + XML DOM COM

Here's the XmlTransform extension that ***finally*** works to produce the Apollo 17 Flight Journal:

This code only works on Windows servers.

<?php

# XmlTransform WikiMedia extension
#
# With WikiMedia's extension mechanism it is possible to define new tags of the form
# <XmlTransform> some text </XmlTransform>
# The function registered by the extension gets the text between the tags as input and can transform it into arbitrary HTML code.
# Note: The output is not interpreted as WikiText but directly included in the HTML output. So Wiki markup is not supported.
# To activate the extension, include it from your LocalSettings.php with: include("extensions/XmlTransform.php");

$wgExtensionFunctions[] = "wfXmlTransform";

function
wfXmlTransform() {
    global
$wgParser;
    
# Register the extension with the WikiText parser.
    # The first parameter is the name of the new tag. In this case it defines the tag <XmlTransform> ... </XmlTransform>
    # The second parameter is the callback function for processing the text between the tags
    
$wgParser->setHook( "XmlTransform", "renderXmlTransform" );
}

# The callback function for converting the input text to HTML output
function renderXmlTransform( $input, $argv ) {
    
# $argv is an array containing any arguments passed to the extension like <example argument="foo" bar>..
    # The DOMDocument component needs a server-relative path for the file names !!!
    #   <XmlTransform xmlfile="filepath.xml" xslfile="filepath.xsl"><xml between the tags></XmlTransform>
    
    
$XmlDoc = new COM("MSXML2.DOMDocument");
    
$XslDoc = new COM("MSXML2.DOMDocument");
    
$XmlDoc->async = false;
    
$XslDoc->async = false;
    
    if (
$argv["xmlfile"])
    {
        
$XmlDoc->load($argv["xmlfile"]);
    }
    else
    {
        
$xml = '<?xml version="1.0" encoding="utf-8" ?><XmlTransformInput>' . $input . "</XmlTransformInput>";
        
$XmlDoc->loadXML($xml);
    }
    
    if (
$argv["xslfile"])
    {
        
$XslDoc->load($argv["xslfile"]);
    }
    elseif (
$argv["xslname"])
    {
        
$image = Image::newFromName( $argv["xslname"] );
        
$XslDoc->load($image->getImagePath());
        
$image = NULL;
    }
    else
    {
        
$output .= "<h3><font color=red>(Oops) Trying to load XSL from the twilight zone.</font></h3><br />";
    }
    
    
# Transform XML with XSL
   
$text = $XmlDoc->transformNode($XslDoc);
    
$XmlDoc = NULL;
    
$XslDoc = NULL;
    
   
# Delete the ?xml tag
    $pos = strpos($text, '>');
    
$output .= substr($text, $pos + 1);
    
    
$XmlDoc = NULL;
    
$XslDoc = NULL;
    
    return
$output;
}

?>

The renderXmlTransform() function wraps the input text with the mandatory xml header <?xml version="1.0" encoding="utf-8" ?>, the opening tag <XmlTransformInput>, and the closing tag </XmlTransformInput> so they don't need to be manually added to every page.


Enhancements Needed:

  • The function should preserve the user's <?xml tag if there is one, such as when a special namespace is needed.
  • Since the XSL transform gives complete control over the output HTML, the function should (optionally) escape any MediaWiki tags
  • The function should (optionally) escape any HTML tags in the output

Revision History

  • March 26, 2006 - initial version
Creative Commons License

Unless otherwise noted, this work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License

 

Site Map | About Me