HowTo: Create a MediaWiki Extension for XML/XSL Data
By Eric Hartwell - last updated
March 26, 2006
As part of the research for my Apollo 17 project, I
needed to prepare a definitive timeline/transcript of the mission. For my first
pass, I built a simple HTML table with time, speaker, and text from a single
transcript. Then, as I started combining multiple sources, I switched to HTML
defined term / definition syntax, with inline comments and later separate
footnotes. As I added more and more sources, the HTML got larger, more complex,
and harder to manage.
A proper reference needs sources identified, interpretations explained, and a
revision history with audit trail. I stated to write a .NET web application with
its own database, when I realized I already have a web application that does all
that - MediaWiki.
Related Articles
I've always believed it's as important to keep a record of what doesn't
work as what does. See MediaWiki Programming
Side Trips: Potholes, Detours, and Dead Ends which documents some of the obstacles
opportunities I encountered along the way.
MediaWiki extensions
MediaWiki, the software that runs the various Wikimedia projects, allows
developers to write their own
extensions to
the wiki markup. An extension defines an XML-style tag which can be used in the
wiki editor like this:
<tagname attribute="some attribute"> some text </tagname>
The attributes and the text between the tags get passed on to a PHP function
you implement. This function can then return a HTML string that gets inserted
into the output in place of the tags and text. Note that the return string
should be HTML, not wiki markup.
At first I started writing a
custom MediaWiki extension for Apollo transcripts. The custom tag approach requires an event callback function for every tag we
want to show. Adding new tags means adding extra code, and each different transcript
would require its own custom code and extension.
XML Transform MediaWiki Extension
A better approach is to write a generic XML/XSL extension. Since XML and XSL
are totally generic, different content can be handled simply by changing the XSL.
This would only require a single tag, reducing the possibility of namespace
collisions. It could also be added to the MediaWiki distribution as a standard
extension, requiring no coding on the user's part. MediaWiki is CSS friendly, and adding in-line CSS to the contents part of the
page works fine.
For this extension to be of any use, the XML data should come from the
MediaWiki page and not from an external file. Storing the XML data as the page text means all of the MediaWiki editing and
revision tools can be used with it.
Caution: Breaking the style sheet will break all pages that use it - and not
necessarily with a useful error message.
Got XSL?
The XSL file needs to be external to the page, but
shouldn't be a hard-coded absolute file path. It seems reasonable to store the XSL as an uploaded file, but MediaWiki restricts the types of files that can be
uploaded:
If you want to upload other file types than .jpg or .ogg (like for
example .pdf), you have to modify the file,
LocalSettings.php, by copying the variable, and its contents, from
/includes/DefaultSettings.php.
You may also need to remove this extension from the filetype blacklist (in
/includes/DefaultSettings.php).
To enable uploads of XML, XSD, and XSL files, I added the following to
LocalSettings.php:
# Allow upload of additional file types
$wgFileExtensions =
array( 'png',
'gif',
'jpg',
'jpeg',
'ogg',
'xsd',
'xsl',
'xml',
'pdf' );
Once you upload the style sheet, it has its own page,
Image:AS17FlightTranscript.xsl. Note that MediaWiki uses the term "image"
for any uploaded file, regardless of type. Anyway, the XSL file now has the
usual permissions and revision history managed by the framework.
After another unreasonable amount of effort (see:
Loading a MediaWiki "image" from a
function), I managed to locate and load the style sheet "image" and get its
absolute path on the server's file system:
$image =
Image::newFromName(
$argv["AS17FlightTranscript.xsl"
]
);
$XslDoc->load($image->getImagePath());
I also tried loading a MediaWiki
page from a function, buy I got the "image" upload approach working first.
Got XML DOM?
MediaWiki doesn't have a built-in XML/XSL processor, and neither does PHP 4
which is installed on my hosted server. [Note: according to
PHP.net, "PHP 5 includes the
XSL extension by default].
How about running an external process? PHP theoretically supports
COM,
.NET, and shell-based
Program Execution
functions. Of course, all these require the function to be uploaded and installed on the
server with appropriate rights.
Got XML DOM COM ?
My server already has the basic Microsoft COM components installed since the
site is hosted on a Windows server. So, this snippet should work:
$XmlDoc = new
COM("MSXML2.DOMDocument");
Eventually I did get it to work: see
XML DOM COM in MediaWiki.
Some of the things I wish I knew before I started:
- The DOMDocument component needs a server-relative absolute path for the
file names, not url or relative path.
- XmlDoc->text
and XmlDoc->xml
sometimes appear to be empty, even when they're not.
- XSLT drops spaces between tags, but does not support Many sources recommend using   as a replacement, but
it turns out the HTML standard does not support ASCII characters
higher than #FF. MediaWiki outputs the   tags as ?, which
can be pretty misleading until you puzzle it out.
- The MSXML parser is stricter than the one in Internet Explorer; IE
will display an XSL transform even when the XML and/or XSL aren't perfectly
well-formed.
XmlTransform Extension and Syntax
MediaWiki extensions define an XML-style tag which can be used in the wiki editor. The attributes and the text between the tags get passed on to
the PHP function.
<XmlTransform xmlfile="filepath.xml" xslname="imagename.xsl" xslfile="filepath.xsl">
<xml between the tags />
</XmlTransform>
The XML data is obtained from the text between the <XmlTransform> and
</XmlTransform> tags, unless the xmlfile attribute is specified,
in which case the XML is loaded from an absolute path on the server's file
system.
The XSL data is obtained from an uploaded "image" file if the xslname
attribute is specified, or from an absolute path on the server's file system if
the xslfile attribute is specified.
The output transform starts with a standard XML tag:
<!-- start content -->
<?xml version="1.0" encoding="UTF-16"?>
Even though the input is UTF-8, the output is UTF-16 because COM components use BSTRs, which are
16 bits.
The MediaWiki parser seems to have problems with the opening <?xml ?> tag. It
usually inserts a paragraph start (<p>), causing it some confusion with the
first tag in the content. This can be a real problem, since the next tag is
usually the section title which is supposed to be treated as a header. I finally
added some code to delete the <?xml ?> tag:
# Delete the
?xml tag
$text =
$XmlDoc->transformNode($XslDoc);
$pos =
strpos($text,
'>');
$output .=
substr($text,
$pos +
1);
XmlTransform = MediaWiki + PHP + XSL "Image" + XML DOM COM
Here's the XmlTransform extension that ***finally***
works to produce the
Apollo 17 Flight Journal:
This code only works on Windows servers.
<?php
# XmlTransform WikiMedia extension
#
# With WikiMedia's extension mechanism it is possible to define new tags of the
form
# <XmlTransform> some text </XmlTransform>
# The function registered by the extension gets the text between the tags as
input and can transform it into arbitrary HTML code.
# Note: The output is not interpreted as WikiText but directly included in the
HTML output. So Wiki markup is not supported.
# To activate the extension, include it from your LocalSettings.php with:
include("extensions/XmlTransform.php");
$wgExtensionFunctions[]
= "wfXmlTransform";
function wfXmlTransform()
{
global $wgParser;
# Register the extension with the WikiText
parser.
# The first parameter is the name of the new tag. In this case it defines
the tag <XmlTransform> ... </XmlTransform>
# The second parameter is the callback function for processing the text
between the tags
$wgParser->setHook(
"XmlTransform",
"renderXmlTransform" );
}
# The callback function for converting the input
text to HTML output
function
renderXmlTransform(
$input, $argv
) {
# $argv is an array containing any arguments
passed to the extension like <example argument="foo" bar>..
# The DOMDocument component needs a server-relative path for the file names
!!!
# <XmlTransform xmlfile="filepath.xml" xslfile="filepath.xsl"><xml between
the tags></XmlTransform>
$XmlDoc = new
COM("MSXML2.DOMDocument");
$XslDoc = new
COM("MSXML2.DOMDocument");
$XmlDoc->async
= false;
$XslDoc->async
= false;
if ($argv["xmlfile"])
{
$XmlDoc->load($argv["xmlfile"]);
}
else
{
$xml =
'<?xml version="1.0" encoding="utf-8" ?><XmlTransformInput>'
. $input
. "</XmlTransformInput>";
$XmlDoc->loadXML($xml);
}
if ($argv["xslfile"])
{
$XslDoc->load($argv["xslfile"]);
}
elseif ($argv["xslname"])
{
$image =
Image::newFromName(
$argv["xslname"]
);
$XslDoc->load($image->getImagePath());
$image =
NULL;
}
else
{
$output .=
"<h3><font color=red>(Oops) Trying to load XSL from
the twilight zone.</font></h3><br />";
}
# Transform XML with XSL
$text =
$XmlDoc->transformNode($XslDoc);
$XmlDoc =
NULL;
$XslDoc =
NULL;
# Delete the ?xml tag
$pos =
strpos($text,
'>');
$output .=
substr($text,
$pos +
1);
$XmlDoc =
NULL;
$XslDoc =
NULL;
return $output;
}
?>
The
renderXmlTransform() function wraps
the input text with the mandatory xml header <?xml version="1.0" encoding="utf-8" ?>,
the opening tag <XmlTransformInput>, and the closing tag </XmlTransformInput>
so they don't need to be manually added to every page.
Enhancements Needed:
- The function should preserve the user's <?xml tag if there is
one, such as when a special namespace is needed.
- Since the XSL transform gives complete control over the output HTML, the
function should (optionally) escape any MediaWiki tags
- The function should (optionally) escape any HTML tags in the output
Revision History
- March 26, 2006 - initial version