R-bloggers

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Using xml schema and xslt in R

Posted on January 10, 2017 by Jeroen Ooms in R bloggers | 0 Comments

[This article was first published on rOpenSci Blog - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This week an update for xml2 and a new xslt package have appeared on CRAN. A full announcement for xml2 version 1.1 will appear on the rstudio blog. This post explains xml validation (via xsd schema) and xml transformation (via xslt stylesheets) which have been added in this release.

XML schemas and stylesheets are not exactly new; both xslt 1.1 (2001) and xsd 1.0 (2004) have been available in browsers for over a decade. Revised specifications for xsd/xslt are still developed, but not widely implemented due to declined popularity of xml itself. Our R implementation builds on libxslt which supports XSLT 1.0 features plus most of the EXSLT set of processor-portable extensions functions.

XML Validation with XSD

XML schema, also referred to as XSD (XML Schema Definition) is standard for defining the fields and formats that are supposed to appear within an XML document. This provides a formal method for validating XML messages. The schema itself is also written in XML (there is even an xsd schema for validating xml schemas).

This example from msdn illustrates the idea using a schema for a hypothetical purchase order. Imagine a vendor has an XML api for retailers to automatically order products. The order can be quite complex but the schema formally describes what constitutes a valid XML order message. It contains fields like this:

Both the client and server can easily validate an XML order against this schema to ensure that all required fields are present and contain the correct format. A copy of this example is included with the xml2 package:

# Example order doc 

The xml_validate function returns TRUE or FALSE. If FALSE it also contains an attribute with a data frame listing invalid elements in the XML document. Let's replace some text in the XML document to make it invalid:

# Create invalid order to test str 1", "", str) str 

This new document will fail validation. The return object from xml_validate contains an error attribute with a dataframe containing the validation errors.

# Fails validation out 

When implementing an R client for a system with an XML API which also provides a schema, it is good practice to validate your messages before submitting them to the server. Thereby you catch problems with your XML document locally.

XML Transformation with XSL

Extensible Stylesheet Language (XSL) Transformation provides a standardized language for converting a certain XML structure into another XML or HTML structure. Usually the original xml document provides the raw data, and the stylesheet contains a template for a HTML page that presents this content. Again, the XSLT document itself is also written in XML.

We have decided to implement this in a separate package called xslt because it requires another C library. Try the example from the xml_xslt manual page:

library(xslt) doc 

This example is explained in more detail on w3schools.

Why Use XSLT?

As the name implies, XSLT is designed to apply styling so that we can separate data of a document from its presentation markup. Take this example of an XSLT document from the msdn homepage:

      

from

Now if we apply this to a document like this:

   An XSLT Programmer Hello, World! 

We get the following output:

    

Hello, World!

from An XSLT Programmer

When XSLT was introduced in 1999, it was expected that xml would replace html. Computer scientists envisioned that dynamic content of websites would be served via semantically structured xmls feeds (such as RSS), and presentation markup (i.e. a nice html page) could be added on the client by applying a fixed transformation.

Unfortunately that's now how it went. It turned out that xslt was overly complex and never really found wide adoption. Instead people started writing dynamic HTML pages using PHP, which was slow and insecure, but considerably easier to learn. And that brings us back to R 🙂

Related

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.