Tuesday, November 01, 2011

Small, simple, cross-platform, free and fast C++ XML Parser

Small, simple, cross-platform, free and fast C++ XML Parser

Business Intelligence
TIMi added value
This project started from his frustration that he could not find any simple, portable XML Parser to use inside all my projects (for example, inside the award-winning TIMi software suite created by the Business-Insight company). Let's look at the well-known Xerces C++ library: The complete Xerces project is 53 MB! (11 MB compressed in a zipfile). In 2003, He was developping many small tools. He was using XML as standard for all my input/ouput configuration and data files. The source code of his small tools was usually around 600KB. In these conditions, don't you think at 53MB to be able to read an XML file is a little bit "too much"? So he created his own XML parser. His XML parser "library" is composed of only 2 files: a .cpp file and a .h file. The total size is 104 KB

Here is how it works: The XML parser loads a full XML file in memory, it parses the file and it generates a tree structure representing the XML file. Of course, you can also parse XML data that you have already stored yourself into a memory buffer. Thereafter, you can easily "explore" the tree to get your data. You can also modify the tree using "add" and "delete" functions
and regenerate a formatted XML string from a subtree. Memory management is totally transparent through the use of smart pointers (in other words, you will never have to do any new, delete, malloc or free)("Smart pointers" are a primitive version of the garbage collector in Java).

To the best of his knowledge, there exists no other "non-validating C++ XML parser" that is as simple and as powerfull.

Well Tiny XML is pretty powerful too!

Here are the characteristics of the XMLparser library:

Non-validating XML parser written in standard C++ (DTD's or XSD's informations are ignored).

Cross-plateform: the library is currently used every day on Solaris, Linux (32bit and 64bit) and Windows to manipulate "small" PMML documents (10 MB).
The library has been tested and is working flawlessly using the following compilers: gcc (under linux, Mac OS X Tiger and under many unix flavours), Visual Studio 6.0, Visual Studio .NET (under Windows 9x,NT,2000,XP,Vista,CE,mobile), Intel C/C++ compiler, SUN CC compiler, C++ Borland Compiler. The library is also used under Apple OS, iPhone/iPad OS, Amiga OS, QNX and under the Netburner plateform. To the best of my knowledge, i think that all plateforms are now supported.
The parser builds a tree structure that you can "explore" easily (DOM-type parser).
The parser can be used to generate XML strings from subtrees (it's called rendering). You can also save subtrees directly to files (automatic "Byte Order Mark"-BOM support).
Modification or "from scratch creation" of large XML tree structures in memory using funtions like addChild, addAttribute, updateAttribute, deleteAttribute,...
It's SIMPLE: no need to learn how to use dozens of classes: there is only one simple class: the 'XMLNode' class (that represents one node of the XML tree).

Very efficient (Efficiency is required to be able to handle BIG files):
The string parser is very efficient: It does only one pass over the XML string to create the tree. It does the minimal amount of memory allocations. For example: it does NOT use slow STL::String class but plain, simple and fast C malloc 's. It also allocates large chunk of memory instead of many small chunks. Inside Visual C++, the "debug versions" of the memory allocation functions are very slow: Do not forget to compile in "release mode" to get maximum speed.
The "tree exploration" is very efficient because all operations on the 'XMLNode' class are handled through references: there are no memory copy, no memory allocation, never.

The XML string rendering is very efficient: It does one pass to compute the total memory size of the XML string and a second pass to actually create the string. There is thus only one memory allocation and no extra memory copy. Other libraries are slower because they are using the string concatenation operator that requires many memory (re-)allocations and memory copy.

In-memory parsing
Supports XML namespaces
Very small and totally stand-alone (not built on top of something else). Uses only standard library (and only for the 'fopen' and the 'fread' functions to load the XML file).

Easy to integrate into you own projects: it's only 2 files! The .h file does not contain any implementation code. Compilation is thus very fast.
Robust.
Optionnally, if you define the C++ prepocessor directives STRICT_PARSING and/or APPROXIMATE_PARSING, the library can be "forgiving" in case of errors inside the XML.

He has tried to respect the XML-specs given at: http://www.w3.org/TR/REC-xml/
Fully integrated error handling :
The string parser gives you the precise position and type of the error inside the XML string (if an error is detected).
The library allows you to "explore" a part of the tree that is missing. However data extracted from "missing subtrees" will be NULL. This way, it's really easy to code "error handling" procedures.

Thread-safe (however the global parameters "guessUnicodeChar" and"strictUTF8Parsing" must be unique because they are shared by all threads).
Full Native Supports for a wide range of character sets & encodings: ANSI (legacy) / UTF-8 / Shift-JIS / GB2312 / Big5 / GBK.
Under Windows, Linux, Linux 64 bits & Solaris, they have additionnaly: Unicode 16bit / Unicode 32bit widechar characters support that includes:
For the unicode version of the library: Automatic conversion to Unicode before parsing (if the input XML file is standard ansi 8bit characters).
For the ascii version of the library: Automatic conversion to legacy or UTF-8 before parsing (if the input XML file is unicode 16 or 32bit wide characters).

The XMLParser library is able to handle successfuly chinese, japanese, cyrilic and other extended characters thanks to an extended UTF-8 encoding support, Shift-JIS (japanese) and to GB2312/Big5/GBK encoding support (chinese) (see this UTF-8-demo that shows the characters available). If you are still experiencing character encoding problems, he suggest you to convert your XML files to UTF-8 using a tool like iconv (precompiled win32 binary).

Transparent memory management through the use of smart pointers.

Support for a wide range of clearTags that are containing unformatted text:
{![CDATA[ ... ]]}, {!-- ... --}, {PRE} ... {/PRE}, {!DOCTYPE ... }
Unformatted texts are not parsed by the library and can contain items that are usually 'forbidden' in XML (for example: html code)
Support for inclusion of pure binary data (images, sounds,...) into the XML document using the four provided ultrafast Base64 conversion functions.
The library is under the Aladdin Free Public License(AFPL).


A small tutorial
Let's assume that you want to parse the XML file "PMMLModel.xml" that contains:
( all < replace by { and all > replace by } )

{?xml version="1.0" encoding="ISO-8859-1"?>
{PMML version="3.0" xmlns="http://www.dmg.org/PMML-3-0" xmlns:xsi="http://www.w3.org/2001/XMLSchema_instance"}
{Header copyright="Frank Vanden Berghen"}
Hello World!
{Application name="<Condor>" version="1.99beta" /}
{/Header}
{Extension name="keys"}
{Key name="urn"}{/Key}
{/Extension}
{DataDictionary}
{DataField name="persfam" optype="continuous" dataType="double"}
{Value value="9.900000e+001" property="missing" /}
{/DataField}
{DataField name="prov" optype="continuous" dataType="double" /}
{DataField name="urb" optype="continuous" dataType="double" /}
{DataField name="ses" optype="continuous" dataType="double" /}
{/DataDictionary}
{RegressionModel functionName="regression" modelType="linearRegression"} {RegressionTable intercept="0.00796037"}
{NumericPredictor name="persfam" coefficient="-0.00275951" /}
{NumericPredictor name="prov" coefficient="0.000319433" /}
{NumericPredictor name="ses" coefficient="-0.000454307" /}
{NONNumericPredictor name="testXmlExample" />
{/RegressionTable}
{/RegressionModel}
{/PMML}

Let's analyse line by line the following small example program:

#include // to get "printf" function
#include // to get "free" function
#include "xmlParser.h"

int main(int argc, char **argv)
{
// this open and parse the XML file:
XMLNode xMainNode=XMLNode::openFileHelper("PMMLModel.xml","PMML");
// this prints "":
XMLNode xNode=xMainNode.getChildNode("Header");
printf("Application Name is: '%s'\n", xNode.getChildNode("Application").getAttribute("name"));
// this prints "Hello world!":
printf("Text inside Header tag is :'%s'\n", xNode.getText());
// this gets the number of "NumericPredictor" tags: xNode=xMainNode.getChildNode("RegressionModel").getChildNode("RegressionTable");
int n=xNode.nChildNode("NumericPredictor");
// this prints the "coefficient" value for all the "NumericPredictor" tags:
for (int i=0; i printf("coeff %i=%f\n",i+1,atof(xNode.getChildNode("NumericPredictor",i).getAttribute("coefficient")));
// this prints a formatted ouput based on the content of the first "Extension" tag of the XML file:
char *t=xMainNode.getChildNode("Extension").createXMLString(true); printf("%s\n",t);
free(t);
return 0;
}

No comments: