
Hi
DevNewz Readers,
The author of this article, Peter A. Bromberg, Ph.D.
of EggheadCafe.com has done an outstanding job. I hope
that you enjoy reading and learning from it as I have.
Best Wishes, Pete


I regard myself as a particularly fortunate "XML Dude":
About a year ago, I determined that, regardless of the
amount of time I had in the day, and regardless of the
fact that the company I worked for at the time had virtually
no vision as to what XML could do to help solve their
problems, I was going to spend some time -- for ME --
each evening, studying this new technology and learning
how to use it. I don't know about you, but I come home
each evening tired from being paid to think all day
long. Very tired, I might add. But, I kept my promise
to myself and made the time to study XML and XSLT. I
remember when I bought my first book on XML, a Wrox
title that I still refer to today (although I now posess
many more such books). My first reaction was "Oh, crap!
This is not going to be fun at all."
To their credit, this same company which had little
XML vision saw fit to pay to send me to both Tech-Ed
and the Microsoft Professional Developer's Conference,
both of which were conveniently held in sunny Orlando
(where I live), last year. All I can say is, when I
arrived home each evening from the conferences, I was
so excited I couldn't sleep. I saw the vision! I felt
the excitement of the Microsoft gurus, who were genuinely
enthusiastic about what they were doing and what the
possibilities were. And it was all about XML and DOT
NET.
Now, I am even more fortunate, because my vision of
learning XML (along with DOT NET and related technologies)
has begun to pay off. I now work for a company whose
XML vision is well-formed (to make a pun) and the project
I work on is almost 100% XML / XSLT based. Every ASP
page in our application serves only as the "glue" to
receive and process querystring or other parameters
from an XML-based menu choice, to tell it what Javascript
and XSL include files to load. All the loading and transformations
are dynamic, all plumbing is handled by global Javascript
functions, every single browser page in the application
is the result of a dynamically generated XSL tranform.
Client - side XML data islands hold important dynamically
updated information that is accessed through global
Javascript functions, and 100% of all data access is
handled through XMLHttp XML request / response documents
via COM + middleware components that we have authored,
and it's sent, received back and processed --- from
the CLIENT SIDE -- over the wire, to and from the databases.
And it's all 100% financial industry standards compliant.
I feel proud to have been a member of the architectural
team that went through a lot of pain to flesh out all
this stuff and make the "proof of concept" become a
reality that will provide real value to the customer
(and make the company that I work for a ton of money).
One of our biggest concerns continues to be performance
tuning (see my "Performance Tuning Checklist" article
for more on this).
Recently a friend I used to work with started using
XML, and he sent some emails asking for comments and
advice. Some of the assumptions he made in doing his
"Beginning XML" exersize made me think: "Well, if he
is making these mistakes (the same ones that I made,
a lot of them) then I wonder how many other developers
struggling with XML / XSLT are also doing this? So I
decided to write this article to try and summarize some
of the most important things I've learned. If this helps
you because you're just starting out with XML, that's
great. And if you're already well under way as an XML
Developer and some of the things that I touch on here
make you think - well, that's even better. I am by no
means an XML "guru". But you know what? I intend to
become one. When I read in magazines like "Smart Partner"
that XML gurus are currently being billed out at $300
an hour, I feel gratified that my decision almost a
year ago was the right one for me. This technology is
not going to go away, folks. It's big time. Study XML!
It's the best job security you can get since winning
the lottery became popular.
XML PERFORMANCE VARIABLES
In working with XML data and documents, there are four
major variables that can affect the performance of MSXML:
The kind of XML data
The ratio of tags to
text
The ratio of attributes
to elements
The amount of discarded
white space
There are also four key performance "metrics" involved
on the Win32 platform:
Working set:
The peak amount of memory used by MSXML to process
requests. Once the working set exceeds available RAM,
performance usually declines sharply as the operating
system starts paging memory out to disk.
Megabytes per second:
The raw speed for a given operation, such as the document
load method.
Requests per second:
How many requests the XML parser can handle per second.
An XML parser might have a high megabytes-per-second
rate, but if it is expensive to set up and tear down
that parser, it will still have a low throughput in
requests per second. For example, if the clients hit
the server at a peak rate of one request per second,
and if the server can do 150 requests per second,
the server can probably handle up to 150 clients.
Scaling:
How well your server can process requests in parallel.
If your server is processing 150 client requests in
parallel, then it is doing a lot of multi-threading.
Processing 150 threads in parallel is a lot for one
processor, it will spend a lot of time switching between
threads. You could add more processors to the computer
to share the load.
The fastest way to load an XML Document
The fastest way to load an XML document is to use
the default "rental" threading model (which means
the DOM document can be used by only one thread at
a time) with validateOnParse, resolveExternals, and
preserveWhiteSpace all disabled, like this in Javascript:
var doc = new ActiveXObject
("MSXML2.DOMDocument");
doc.validateOnParse = false;
doc.resolveExternals = false;
doc.preserveWhiteSpace = false;
doc.load("mystuff.xml");
If you have an element-heavy XML document that contains
a lot of white space between elements and stored in
Unicode, it can actually be smaller in memory than
on disk. Files that have a more balanced ratio of
elements to text content end up at about 1.25 to 1.5
the UCS-2 disk file size when in memory. Files that
are very data-dense, such as an attribute - heavy
XML - persisted ADO recordset, can end up more than
twice the disk-file size when loaded into memory.
Attributes vs. Elements
You could conclude that attribute-heavy formats (such
as an XML - persisted ADO recordset) deliver more
data per second than element- heavy formats. But this
should not be the only reason for you to switch everything
to attributes. There are many other factors to consider
in the decision to use attributes versus elements.
Unique elements
My friend, in his honest but less than informed effort
to create a useful XML document, made the mistake
of attempting to use the XML elements as if they were
unique "database fields". For example if you have
an XML Document that consists of survey questions,
you could conceive that in order to make each element
"unique" you would give the tag a unique name. So
your survey questions document might end up looking
kind of like this:
<POLLQUESTIONS>
<Q10000002>Who is central Florida's best Internet
Service Provider?</Q10000002>
<A10000009>MPINet</A10000009>
<A10000010>EarthLink</A10000010>
<A10000011>MindSpring</A10000011>
<A10000012>Access Orlando</A10000012>
<Q10000003>What is your favorite search engine?</Q10000003>
<A10000018>Yahoo</A10000018>
<A10000019>Altavista</A10000019>
<A10000020>Lycos</A10000020>
</POLLQUESTIONS>
Now what is wrong with the above document fragment?
Two things, actually. First, it does not lend itself
easily to XPATH statements that allow you to walk
the DOM and find isolated nodes and / or subnodes
of elements. True, each element has a "unique" tag
name, but that's not the point. XML is a tree-like
hierarchical structure. If you need to be able to
find an element or a node by number, or to sort, search
or group, it's better to use either an attribute ()
to identify the unique "ID" of elements or nodes,
or to include a sibling element (1) inside
each Question tag. You can also use the position()
operator. The second thing that's "wrong" is that
the answer tags simply follow closed Question tags
here -- there is no closing "Question" element that
encompasses both the question and it's answers. A
more productive version of the above might look like
this:
<POLLQUESTIONS>
<Question QNum="1">
<QText> Who is central Florida's best Internet
Service Provider?</Qtext>
<Answer>MPINet</Answer>
<Answer>EarthLink</Answer>
|
|
|

|
Oracle Technology Network: the online community
that drives the software that powers the internet.
Access the latest information you need to stay
ahead. Get Oracle's latest development software
FREE, with your membership.
Join
OTN today.
|
|
<Answer>MindSpring</Answer>
</Question>
<Question QNum="2">
<QText> What is your favorite search engine?</QText>
<Answer>Yahoo</Answer>
<Answer>Altavista</Answer>
<Answer>Lycos</Answer>
</Question>
</POLLQUESTIONS>
Separate Memory Structure for unique elements
With the second example, we can find any question
by its number using XPATH like: //Question[@Qnum="2"].
We can sort, search, grab a question node along with
its answers, and so on. And there is another very
important but often overlooked reason to try and arrange
your XML documents so that all the major tags have
the same names: when the XML parser loads and processes
your document, it creates a separate memory structure
for each unique element name. So conceivably the first
example above, if it had 1000 questions, could occupy
orders of magnitude more memory than the second example
with 1000 questions, taking a lot longer to parse,
and possibly a lot longer to search or sort as well.
Walking the DOM tree for the first time also has an
impact on the working set metric because some nodes
in the tree are created "on demand", they are not
automatically "there" after loading the document.
Creating a DOM tree from scratch results in a higher
peak working set than loading the same document from
disk. Loading a document is roughly five times faster
than creating the same document from scratch in memory.
The reason is that the process of creating a document
requires a lot of DOM calls, which slows things down.
Walk Fast
The fastest way to walk the tree is to avoid the children
collection and any kind of array access. Instead,
use firstChild and nextSibling:
function WalkNodes(node) {
var child = node.firstChild;
while (child != null)
{ WalkNodes(child);
child = child.nextSibling;
}
}
However, if you are looking for something in the tree,
the fastest way to find it is to use XPath via the
selectSingleNode or selectNodes methods.
Free-Threaded Documents
The "free-threaded" DOM document exposes the same
interface as the "rental" threaded document. This
object can be safely shared across any thread in the
same process. It can be safely stored in ASP Application
state on IIS.
Free-threaded documents are generally slower than
rental documents because of the extra thread safety
work they do. You use them when you want to share
a document among multiple threads at the same time,
avoiding the need for each of those threads to load
it's own copy. In some cases, this can result in a
big performance gain.
For example, suppose you have a 12K XML file on your
Web server, and you have a simple ASP page that loads
that file, increments an attribute inside the file,
and saves the file again. Such ASP code is likely
to be completely tied up with disk I/O. However, you
could put the file into shared-application state using
a free-threaded DOM document:
<%@ LANGUAGE=JSCRIPT %>
<%
Response.Expires = -1;
var doc = Application("Stuff");
if (doc == null)
{
doc = Server.CreateObject
("Msxml2.FreeThreadedDOMDocument");
doc.async = false;
doc.load(Server.MapPath("stuff.xml"));
Application("Stuff") = doc;
}
Application.Lock();
var c = parseInt(doc.documentElement.getAttribute("count"))+1;
doc.documentElement.setAttribute("count",c);
Application.UnLock();
%>
<%=c%>
This second approach using the free-threaded DOM document
can easily be seven times faster than the other.
IDispatch
Late-bound scripting languages such as JScript and
VBScript add a lot of overhead to each method call
and property access in the DOM interface. The script
engines invoke the methods and properties indirectly
through the IDispatch interface and first call GetIDsOfNames
or GetDispID, which will pass in a string name for
the method or property and return a DISPID. Finally
the engines package all the arguments into an array
and call Invoke with the DISPID.
This is slower than calling a virtual function in
C++ or compiled Visual Basic. For this reason, you
may want to consider calling all your DOM functions
from a "wrapper" compiled component that has the generic
methods you need to do what you want, when using a
script based application environment such as ASP.
With VB, you want to also avoid late- bound DOM object
invocation calls like the following:
Dim doc as Object
set doc = CreateObject("Microsoft.XMLDOM")
This will be as slow as VBScript or JScript. To speed
this up, from the Project menu, select References
and add a reference to the latest version of the "Microsoft
XML" library. Then you can write the following early-bound
code:
Dim doc As New MSXML.DOMDocument
Use XSL to get the data you need for speed
XSL can be a big performance win over using DOM code
for generating "transformed" reports from an XML document.
For example, suppose you wanted to show all the questions
and answers matching a certain key word category element.
You might use selectNodes to find all the questions
matching the category, then use another selectNodes
call to iterate through the answer elements of each
of those questions.
But you could also write an XSL stylesheet:
<xsl:template xmlns:xsl="http://www.w3.org/TR/WD-xsl">
<xsl:for-each select="/Question[CATEGORY='ASP']">
<xsl:for-each select="Answer">
<xsl:value-of/>
</xsl:for-each><hr/></xsl:for-each>
</xsl:template>
You could then create
your output with a function like:
function report(doc) {
var xsl = new ActiveXObject("Microsoft.XMLDOM");
xsl.async = false;
xsl.load("pollquestions.xsl");
return doc.transformNode(xsl)
}
This XSL transformation could be from 5 to 10 times
faster than iterating through the DOC looking for
your data!
The "//" Operator
The "//" operator walks the entire subtree looking
for matches. if you are lazy like me, you use it more
than you should because you are too lazy to look up
and type in the full path. If you can, use the full
path to get your data; it will typically give you
up to a 15% performance boost. In editors such as
XML Spy, there is a clipboard copy function that will
return the XPATH statement to return any element.
Many of the items I"ve chosen to cover in this article,
and others, along with real - world test results and
detail, can be found in some excellent work by Chris
Lovett of Microsoft. Also, author Kurt Cagle has some
great work on using the IXSLTemplate interface to
use cached template processors that will speed up
repetitive transformations big time. You can find
articles by both authors at MSDN online, by simply
searching on their names.
ABOUT THE AUTHOR: By Peter
A. Bromberg, Ph.D.
Peter Bromberg is a Senior Programmer / Analyst at
Fiserv, Inc. in Orlando and a co-developer of the
EggheadCafe.com developer website. He can be reached
at pbromberg@yahoo.com
I hope that you enjoyed this issue of DevNewz.
If you would like to advertise in one of the hottest
e-newsletters on the planet, then email Sue
Coppersmith.
Best Wishes,
Peter Thiruselvam
The DevNewz Team
|
|