- 651 views
- 423 downloads
Comparing XML Documents as Reference-aware Labeled Ordered Trees
-
- Author / Creator
- Mikhaiel, Rimon A. E.
-
XML, the Extensible Markup Language, is the standard exchange format for
modern Information Systems, Service Oriented Architecture (SOA) and the
Semantic Web. Hence, comparing XML documents has become a necessary task
for tracking and merging changes between versions of the same document, or for
translating between documents referring to the same information but complying
with different schemata or originating from different parties. In this scenario,
given two documents, XML differencing is the process of finding an edit
sequence, namely a sequence of exact and approximate matching, deletion, and
insertion operations, which, if applied to the first document will result in the
second. In practice, domain-specific differencing solutions are expensive to
develop, and hard to reuse. Therefore, a generic differencing approach, able to
serve various domains, would be both useful and cost-effective. This thesis
presents VTracker, a generic XML differencing approach, which is capable of
capturing domain knowledge and semantics through a configurable domainspecific
cost function. VTracker views an XML document as an ordered labeled
tree. Given two XML-document trees and a cost function VTracker calculates the
tree-edit distance needed to transform one tree to the other. The first contribution
of VTracker is an automatic method used to synthesize such a cost function based
on the domain’s XML Schema Definition (XSD). Second, VTracker considers the
XML reference structure in addition to the natural XML containment structure.
Third, VTracker implements an affine-cost policy that prefers edit operations
applied to neighbors over dispersed elements. Finally, VTracker uses a set of
simplicity heuristics to nominate the best edit script in case of multiple ones found
with the same minimum cost. VTracker was applied to a variety of domains,
namely OWL/RDF, WSDL, BPEL, UML/XMI, XHTML, and RNA secondary
structure, where it performed competitively with, or even better than, state-of-theart
methods in each of these domains. -
- Subjects / Keywords
-
- Graduation date
- Fall 2011
-
- Type of Item
- Thesis
-
- Degree
- Doctor of Philosophy
-
- License
- This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.