Finding Syntactic Similarities Between XML Documents

  • Technical report TR05-16. We present a concise and accurate structural summary of XML documents and show that this summary can be used to effectively cluster documents that belong to a structurally similar class. We present efficient formulations of similarity between structural summaries that leads to a better detection of documents that conform to the same DTD. Our formulation is based on the intuition that two documents are likely to be generated by the same DTD if a large fraction of paths in the two documents are the same or similar. Our experimental evaluation shows that this method does an excellent job of grouping documents generated by the same DTD, outperforming some of the previously proposed solutions based on a tree comparison. | TRID-ID TR05-16

