Detecting Visually Similar Web Pages: Application to Phishing Detection

  • Author / Creator
    Teh-Chung, Chen
  • We propose a novel approach for detecting visual similarity between two web pages. The proposed approach applies Gestalt theory and considers a webpage as a single indivisible entity. The concept of supersignals, as a realization of Gestalt principles, supports our contention that web pages must be treated as indivisible entities. We objectify, and directly compare, these indivisible supersignals using algorithmic complexity theory. We apply our new approach to the domain of anti-Phishing technologies, which at once gives us both a reasonable ground truth for the concept of “visually similar,” and a high-value application of our proposed approach. Phishing attacks involve sophisticated, fraudulent websites that are realistic enough to fool a significant number of victims into providing their account credentials. There is a constant tug-of-war between anti-Phishing researchers who create new schemes to detect Phishing scams, and Phishers who create countermeasures. Our approach to Phishing detection is based on one major signature of Phishing webpage which can not be easily changed by those con artists –Visual Similarity. The only way to fool this significant characteristic appears to be to make a visually dissimilar Phishing webpage, which also reduces the successful rate of the Phishing scams or their criminal profits dramatically. For this reason, our application appears to be quite robust against a variety of common countermeasures Phishers have employed. To verify the practicality of our proposed method, we perform a large-scale, real-world case study, based on “live” Phish captured from the Internet. Compression algorithms (as a practical operational realization of algorithmic complexity theory) are a critical component of our approach. Out of the vast number of compression techniques in the literature, we must determine which compression technique is best suited for our visual similarity problem. We therefore perform a comparison of nine compressors (including both 1-dimensional string compressors and 2-dimensional image compressors). We finally determine that the LZMA algorithm performs best for our problem. With this determination made, we test the LZMA-based similarity technique in a realistic anti-Phishing scenario. We construct a whitelist of protected sites, and compare the performance of our similarity technique when presented with a) some of the most popular legitimate sites, and b) live Phishing sites targeting the protected sites. We found that the accuracy of our technique is extremely high in this test; the true positive and false positive rates reached 100% and 0.8%, respectively. We finally undertake a more detailed investigation of the LZMA compression technique. Other authors have argued that compression techniques map objects to an implicit feature space consisting of the dictionary elements generated by the compressor. In testing this possibility on live Phishing data, we found that derived variables computed directly from the dictionary elements were indeed excellent predictors. In fact, by taking advantage of the specific characteristic of dictionary compression algorithm, we slightly improve on our accuracy when using a modified/refined LZMA algorithm for our already perfect NCD classification application.

  • Subjects / Keywords
  • Graduation date
    Spring 2011
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.