Detecting Visually Similar Web Pages: Application to Phishing Detection

Teh-Chung, Chen

doi:doi:10.7939/R3NT4J

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

324 views
1643 downloads

Detecting Visually Similar Web Pages: Application to Phishing Detection

Author / Creator

Teh-Chung, Chen
We propose a novel approach for detecting visual similarity between two web pages. The proposed approach applies Gestalt theory and considers a webpage as a single indivisible entity. The concept of supersignals, as a realization of Gestalt principles, supports our contention that web pages must be treated as indivisible entities. We objectify, and directly compare, these indivisible supersignals using algorithmic complexity theory. We apply our new approach to the domain of anti-Phishing technologies, which at once gives us both a reasonable ground truth for the concept of “visually similar,” and a high-value application of our proposed approach. Phishing attacks involve sophisticated, fraudulent websites that are realistic enough to fool a significant number of victims into providing their account credentials. There is a constant tug-of-war between anti-Phishing researchers who create new schemes to detect Phishing scams, and Phishers who create countermeasures. Our approach to Phishing detection is based on one major signature of Phishing webpage which can not be easily changed by those con artists –Visual Similarity. The only way to fool this significant characteristic appears to be to make a visually dissimilar Phishing webpage, which also reduces the successful rate of the Phishing scams or their criminal profits dramatically. For this reason, our application appears to be quite robust against a variety of common countermeasures Phishers have employed. To verify the practicality of our proposed method, we perform a large-scale, real-world case study, based on “live” Phish captured from the Internet. Compression algorithms (as a practical operational realization of algorithmic complexity theory) are a critical component of our approach. Out of the vast number of compression techniques in the literature, we must determine which compression technique is best suited for our visual similarity problem. We therefore perform a comparison of nine compressors (including both 1-dimensional string compressors and 2-dimensional image compressors). We finally determine that the LZMA algorithm performs best for our problem. With this determination made, we test the LZMA-based similarity technique in a realistic anti-Phishing scenario. We construct a whitelist of protected sites, and compare the performance of our similarity technique when presented with a) some of the most popular legitimate sites, and b) live Phishing sites targeting the protected sites. We found that the accuracy of our technique is extremely high in this test; the true positive and false positive rates reached 100% and 0.8%, respectively. We finally undertake a more detailed investigation of the LZMA compression technique. Other authors have argued that compression techniques map objects to an implicit feature space consisting of the dictionary elements generated by the compressor. In testing this possibility on live Phishing data, we found that derived variables computed directly from the dictionary elements were indeed excellent predictors. In fact, by taking advantage of the specific characteristic of dictionary compression algorithm, we slightly improve on our accuracy when using a modified/refined LZMA algorithm for our already perfect NCD classification application.
Subjects / Keywords
Graduation date

Spring 2011
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/R3NT4J
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Electrical and Computer Engineering
Supervisor / co-supervisor and their department(s)
- Scott Dick (Electrical and Computer Engineering)
- James Miller (Electrical and Computer Engineering)
Examining committee members and their departments
- Jens Weber (Software Engineering, University of Victoria)
- Osmar Zaiane (Computing Science)
- Vicky Zhao (Electrical and Computer Engineering)