Classifying Websites into Non-topical Categories

  • Author / Creator
    Thapa, Chaman
  • With the large presence of organizations from different sectors of economy on the web, the problem of detecting which sector a given website belongs to is both important and challenging. We study the problem of classifying websites into four non-topical categories: public, private, non-profit and commercial franchise. We study textual features based on word unigrams and bigrams, syntactic features based on part-of-speech tags and named entity distribution, and structural features based on depth of websites, link structures and URL patterns. Our experiments with different sets of features in classifying websites reveal that syntactic and structural features help to improve the performance when combined with word unigrams and bigrams. The improvement is more significant when words are insufficient. Experimenting on websites related to obesity control, we compare classifiers built on words extracted from various depths of a website. Our experiments under a multi-label classification setting show that crawling words from deeper depths may not be helpful. When the number of unlabeled websites is significantly larger than the labeled ones, which is usually the case, it is beneficial if the classifiers can utilize both the labeled and unlabeled data. Based on this observation, we combine multiple sets of features using the co-training algorithm in a semi-supervised setting. Our experiments show that co-training does indeed improve the classification accuracy when multiple feature sets and few labeled samples are available for training.

  • Subjects / Keywords
  • Graduation date
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
  • Language
  • Institution
    University of Alberta
  • Degree level
  • Department
    • Department of Computing Science
  • Supervisor / co-supervisor and their department(s)
    • Zaiane, Osmar (Computing Science)
    • Rafiei, Davood (Computing Science)
  • Examining committee members and their departments
    • Sander, Jörg (Computing Science)
    • Kurgan, Lukasz (Electrical and Computer Engineering)