Usage
  • 277 views
  • 378 downloads

Classifying Websites into Non-topical Categories

  • Author / Creator
    Thapa, Chaman
  • With the large presence of organizations from different sectors
    of economy on the web, the problem of detecting which sector a
    given website belongs to is both important and challenging. We study the problem of classifying websites into four non-topical categories:
    public, private, non-profit and commercial franchise. We study textual features based on word unigrams and bigrams, syntactic features based on part-of-speech
    tags and named entity distribution, and structural features based on depth of websites, link structures and URL patterns. Our experiments with different sets of features in classifying
    websites reveal that syntactic and structural features help to improve the performance when combined with word unigrams and bigrams. The improvement is more significant when words are
    insufficient. Experimenting on websites related to obesity control, we compare classifiers built on
    words extracted from various depths of a website. Our experiments under a multi-label classification setting show that crawling words from deeper depths may not be helpful.

    When the number of unlabeled
    websites is significantly larger than the labeled ones, which is usually the case, it is
    beneficial if the classifiers can utilize both the labeled and
    unlabeled data. Based on this observation, we combine multiple
    sets of features using the co-training algorithm in a semi-supervised
    setting. Our experiments show that co-training does indeed improve the classification accuracy when multiple feature sets and few labeled samples are available for training.

  • Subjects / Keywords
  • Graduation date
    Fall 2012
  • Type of Item
    Thesis
  • Degree
    Master of Science
  • DOI
    https://doi.org/10.7939/R3S602
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.