Learning Representations for Anonymizing Sensor Data in IoT Applications

  • Author / Creator
    Hajihassani, Omid
  • Recent years have witnessed a growth in the deployment of IoT devices in homes and workplaces. The number of IoT devices is projected to surpass tens of billions in the near future. This rapid growth can be credited to useful insights and convenience offered by IoT services and applications. A typical IoT device is equipped with one or several sensors which are capable of collecting high-fidelity and high-sample-rate data from the environment, often without notifying the user. This ubiquitous and inconspicuous data collection threatens user privacy as the collected data may contain private or sensitive information which can be extracted by malicious applications through unsolicited inferences. This thesis investigates potential solutions based on generative machine learning models to limit the accuracy of privacy-intrusive inferences
    with an imperceptible impact on the accuracy of useful and desired inferences.

    We begin this thesis by surveying different approaches to privacy-preserving data collection and processing. As the first contribution of this thesis, we investigate the ability of variational autoencoder (VAE) models to learn representations that enable hiding the private information embedded in sensor data. Specifically, we modify the loss function of standard and conditional VAE models to obtain two different anonymization techniques. These techniques perform deterministic and probabilistic manipulations in the learned latent space of autoencoders. These manipulations effectively support data anonymization when the corresponding latent variable is used to reconstruct the original data.

    To evaluate our methods, we use two publicly available Human Activity Recognition (HAR) datasets, namely the MobiAct and the MotionSense datasets. These datasets contain both public and private information about users which can be detected using inference models (desired and sensitive inferences, respectively). Our goal is to use the proposed techniques to conceal private information while maintaining the accuracy of the desired inference as much as possible.

    We evaluate the efficacy of each technique in concealing private information through ablation studies and comparison with multiple baseline methods, including recent techniques proposed in the literature. We evaluate our techniques by treating the activity attribute in both datasets as public information, and the gender and weight of subjects as private information. We show that state-of-the-art anonymization techniques are vulnerable to a user re-identification attack,
    while our techniques are less susceptible to this attack thanks to the proposed non-deterministic manipulations. In comparison to the best autoencoder-based baseline method, we achieve 13.48% lower privacy loss on average in the two HAR datasets while getting a comparable activity inference accuracy. This indicates that a better trade-off between utility and privacy is achieved by our techniques. Moreover, we discuss how users can navigate the utility-privacy trade-off (according to their own needs and values)
    by tweaking the weights in the modified loss functions of the generative models. We show that one of the proposed anonymization techniques can simultaneously conceal multiple private attributes with only a small decrease in the anonymization performance.

  • Subjects / Keywords
  • Graduation date
    Spring 2021
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.