Usage
  • 96 views
  • 143 downloads

Multivariate Exploratory Data Analysis of Spatial Data to Support Geostatistical Modeling

  • Author / Creator
    Zhang, Haoze
  • Geostatistical modeling takes geological data as inputs and builds statistical models for resource
    prediction. Geostatistics consists of several components, including preprocessing, modeling, and
    postprocessing. Exploratory data analysis (EDA) is an early step in preprocessing. It provides the
    characteristics of data and helps identify erroneous or inconsistent data. In the context of geostatistics, missing data and below detection limit (BDL) data are an important anomaly to be understood
    in EDA. Missing data are problematic in EDA techniques such as principal component analysis
    (PCA). BDL data also cause problems when conducting cluster analysis and other analysis. Geostatistical models need to be conducted in stationary domains, so multivariate and spatial cluster
    analysis is another important aspect in EDA. It separates data into smaller groups in which data
    share similar features.
    This thesis covers multiple aspects of geostatistical EDA. A data map examines missing data, and
    it shows the number of missing data in each variable and location. A combined permutation and
    Kolmogorov–Smirnov (KS) test identify if the missingness in variables is systematic. BDL data are
    investigated in univariate and bivariate methods. A BDL statistics table complements histograms.
    Three methods measure the spikiness of data. Bivariate analysis compares observed distributions
    with expected distributions which indicate full independence of BDL occurrence. Kullback–Leibler
    (KL) test quantifies the difference between the distributions, obtaining combinations of variables in
    which the BDL occurrence can be dependent. This helps the understanding of the reasons for BDL
    data.
    The handling of BDL data in cluster analysis is addressed, including a workflow that finds the
    optimal number of clusters. Tests on synthetic data examine the compatibility of the workflow with
    different data transformations and clustering methods. K‑means is a suitable clustering method for
    dealing with BDL spikes. Four transformations compatible with the workflow are combined with
    k‑means to examine clusters in real data. The trade‑off between spatial continuity and multivariate
    continuity in cluster analysis is addressed. A novel classification method is proposed to find the
    optimal clustering and domain labels. Ensemble clustering labels are used as inputs for the classification. The classification algorithm takes multiple sets of clustering labels as inputs. The domains
    are assigned based on clustering labels and two hyperparamters ‑ spatial weight and number of
    domains. The matrix of classification results shows higher spatial weight results in more continuous domains. Flow simulation results show that the domain label assignment has an impact on
    the performance of the final geostatistical models, because flow responses are highly sensitive to
    spatial and multivariate continuity.

  • Subjects / Keywords
  • Graduation date
    Spring 2022
  • Type of Item
    Thesis
  • Degree
    Master of Science
  • DOI
    https://doi.org/10.7939/r3-a59t-6s56
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.