- 157 views
- 213 downloads
Multivariate Exploratory Data Analysis of Spatial Data to Support Geostatistical Modeling
-
- Author / Creator
- Zhang, Haoze
-
Geostatistical modeling takes geological data as inputs and builds statistical models for resource
prediction. Geostatistics consists of several components, including preprocessing, modeling, and
postprocessing. Exploratory data analysis (EDA) is an early step in preprocessing. It provides the
characteristics of data and helps identify erroneous or inconsistent data. In the context of geostatistics, missing data and below detection limit (BDL) data are an important anomaly to be understood
in EDA. Missing data are problematic in EDA techniques such as principal component analysis
(PCA). BDL data also cause problems when conducting cluster analysis and other analysis. Geostatistical models need to be conducted in stationary domains, so multivariate and spatial cluster
analysis is another important aspect in EDA. It separates data into smaller groups in which data
share similar features.
This thesis covers multiple aspects of geostatistical EDA. A data map examines missing data, and
it shows the number of missing data in each variable and location. A combined permutation and
Kolmogorov–Smirnov (KS) test identify if the missingness in variables is systematic. BDL data are
investigated in univariate and bivariate methods. A BDL statistics table complements histograms.
Three methods measure the spikiness of data. Bivariate analysis compares observed distributions
with expected distributions which indicate full independence of BDL occurrence. Kullback–Leibler
(KL) test quantifies the difference between the distributions, obtaining combinations of variables in
which the BDL occurrence can be dependent. This helps the understanding of the reasons for BDL
data.
The handling of BDL data in cluster analysis is addressed, including a workflow that finds the
optimal number of clusters. Tests on synthetic data examine the compatibility of the workflow with
different data transformations and clustering methods. K‑means is a suitable clustering method for
dealing with BDL spikes. Four transformations compatible with the workflow are combined with
k‑means to examine clusters in real data. The trade‑off between spatial continuity and multivariate
continuity in cluster analysis is addressed. A novel classification method is proposed to find the
optimal clustering and domain labels. Ensemble clustering labels are used as inputs for the classification. The classification algorithm takes multiple sets of clustering labels as inputs. The domains
are assigned based on clustering labels and two hyperparamters ‑ spatial weight and number of
domains. The matrix of classification results shows higher spatial weight results in more continuous domains. Flow simulation results show that the domain label assignment has an impact on
the performance of the final geostatistical models, because flow responses are highly sensitive to
spatial and multivariate continuity. -
- Subjects / Keywords
-
- Graduation date
- Spring 2022
-
- Type of Item
- Thesis
-
- Degree
- Master of Science
-
- License
- This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.