The second step involves selecting the area for which a map is to be developed, which is divided into pixels (or squares). The pixel size depends on the resolution of the available data (1 km2 in GAP). The average value of the measured concentration data within each pixel is used for model calibration. For each measured data point (dependent variable), an iterative statistical analysis is made with the predictor variables to find the degree of positive or negative correlation between the two and to determine the coefficients of the model parameters. It is important to have a range of measured data values that encompasses a comparable proportion of high and low values. This spread of data is essential, since the model will be able to predict only across this same value range. In the case of logistic regression, where the dependent variable is taken to be either high or low (1 or 0), the cut-off between the two is commonly chosen to be the contaminant concentration limit determined by the authorities (e.g. WHO) as being acceptable for human consumption. The same is true for independent variables. For example, if the values of an independent variable are the same for all of the data points being modelled, this variable cannot explain any of the variance found in the data since the independent variable itself does not vary. It is therefore important in being able to establish a correlation that the dependent data points take in a broad range of independent variable values. This could then mean targeting a groundwater sampling campaign in specific regions with differences in, for example, geology or soil type and not necessarily where high arsenic levels are expected. A rule of thumb for the minimum size of the dataset (of the dependent variable) to be modelled is to have a ratio of at least 10 cases to every independent variable. In this instance, “cases” refers to the smaller of the number of high or low data values (1 or 0) in the dataset. For example, when using three independent variables with a dataset having 60% high values / 40% low values, the dataset should contain at least: 10 x 3 / 0.4 = 75 samples (REF).
It should be noted that in both low- and high-pH conditions there is a potential for elevated dissolved fluoride concentrations in groundwater because of limited dissolved calcium concentrations that could otherwise control dissolved fluoride concentrations by the precipitation of fluorite (CaF2(s)).