Proposal, availability and interface
The tool developed in this study was called GEOTrat - Points, and its most up-to-date version is available in the GitHub repository, accessible online via the link https://github.com/LauraMouraXavier/geotrat (accessed in February 2024). The repository contains the GEOTrat_Points files with the model and .py extensions, compatible with QGIS Desktop version 3.22.8. To use the developed resource, it is necessary to import one of these files into the QGIS software’s processing toolbox and to install the SAGA and SAGA Next Gen add-ons.
The purpose of the GEOTrat - Points tool is to allow to evaluate the effectiveness of a treatment quantitatively and spatially in relation to others using maps generated through geostatistical interpolation. The agricultural experiment to be conducted in the field must be structured in a randomized block design, with a minimum of two treatments and up to five different treatments called T1, T2, T3, T4, and T5. The user, in turn, must collect georeferenced samples of the agricultural variable of interest to perform a comparative analysis. Figure 1 illustrates examples of field experiment design suitable for applying the resource developed in this study.
In addition to the examples of designs shown in Fig. 1, both the study area and the treatment blocks are not restricted to specific dimensions and can take on any configuration. In addition, the number of samples is determined by the user, and it is suggested to follow the norms defined by agricultural experimentation manuals. More detailed information on the principles of experimental planning, design, and data analysis in agricultural contexts can be found in [11] and [12]. It is also suggested that the samples cover the entire study area, distributed homogeneously in equal quantities between the two treatments.
The interface for running GEOTrat - Points follows the pattern of the tools developed in QGIS and is shown in Fig. 2.
The fields displayed on the tool’s interface are for entering inputs and specifying the paths for storing the results. The inputs consist of information provided by the user, covering data related to the experiment conducted in the field as well as the parameters needed to conduct geostatistical interpolation. In the right-hand corner of the tool’s interface is a help text box with detailed information about each of the tool’s input and output parameters as well as details about the developers and versions available. The tool also has buttons for starting execution, closing the interface, accessing the help function, and running a set of batch processes.
The GEOTrat - Points tool consists of inputs, algorithms, and outputs. The logic flow diagram is shown in Fig. 3, which consists of the interconnections and dependencies between the inputs, algorithms used, and outputs. The following topics detail the specifications of the inputs, algorithms, and outputs generated by the tool.
Inputs
The tool’s inputs must be supplied by the user and consist of vector layers, vector layer fields, coordinate reference system, numerical values, and enumerated lists of options. The specifications of the inputs and their formats are shown in Table 1.
Table 1
Input specifications and their formats
Input Name | Format |
Variable measured in the field | Vector Layer - Geometry type Point |
Variable field | Vector Field - Data type Number |
Treatment field | Vector Field - Data type String |
Reference treatment | Enumerated List (T1 or T2) |
Study area polygon | Vector Layer - Geometry type Polygon |
Projected Coordinate System | Coordinate Reference System |
Pixel size (m) | Number |
Semivariogram model | Enumerated List (Linear, Exponential, Gaussian and Spherical) |
Number of treatments | Number |
To use the tool, the first piece of information to be entered by the user is the variable measured in the field, which must be a vector layer of point geometry in shapefile format. This vector layer should contain the georeferenced points of the samples obtained in the field experiment. The file’s attribute table must contain two mandatory fields: a numeric field with the values of the agricultural variable measured in the field, to be defined in the Variable field entry; and a textual field identifying the treatment applied (e.g., T1, T2, T3, T4, T5), to be defined in the Treatment field entry.
In the Reference treatment entry, which is an enumerated list with the options T1 (default), T2, T3, T4 and T5, the user must indicate which treatment will be considered the reference for comparison with the other treatments. In addition, in the Number of treatments entry, the user must indicate the number of treatments used; in this option, the minimum value is equal to 2 (default), and the maximum value is equal to 5.
In addition to the vector layer of points, the user must have a vector layer of polygon geometry in shapefile format, which will represent the delimitation of the field experiment’s study area. This vector layer must be inserted in the Study area polygon input.
The comparison of treatments conducted by the tool uses the technique of spatial interpolation using geostatistics, specifically the Kriging method (more details in section 2.3). This method is used to generate maps by estimating matrix surfaces. Therefore, the user must provide additional information, such as the spatial resolution in meters of the generated surfaces, which must be specified in the Pixel size (m) input. It is recommended that the pixel size chosen is compatible with the average distance at which samples are collected in the field. In addition, in the Semivariogram model entry, the user must enter the mathematical model of the semivariogram to be used in kriging interpolation. The list provides the following options: Linear (default), Gaussian, Exponential, and Spherical. Once the inputs have been defined, the user can run the tool.
ALGORITHMS
The algorithms used to develop the tool belong to the QGIS geoprocessing package and the algorithm providers SAGA and GDAL. Table 2 shows the sequence of procedures conducted by the tool, detailing the order of execution, the description given to each process in the development of the tool, the algorithm used, and its respective provider.
Table 2
Procedures executed by the tool, order of execution, description, algorithm, and provider
Order of execution | Process | Description in the tool | Algorithm | Provider |
1 | Reprojecting sample points | Reproject points | Reproject layer | QGIS |
Redesign area | Reproject area |
2 | Rename the field of the agricultural variable of interest in the points layer | Rename variable field | Rename field | QGIS |
Rename the field specifying the point layer treatments | Rename treatment field |
3 | Separate the points belonging to the different treatments | Separate T1 | Extract by attribute | QGIS |
Separate T2 |
Separate T3 |
Separate T4 |
Separate T5 |
4 | Select 80% of the points belonging to each treatment | Select T1-80% | Rondom selection | QGIS |
Select T2-80% |
Select T3-80% |
Select T4-80% |
Select T5-80% |
5 | Extract the selected points to the vector layer | Extract T1-80% | Extract selected features | QGIS |
Extract T2-80% |
Extract T3-80% |
Extract T4-80% |
Extract T5-80% |
6 | Extract the remaining 20% of points to the vector layer | Extract T1-20% | Extract by location | QGIS |
Extract T2-20% |
Extract T3-20% |
Extract T4-20% |
Extract T5-20% |
7 | Estimate the variable of interest for the study area from 80% of the points -Interpolation by ordinary kriging | KrigO - T1-80% | Ordinary kriging | SAGA |
KrigO - T2-80% |
KrigO - T3-80% |
KrigO - T4-80% |
KrigO - T5-80% |
8 | Collect samples of the surface estimated with points using the 20% of the points | Sample T1 | Sample raster values | QGIS |
Sample T2 |
Sample T3 |
Sample T4 |
Sample T5 |
9 | Calculate the estimation error of the generated surface | Error T1 | Fied calculator | QGIS |
Error T2 |
Error T3 |
Error T4 |
Error T5 |
10 | Generate a surface of the calculated error - Interpolation by Ordinary Kriging | SupError T1 | Ordinary kriging | SAGA |
SupError T2 |
SupError T3 |
SupError T4 |
SupError T5 |
11 | Add up the estimated and calculated error surfaces, generating a final estimated surface | T1 | Raster calculator | GDAL |
T2 |
T3 |
T4 |
T5 |
12 | Cut out the final surface in the study area | T1_rec | Clip raster by mask layer | GDAL |
T2_rec |
T3_rec |
T4_rec |
T5_rec |
13 | Subtraction between final surfaces (Reference Treatment - Other Treatments) | Gain (T1 and T2) | Raster calculator | GDAL |
Gain (T1 and T3) |
Gain (T1 and T4) |
Gain (T1 and T5) |
Gain (T2 and T3) |
Gain (T2 and T4) |
Gain (T2 and T5) |
Gain (T3 and T4) |
Gain (T3 and T5) |
Gain (T4 and T5) |
14 | Calculate basic surface statistics generated by subtraction | Stats (T1 and T2) | Raster layer statistics | QGIS |
Stats (T1 and T3) |
Stats (T1 and T4) |
Stats (T1 and T5) |
Stats (T2 and T3) |
Stats (T2 and T4) |
Stats (T2 and T5) |
Stats (T3 and T4) |
Stats (T3 and T5) |
Stats (T4 and T5) |
The procedure shown in Table 2 describes the 14 steps conducted by the GEOTrat - Points tool when the user starts running it. The first step consists of reprojecting the point layers into a projected coordinate system, since ordinary kriging will be performed in subsequent steps. This technique requires the data to be in a metric coordinate system so that the semivariograms, calculated considering the distance between the samples, can be calculated, and reprojecting the input data guarantees suitability for the processes conducted.
Then, in step 2, the relevant fields in the points layer, which represent the agricultural variable of interest and treatment specification, are standardized with a specific name. This makes it easier to identify and manipulate the data during the analysis process. In step 3, the points are separated based on the treatment attribute, resulting in separate point layers for each treatment.
In steps 4, 5, and 6, the samples from each treatment are randomly divided. This division is necessary to generate the surface interpolation model using the kriging method and, subsequently, to estimate the error of this model. This method of dividing samples for modeling is known as holdout, a common technique used in machine learning and statistics to evaluate the effectiveness of a mathematical model, and it was chosen due to its ease of implementation. In this method, the dataset is divided into two mutually exclusive subsets: a training set used to generate the estimation model (usually 80% of the data), and a test set used to evaluate the model’s performance (usually 20% of the data). More information on the holdout technique can be found in [13], and more recently in the works by [14] and [15].
In step 7, interpolation by ordinary kriging is conducted using 80% of the points with samples from each treatment, estimating the variable of interest for the study area. This results in the creation of estimated surfaces of the variable of interest based only on the points of each treatment (T1, T2, T3, T4, T5). Kriging is a technique widely used in geostatistics and spatial analysis to estimate unknown values at unsampled locations based on known point observations [16]. This technique is used in agricultural applications to map the spatial variability of soil attributes, crop yields, and other agricultural parameters [17][10][18].
Ordinary kriging stands out as one of the most widely used techniques in geostatistics due to its ability to consider the spatial dependence of data. This approach assumes that the correlation between values at different locations can be modeled using a semivariogram, which describes how the variability of the data varies with the distance between the sampled points. It then estimates unsampled values by weighting the closest observations and the spatial correlation structure [19].
In the GEOTrat - Points tool, the parameters used to run the ordinary kriging algorithm, provided by the SAGA tester, are the sample points from the treatments and the parameters defined in the input, which are the variable of interest, contour of the study area, size of the interpolation pixel, and semivariogram model. To facilitate comparisons of the experiments, the other parameters were kept at the algorithm’s default values (lag equal to 100, skip equal to 1, global search range, maximum search distance equal to 1,000, minimum of 16 neighbors, and maximum of 20 neighbors).
In steps 8 and 9, the estimated surfaces are sampled using the 20% of the points separated in step 6. After this sampling, the estimation error is calculated by subtracting the estimated value of the variable of interest from the corresponding measured value. This calculated error, in step 10, is used to generate a surface that represents the error calculated for the surfaces. This step is conducted again using the ordinary kriging technique. Then, in step 11, the agricultural variable estimation surface for each treatment and its respective error surface are added together, producing final estimated surfaces for the treatments. In step 12, these final surfaces are cut out, thus limiting the analysis to the space of the study area.
Finally, in step 13, the final surfaces of the treatments are subtracted, allowing a comparison between the treatment defined as the reference and the other treatments. In addition, in step 14, basic statistics are calculated for the surfaces resulting from the subtractions, providing important information on the differences between the treatments, such as sum, mean, standard deviation, maximum, and minimum.
Outputs
The outputs generated by the resource developed in this study consist of estimated surfaces in matrix format for the study area of the agricultural variable of interest, taking into account the treatments. The T1 surface represents an estimate of the agricultural variable for the study area if it were treated exclusively with the T1 treatment, and the same applies to the T2, T3, T4, and T5 outputs.
In addition, the tool generates an output called Gain, which is a surface in matrix format representing the gain associated with the treatment defined as a reference in relation to the other treatments. An output file with an HTML extension, entitled Statistics of Gain, is generated to present the basic statistics of these surfaces. It should be noted that the results generated are presented in the same unit of measurement as the agricultural variable of interest, and the execution time is variable, depending on the settings of the equipment used and the number of points to be analyzed.
PERFORMANCE EVALUATION AND DISCUSSIONS
This topic presents the performance evaluation of the GEOTrat - Points tool. The data used for the case study were provided by the company Lallemand Plant Care Ltda (in the study, no experiments were carried out with live plants). They belong to experiments that followed a block design, consisting of two different treatments with products of biological origin. It is important to note that information on the specifications of the products used is confidential and therefore not included in the scope of this research.
The simulated surfaces for the experimental area, under treatments T1 and T2, were evaluated by calculating the Root Mean Square Error (RMSE), using the samples taken from the database before the start of the case study using the tool. RMSE is a metric widely used to assess the accuracy of forecasting and estimation models. The lower the RMSE value, the smaller the relative discrepancy between the estimates and the values measured in the field. Eq. 1 shows the formula used to calculate the RMSE as a percentage:
$$\:RMSE\:\left(\%\right)=\:\frac{\sqrt{\frac{{\sum\:\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}}{n}}}{\frac{\sum\:{y}_{i}}{n}}*\:100$$
1
where \(\:{y}_{i}\) represents the yield value measured in the field; \(\:{\widehat{y}}_{i}\) represents the estimated yield value; and \(\:n\) the number of evaluation samples.
Another measure used to evaluate the surfaces calculated in this study was Pearson’s Correlation Coefficient (r), which can take values in the range − 1 to + 1. This coefficient was used to measure the intensity and direction of the linear relationship between the yield values measured in the field and the values estimated for the T1 and T2 surfaces. Eq. 2 shows the formula used to calculate r:
$$\:r=\:\frac{\sum\:\left({y}_{i}-\:\stackrel{-}{y}\right)\left({\widehat{y}}_{i}-\stackrel{-}{\widehat{y}}\right)}{\sqrt{\left(\sum\:{\left({y}_{i}-\:\stackrel{-}{y}\right)}^{2}\right)\left(\sum\:{\left({\widehat{y}}_{i}-\stackrel{-}{\widehat{y}}\right)}^{2}\right)}}$$
2
where \(\:{y}_{i}\) represents the yield measured in the field; \(\:\stackrel{-}{y}\) represents the average yield measured in the field; \(\:{\widehat{y}}_{i}\) represents the estimated yield value; and \(\:\stackrel{-}{\widehat{y}}\) represents the average of the estimated productivity.
Data
The samples were collected from six experimental areas, five of them located in France and one in Brazil. In these areas, different grain crops were grown, including barley, sunflower, corn, soybeans, wheat, and triticale. Figure 4 shows the geographical location of these six areas as well as the configuration adopted for the treatments’ design.
The map in Fig. 4 shows the spatial distribution of the samples from treatments T1 and T2. The experiments aimed to assess the impact of the T2 treatment on increasing the yield of the crops under study. To this end, the samples were collected using a combine harvester, to cover the entire experimental area homogeneously. It should be added that the samples are already collected in a georeferenced manner, regardless of the direction of harvest or layout of the treatments. The presence of empty areas in some treatment strips is due to faults in the planting line or errors in the machine’s yield measuring equipment.
Two vector files in shapefile format were made available for each of the areas: a vector file with point-type geometry, with a numeric field containing the yield value in kg/ha and another binary textual field specifying the treatment used, identified as T1 or T2; and another vector file of polygon geometry representing the delimitation of the experimental area. Table 3 shows relevant information about each area, including the crop, size of the area, and basic statistics relating to the total yield samples and for each treatment.
Table 3
Crop, area size, and basic statistics of the yield samples
Cultivation | Area size (ha) | Treatment | Number of samples | Average (kg/ha) | Sum (kg/ha) | Minimum (kg/ha) | Maximum (kg/ha) | Standard Deviation (kg/ha) | Coefficient of variation (%) |
Barley | 11.48 | Total | 4,800 | 5600.00 | 26869700.00 | 1000.00 | 10100.00 | 1530.00 | 27.32 |
T1 | 2,400 | 5680.00 | 13635900.00 | 1200.00 | 10100.00 | 1570.00 | 27.64 |
T2 | 2,400 | 5510.00 | 13233800.00 | 1000.00 | 9900.00 | 1480.00 | 26.86 |
Sunflower | 12.62 | Total | 9,800 | 3591.52 | 35196923.10 | 2132.10 | 4829.00 | 494.26 | 13.76 |
T1 | 4,900 | 3468.91 | 16997649.60 | 2132.10 | 4827.20 | 503.90 | 14.53 |
T2 | 4,900 | 3714.14 | 18199273.50 | 2138.80 | 4829.00 | 452.32 | 12.18 |
Corn | 3.92 | Total | 4,400 | 10362.17 | 45593563.00 | 5364.20 | 15072.20 | 1362.58 | 13.15 |
T1 | 2,200 | 10436.65 | 22960630.90 | 5364.20 | 15072.20 | 1458.12 | 13.97 |
T2 | 2,200 | 10287.69 | 22632932.10 | 5670.70 | 15000.20 | 1255.40 | 12.20 |
Soy | 3.2 | Total | 1,600 | 4470.00 | 7156520.00 | 3840.00 | 5090.00 | 250.00 | 5.59 |
T1 | 800 | 4450.00 | 3557810.00 | 3840.00 | 5070.00 | 250.00 | 5.62 |
T2 | 800 | 4500.00 | 3598710.00 | 3880.00 | 5090.00 | 240.00 | 5.33 |
Wheat | 5.01 | Total | 3,400 | 8060.00 | 27420830.00 | 4780.00 | 10900.00 | 980.00 | 12.16 |
T1 | 1,700 | 7760.00 | 13191740.00 | 4950.00 | 10610.00 | 920.00 | 11.86 |
T2 | 1,700 | 8370.00 | 14229080.00 | 4780.00 | 10900.00 | 950.00 | 11.35 |
Triticale | 3.54 | Total | 1,940 | 6163.93 | 11958025.90 | 2520.80 | 9607.00 | 1181.11 | 19.16 |
T1 | 970 | 6063.13 | 5881237.70 | 2520.80 | 9607.00 | 1215.33 | 20.04 |
T2 | 970 | 6264.73 | 6076788.20 | 2682.20 | 9326.70 | 1136.97 | 18.15 |
Table 3 provides an analysis of the yield of the different agricultural crops and in relation to the treatments. The case study included various grain crops with different area sizes, ranging from 3.20 ha to 12.62 ha. In addition, the number of samples differed considerably, with intervals ranging from 1,600 points for the soybean crop to 9,800 points for the sunflower crop, with the number of samples being proportional to the size of the area and considering the planting lines used to conduct the experiment.
The number of samples per hectare recommended for the use of kriging applied to agriculture varies according to spatial variability, area size, desired precision, and available resources [20]. In the experiments in this research, the ratio of the number of samples per area is significantly high, with the barley crop having the lowest number of samples per area, approximately 418.12 samples/ha, and the maize crop having the highest number of samples per area, approximately 1,122.45 samples/ha. The equipment used to collect yields for the experimental areas provides a high density of samples.
The average yield of the experimental areas ranges from 3,591.52 kg/ha for the sunflower crop to 10,362.17 kg/ha for the corn area. The analysis of the sum of yields shows that maize was the crop with the highest yield, at 45593563.00 kg/ha, while soya recorded the lowest yield, at 7156520.00 kg/ha. In terms of the percentage of variation in yield values, barley and triticale have the greatest variability, with coefficients of variation of 27.32% and 19.16%, respectively. The sunflower, corn, and wheat crops show similar variability, with values of 13.76%, 13.15% and 12.16%, respectively, while the soybean crop has the lowest variability, with 5.59%.
Table 3 includes the yield averages for each treatment in each experiment, showing significant differences between the yield averages. The discrepancies between the average yields of the treatments range from 50 kg/ha for the soybean crop to 610.10 kg/ha for the wheat crop. It is important to note that, in all the experiments, the reference treatments were those of treatment T2; however, higher yield averages were noted in treatment T1 for the barley, corn, and soybean crops.
Variability in grain crop yields is a phenomenon that can be linked to various factors. Fluctuations in rainfall, temperature, and weather patterns affect the growth and development of these crops. In addition, the physical and chemical characteristics of the soil, such as texture, fertility, and acidity, directly interfere with the absorption of nutrients by plants. Another important factor is the choice of agricultural management, such as the use of seed varieties, fertilizers, irrigation, and pest and disease control. The interaction between these factors can lead to variations in crop yield [21].
Parameters
The GEOTrat - Points case study was preceded by a sample partitioning stage, aimed at external evaluation of the yield estimates generated by the models. To do this, 20% of the total samples were randomly selected for each treatment. This process was conducted using the QGIS software, making use of the random selection functionality.
The input parameters for execution were defined as the remaining 80% of the samples after the initial partitioning, selection of the field containing the yield, selection of the field specifying the treatment, and definition of the reference treatment set as T2. In addition, the boundaries of the experimental areas were included, in addition to the projected coordinate system referring to the location of each area, a standard pixel size of 1.50 meters, and the selection of the semivariogram model with linear equation as standard.