Currently, there are 7 functions associated with the sample
verb in the sgsR
package:
sample_srs()
- simple random samplingsample_systematic()
- systematic sampling in a grid or hexagon tessellationsample_strat()
- stratified sampling within asraster
sample_nc()
- Nearest centroid sampling. See Melville & Stone (2016)sample_clhs()
- Latin hypercube samplingsample_ahels()
- adapted hypercube evaluation of a legacy sample (ahels)
Access
One key feature of using some sample_*
functions is its ability to define access
corridors. Users can supply a road access
network (must be sf
line objects) and define buffers around access
where samples should be excluded and included.
Relevant and applicable parameters when access
is defined are:
buff_inner
- Can be left asNULL
(default). Inner buffer parameter that defines the distance fromaccess
where samples cannot be taken (i.e. if you don’t want samples within 50 m of youraccess
layer setbuff_inner = 50
).buff_outer
- Outer buffer parameter that defines the maximum distance that the samples can be located fromaccess
(i.e. if you don’t want samples more than 200 meters from youraccess
layer setbuff_inner = 200
).
sample_srs
We have demonstrated a simple example of using the sample_srs()
function in vignette("sgsR")
. We will demonstrate additional examples below.
The input required for sample_srs()
is a raster
. This means that sraster
and mraster
are supported for this function.
#--- perform simple random sampling ---#
sample_srs(raster = sraster, # input sraster
nSamp = 200, # number of desired samples
plot = TRUE) # plot
#> Simple feature collection with 200 features and 0 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 431110 ymin: 5337710 xmax: 438510 ymax: 5343190
#> CRS: +proj=utm +zone=17 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
#> First 10 features:
#> geometry
#> 1 POINT (433630 5341170)
#> 2 POINT (433630 5341170)
#> 3 POINT (435790 5342410)
#> 4 POINT (433890 5341450)
#> 5 POINT (434470 5340850)
#> 6 POINT (435610 5341890)
#> 7 POINT (432330 5341550)
#> 8 POINT (434250 5342610)
#> 9 POINT (432610 5343190)
#> 10 POINT (432350 5338490)
sample_srs(raster = mraster, # input mraster
nSamp = 200, # number of desired samples
access = access, # define access road network
mindist = 200, # minimum distance samples must be apart from one another
buff_inner = 50, # inner buffer - no samples within this distance from road
buff_outer = 200, # outer buffer - no samples further than this distance from road
plot = TRUE) # plot
#> Simple feature collection with 200 features and 0 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 431150 ymin: 5337710 xmax: 438530 ymax: 5343230
#> CRS: +proj=utm +zone=17 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
#> First 10 features:
#> geometry
#> 1 POINT (433570 5341270)
#> 2 POINT (432830 5341730)
#> 3 POINT (434430 5338770)
#> 4 POINT (434050 5338630)
#> 5 POINT (434970 5339930)
#> 6 POINT (434530 5339430)
#> 7 POINT (436110 5342390)
#> 8 POINT (435070 5337970)
#> 9 POINT (433010 5339650)
#> 10 POINT (435330 5342270)
sample_systematic
The sample_systematic()
function applies systematic sampling across an area with the cellsize
parameter defining the resolution of the tessellation. The tessellation shape can be modified using the square
parameter. Assigning TRUE
(default) to the square
parameter results in a regular grid and assigning FALSE
results in a hexagonal grid. The location of samples can also be adjusted using the locations
parameter, where centers
takes the center, corners
takes all corners, and random
takes a random location within each tessellation.
#--- perform grid sampling ---#
sample_systematic(raster = sraster, # input sraster
cellsize = 1000, # grid distance
plot = TRUE) # plot
#> Simple feature collection with 32 features and 0 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 431900 ymin: 5338400 xmax: 437900 ymax: 5342400
#> CRS: +proj=utm +zone=17 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
#> First 10 features:
#> geometry
#> 1 POINT (431900 5338400)
#> 2 POINT (432900 5338400)
#> 3 POINT (433900 5338400)
#> 4 POINT (434900 5338400)
#> 5 POINT (435900 5338400)
#> 6 POINT (436900 5338400)
#> 7 POINT (437900 5338400)
#> 8 POINT (432900 5339400)
#> 9 POINT (433900 5339400)
#> 10 POINT (434900 5339400)
#--- perform grid sampling ---#
sample_systematic(raster = sraster, # input sraster
cellsize = 500, # grid distance
square = FALSE, # hexagonal tessellation
location = "random", # random sample within tessellation
plot = TRUE) # plot
#> Simple feature collection with 164 features and 0 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 431123.1 ymin: 5337705 xmax: 438468.2 ymax: 5343149
#> CRS: +proj=utm +zone=17 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
#> First 10 features:
#> geometry
#> 1 POINT (431242.9 5339872)
#> 2 POINT (431123.1 5341606)
#> 3 POINT (431200.1 5337728)
#> 4 POINT (431144.1 5338523)
#> 5 POINT (431155 5340477)
#> 6 POINT (431326.4 5341354)
#> 7 POINT (431398.6 5343071)
#> 8 POINT (431650.5 5338067)
#> 9 POINT (431419.7 5339931)
#> 10 POINT (431569.9 5340644)
sample_systematic(raster = sraster, # input sraster
cellsize = 500, # grid distance
access = access, # define access road network
buff_outer = 200, # outer buffer - no samples further than this distance from road
square = FALSE, # hexagonal tessellation
location = "corners", # take corners instead of centers
plot = TRUE)
#> Simple feature collection with 597 features and 0 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 431100 ymin: 5337903 xmax: 438350 ymax: 5343100
#> CRS: +proj=utm +zone=17 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
#> First 10 features:
#> geometry
#> 1 POINT (431100 5340501)
#> 2 POINT (431100 5340501)
#> 3 POINT (431350 5340934)
#> 4 POINT (431350 5340646)
#> 5 POINT (431100 5340501)
#> 6 POINT (431350 5342667)
#> 7 POINT (431600 5338481)
#> 8 POINT (431100 5340501)
#> 9 POINT (431350 5340646)
#> 10 POINT (431600 5340501)
sample_strat
The sample_strat()
contains two method
s to perform sampling:
"Queinnec"
- Hierarchical sampling using a focal window to isolate contiguous groups of stratum pixels, which was originally developed by Martin Queinnec."random"
- Traditional stratified random sampling. Thismethod
ignores much of the functionality of the algorithm to allow users the capability to use standard stratified random sampling approaches without the use of a focal window to locate contiguous stratum cells.
method = "Queinnec"
Queinnec, M., White, J. C., & Coops, N. C. (2021). Comparing airborne and spaceborne photon-counting LiDAR canopy structural estimates across different boreal forest types. Remote Sensing of Environment, 262(August 2020), 112510.
This algorithm uses moving window (wrow
and wcol
parameters) to filter the input sraster
to prioritize sample locations where stratum pixels are spatially grouped, rather than dispersed individuals across the landscape.
Sampling is performed using 2 rules:
Rule 1 - Sample within spatially grouped stratum pixels. Moving window defined by
wrow
andwcol
.Rule 2 - If no more samples exist to satisfy desired sampling count, individual stratum pixels are sampled.
The rule applied to a select a particular sample is defined in the rule
attribute of output samples. We give a few examples below:
#--- perform stratified sampling random sampling ---#
sample_strat(sraster = sraster, # input sraster
nSamp = 200, # desired sample number
plot = TRUE) # plot
#> Simple feature collection with 200 features and 3 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 431190 ymin: 5337730 xmax: 438530 ymax: 5343110
#> CRS: NA
#> First 10 features:
#> strata type rule geometry
#> x 1 new rule1 POINT (434390 5341090)
#> x1 1 new rule2 POINT (437130 5337990)
#> x2 1 new rule2 POINT (434770 5341850)
#> x3 1 new rule2 POINT (434790 5341510)
#> x4 1 new rule2 POINT (437870 5338770)
#> x5 1 new rule2 POINT (431570 5340830)
#> x6 1 new rule2 POINT (436730 5338150)
#> x7 1 new rule2 POINT (437990 5338910)
#> x8 1 new rule2 POINT (438110 5338810)
#> x9 1 new rule2 POINT (436250 5342950)
In some cases, users might want to include existing
samples within the algorithm. In order to adjust the total number of samples needed per stratum to reflect those already present in existing
, we can use the intermediate function extract_strata()
.
This function uses the sraster
and existing
samples and extracts the stratum for each. These samples can be included within sample_strat()
, which adjusts total samples required per class based on representation in existing
.
#--- extract strata values to existing samples ---#
e.sr <- extract_strata(sraster = sraster, # input sraster
existing = existing) # existing samples to add strata value to
Notice that e.sr
now has an attribute named strata. If that parameter is not there, sample_strat()
will give an error.
sample_strat(sraster = sraster, # input sraster
nSamp = 200, # desired sample number
access = access, # define access road network
existing = e.sr, # existing samples with strata values
mindist = 200, # minimum distance samples must be apart from one another
buff_inner = 50, # inner buffer - no samples within this distance from road
buff_outer = 200, # outer buffer - no samples further than this distance from road
plot = TRUE) # plot
#> Simple feature collection with 400 features and 3 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 431110 ymin: 5337710 xmax: 438550 ymax: 5343230
#> CRS: NA
#> First 10 features:
#> strata type rule geometry
#> 1 1 existing existing POINT (435490 5342130)
#> 2 1 existing existing POINT (433730 5340590)
#> 3 1 existing existing POINT (433890 5343010)
#> 4 1 existing existing POINT (434390 5341350)
#> 5 1 existing existing POINT (433190 5343210)
#> 6 1 existing existing POINT (433070 5341890)
#> 7 1 existing existing POINT (434890 5342250)
#> 8 1 existing existing POINT (434330 5341990)
#> 9 1 existing existing POINT (434730 5337750)
#> 10 1 existing existing POINT (435070 5337910)
As seen on the code in the example above, the defined mindist
parameter specifies the minimum euclidean distance that samples must be apart from one another.
Notice that the sample outputs have type
and rule
attributes which outline whether the samples are existing
or new
and whether rule1
or rule2
were used to select the individual samples. If type
is existing (a user provided existing
sample), rule
will be existing as well as seen above.
sample_strat(sraster = sraster, # input
nSamp = 200, # desired sample number
access = access, # define access road network
existing = e.sr, # existing samples with strata values
include = TRUE, # include existing plots in nSamp total
buff_outer = 200, # outer buffer - no samples further than this distance from road
plot = TRUE) # plot
#> Simple feature collection with 200 features and 3 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 431110 ymin: 5337730 xmax: 438530 ymax: 5343230
#> CRS: NA
#> First 10 features:
#> strata type rule geometry
#> 1 1 existing existing POINT (435490 5342130)
#> 2 1 existing existing POINT (433730 5340590)
#> 3 1 existing existing POINT (433890 5343010)
#> 4 1 existing existing POINT (434390 5341350)
#> 5 1 existing existing POINT (433190 5343210)
#> 6 1 existing existing POINT (433070 5341890)
#> 7 1 existing existing POINT (434890 5342250)
#> 8 1 existing existing POINT (434330 5341990)
#> 9 1 existing existing POINT (434730 5337750)
#> 10 1 existing existing POINT (435070 5337910)
The include
parameter determines whether existing
samples should be included in the total count of samples defined by nSamp
. By default, the include
parameter is set as FALSE
.
method = "random
Stratified random sampling with equal probability for all cells (using default algorithm values for mindist
and no use of access
functionality). In essence this method perform the sample_srs
algorithm for each stratum separately to meet the specified sample allocation.
#--- perform stratified sampling random sampling ---#
sample_strat(sraster = sraster, # input sraster
method = "random", #stratified random sampling
nSamp = 200, # desired sample number
plot = TRUE) # plot
#> Simple feature collection with 200 features and 1 field
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 431110 ymin: 5337770 xmax: 438550 ymax: 5343230
#> Projected CRS: UTM Zone 17, Northern Hemisphere
#> First 10 features:
#> strata geometry
#> 1 1 POINT (436490 5338490)
#> 2 1 POINT (436490 5338490)
#> 3 1 POINT (434410 5341850)
#> 4 1 POINT (435590 5338530)
#> 5 1 POINT (435510 5342510)
#> 6 1 POINT (431710 5342590)
#> 7 1 POINT (438090 5339750)
#> 8 1 POINT (435370 5342230)
#> 9 1 POINT (437510 5339070)
#> 10 1 POINT (436970 5338050)
sample_nc
sample_nc()
function implements the Nearest Centroid sampling algorithm described in Melville & Stone (2016). The algorithm uses kmeans clustering where the number of clusters (centroids) is equal to the desired number of samples (nSamp
). Cluster centers are located, which then prompts the nearest neighbour mraster
pixel for each cluster to be located (assuming default k
parameter). These nearest neighbours are the output samples. Basic usage is as follows:
#--- perform simple random sampling ---#
sample_nc(mraster = mraster, # input
nSamp = 25, # desired sample number
plot = TRUE)
#> K-means being performed on 3 layers with 25 centers.
#> Simple feature collection with 25 features and 4 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 431170 ymin: 5338170 xmax: 438510 ymax: 5343210
#> CRS: +proj=utm +zone=17 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
#> First 10 features:
#> zq90 pzabove2 zsd kcenter geometry
#> 744 23.50 90.5 6.75 1 POINT (438510 5343210)
#> 1294 20.20 76.4 6.10 2 POINT (434590 5343170)
#> 36188 13.00 73.5 3.33 3 POINT (431230 5341290)
#> 32048 26.50 87.4 8.24 4 POINT (437950 5341530)
#> 10960 6.77 14.5 1.73 5 POINT (433950 5342650)
#> 94656 6.27 61.8 1.45 6 POINT (436830 5338170)
#> 60884 21.20 91.4 5.53 7 POINT (432790 5339970)
#> 16935 15.50 94.8 2.71 8 POINT (434090 5342330)
#> 15918 17.20 59.2 5.02 9 POINT (436130 5342390)
#> 89846 8.71 41.9 2.37 10 POINT (437610 5338430)
Altering the k
parameter leads to a multiplicative increase in output samples where total output samples = \(`nSamp` * `k`\).
#--- perform simple random sampling ---#
samples <- sample_nc(mraster = mraster, # input
k = 2, # number of nearest neighbours to take for each kmeans center
nSamp = 25, # desired sample number
plot = TRUE)
#> K-means being performed on 3 layers with 25 centers.
#--- total samples = nSamp * k (25 * 2) = 50 ---#
nrow(samples)
#> [1] 50
Visualizing what the kmeans centers and samples nearest neighbours looks like is possible when using details = TRUE
. The $kplot
output provides a quick visualization of where the centers are based on a scatter plot of the first 2 layers in mraster
. Notice that the centers are well distributed in covariate space and chosen samples are the closest pixels to each center (nearest neighbours).
#--- perform simple random sampling with details ---#
details <- sample_nc(mraster = mraster, # input
nSamp = 25, # desired sample number
details = TRUE)
#> K-means being performed on 3 layers with 25 centers.
#--- plot ggplot output ---#
details$kplot
sample_clhs
sample_clhs()
function implements conditioned Latin hypercube (clhs) sampling methodology from the clhs
package. A number of other functions in the sgsR
package help to provide guidance on clhs sampling including calculate_pop()
and calculate_lhsOpt()
. Check out these functions to better understand how sample numbers could be optimized.
The syntax for this function is similar to others shown above, although parameters like iter
, which define the number of iterations within the Metropolis-Hastings process are important to consider. In these examples we use a low iter
value because it takes less time to run. Default values for iter
within the clhs
package are 10,000.
sample_clhs(mraster = mraster, # input
nSamp = 200, # desired sample number
plot = TRUE, # plot
iter = 100) # number of iterations
sample_clhs(mraster = mraster, # input
nSamp = 300, # desired sample number
iter = 100, # number of iterations
existing = existing, # existing samples
access = access, # define access road network
buff_inner = 100, # inner buffer - no samples within this distance from road
buff_outer = 300, # outer buffer - no samples further than this distance from road
plot = TRUE) # plot
The cost
parameter defines the mraster
covariate, which is used to constrain the clhs sampling. This could be any number of variables. An example could be the distance a pixel is from road access
(e.g. from calculate_distance()
see example below), terrain slope, the output from calculate_coobs()
, or many others.
#--- cost constrained examples ---#
#--- calculate distance to access layer for each pixel in mr ---#
mr.c <- calculate_distance(raster = mraster, # input
access = access,
plot = TRUE) # define access road network
sample_clhs(mraster = mr.c, # input
nSamp = 250, # desired sample number
iter = 100, # number of iterations
cost = "dist2access", # cost parameter - name defined in calculate_distance()
plot = TRUE) # plot
sample_balanced
The sample_balanced()
algorithm performs a balanced sampling methodology from the stratifyR / SamplingBigData
packages.
sample_balanced(mraster = mraster, # input
nSamp = 200, # desired sample number
plot = TRUE) # plot
#> Simple feature collection with 200 features and 0 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 431190 ymin: 5337730 xmax: 438530 ymax: 5343210
#> CRS: +proj=utm +zone=17 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
#> First 10 features:
#> geometry
#> 1 POINT (433870 5343210)
#> 2 POINT (433930 5343210)
#> 3 POINT (431490 5343190)
#> 4 POINT (434330 5343170)
#> 5 POINT (436370 5343090)
#> 6 POINT (431950 5343050)
#> 7 POINT (437890 5342990)
#> 8 POINT (437590 5342950)
#> 9 POINT (431810 5342910)
#> 10 POINT (432810 5342890)
sample_balanced(mraster = mraster, # input
nSamp = 100, # desired sample number
algorithm = "lcube", # algorithm type
access = access, # define access road network
buff_inner = 50, # inner buffer - no samples within this distance from road
buff_outer = 200) # outer buffer - no samples further than this distance from road
#> Simple feature collection with 100 features and 0 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 431590 ymin: 5337750 xmax: 438550 ymax: 5343210
#> CRS: +proj=utm +zone=17 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
#> First 10 features:
#> geometry
#> 1 POINT (435170 5339710)
#> 2 POINT (437910 5339410)
#> 3 POINT (434990 5338690)
#> 4 POINT (436310 5341470)
#> 5 POINT (431970 5337750)
#> 6 POINT (434410 5342370)
#> 7 POINT (431930 5337850)
#> 8 POINT (435210 5343110)
#> 9 POINT (437970 5341450)
#> 10 POINT (435930 5342950)
sample_ahels
The sample_ahels()
function performs the adapted Hypercube Evaluation of a Legacy Sample (ahels) algorithm usingexisting
sample data and an mraster
. New samples are allocated based on quantile ratios between the existing
sample and mraster
covariate dataset.
This algorithm was adapted from that presented in the paper below, which we highly recommend.
Malone BP, Minansy B, Brungard C. 2019. Some methods to improve the utility of conditioned Latin hypercube sampling. PeerJ 7:e6451 DOI 10.7717/peerj.6451
This algorithm:
Determines the quantile distributions of
existing
samples andmraster
covariates.Determines quantiles where there is a disparity between samples and covariates.
Prioritizes sampling within those quantile to improve representation.
To use this function, user must first specify the number of quantiles (nQuant
) followed by either the nSamp
(total number of desired samples to be added) or the threshold
(sampling ratio vs. covariate coverage ratio for quantiles - default is 0.9) parameters. We recommended you setting the threshold
values at or below 0.9.
sample_ahels(mraster = mraster,
existing = existing, # existing samples
plot = TRUE) # plot
#> Simple feature collection with 248 features and 7 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 431110 ymin: 5337730 xmax: 438530 ymax: 5343230
#> CRS: +proj=utm +zone=17 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
#> First 10 features:
#> type.x zq90 pzabove2 zsd strata type.y rule
#> 1 existing 3.79 31.5 0.75 1 new rule1
#> 2 existing 8.44 30.0 2.22 1 new rule2
#> 3 existing 7.79 82.4 1.65 1 new rule2
#> 4 existing 10.70 69.4 2.79 1 new rule2
#> 5 existing 7.30 85.7 1.65 1 new rule2
#> 6 existing 7.34 24.4 1.95 1 new rule2
#> 7 existing 5.85 54.2 1.29 1 new rule2
#> 8 existing 4.61 17.6 1.61 1 new rule2
#> 9 existing 10.60 74.4 2.37 1 new rule2
#> 10 existing 10.70 86.0 2.33 1 new rule2
#> geometry
#> 1 POINT (435490 5342130)
#> 2 POINT (433730 5340590)
#> 3 POINT (433890 5343010)
#> 4 POINT (434390 5341350)
#> 5 POINT (433190 5343210)
#> 6 POINT (433070 5341890)
#> 7 POINT (434890 5342250)
#> 8 POINT (434330 5341990)
#> 9 POINT (434730 5337750)
#> 10 POINT (435070 5337910)
Notice that no threshold
, nSamp
, or nQuant
were defined. That is because the default setting for threshold = 0.9
and nQuant = 10
.
The first matrix output shows the quantile ratios between the sample and the covariates. A value of 1.0 indicates that samples are represented relative to the quantile coverage. Values > 1.0 indicate over representation of samples, while < 1.0 indicate under representation of samples.
sample_ahels(mraster = mraster,
existing = existing, # existing samples
nQuant = 20, # define 20 quantiles
nSamp = 300) # total samples desired
#> Simple feature collection with 500 features and 7 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 431110 ymin: 5337730 xmax: 438530 ymax: 5343230
#> CRS: +proj=utm +zone=17 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
#> First 10 features:
#> type.x zq90 pzabove2 zsd strata type.y rule
#> 1 existing 3.79 31.5 0.75 1 new rule1
#> 2 existing 8.44 30.0 2.22 1 new rule2
#> 3 existing 7.79 82.4 1.65 1 new rule2
#> 4 existing 10.70 69.4 2.79 1 new rule2
#> 5 existing 7.30 85.7 1.65 1 new rule2
#> 6 existing 7.34 24.4 1.95 1 new rule2
#> 7 existing 5.85 54.2 1.29 1 new rule2
#> 8 existing 4.61 17.6 1.61 1 new rule2
#> 9 existing 10.60 74.4 2.37 1 new rule2
#> 10 existing 10.70 86.0 2.33 1 new rule2
#> geometry
#> 1 POINT (435490 5342130)
#> 2 POINT (433730 5340590)
#> 3 POINT (433890 5343010)
#> 4 POINT (434390 5341350)
#> 5 POINT (433190 5343210)
#> 6 POINT (433070 5341890)
#> 7 POINT (434890 5342250)
#> 8 POINT (434330 5341990)
#> 9 POINT (434730 5337750)
#> 10 POINT (435070 5337910)
Notice that the total number of samples is 500. This value is the sum of existing samples (200) and number of samples defined by nSamp = 300
.