OefenTT Flashcards

1
Q

Explain the basic principle of kriging, including the role of the semivariogram in this interpolation technique and the term ‘regionalized variable’.

A

Answer should include :
* the contribution (or ‘weight’) of observed values to derive an estimate of the value at the
unvisited location is based on spatial correlation properties of the data set itself instead
of being presumed by some arbitrary function.
* Regionalized variable is variable which exhibits spatial correlation over ‘small’ distances,
but values become uncorrelated over ‘large’ distances. Kriging requires spatial correlation
to be present > so kriging requires variable of interest to be a regionalized variable
* The semivariogram describes the spatial correlation structure of the regionalized variable
> expressed in semivariance as a function of spatial lag, where semivariance is a measure
of average dissimilarity between data values at sample points that are a certain distance
apart (spatial lag)
* Use of semivariogram in kriging: needed to obtain the interpolation weights: distance
between data point and unvisited grid point is used to extract matching semivariance
value from semivariogram (e.g use sketch as used in lecture on kriging to illustrate) for
calculating interpolation weight for that data point, as are the distances between data
points to reduce the interpolation weights for spatially clustered data points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Kriging and ‘inverse distance weighting’ are two techniques that can be used for interpolating between spatially distributed data. To estimate the value at an unvisited location these techniques provide weights to surrounding data points in different ways. Which are two essential differences between these approaches?

A

Name 2 of the 3 points below:
1. The way in which weights are assigned to observed values to obtain a value at an unvisited location: kriging uses interpolations weight that are based on spatial correlation in the data set, as expressed in the semivariogram, while IDW uses inverse distance to determine interpolation weights, where distance has a user
picked value for power p (1/Dp. Distance (D) refers to the distance between location of observed value and unvisited location of unvisited where a value has to be derived through interpolation
2. Kriging automatically corrects for spatial clustering in observation points, by reducing the weights of observed values that are spatially clustered, whereas IDW has no such correction
3. In kriging an error map is created (kriging variance map), in IDW only error statistic for the interpolated map as a whole can be created

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Zie figuur 2.1
What is, according to the two empirical semivariograms, a characteristic property of this data set? What terminology is generally used to refer to this property?

A
  • property: the maximum distance over which spatial correlation exist (the range) depends on direction.
  • Terminology: referred to as geometric anisotropy,
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When fitting the theoretical models a fundamental error has been made. Which one? Explain your answer.

A

Fundamental error: nugget value is not the same for the two directional semivariograms
* Explanation: this is an error because the nugget value has no directional component as nugget is semivariance at spatial lag =0, it is a property reflecting measurement error (such as instrument accuracy, small scale variability of the variable that cannot be captured by the measurement method) > so nugget should have the same value for both
directional semivariograms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What can be learned from the ‘kriging variance’ map ?

A
  • Kriging variance > measure for uncertainty in the estimate of an interpolated value
  • Kriging variance map > the map shows spatial variation in this interpolation uncertainty
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Does the ‘kriging variance’ map also apply to the final copper concentration map or to the interpolated residuals only? Explain your answer

A

Answer should include
* Kriging variance only applies to interpolated residuals,
* the trend surface itself also has an uncertainty associated with it. An interpolation error map for the final copper concentration map (trend + residual) should include both uncertainties

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the basic principle of spectral analysis. In your answer include the type of data set that can be analysed by this technique and the typical properties of the data set that are extracted by this technique

A

Answer should include :
* Type of data set: interval or ratio scale of measurements, time series (or spatial series) with evenly spaced measurements/constant time interval between
measurements, no trend over time (or space in case of spatial series)
* Basic principle: (explained below for time series, but can also be spatial series)
o Transform a time series from time domain to frequency domain: translate time series to a summation of multiple sinusoidal functions with different frequency, amplitude and phase > achieved by Fourier transform of the time series
o Result presented in a power spectrum/spectral density spectrum: shows for each frequency the matching variance or energy density .
* Typical property: Reveals presence of periodicity in the time series, and at which frequencies this occurs (or wave periods) and how dominantly present these periodicities are (height of spectral peak)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the meaning of the following two terms :
· ‘Aliasing’ (clarify your explanation using an example sketch)
· ‘Spectral leakage’

A

‘Aliasing’ (clarify your explanation using an example sketch):
Answer should include:
- Aliasing is artefact in spectrum: a spectral peak shows up at a frequency where it is
not present in reality
- in reality the peak is present at a frequency beyond the Nyquist frequency
- artefact occurs due to a too coarse sampling interval compared to highest frequency occurring in reality
- sketch: see lecture slides > an apparent long wave appears from measurements (longer than the true wave period) when sampling with a time step that is longer than the wave period of the oscillation
-
* ‘Spectral leakage’:
Answer should include:
- Spectral leakage is an artefact resulting from the estimation procedure of spectrum
- It results from Fourier transforming a finite length time series
o fourier transform of finite series also involves Fourier transform of
rectangular window resulting in spectrum that is combination of true
spectrum and spectrum of rectangular window (convolution of the two
spectra) such that energy appears at frequencies that are neighbouring to the true freq.(so called ‘side lobes’>see lecture slides)
o Variance of the time series can only be mapped to frequencies that are
multiples of 1/T, where T is time series length. >so variance present at freq. that is not multiple of 1/T will appear in neighboring frequencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When estimating a power spectrum from a time series one may apply ‘bin averaging’ as part of the estimation procedure.
· What is ‘bin averaging’ ?
· What is the reason for applying ‘bin averaging’?
· What is a disadvantage of applying ‘bin averaging’?

A

What is ‘bin averaging’ ?
Answer should include:
- bin averaging is part of calculation procedure of a power spectrum : take average of values of spectral estimates in a frequency bin containing several neighboring frequencies from the raw spectrum. (size bin chosen by the researcher)
- use this average value as best estimate of the spectral value for that frequency bin
* What is the reason for applying ‘bin averaging’?
- Decrease uncertainty in value of spectral estimate,
What is a disadvantage of applying ‘bin averaging’?
- loss of spectral resolution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Scientists have measured nearshore wave height along the west coast of France by measuring water level fluctuations (due to the passing waves) at a fixed location. Their measurement device measures the water level elevation at a frequency of 2 Hz (so two measurements per
second). They made their measurements during high tide over a total time span of 100 minutes. The energy density spectrum of this time series can be used to calculate the mean wave height during this time span, because the wave height is related to the amplitude of the
water level fluctuations relative to the mean water level.
What is the Nyquist frequency of the spectrum of the above described time series?
Explain how you derived your answer

A

f nyq = 1 Hz, (Note: use an appropriate unit! )
because
* f nyq =1/(2∆t) = 1/(2x0.5) = 1 s-1 = 1 Hz (Alternatively: 1/2∆t = ½ * 1/∆t = 0.5fsampling = 0.52 = 1 Hz)
* sampling freq.= fsampling =2 Hz = 2 measurements/s, so ∆t = 0.5 s
( Note on appropriate unit: ∆t = 0.5 s = (0.5/60) minute = (0.5/3600) hr etc., then f nyq has units of respectively s-1 (=Hz), min-1 , hr -1 , etc )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the fundamental frequency of the spectrum of the above described time series?
Explain how you derived your answer.

A

fund = 1/100 = 0.01 min -1 (use an appropriate unit!) (or when T in seconds: f fund = 1/6000 = 1,67 * 10-4 Hz )
* Because f fund = 1/T, where T=total time series length = 100 minutes
(or T= 100/60 hours = 100*60 s )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

On which date are the waves the highest at this location? Explain your answer

A

Answer should include:
* highest waves on May 28
because
* maximum value of energy density is highest on May 28
* value of energy density is measure for the amplitude of the water level fluctuations,
hence the wave height, at a given frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which other conclusion regarding differences in wave conditions between these two dates can be derived from these two spectra?

A

Answer should include
* On May 26 clear double peaked spectrum, while one broad spectrum for May 28 which means difference in wave height distribution over wave periods:
* On May 26 roughly two types of waves/wave fields, one type with wave periods around 10 seconds (roughly 8-11 s) and another type with wave periods less than about 6 s; while on May 28 one wave field with broad range of wave periods of less than 10 seconds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain the basic principle of principal component analysis. Use a data set consisting of only two variables to clarify your explanation, including a sketch that shows how these variables relate to the principal components,

A

Answer should include:
* Analysis method for a multivariate data set, which is a data set in which N ‘objects’ are described/characterized by multiple variables (you may also give an example of a multivariate data set: eg 50 groundwater wells, water in each well is characterized by several variables such as its pH, temperature, phosphate concentration and groundwater depth)
* Map a set of measured variables to a new set of variables that are linear combinations of the original variables > the new variables are the principal components (PC’s)
* Mapping of variables to PC’s is based on the correlation (or covariance) between the original variables
* At least the first PC (but generally several PC’s) describes more variance than any of the original variables
* based on the strength of the contribution of original variables to a PC (expressed as PC loading),a PC is generally given a physical meaning (e.g. water pollution from urban origin)
* The projection of individual samples/objects on the PC’s (indicated as PC scores) characterizes the sample/object in terms of this physical meaning (eg. how polluted the given water sample is in terms of urban pollution)
* make sketch as used in the lecture slides on PCA analysis: scatterplot of 2 original variables. PC’s (‘new axes’) are plotted in direction of maximum variance of this point cloud (PC1) and in orthogonal direction in direction of remaining variance (PC2). Refer to this sketch in meaningful way from the above points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Give two reasons for applying principal component analysis on a data set.

A
  • Explore structure in a large data set
  • Data reduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Figure 4.2 shows the loadings of each of the 18 variables on the first and second principal component respectively. The physical interpretation of these two principal components is the
following:
· Principal component 1 (PC1) is an indicator for the level of pollution of the lakes by urban waste water. A positive score implies an above average pollution by urban waste water, while a negative score implies a below average amount of pollution.
· Principal component 2 (PC2) is an indicator of the prevailing type of lake vegetation, as well as the abundance of it. Indirectly it is a measure for eutrophication of the lake. A positive score means colonisation by phytoplankton; the more positive the score, the more abundantly present. A negative score means colonisation by macrophytes (e.g.
floating water plants like water lilies); the more negative the score, the more abundantly present.
Figure 4.3 shows the scores on the first 2 principal components of all water samples taken from Lake 1, 5 and 10 during a 19 month period.

*What can be said about the water quality in the 3 lakes according to Figure 4.3 ? Think of the differences in water quality between the lakes and the variability occurring within each lake. Clearly indicate how you derive your conclusions from Figure 4.3.

A

Based on PC1, lake1 is the cleanest with respect to urban waste water pollution and lake 5 the most polluted because PC1 scores for lake 5 are all positive and for lake 1 are all negative. Lake 10 on average has an average level of pollution by urban waste water (PC1 scores center around zero ) but individual samples can be as polluted as lake 1 or as clean as lake 5. The mentioned average level of pollution refers to the average over the water samples from all lakes together
* Based on PC 2 , lake 10 always has abundant macrophyte vegetation (negative score on PC2) and lake 1 and lake 5 have phytoplankton (= algae), where lake 5 has on average the largest abundance of phytoplankton (largest positive scores on PC2), although lake 5 also experiences almost algae free conditions (some samples have near zero score on PC2) as also occurs more frequently in lake 1.
* Overall water quality can be characterized by combining level of urban waste water pollution (PC1 score) and eutrophication level (PC2 scores ) as indicated by prevailing vegetation type (macrophytes or algae, where algae abundance relates to high eutrophication levels)
* Lake 5 has worst water quality : highest urban waste water pollution + highest eutrophication level as expressed by high algae abundance. Lake with best water quality is undecided: lake 10 scores best in terms of eutrophication but most of the time shows higher level of urban waste water pollution compared to lake 1

17
Q

The above presented principal component analysis is based on standardized data (i.e. for each variable first the mean value has been subtracted and next the residual values are divided by the standard deviation of that variable).
d) What could have been the reason for using standard standardized data in this analysis ?

A

Answer should include:
* By standardizing, no longer an effect of unit of measurement on resulting PC’s > by standardizing, all variables now have same contribution to total (normalized) variance and thus only the mutual correlations determine contribution to different PC’s
* Undesirable to let unit of measurement have an effect on the PC’s when variables represent very different properties, such as concentrations of various chemicals, but also acidity (pH) and water temperature of which units are incomparable and numeric values can cover a very different range of values

18
Q

Explain the basic principle of autocorrelation. In your answer include the type of data set that can be analysed by this technique and the typical properties of the data set that are extracted by this technique.

A

Answer should include:
* Type of data set:
o ratio or interval measurement scale
o time series, or spatial series (NB: below spatial series/space can also beread for time series/time)
o Equally space measurements/constant sampling interval
* Basic principle:
o Shift a time series relative to itself over increasingly larger time spans
(step size is sampling interval), and calculate the correlation between the
data points in the overlapping part
o Plot these correlation values as a function of the time shift (time lag)
gives autocorrelogram
* Typical properties the autocorrelogfram reveals
o Self-similarity (certain patterns in the time series repeat itself over time
o Persistence (values change ‘slowly’ over time, i.e. subsequent
measurements exhibit (on average) similarity in value

19
Q

Draw the matching correlograms of the following time series and explain for each of them why they look that way:
· time series consisting of random noise
· time series exhibiting ‘short-term’ correlation
· time series consisting of a periodic signal with random noise

A

drawings: see your results from practical on autocorrelation, do not forget to add axis labels (r and time lag) and put units an y-axis ! Note: all cases r=1 at lag=0
1: explanation: r values drop to close to 0 at lag>0 because random means no correlation between values (r=0), but values remain non-zero (but close to zero) because of sampling error due to finite length time series
2: explanation: : r values drops off more slowly (compared to random case) to close to zero at lag>0, only after several lags r stays close to 0.
o Slow drop-off because short-term correlation implies that subsequent
measurements have more similar values than measurements that have a
long time interval between them.
o At larger lags: r near zero values (fluctuating around 0) because no longer correlation between values exist (r=0), but non-zero is result of sampling error (see 1)
3: explanation:
o periodic fluctuation in autocorrelation (sinusoidal fluctuation when time
series is sinusoidal), with maxima in correlation occurring at multiples of time lag that equals periodicity T in the time series> because after shift of time series over n times the period T (n=1,2,3, etc) time series is quite similar to itself, only noise component different (see next point)
o the maxima in autocorrelation at time lags that are multiples of periodicity are smaller than 1, because the presence of random noise reduces the correlation between the two time-shifted signals.(only at lag=0 the time series (auto)correlate perfectly, so r=1)

20
Q

Which equation should be applied for testing for randomness (see description above) for the correlogram shown in Figure 5.1 ? Explain your answer.

A

Equation 2 should be used
Because:
* fig 5.1 shows large autocorrelation exists at small time lags > because of this persistence, randomly occurring extreme values in a time series will persist for a while, leading to apparent periodic fluctuations in a finite length time series > will result in increase in correlation at large lags > not real, so should be inside the conf. interval
* The extra term in brackets in eq.2 (compared to eq.1) corrects for this effect of increase in autocorrelation at large lags in an empirical way (‘large lag standard error’)

21
Q

What can be concluded on the importance of an annual cycle for variable X, based on the correlogram shown in Figure 5.1, in case:
i) the confidence interval according to eq.1 is correct. Explain your answer.
ii) the confidence interval according to eq.2 is correct. Explain your answer.

A

Answer should include:
* Annual cycle: will be visible through an increase in autocorrelation around lag=1 year (= 365 days)
* Assuming the presence of an annual cycle was an a priori hypothesis : If autocorrelation value at lag=1 year is OUTSIDE the confidence interval than a statistically significant (at alpha=0.05) annual cycle is present
* Case i: Autocorrelation at lag=1 year is outside conf.interval so it should be concluded an annual cycle is stat. significant present, however, the importance of the annual cycle compared to other fluctuations in the time series is small because r ≈ 0.1 at lag=1year
* Case ii: Autocorrelation at lag=1 year is inside conf.interval so no statistically significant evidence of an annual cycle
NOTE: In case you assumed the annual cycle to be an a posteriori hypothesis: Case ii conclusion does not change (no statistical evidence for annual cycle). For case i: given the fact that approximately 390 days *8 (obs/day) ≈ 3100 time lags are shown in the correlogram, the drawn 95% conf. interval for a prior hypothesis testing effectively reduces
to a 0.95^3100 ≈ 0% conf. interval for an a posteriori hypothesis testing (i.e. the probability that in this case the autocorrelation values fall inside the interval for all lags is virtually zero, zo almost 100% certain that at least one value of the correlogram (lag>0) will be outside this
interval), hence meaningless when an autocorrelation is outside that c.i. interval and no conclusion can be drawn regarding stat. sign presence of an annual cycle