The literature on the correct and spurious estimation of the correlation
dimension is huge and this is certainly not the place to repeat all the
arguments. The relevant caveats and misconceptions are reviewed for example in
Refs. [75, 11, 76, 2]. The most prominent precaution
is to exclude temporally correlated points from the pair counting by the so
called Theiler window w [75]. In order to become a consistent
estimator of the correlation integral (from which the dimension is
derived) the correlation sum should cover a random sample of points drawn
independently according to the invariant measure on the attractor. Successive
elements of a time series are not usually independent. In particular for highly
sampled flow data subsequent delay vectors are highly correlated. Theiler
suggested to remove this spurious effect by simply ignoring all pairs of points
in Eq.() whose time indices differ by less than w, where w
should be chosen generously. With O(N²) pairs available, the loss of O(N)
pairs is not dramatic as long as
. At the very least, pairs with j=k
have to be excluded [77], since otherwise the strong bias to
, the mathematically correct value for a finite set of points, reduces
the scaling range drastically. Choosing w, the first zero of the
auto-correlation function, sometimes even the decay time of the autocorrelation
function, are not large enough since they reflect only overall linear
correlations [75, 76]. The space-time-separation plot
(Sec.
) provides a good means of determining a sufficient value
for w, as discussed for example in [41, 2]. In some cases,
notably processes with inverse power law spectra, inspection requires w to be
of the order of the length of the time series. This indicates that the data
does not sample an invariant attractor sufficiently and the estimation of
invariants like
or Lyapunov exponents should be abandoned.
Parameters in the routines d2 and
c2naive are as usual the
embedding parameters m and , the time delay, and the embedding
dimension, as well as the Theiler window.
Fast implementation of the correlation sum have been proposed by several
authors. At small length scales, the computation of pairs can be done in
O(NlogN) or even O(N) time rather than
O(N²) without loosing any of
the precious pairs, see Ref. [20]. However, for intermediate size data
sets we also need the correlation sum at intermediate length scales where
neighbor searching becomes expensive. Many authors have tried to limit the use
of computational resources by restricting one of the sums in
Eq.() to a fraction of the available points. By this practice,
however, one looses valuable statistics at the small length scales where points
are so scarce anyway that all pairs are needed for stable
results. In [62], buth approaches were combined for the first time by
using fast neighbor search for
and restricting the sum
for
. The TISEAN implementation
d2
goes one step further and selects the range for the sums individually for each
length scale to be processed. This turns out to give a major improvement in
speed. The user can specify a desired number of pairs which seems large enough
for a stable estimation of
, typically 1000 pairs will
suffice. Then the sums are extended to a range which guarantees that number of
pairs, or, if this cannot be achieved, to the whole time series. At the largest
length scales, this range may be rather small and the user may choose to give a
minimal number of reference points to ensure a representative average.
In the program
, rather than restricting the range of the
sums, only a randomly selected subset is used. The randomization however
requires a sophisticated program structure in order to avoid an
overhead in computation time.