ESP: A Guide to the JUMBL Test Statistics

The JUMBL's test analysis engine produces several analysis results; this page attempts to explain what those results signify, and how they may be used. Please note that which statistics are important and useful to you depends very much on what problem you are trying to solve. This page is just an overview, nothing more. For an example test analysis report, click here.

The JUMBL may have more than one analytical engine. Different analytical engines will produce different results. Some analytical engines may use approximations to obtain their results, and these will be slightly different from the ``analytical'' solutions of other engines. At the time of writing, the JUMBL contains only one analytical engine:

Of these, the statistics which are potentially of the greatest use are those which relate to the overall reliabilities and the stimulus reliabilities.

General Notes

Expectation

The mathematical expectation E(X) (also called the mean) is the average value of a random variable X across many trials.

Variance

Every expectation E(X) has an associated variance Var(X). The variance can be interpreted using Chebyshev's Inequality. The probability that the observed value of random variable X differs from the expected value E(X) by at least k (for any positive k) is bounded above by Var(X)/k². In other words:

Alternately, you may think of the variance in this way. 95% of the time, the observed value will be within SQRT[Var(X) / 0.95] of the expectation. (SQRT[x] is the square root of x.)

Often the square root of the variance is reported; this is called the standard deviation. The Chebyshev Inequality does not assume any distribution for X. If the distribution of a random variable is known, one can often do much better than the Chebyshev Inequality. For example, for a normally-distributed random variable, approximately 68% of the time the observed value is within one standard deviation of the mean, and 95% of the time the observed value is within two standard deviations of the mean.

Miller Reliability

The reliability and its variance are computed according to the Miller reliability model described in K.W. Miller, et. al., ``Estimating the Probability of Failure When Testing Reveals No Failures,'' IEEE Transactions on Software Engineering, v. 18, pp. 33-42, January 1992. Despite the name, the model may be used to determine the probability of failure when testing does, in fact, reveal failures.

The Miller model is based on Bayesian statistics and allows exploitation of prior information about system reliability (in this case, ``prior'' refers to information, either conjectural, analytical, or empirical, about the reliability of the system). This prior information is given in the form of two priors: a prior number of failures and a prior number of successes. For the case of no prior information about software reliability, both these priors are set to one.

Letting the prior failures and successes be denoted a and b, respectively, and letting the observed number of successful and unsuccessful tests in the current experiment be denoted s and f, respectively, one may compute the expected reliability using the following equation:

Since this is an expectation, there is an associated variance. The variance may be computed as:

Using the variance, one may compute a confidence interval around the expectation. This may be done, for instance, with the Chebyshev Inequality.

Generation and Execution Counts

Generation and execution counts may differ due to failures. For example, a failure may prevent running some portion of a test case. The remaining items in the test case were generated (and appear in generation counts) but not executed. Both counts are computed and presented in the report.

``Optimum'' Results

Certain results, such as reliabilities and the Kullback, can be computed in two ways: using the testing record of execution and failure counts, or using the generation counts.

Prior to running any tests, one may wish to compute these results using the generation counts only, to get a sense of the best one could do (the ``optimum'' values for reliability and Kullback). Once testing has been completed, one can then look at the testing record and compute the experiential values.

Both statistics are reported, with the optimum values clearly labelled. These represent what would be observed if no failure occurred in testing.

Model Statistics

Model Name

Distribution

Every model may have one or more distributions. Distributions are distinguished by keys, and designate a collection of constraints which are to be taken together. For an example of a model with multiple distributions, see the NAS TML model.

This collection of constraints may be viewed as probabilities on arcs, or as costs associated with arcs, etc. The special default distribution has no key. The following example shows two TML constraints; the first is part of the default distribution, while the second is part of the ``special'' distribution.

($ 1/2 $)
special:($ 2/3 $) Every analysis must be for one and only one distribution, so this is indicated in the analysis results page.

Node Count

The number of nodes n in the model is reported. Many computations are O(n²) or O(n³), so this number is significant.

Arc Count

The number of arcs in the model is reported. Note that there may be many arcs between two nodes, so this number is in some sense independent of the number of nodes.

Stimulus Count

Each arc must be labeled with a stimulus, and a single stimulus may label more than one arc. This is the number of distinct stimuli in the model. Each arc from a given node must have a distinct stimulus, so the number of arcs in the model is bounded above by the number of stimuli multiplied by the number of nodes.

Test Cases Recorded

The number of test cases applied to the model. This number may include multiple applications of the same test case.

Nodes Generated
Nodes Executed

This is the unreduced fraction of the nodes generated / executed in applied test cases, and the percentage of all nodes generated / executed.

Arcs Generated
Arcs Executed

This is the unreduced fraction of the arcs generated / executed in applied test cases, and the percentage of all arcs generated / executed.

Stimuli Generated
Stimuli Executed

This is the unreduced fraction of the stimuli generated / executed in applied test cases, and the percentage of all stimuli generated / executed.

Stimulus Statistics

Generated / Executed / Failed

The total number of times this stimulus has been generated / has been executed / has failed in applied test cases.

Reliability and Variance of Specified Stimulus
Optimum Reliability and Variance of Specified Stimulus

The predicted operational reliability and its variance, computed according to the Miller model. The reliability is the probability that a randomly-selected instance of the specified stimulus will succeed.

Prior Successes and Failures

The Bayesian prior information for the Miller model. This is the number of prior successes and failures of the specified stimulus. This information is computed from the priors for all arcs.

Arc Statistics

Probability

Generated / Executed / Failed

The number of times this arc has been generated / has been executed / has failed in the applied test cases.

Reliability and Variance of Specified Arc
Optimum Reliability and Variance of Specified Arc

The predicted operational reliability and its variance, computed according to the Miller model. The reliability is the probability that a randomly-selected instance of the specified arc will succeed.

Prior Successes and Failures

The Bayesian prior information for the Miller model. This is the number of prior successes and failures of the specified arc.

Reliabilities

Single Event Reliability and Variance
Single Event Optimum Reliability and Variance

The predicted probability that a randomly-selected stimulus will execute without failure in the long run, and its variance. This is computed using the

Single Use Reliability and Variance
Single Use Optimum Reliability and Variance

The predicted probability that a randomly-selected use of the software will execute without failure, and its variance, are computed and presented. This is the ``usual'' reliability, and will typically be less than the single event reliability (it may be close to the single event reliability to the power of the expected test case length). This is computed using the Miller model.

Arc Source Entropy

The average, per-step uncertainty in bits. This can be interpreted as the minimum average number of yes / no questions which must be asked to determine which path is taken. For example, if there are eight choices then one can determine which choice is made by asking three yes / no questions (splitting the number of choices in half each time). For n equally likely choices, one must ask CEIL(log₂n) questions to determine which choice is made. (CEIL(n) is the smallest integer greater than or equal to n.)

When choices are not equally likely, the uncertainty is less than for the equally likely case (since some choices must be more likely than others). The uncertainty of a distribution P = {p₁,p₂,p₃,...,p_n} is given by the following famous equation derived by C. Shannon: H{P} = -SUM(p_i log₂ p_i)

(SUM(x_i) is the sum of all the x_i values; SUM(x_i) = x₁+x₂+...+x_n.)

There are many different probability distributions on a Markov chain and thus many entropies. One can compute the entropy for each node of the model (using the probability distribution on the outgoing arcs), then take the average of these values, weighted by the long run occupancy for the node. This weighted average is called the source entropy of the model, and it is useful as a measure of complexity of the model.

There are two choices for the outgoing arc probability distribution. Traditionally the next node probability distribution (sometimes called the transition probability distribution) has been used--this gives the node source entropy, or the average per-step uncertainty in choosing the next node. More recently, attention has been focused on the stimuli which label the outgoing arcs. The probability distribution on the outgoing stimuli can be used instead--this gives the arc source entropy, or the average per-step uncertainty in choosing the next node and stimulus pair.

The arc source entropy is bounded below by the node source entropy. The two are equal if and only if each stimulus labels an arc to a different node. Both entropies are reported in the model analysis report.

Kullback
Optimum Kullback

The Kullback-Leibler (KL) number (sometimes called the ``discriminant'') is an information-theoretic measure of the similarity between a statistical model and a true distribution. In this case, the statistical model is given by the executed tests, while the ``true'' distribution is assumed to be the generating chain. Thus the KL measures the similarity between the statistics of the test cases which have been generated, and the statistics of all test cases.

The KL number is nonnegative, and zero only when the two distributions coincide. Thus as the statistics of the generated sample of tests approach the statistics of all tests, the KL number approaches zero. The number may therefore be used as a stopping criteria; that is, a criterion to help one determine when testing is sufficient.

The KL is not a metric; it does not satisfy the triangle inequality and is not symmetric. However, it is a convex function of the ``true'' distribution.

The KL may also be thought of as the relative entropy of one distribution p with respect to another distribution q. One can think of the KL (using log base two) as the average number of bits that are wasted by encoding events from distribution p with a code based on the distribution q.

The Kullback-Leibler (KL) number is not defined if a model's arcs have not been covered. It is possible, however, to introduce small perturbations in the visitation counts to allow computation of a number similar to the KL. This ``permuted'' Kullback number is asymptotically close to the true KL when the latter is defined. One may think of this permuted Kullback as an approximation to the KL, though this is not strictly true, since the permuted Kullback is defined over a larger domain that the KL. For this reason, only the permuted KL is reported.

Relative Kullback
Optimum Relative Kullback

The Kullback-Leibler (KL) number (see Kullback) is defined as the difference between an approximating entropy and the source entropy (see Arc Source Entropy), with the latter value always smaller. As testing experience approaches the distribution given by the usage model, the KL decreases toward zero.

If the source entropy is 1 bit, then a KL of 0.1 bit is small. If the source entropy is 0.1 bits, then a KL of 0.1 bit is large. Thus the KL by itself may be misleading unless it is combined with the source entropy. Thus the relative Kullback is computed and presented. This is the percentage difference between the testing experience entropy and the true source entropy, and is given by:

where H is the arc source entropy. Again, as testing experience approaches the distribution given by the usage model, this number approaches zero.

General Notes

Expectation

Variance

Miller Reliability

Generation and Execution Counts

``Optimum'' Results

Model Statistics

Model Name

Distribution

Node Count

Arc Count

Stimulus Count

Test Cases Recorded

Nodes Generated Nodes Executed

Arcs Generated Arcs Executed

Stimuli Generated Stimuli Executed

Stimulus Statistics

Generated / Executed / Failed

Reliability and Variance of Specified Stimulus Optimum Reliability and Variance of Specified Stimulus

Prior Successes and Failures

Arc Statistics

Probability

Generated / Executed / Failed

Reliability and Variance of Specified Arc Optimum Reliability and Variance of Specified Arc

Prior Successes and Failures

Reliabilities

Single Event Reliability and Variance Single Event Optimum Reliability and Variance

Single Use Reliability and Variance Single Use Optimum Reliability and Variance

Arc Source Entropy

Kullback Optimum Kullback

Relative Kullback Optimum Relative Kullback

Nodes Generated
Nodes Executed

Arcs Generated
Arcs Executed

Stimuli Generated
Stimuli Executed

Reliability and Variance of Specified Stimulus
Optimum Reliability and Variance of Specified Stimulus

Reliability and Variance of Specified Arc
Optimum Reliability and Variance of Specified Arc

Single Event Reliability and Variance
Single Event Optimum Reliability and Variance

Single Use Reliability and Variance
Single Use Optimum Reliability and Variance

Kullback
Optimum Kullback

Relative Kullback
Optimum Relative Kullback