This section provides readers with the background information necessary to understand the research problem that minimal guidance is available on how to reduce the impact of expert bias on information technology risk assessment, decision making, and operations. The foundational judgment and decision-making research and terms established by Kahneman et al. are summarized along with other uncommon terms used in the remainder of this capstone. The section continues to describe what options are available to address the issue of expert bias. Studies have found that special measures are required when eliciting knowledge from experts. Experts may be providing incorrect information without being aware of it. Specifically discussed are what options are available for optimizing the use of experts and the peer-reviewed criticisms of each. The literature reviewed includes methods of expert knowledge elicitation from other fields of work that may be applicable to information security risk management, decision making, and daily operations such as Security Operation Centers (SOC) and Critical Incident Response Teams (CIRT).

Heuristics and biases. Kahneman et al. published research into the challenges of eliciting knowledge from humans accurately. Kahneman et al. called the cognitive cause of these challenges heuristics. Heuristics are simplifications of our environment that the human mind presumably makes in order to solve problems more quickly, at the expense of considering all of the particular details of a decision (Kahnman et al., 1974). The heuristics discussed include what they called Representativeness, Availability, and Anchoring and Adjustment. Biases that resulted from these heuristics were divided into Miscalibration, Conjunction fallacy, and Base Rate Neglect. Representativeness was the heuristic seen when a person assumes that two things belong to the same group when they represent themselves similarly. Availability was the heuristic seen when someone assumes that the event frequently occurred since the expert can recall specific examples from memory so readily. Anchoring and Adjustment was the two-part heuristic seen when an expert provided estimates that adhered closely to an example provided by the person eliciting a response. These heuristics help humans rapidly solve problems in simple situations. Mental shortcuts like these resulted in bias when applied to more complex problems (Kahneman et al., 1974). The terms and conditions coined by Kahneman et al. are used in many research articles, but their conclusion on human estimation capacity has been challenged.

Criticisms of Kahneman et al. In “The Rhetoric of Irrationality”, Lopes criticizes one of the conclusions made by Kahneman et al. in response to their findings. Her interpretation of their research was that people use heuristics instead of probability theory when making decisions most of the time, not simply that humans are poor estimators. Lopes pointed out that data was available which showed that people correctly estimated value and risk in gambling situations (Anderson & Shanteau 1970; Shanteau 1974; Tversky, 1967) and also in assessing the likelihood of fairly complex joint events occurring (Beach & Peterson 1966, Lopes 1976; Shuford 1959). In that research subjects produced the same probabilities produced by normative calculation using expected utility and compound probability multiplication. Neither of these mathematical  methods was common knowledge among the subjects, but they produced equally correct answers.

In ‘heuristics and biases’ bias in expert elicitation, Kynn found that there was a disproportionate bias toward the citation of Kahneman et al.’s research in the literature. Studies that showed poor performance by human estimators, compared to research showing good performance, were cited in the literature 6:1 (Kynn, 2008). Kynn pointed out that, regardless of the strong arguments made by authors like Lopes, most citations do not acknowledge these criticisms of Kahneman et al.’s work. Research by Gigerenzer et al. found humans are more Bayesian thinkers and intuitive statisticians than the works of Kahneman et al. expressed, so long as information is communicated to experts in a frequency format (1995). Frequency formats communicate information to experts in a form that more closely resembles the natural sampling observed in animal foraging and neural networks. What is 1% in standard format would be 10-out-of-100 in frequency format. The use of visuals was also a means to present data in a frequency format.

Bayesian inference. Bayesian Inference uses subjective estimates of knowledgeable persons and improves upon their estimates using statistical methods (Vose, 2008). The process of Bayesian Inference may involve an expert communicating their prior knowledge by multiplying together a likelihood estimate of an event occurring and impact of that based on historical data. These estimates and values take the form of functions or distributions instead of point estimates. Scientists and statisticians may instead perform the “classical” approach which is considered more objective than Bayesian Inference by leaving less room for human error. The classical approach involves experimentation and identical independent trials instead of subjective estimates provided by experts.

Distribution. Refers to Cumulative Distribution Function (CDF), distribution function, cumulative frequency function, cumulative probability function, or frequency distribution shape. A distribution function describes the probability that a random variable is less than or equal to some value. In the case of frequency distributions, this may take the form of a normal distribution. The normal distribution would show that most of the random values generated are closer to the mean than to either end of the range provided, forming the bell curve shape.

Variability and uncertainty. A common criticism of probability is that there is simply too much luck or chance involved with certain events to measure probability. The academic community addresses this criticism with the concepts of variability and uncertainty. Vose posits that the human inability to predict future events is due to a factor of variability and uncertainty (2008). If the uncertainty factor is excluded from a risk assessment, the results may be over confident which equates to the ranges being unrealistically narrow, i.e. Imagine the weatherman saying that there is between a 60 and 61% chance of rain tomorrow. The assessor may be either unaware or deliberately overconfident in their estimate. Conversely, the exclusion of variability will widen ranges so much that the results may be useless, i.e. saying that there is between a 1 and 100% chance it will rain tomorrow. Variability is sometimes called chance. Specifically, chance that is inherent and irreducible to what is being observed e.g. the chance of a fair coin being heads or tails after being flipped cannot be controlled or reduced from 50%. Similarly, if the coin is tossed twice there is a 25% chance of getting either heads-heads, heads-tails, tails-heads or tails-tails (Vose, 2008). Since we cannot further reduce our uncertainty of what the coin will show after a flip, the results contain an irreducible factor of randomness. A more complex example that Vose states is the stock market (2008). Stock prices are affected by a potentially infinite number of factors and so cannot easily be predicted. The EPA describes variability as “true heterogeneity or diversity in a population or exposure parameter” and, unlike Vose, specifies that such randomness is irreducible only most of the time opposed to absolutely (1997, p.9). The EPA’s response to high variability is to better characterize the diversity in a population or exposure parameter (1997, p.9). By doing so, inferences can still be made from the sample being observed. Uncertainty, unlike variability, can be reduced and is inherent in all estimates. It may be reduced by collecting more information or by finding more knowledgeable subject matter experts. Vose posits that experts can be made to provide subjective estimates that are as objective as possible (Vose, 2008). They do so by following a logical path of reasoning which excludes prior, non-quantitative information about what they are assessing. This sounds like the Calibration technique described in the next section of this capstone.



The methods that will be evaluated include single-point scoring as proposed in many industry standards, calibration of estimators as proposed by Lichtenstein et al. (1982), aggregating data with expert opinion with Bayes’ theorem by Yang et al (1997), Bayesian reasoning with frequency formats as proposed by Gigerenzer et al.(1995), aggregating estimates by multiple experts by Yaniv (2004), Harvey et al (1997), Lim et al (1995), Johnson et al (2001) and Lele et al (2005); and Monte Carlo simulation modeling as described by the EPA (1997) and Vose (2008). Each method is followed by criticisms where criticisms were available.

Single point subjective scoring. There are a variety of expert knowledge elicitation methods that involve communicating estimates in the form of single-point estimates, or scores. Methods for assessing risk that elicit estimates in the form of single-point figures, opposed to ranges, are called “single-point subjective scoring” throughout this paper. Official standards organizations who publish guidance for eliciting expert estimates in the form of single points include: the National Institute of Standards and Technology (NIST SP 800-30), the International Organization for Standardization (ISO 31000); organizations that use ISO 31000 like the British Standards Institute (BSI), the Australian Standard/New Zealand Standard (AS/ANS); the Information Security Forum (ISF), SANS Institute, Global Information Assurance Certification (GIAC), the Federal Financial Institutions Examination Council (FFIEC), the Information Systems Audit and Control Association (ISACA), the Project Management Institute (PMI), and the Computing Technology Industry Association (CompTIA), among many others. These organizations provide guidance to organizations on measuring organizational risk. In practice, a subject matter expert rates the probability and impact of events. Experts are asked to provide their estimates in the form of scores like 1, 2, or 3, or the words high, medium, or low; or the colors green, yellow, and red. In each case the answers are points on a spectrum. The first round of questions asks for the probability of an event. The second round asks for the impact. On one end of each of these spectrums may be the word “very low” and on the other end “very high.” In some cases, these descriptors are used explicitly instead of colors or numbers. Some methods use percentages or decimals between zero and one in place of the same aforementioned spectrum. Once ratings are established, they are sometimes fit into a grid or a risk matrix. Some methods add the step of multiplying the probability numbers and impact numbers together. Multiplying these two values together creates a single value believed to represent both probability and impact of the event or condition being assessed. No peer-reviewed publications are available that test the mathematical validity or success of these methodologies.

Criticisms of single-point subjective scoring. Hubbard and Evens identified four problems with scoring methods. The conclusion of their study was in favor of risk assessment methods that measure risk in terms of mathematical probability using methods like Monte Carlo simulation (Hubbard & Evans, 2010). Organizations measuring their risk of experiencing rare events like flu pandemics or data breaches, would have to systematically track the occurrence of the rare events in their organization over a longer period of time to determine the accuracy of their forecasts. The extended period of time would require the organization to trust their score-based forecasts until a sufficient number of rare and harmful events occurred, or did not occur, to evaluate its effectiveness (Hubbard & Evans, 2010). In studies involving single-point scoring methods that use labels in place of scores; such as high probability, medium probability, and low probability, interpretations varied considerably between assessors (Hubbard & Evans, 2010). This disparity impacted assessment output in most cases. Similar studies have shown assessors defining terms differently, even when provided explicit definitions (Budescu, Broomell, & Por, 2009; Heuer, 2005). When multiple risks are considered in an assessment, dependencies between them, that may change risk ratings, are not factored into any of the observed single-point methods (Hubbard & Evans, 2010). Hubbard and Evan’s proposed the use of percentages to communicate probability, dollar amounts to communicate impact, bias reducing techniques, like calibration, and Monte Carlo simulation to integrate probability and impact into a form that mathematical methods can be used on (Hubbard & Evans, 2010).

Risk Matrices. Risk matrices fail to communicate the disparity between different risk events.  Two highly disparate risks may sit visually side-by-side on a risk matrix which gives the illusion that addressing either will mitigate a similar amount of risk. Matrices developed by the Federal Highway Administration, Federal Aviation Administration, California Department of Transportation, and a General Accounting Office report on “Combatting Terrorism” employed risk matrices and so were used by Cox to explore the mathematical properties of risk matrices (2008). Cox provides a list of conditions required to use a risk matrix to produce accurate results. When data is unavailable, subjective ratings are produced by experts and then mapped onto a risk matrix. Cox’s findings were that high-quality data, opposed to subjective estimates, were a minimum requirement for effective matrix use (Cox, 2008). In most cases, a risk matrix will not provide what decision makers are looking for, even under ideal conditions. In all cases, something will be lost and errors in matrix construction are often invisible without time- consuming evaluation of the individual matrices by mathematics experts. Cox observed that risk matrices provided poor detail resolution, errors, suboptimal resource allocation, and ambiguous inputs and outputs (Cox, 2008). When hazards given single- point risk ratings were mapped on risk matrices, they appeared to have identical risk when they their risk values were highly disparate.

Calibration. Analysts may provide decision makers with estimates of the likelihood or impact of future events. They may also provide estimates for past events when sufficient historical data is unavailable. Estimates may be produced by processing data from a sufficiently reliable source. For the purposes of this research, a sufficiently reliable source includes any instrument that provides verifiably accurate measurements. Verifiability accurate and precise measurements have been provided by trained experts (Lichtenstein and Fischhoff, 1980). An example of a non-human instrument is a correctly calibrated pH probe. Such an instrument provides verifiably accurate data and so is trusted as providing factual data but only after it has been calibrated. In the same way, a calibrated human provides trustworthy information, some researchers go so far as to call this elicited information data. Calibration training requires intervals instead of single-point estimates. Interval estimates are estimates that take the form of ranges instead of single-point answers. For example, a questionnaire may ask: In what year was Benjamin Franklin born? The instructions may request a range of years instead of a single number. An expert taking such a questionnaire may be asked to provide a range that they are 90% confident contains the correct answer i.e. 1650-1750. That range is referred to as their confidence interval (CI) for that particular estimate. Experts are calibrated if the probability that they assign to an event turns out to be true on most occasions over time (Lichtenstein and Fischhoff, 1980). This ability may be improved through what they call calibration training. Training involves a trainer asking a trainee many questions to which the answers are known to the trainer. The trainee is asked to provide answers in the form of interval estimates. The objective of calibration is to get experts into the habit of recognizing their uncertainty when providing estimates. This skill is independent of subject matter and may be learned by most people who take the time (Kynn, 2008). The ranges provided should be sufficiently large to contain the answer but no so large that the expert is not providing useful information. That being said, if the expert provides only wide ranges, they may not have sufficient expertise in the subject matter.

Criticisms of calibration. One of the findings in Calibration of Probabilities: The state of the art to 1980 was that training can only improve the calibration to a limited extent  (Lichtenstein et al. 1981). Lichtenstein et al. recommend the continued development of technology and bias reducing methods (Lichtenstein, Fischoff et al. 1981).

Integrating data with expert opinion using Bayes’ Theorem. Methods that use Bayes’ Theorem to integrate the subjective opinions of experts with real data are referred to as Subjective Bayesian methods in this capstone. Real data refers to data obtained using instruments that maintain reliable accuracy. In most cases examined by Yang & Berger, opinions of experts were elicited in the form of their prior beliefs (1987). These elicited priors were then used to help form a statistical model. The priors made up the parameters, typically minimum and maximum possible outcomes of the statistical model once quantified into a probability distribution. Resulting distributions, sometimes referred to as “posterior distribution” (Yang & Berger, 1997). Posterior distributions can then be used to make probabilistic predictions.

Criticisms of subjective Bayesian. Yang and Berger experienced difficulty quantifying expert opinion in the form of a prior distribution. Challenges they described included quantifying the informativeness of an expert, justifying the costs and training of experts to provide subjective Bayesian estimates, eliciting minimally biased prior distributions and prioritizing which factors to count and discount based on their perceived importance (1997). Methods for addressing these challenges are proposed by Lele and Allan later in this capstone.

Bayesian reasoning with frequency formats. Communicating conditions in the form of frequency formats is a solution proposed by Gigerenzer et al. in How to Improve Bayesian Reasoning Without Instruction: Frequency Formats (1995). People were more likely to understand what was being asked of them and communicate their expert estimates more effectively when the information was presented in a frequency format. Estimation with Bayesian reasoning using frequency formats was a method proposed in response to the difficulty people appeared to have when assessing risk using other formats such as tables of purely numerical information. The difficulty is usually due to varying levels of expertise and comprehension of the mathematical concepts behind the questions. They observed assessments animals appeared to make when faced with the risk. Animals seemed to comprehend risks that presented themselves in terms of a frequency format. In their research, the subjects provided with frequency formats produced significantly more problem-solving methods that followed the equivalent Bayesian algorithms than those who received standard formats regardless of education in Bayes theorem. So long as the format presented was a frequency format, the students tested intuitively used Bayes theorem. The standard probability format produced inverse results. The analogy used was feeding binary formatted numbers (combinations of 0 and 1) into to a modern calculator that  only understands the numbers 0 to 9. The calculator is exceptional at calculating values, but if the format of the information is unusual, then it cannot calculate accurately if at all. This new way of formatting problems had additional benefits such as reducing symptoms of the conjunction fallacy (Tversky & Kahneman, 1983) and overconfidence bias, and representativeness (base rate neglect). Some of the formats evaluated included Standard probability format, Standard frequency format, Short probability format and Short frequency format. Standard probability format consisted of presenting probabilities in the form of percentages. A series of known probabilities were presented to the subjects. Using those percentages the subject was then asked for the probability of an event or condition based on the probabilities provided. Their answer was also requested in % form. Standard frequency format consisted of presenting the same probabilities but in fraction statements like “10 out of every 1,000” instead of 1%. The answer was requested in the same fractional statement format. Probabilities presented in the frequency format caused subjects to use implicit mental calculations that resembled Bayesian formulas. The result of making estimates in this way was more accurate estimates more often, even for users with minimal statistics background (1995).

Criticisms of the frequency format methodology. Although not a direct criticism, Paul Meehl et al. found that most risks are not single factor problems and so are not calculable by the human mind consistently (Meehl et al.).  Meehl et al. specified that computers should be used whenever possible to combine single factor probabilities into a risk model that reflects reality more accurately (1986)

Aggregating estimates from multiple experts. The opinions of experts are used to provide decision makers with guidance when hard data, the skills required to process the data or the funds for either are unavailable. Common methods used by decision makers for aggregating the opinions of multiple experts include simple averages, weighting, and trimming. Yaniv cited Kahneman, Slovic and Tversky’s (1982) research but proposed further research into when heuristics could be an effective tool for decision-making (Yaniv, 1997). Yaniv confirmed that decision-makers tend to use heuristics, like weighting and trimming, when performing complex tasks, such as aggregating highly disparate opinions. Combing estimates provided by multiple experts improved accuracy in many studies (Ashton & Ashton, 1985; Sniezek & Buckley, 1995; Sorkin et al. 2001; Winkler & Poses, 1993; Yaniv, 1997; Yaniv & Hogarth, 1993; Zarnowitz, 1984.) Yaniv performed two studies, one in which computers were tasked with aggregating expert opinions, and the other in which humans were tasked to do the same using their weighting, trimming, and a combination of the two. Each of these methods was used on a variety of sample sizes to identify if the heuristics provided accurate aggregates with varying sample size. The conclusion of the study was that the heuristics of weighting and trimming, when performed by humans, is a justifiable measure that increased accuracy of estimation depending on the properties of the estimates being made.

Weighting and trimming for aggregating judgments. Yaniv published research in 2004 that examined how people weight advice received from others. Yaniv’s research evaluated the influence of advice from advisors on advisees and the extent to which advice can improve judgment accuracy. It was found that advisees tended to adjust their initial estimates toward estimates provided by advisors but typically weighted their own opinion higher. Advisees who were knowledgeable in the subject matter were found to adjust their estimates less. The more disparate the advisor’s estimate was from the advisee’s initial estimate; the less adjustment tended to occur. The accuracy of estimates yielded from advisor and advisee collaboration tended to increase even with this observed conditional discounting of advisor opinion. Overall, seeking advice tended to improve accuracy, but Yaniv demonstrated through study that  that integrating advice can be optimized to provide more precision. Yaniv found that the weight of advice provided by advisors decreased as deviation, or distance, from the advisee’s inherent estimate increased. Advice improved accuracy significantly though suboptimally. Advisees tended to put too much confidence in their own knowledge and estimates, discounting advisors advice to the point of decreasing accuracy of the resulting estimate. Advisees who recognized their lack of knowledge in the subject matter were found to discount advice from advisors less. Advisees tended to adjust their initial estimates toward those provided by advisors though only slightly, having more confident in their initial ideas. When advisors provided an estimate that varied greatly from the advisee’s, the advisee would adjust their initial estimate even less. In Yaniv’s 2004 study, subjects received bonuses for making accurate judgments. Their initial estimates could also be adjusted, after viewing estimates made by other subjects. The estimates of other subjects were provided along with computer generated estimates that are known to be inaccurate. This was to test if wrong estimates were just as influential on subjects. Estimates were provided in the form of ranges instead of single-points. The Confidence Interval (CI) requested was .95. Weighting by decision makers was inferred by measuring the variation in estimates before and after advice was given. For example, the two ends of the weight spectrum are 0% discounting and 100% discounting. If the decision maker discounts the advisor’s advice completely, that is 100% discounting and is given a weight of 0. In contrast, 0% discounting of advice is given a weight of 1.0. Harvey and Fischer found that estimates by decision makers were observed shifting 20- 30% toward advisor estimates. This occurred even when the they were aware that their advisor had less training than them in the subject matter (Harvey & Fischer, 1997). Research by Lim and O’Connor showed that advice in the form of calculated reports, such as those created using statistical measures, were discounted by advisees most of the time. Subjects tended to give their initial forecasts twice the weight that they gave to statistical model meant to inform their decision (Lim & O’Connor, 1995). Yaniv evaluated the methods for aggregating expert judgments under uncertainty using weighting and trimming. These methods were meant to offset challenges that arose from conflicting subjective estimates and varying levels of uncertainty between experts.

Judgments provided by experts may be given more weight than others when aggregating multiple estimates. Judgments that are relative outliers may be trimmed from the aggregate product also. Such outliers may be outliers in terms of the expert’s unusually high or low level of confidence in their judgment, or the judgment is an extreme uncommon to the average judgment among the experts. The benefit of trimming is that it abstracts the central tendency of the judgments, and it prevents one expert from influencing the set of judgments excessively. The drawback to trimming, in this case, would be that most of the experts are wrong and that the outlying judgments were right. Yaniv found that methods which involved selecting judgments that overlap was ineffective as sample size increased due to disagreement among judges. Yaniv’s research found that estimates with confidence intervals stated by experts to be at 95% contained the true answer only 43% of the time. To remediate this Yaniv found weighting judgments by inverse width improved accuracy. Trimming and weighting provided more accurate judgments than trimming or weighting alone. The use of weighting and trimming also proved to be more effective at yielding more accurate answers than simple averaging. Yaniv’s 2004 publication found that simply aggregating another person’s opinion into an estimate improved accuracy by 20%. This effect does not require experts smarter or more knowledgeable than the decision makers requesting estimates. The estimates must come from advisors who work independently from each other, some dependence has however produced useful data (Johnson, Budescu, & Wallsten, 2001).

Frequentist statistical analysis with logistic regression for aggregating judgments. Lele and Allen found that subjective Bayesian methods provided excessively wide confidence intervals. They proposed eliciting data instead of priors to combine subjective expert estimates with hard data in order to increase the precision of statistical analyzes (Lele & Allen, 2006). Their method also involved distinguishing between useful and less useful experts and weighting their estimates accordingly. That is, experts whose estimates provide information beyond what hard data is available opposed to providing estimates that stray from the truth. This is opposed to rating expert informativeness based on traditional factors like the expert’s experience, fame, or other qualitative characteristics.

Lele and Allen attempted to remedy some of the challenges that came with the logistic regression approach to probability (2006). This was in response to Fleishman et al.’s observations on the difficulty of estimating parameters in logistic regression when the number of important covariates was large compared to the number of observations (Fleishman et al., 2001). Where the standard error was excessively large, such was the case with rare events, logistic regression provided little value (Lele & Allen, 2006). Lele and Allen’s study attempted to use expert knowledge to increase the accuracy of logistic regression which in turn would provide more accurate predictions.

In Lele and Allen’s study, an expert was considered useful if they could improve accuracy, provide information that decreased standard error, and narrow the width of the confidence intervals established previously by analysts on data alone. The example problem used to test their hypothesis was to estimate a logistic regression model that related to the probability of species habitation based on habitat covariates at select locations (Lele & Allen, 2006). They were specifically attempting to predict the presence of a masked shrew which were believed to be a good indicator of certain environmental factors of interest in their research. What frequentist subjective information they did collect was used to supplement a logistic-regression derived prediction.

Lele and Allen found that elicitation of priors was difficult because it was not in a statistical format intuitive to most scientists (2006). Although communicating statistical format to scientists was possible, it was not often practical. Their research concluded that it was easier to elicit expert estimates in the form of the probability of events instead of recollected prior distribution of the events occurring or not, on the parameters of the statistical model.

Lele and Allen also evaluated the effectiveness of measuring the differences in informativeness between multiple experts, if multiple experts are employed for the elicitation. The statistical approach and formulas used are commonly used hierarchical models used to measure error in models. Lele and Allen provide a mathematically precise formula and instruction on how to quantify the informativeness of experts (2006). They also measure the value of combining multiple expert opinions.

After these methods and functions had been established, Lele and Allen tested them on species occurrence data observed and elicited from experts (2006). What they found was that experts were not equally useful, that experts varied in their usefulness based on the species under question, that a well-calibrated but non-useful expert will not negatively affect the overall analysis. Eliciting information on the observable scale and using their regression model to relate the elicited data to the observed data allowed calibration of the expert’s opinion in the form of elicited data.

Criticisms of aggregating estimates from multiple experts. Other conclusions made as a result of Yaniv’s studies with normative computer simulations was that the number of judgments required for accurate answers was fewer than suspected. It was found that increasing the number of experts providing estimates was minimally beneficial when using more than 2, up to 8, experts. This finding that minimal opinions are required for accurate estimates, or that more subjective data is not necessarily valuable was also observed by several other researchers (Ashton & Ashton, 1985; Hogarth, 1978).

Monte Carlo simulation modeling. Monte Carlo simulation modeling is a collection of tools for simulation using computational algorithms and repetitive random sampling (Metropolis and Ulam 1949). In a Monte Carlo simulation, a set of sample data is created using random numbers generated within the constraints of user specified parameters. Analysts generate random values that fall between the range (interval estimate) provided by the expert. The random numbers generated within that range are constrained further by having the random number generator produce only values that follow a specified frequency distribution. Monte Carlo simulation is widely regarded as a valid technique, and the mathematics required to create them can be very simple (Vose, 2008). Monte Carlo simulations are considered an upgrade from methods like what-if analyzes. What-if analyzes produce wider ranges that are limited to exclusively probabilities or impacts whereas Monte Carlo simulations take into account data from both variables. The probability distributions that may be used in Monte Carlo simulations are many. Common distributions that may be familiar to readers include the Triangle, Uniform, PERT, Relative, Cumulative Ascending, and Discrete distributions. One of two common methods for generating samples used in Monte Carlo analysis is Latin Hypercube Sampling. The other common method is simple random number generation. The former is employed to reduce model complexity when the resources required for simulation become excessive (EPA 1997, p. 7).

Fitting distributions to data. Selecting a distribution to use for a Monte Carlo simulation requires elementary statistical method or the use of distribution fitting software. Distribution fitting software provides the name of and formula for the distribution that best fits the sample data entered into it. The data entered into the software may be sample data derived from events that sufficiently resemble the kind of events being estimated by the expert. The quality of the data used will depend on factors including collection methods, bias of the organization providing the data, and scope of the data available. Vose prescribes the use of maximum likelihood estimators and optimizing goodness-of-fit in addition to analyzing the properties of the data being observed. The techniques available for determining a sample’s distribution Vose divides into first-order parametric, second-order parametric, and nonparametric distribution. There are also systematic and non-systematic errors, sample size, and sample dispersion to consider.

Fortunately, fitting distributions to data is not a new science and can be learned and practiced with elementary statistical education or the use of best-fit tools like those listed previously (Vose, 2008).

U.S. Environmental Protection Agency’s Monte Carlo Analysis guidance. The Risk Assessment Forum of the U.S. Environmental Protection Agency (EPA) published the Guiding Principles for Monte Carlo Analysis in March 1997. The document discussed the objectives, challenges, the potential value of Monte Carlo Analysis as applied to EPA efforts. Other EPA documents have emphasized the importance of using probabilistic techniques in risk assessments for adequately characterizing variability and uncertainty. These included the 1986 Risk Assessment Guidelines, 1992 Risk Assessment Council (RAC) Guidance (the Habicht memorandum), 1992 Exposure Assessment Guidelines, and 1995 Policy for Risk Characterization (the Browner memorandum) (EPA p1, 1997). The document states that the guidance provided serves only as a minimum set of principles and that innovative methods are encouraged where scientifically defensible (EPA p.3, 1997).

Introduction. The basic goal of Monte Carlo analysis in the guidance is to characterize quantitatively, the uncertainty and variability in estimates of exposure or risk as well as identify key sources of variability and uncertainty and to quantify their impact on risk model output. The EPA explicitly states in this section that the guidance is not intended to provide technical guidance on conducting or evaluating variability and uncertainty analysis.

Limit of review for EPA guidance. Each of these points is discussed in more detail in the 1997 report. The report also describes the benefits of Monte Carlo simulation and outlines methods for adequately communicating the often unfamiliar concepts of variability and uncertainty in terms of risk. Instead, the document is meant to provide a discussion of principles of good practice for Monte Carlo simulation as applied to environmental assessments (1997 p. 11). The guidance included a set of standard conditions for satisfactory probabilistic risk assessment methods. The conditions were based on principles relating to “good scientific practices of clarity, consistency, transparency, reproducibility, and the use of sound methods” (EPA 1997, p1). The relevant principles may be summarized as:

  1. Purpose and scope clearly articulated and includes a full discussion of highly exposed or susceptible subpopulations.
  2. Methods of analysis clearly documented, data representativeness clearly defined, all models and software documented, and information of the whole analysis must be sufficient enough for independent parties to reproduce the research.
  3. Sensitivity analysis discussed, and probabilistic techniques applied to compounds, pathways, and factors of importance.
  4. Correlations or dependencies between input variables incorporated into analysis and clear descriptions mapping their effects on the output distribution.
  5. Input and output distribution information documented including tabular and graphical representations, locations of any point estimates of interest, rationale behind distribution selection, and variability and uncertainty differentiated.
  6. Numerical stability of central tendency and higher end of output distribution are discussed.
  7. Exposures and risks using deterministic methods are documented, allowing comparison to probabilistic analysis and other historical assessments. Similarities and differences between probabilistic and other methods in terms of data, assumptions, and models documented.
  8. Fixed assumption metrics are documented and output distributions are aligned to them.

Determining the value of a quantitative variability and uncertainty analysis. Risk assessor, manager, and other stakeholders may establish if a Monte Carlo simulation is necessary by considering: Whether or not a quantitative analysis of uncertainty and variability will improve the risk assessment; What the major sources of variability and uncertainty are; Whether or not variability and uncertainty will be kept separate in the analysis, whether or not there are sufficient resources to complete the analysis, whether or not the project warrants the level of effort required, whether or not a quantitative estimate of uncertainty will improve decision making; How the regulatory decision may be affected by the variability and uncertainty output; What skills and experience are necessary to perform the analysis; The strengths and weaknesses of the analysis methods considered; and how variability and uncertainty analysis itself will be communicated to stakeholders and other interested parties.

What the EPA calls preliminary “screening calculations” may show that a quantitative characterization of variability and uncertainty is unnecessary. Such a calculation may show risk exposure to be clearly below the concern of decision makers. Similarly, the cost to remediate the potential risk may be sufficiently low that spending additional resources on analysis is deemed unnecessary. In contrast, there may be sufficient reason to perform a quantitative characterization or variability and uncertainty as a result of such preliminary screenings. The screening may yield point estimates above decision maker’s risk appetite, there may be indications of bias in the expert estimates or data, the cost of remediation may be very high while exposure is seen as marginal, or the potential impact of risk events is too high not to plan for.

Defining the assessment questions. Begin an exposure assessment by first defining purpose and scope of the assessment clearly. Simplicity of the assessment should be balanced with including all important exposures of risk. Sophistication of the analysis should increase only if doing so will increase value.

Selection and Development of the Conceptual and Mathematical Models. Selection criteria should be established for each assessment question. Criteria may consider the varying exposure of populations examined, significant assumptions, uncertainties, and the degree of variation in output if alternative models were used.

Selection and Evaluation of Available Data. Evaluate data quality and representativeness of the data to the population being examined.

Selecting Input Data and Distributions for Use in Monte Carlo Analysis. Preliminary analysis should be performed to determine model structure, exposure pathways, model input assumptions, and parameters most influential to the assessment’s output, variability, and uncertainty. This information helps prevent spending resources on collecting data or performing analysis on unimportant parameters as well as identify dependencies and correlations between models. Identifying where correlations and dependencies exist informs the analyst that whatever models they choose must be compatible with each other. Preliminary analysis methods include what-if scenarios, numerical experiments, and performing systematic sensitivity studies.

Correlations and dependencies must be documented along with any parameter or pathways excluded from the analysis and the reasons why they were excluded. Distribution wideness should be commensurate with the available knowledge and certainty. The EPA emphasizes the importance of not employing probabilistic assessment on insignificant pathways or parameters since the process may be a significant undertaking and costly to perform. If distribution shapes change over time, the reason for the change and rationale for new shape should be documented. Selecting input distributions should involve qualitative and quantitative information available but both should undergo the same scrutiny. Considerations when evaluating the quality of information include:

  • Availability of a mechanistic basis available for choosing the distribution family.
  • What mechanisms dictate the shape of the distribution.
  • Whether the variable is discrete or continuous.
  • Bounds of the variable.
  • Skew or symmetry of the distribution.
  • Qualitative estimate of distribution’s skew and in which direction.
  • Any other known factors affecting the shape of the distribution.
  • When data is unavailable that represents the true distribution of the population being assessed, surrogate data may be used that justifiably resembles what the distribution shape but the rationale must be defensible and documented.

The EPA goes on to describe Environmental factor specific examples of scenario analysis and addressing quality of information at the tails of distributions. The guidance emphasizes the importance of highlighting and differentiating the distributions and data provided by expert judgments opposed to real sample data. The appreciable effect of each on the outcome of the analysis should also be clearly communicated. In other words, the degree to which the assessment is based on data vs expert estimates.

Evaluating Variability and Uncertainty. The EPA guidance recommends establishing formal approaches for distinguishing between and evaluating variability and uncertainty in the end report. The issues that should be considered include:

  • That variability depends on averaging time, averaging space, and other dimensions in which the data are aggregated.
  • That standard data analysis typically understates uncertainty in the form of human error while overstating variability in the form of measurement error.
  • That model error may represent a significant source of uncertainty.
  • That accuracy of variability is significantly dependent on the representativeness of the data.

Numerical stability of the moments and tails of distributions are important in evaluating variability and uncertainty. Numerical stability can be seen in changes of mean, variance, or percentiles observed in the output of a Monte Carlo simulation as permutations increase. Some models may require more permutations than others to stabilize or become consistent. Since that is the case, permutations should numerous enough to ensure stabilization has been reached.

Areas of uncertainty should be included in the analysis either quantitatively or qualitatively. Some level of uncertainty is unavoidable. The EPA recommends relative ranking of sources of uncertainty, particularly when quantitative measures are unavailable or the use of Bayesian methods which correct for subjective estimates so long as they distinguish between variability and uncertainty.

Presenting the Results of a Monte Carlo Analysis. Throughout the guidance there is an emphasis of clearly defining the limitations of the methods and results of Monte Carlo analysis. This includes previously described efforts like detailing the reasons why a particular input distribution was selected and any significant variability and/or uncertainty and goodness-of-fit statistics on the model shape. In addition to thorough documentation the EPA recommends visuals such as graphs and charts. Visuals that typically provide value are graphs of the probability density function (PDF), and cumulative distribution function(CDF). The PDF graph communicates the relative probability of values, most likely values, distribution shape, and any small changes in probability density. The PDF graph communicates fractiles (like mean), probability intervals (like confidence intervals), stochastic dominance, mixed, continuous, and discrete distributions.

The EPA describes the importance of evaluating hypotheses that one set of sampled observations are independent from the distribution chosen. Methods of evaluation include Goodness-of-fit tests like chi-square, Kolmogorov-Smirnov, Anderson-Darling tests; and for normality and lognormality, Lilliefor’s Shapiro-Wilks’ or D’Agostino’s tests. Alternative methods of testing should also be employed due to problems with the effectiveness of these tests with certain sample sizes. Assessing fit based on graphical comparison of experimental data and the fitted distribution with probability-probability (P-P, or quantile-quantile (Q-Q) plots are two such method. The guidance specifies that these methods are effective at ruling out poor fits but cannot provide confirmation of perfect fit.

Scales should be identical wherever possible in the report to prevent inaccurate or deceptive communication of data. Each graph accompanied by a summary table of the relevant data. The guidance goes on to describe optimal formatting and style of graphs as well as limitations on how much data to communicate in a single graph.

Criticisms of Monte Carlo simulation models.

Monte Carlo Simulation is criticized as being too approximate. This can be remedied by increasing the number of permutations that the model simulates (Vose, 2008). Another criticism of Monte Carlo simulation targets the common use of spreadsheets to perform them. Spreadsheet software works for simple Monte Carlo simulations, but they can become problematic for more complex assessments (Vose, 2008). In the proceedings of the 13th Hawaii International Conference on Systems Sciences in 1997 T.S.H Teo et al. presented data that showed spreadsheet errors were beyond what was expected by most organizations (1997). In Hitting the Wall: Errors in Developing and Code Inspecting: a ‘Simple’ Spreadsheet Model, R.R. Panko et al. found that software code errors occurred in spreadsheets. The errors were not immediately evident to the user of the spreadsheet software. Errors occurred most often when handling large amounts of data (Decision Support Systems 22(4), April 1998, 337-353).

Next section: Discussion of the Findings