Research by Kahneman and Tversky suggested that humans performed poorly at providing their expertise due to heuristics, and the biases they caused. Corrective methods have been successful in fields including ecology, finance, sociology, and psychology. There is minimal guidance or research available on how to reduce the impact of expert bias on information technology risk assessments.

Reason Sources Were Chosen


Researchers sometimes count the number of peer-reviewed articles available regarding eliciting expert knowledge in order to identify which methods deserve focus. Lopes demonstrated in The Rhetoric of Irrationality that there are a disproportionate number of articles assuming the conclusions of Kahneman and Tversky’s 1972 article. Though their works have merit and are widely cited, this capstone does not assume their claim that “humans are poor estimators.” This work instead takes Lopes’ conclusion that humans can estimate poorly depending on certain conditions. This allowed for research like Gigerenzer et al.’s who found that humans could be observably Bayesian if questions were presented to them in a particular way.

The U.S Environmental Protection Agency’s Monte Carlo Analysis Guidance was an included source because it serves as a good example of the kind of guidance missing for other expert knowledge elicitation methods. The format of the guidance is directed at environmental work but can be easily tweaked to inform other fields of work such as information security.

I was searching for as many different methods for eliciting expert knowledge that I could find in peer reviewed literature. I did not find research on information security that described in its methodology the expert knowledge elicitation method used. I searched through other fields, such as ecology, where their expert knowledge elicitation methods were explicitly stated. Within just the field of ecology there were multiple methods being used. There were also critical studies on the methods themselves. Where a method was mentioned, I followed the citations of who originally developed that method e.g. calibration of Lichtenstein and Fischoff. Once I understood the method from the primary source I searched for articles that were citing the primary source in research that either evaluated the effectiveness of that method in a particular field, e.g. calibration of experts for sub-species population predictions, or that critically evaluated the logic of the method itself. Since the intended audience of this work are information security decision makers, leaders, and analysts, I included the method that they most often use, subjective single- point scoring. That they would be using these methods is based on the survey by X and the assumption that many of these organizations are compliant with the major regulators and standards discussed e.g. NIST, COBIT, FFIEC, etc.

The list of expert knowledge elicitation methods reviewed is incomplete. Further surveys may be necessary to discover all methods being used. Additional research may necessary to determine the effectiveness of those methods, or if they are incapable of being verified in a reasonable period of time e.g. the industry standard single-point subjective scoring. Whether or not these methods could be effective in information security may require surveying real organizations that have implemented them. Such surveys could not be found but the methods discussed appear to apply to a variety of fields that rate risk similarly.

The terms and definitions used across articles, especially across fields of work, varied. In order to maintain readability a common set of terms was used. The Vose, Kouns and Minoli, and Segal citations in this work were not from peer-reviewed research journals. The terms, definitions, and methods described by these textbook authors were used to establish the common language of this paper. Their terms and definitions for areas of risk management and assessment, elementary statistics were sufficiently general enough to cover the different language of the peer- reviewed articles included. In cases where no single term was satisfactory, each term was listed.

Theme 1: Format of Questions


Kahneman et al. observed that humans simplify their environment in order to solve problems more quickly. The same simplification, when applied to complex situations, causes humans to provide inaccurate estimates. Lopes interpreted their research differently, finding that humans provided accurate estimates in complex situations. Gigerenzer et al. found this to be the case most often when humans were given information in a frequency format that resembles animal foraging and neural networks (1995). In other words, using fraction statements instead of percentages.

Theme 2: Format of Answers


Single-point subjective scoring methods appear to be the industry standard. Hubbard et al. found that verifying the effectiveness of these methods is most often not practical or possible. How verbal subjective scores like high, medium, and low, were defined varied widely with little consistency. Variability in definitions was observed between experts as well as by individual experts over time, even within the same 24 hour period. Dependencies between risks could not be accommodated with this method which causes such models to exclude risks that change the probability or impact of other risks on the same risk.

Single-point subjective scores presented in the format of risk matrices and heatmaps were found to be grossly deceptive. Cox’s mathematical evaluation of example matrices for use ranging from highway maintenance to combatting terrorism found multiple flaws. Accurately communicating risk in matrix form requires rigorous mathematical evaluation, high quality non- subjective data, and the acceptance of losing precision.

Yaniv found that decision makers intuitively weighted advice from experts and improved accuracy, though not optimally. The methods proposed for optimizing the process may not be applicable since, as Yaniv explicitly states in the study’s limitations, qualitative advice and opinions were not evaluated in the study, only quantitative factual advice like the date of events. Answers in the form of ranges with confidence intervals follow the format used in calibration training and ensure that uncertainty is communicated with an estimate in addition to the estimate value. Since the format is a range, the expert is only obligated to provide a wide enough range that contains the estimated value. Ranges can also be used as the parameters of Monte Carlo simulation models created with the elicited information allowing further processing and correction.

Gigerezer et al,’s literature suggested that questions be formatted in a frequency format (1995), such as fractional statements instead of percentages, but also suggested that the subject’s responses be provided in a frequency format.

Theme 3: Calibration of Experts


Lichtenstein et al. found that overconfidence was a common bias in most experiments. The degree of overconfidence increased as the relative difficulty of the task increased. They found that calibration training could be used to increase a person’s ability to provide accurate estimates. The reduction in bias was also measurable and repeatable. Kynn found that most people could be calibrated if they took the time. A person’s calibration, or performance providing accurate estimates, carries over to estimates provided for content outside of the calibration training, such as the person’s field of work. Lichtenstein et al later found that calibration could only improve accuracy to an extent, suggesting the use of corrective technologies in addition to calibration.

Logistic regression may be used to integrate data with expert opinion, as described by Yaniv. Regression modeling may be used to determine how much an expert’s opinion needs to be adjusted based on their measurable bias. Measures of bias may be obtained with methods like calibration training or by reviewing the accuracy of an expert’s predictions over time.

Vose does not use the term calibration in this way, but he does state that subjective assessors can provide more objective estimates if trained to think about risk in a particular way.

Theme 4: Aggregation of Experts


Aggregating the opinions of experts improved the accuracy of estimates in most trials in multiple studies. Yaniv found that when weighting and trimming was used by humans to aggregate expert opinions, accuracy was improved in most cases. Computation can be used to aggregate expert opinions with more precision, such as by using regression methods to adjust for known expert error, such as those used by Lele and Allen.

The benefit of additional expert input to improve benefits had a point of diminishing return. Yaniv observed that the greatest increase in estimate. Mathematical formulas can be used to measure the value of combining multiple expert opinions to find the optimal number required for the most value.

Lele et al. found elicitation of priors difficult and impractical. The effective alternative observe was asking experts for probabilities of events occurring instead of prior distributions of the event occurring or not occurring.


Theme 5: Integration of Data


Like Lele et al, Yang et al. experienced difficulty quantifying expert opinion in the form of prior distributions. They also found it difficult to measure the informativeness of an expert, and justifying the costs for training them to provide Bayesian estimates.

Lele and Allen observed that using logistic regression, real data could be integrated with expert opinion. The usefulness of the expert was also effectively measured and weighted their contribution to improve estimate accuracy in most trials.

Theme 6: Simulation


Humans seem to require some amount of behavioral and technical correction when providing estimates. Data should be used whenever possible before eliciting subjective estimates. Computer processing can reduce bias of human estimates by factoring in more information simultaneously than the human mind is capable of doing unaided. Monte Carlo simulation  factors in variability and uncertainty together where methods like what-if analysis and unassisted human reasoning cannot. Certain risk assessments, even if empirical data are used, exclude these factors and may result in inaccurate or grossly imprecise results.

Comparison of the Findings


Kuhnert et al. described how to elicit information from experts and incorporate that information into Bayesian models. Using data wherever it was available, establishing whether or not eliciting such estimates would provide sufficient value to spend resources, identifying how to measure the expert’s uncertainty, clearly presenting the question for the expert, using graphical aids and discussion with the expert instead of one-way questionnaires, and assessing the impact of Bayesian priors. Kuhnert does not discuss the consideration of variability along with uncertainty as Vose proposes.

Martin et al. presented methods for ensuring that uncertainty was adequately communicated by experts. The elicitation involved a single-point scoring and the use of computers where possible. They emphasize the importance of treating subjective elicitations as snapshots of the truth and encourage the use of comparing elicitations to available data wherever possible. They also recommend the use of multiple expert judgments mathematically supplemented with linear opinion pooling methods. They used weighted averages as data and addressed strong variations in data with more complex mathematical methods.

Tara et al. cites both Kynn and McBride’s works and adds research by O’Hagan et al. who confirms the calibration literature, that experts systematically underestimate their uncertainty (2006). They also acknowledged the findings made 40 years after Kahneman et al.’s works that some studies found people with minimal bias even without calibration training (Kahneman et al. 1972). It may be the case that these studies satisfied Gigerenzer et al’s frequency format (1995), or that some professionals are well calibrated without training. Using the Calibration exercises described by Kynn, the natural calibration of a professional can be tested.

Mcbride et al. experimented with a structured approach to preventing bias in expert elicitation using email. They found that emailed questionnaires were effective even for contentious issues (2012). The downside of the method was that it took months, opposed to days for traditional methods, to get all of the expert’s responses.

Limitations of the Study

Peer reviewed publications specifically discussing expert elicitation methods for Information Security Risk Assessment were not available for this capstone. Sources of research were limited to available resources to Utica College students. Sources of research are only as accessible as Utica College research resources make them. Sources of research are only as accessible as Utica College students are capable of operating research tools such as research catalogs, databases, and indexes.

Next section: Recommendations and Conclusions