RECOMMENDATIONS AND CONCLUSIONS

 

Recommendations

Judgement and Decision Making Psychology literature showed the fallibility of human experts when providing estimates. Information Security decision makers use expert estimates to drive security efforts. Minimal guidance is available for eliciting knowledge from experts in the field of information security that corrects for this human error. Eliciting knowledge from experts with optimal accuracy and precision has been researched in other fields of work. These methods appear independent of subject matter and may be applied to other fields including Information Security / Cybersecurity. How might we use these methods in the office? Example applications of the expert knowledge elicitation methods discussed are outline below. These methods can inform decision makers for long-term, high-level management decisions, but also low-level day- to-day decisions such as what events and alerts a SOC and CIRT analysts should investigate first. Such prioritization is key to optimal cybersecurity decision-making and operations where there is both uncertainty and limited resources.

To briefly review the findings of this capstone, questions, and available information can be formatted in a way that ensures clarity and comprehension by experts. Responses elicited from experts if requested in the form of dollars and percentages, can effectively avoid human heuristic tendencies and their resulting biases. Experts that undergo calibration training can provide estimates minimally influenced by their over confidence or under confidence. Combining the opinions of multiple experts can improve both accuracy and precision of estimates.  Integrating available data with expert estimates can improve accuracy and precision of estimates. Simulation models can decrease bias, take into account, uncertainty expressed by experts, irreducible uncertainty of the threat environment (variability), highly complex scenarios,available data, and data as it later becomes available or is learned from observations over time in an environment. Although these methods can be used in a variety of ways, a few example application scenarios are listed below.

Frequency Formats. The success of frequency formats observed by Gigerenzer et al. suggests that use of visual aids such as scatter plots and time-series graphs, which present data in a frequency format, should improve expert comprehension (1995). Lopes’ studies showed that presenting questions and data in the form of dollars and percentages should also increase comprehension of information for expert processing (1976).  Presenting data in the frequency format takes advantage of a human’s natural ability to estimate the likelihood and impact of events, further increasing accuracy and precision. It may be beneficial to present information in a frequency format whenever possible and practical in order to help reduce bias along with the other measures discussed in this capstone.

Cybersecurity Decision Making. Converting available data to frequency formats may help improve the accuracy of estimates provided by experts. This can be done by replacing percentages with fraction ranges. That is, 1% becomes 10 out of every 1000. Visuals that present probability in a frequency format, such as a pie chart showing the fraction equivalent of the percentage may also help experts understand the relative probability of the data that they are viewing.

Cybersecurity Operations. The presentation of events with probability ratings may be best performed with a frequency format rather than a standard percentage. How SOC analysts can use probability values is explained in the Monte Carlo section below. Visuals that present probability in a frequency format, such as a pie chart showing the proportional equivalent of the percentage may also help analysts understand the relative probability of the events they are viewing. Such a visual may be present in a SOC visual dashboard.

Interval Estimates. Expert responses in the form of interval estimates avoid the many observed flaws of single-point scoring. Experts asked to provide a value for the probability of a particular event can provide a range instead of a single percentage, e.g. less than 50%, or greater than 80%. Impact values could be given in ranges such as $1-2 million.

By providing ranges, experts are not pressured to provide unrealistically accurate responses, which may create bias. Ranges communicate the estimated value, but also the expert’s uncertainty, a factor wholly absent from the subjective scoring methodologies that make up most industry standard methods. Monte Carlo simulation models use high and low bounds, like the interval estimates experts would provide. Finally, ranges allow decision makers to calculate the value of their experts by having a record of that expert’s precision by observing how narrow or wide the ranges they provide are. Since Gigerenzer et al.’s research showed the benefits of frequency formats (1995), experts may provide their estimates in the frequency format instead of standard percentage form.

Cybersecurity Decision Making. Risk management analysts or managers looking to their experts for advice may present their experts with a written or verbal questionnaire. The manager may be asking their internal or contracted experts about a recent cyber threat that has appeared from a threat intelligence feed. The risk analyst may be performing routine assessment on what risk is posed to company equipment.  In either case, the experts could be asked to provide their answers in the form of ranges.  The estimated probability of an event can be presented as a percentage. The estimated impact of an event can be presented with a low dollar amount and high dollar amount. The expert could also be asked for frequency distribution information or sources of data. An example risk event could be the risk of a DDoS attack on the company website. The conversation that would follow should involve specifying exactly what the questioner is looking for. After the manager or analyst and expert(s) put all assumptions on the table and define the terms and intended use of the estimate requested, they may all agree that what they are wondering is:

  • How much money do we lose per hour that the site is down?
  • How much money would we lose if the fact a DDoS attack on us was successful goes public?
  • How much money would we lose if the DDoS attack weakened the web server and allowed data to be exfiltrated from it?
  • How much money would we lose if the DDoS attack made our website vulnerable and allowed attackers to take control of it, post content, and cause reputational damage?

The manager or analyst is likely asking these questions because they want to know how much of a risk they are taking. They may also be wondering if the cost of protective controls would yield a net benefit, and if so, what controls would be optimal and not cost more than the negative event itself occurring. The expert may provide probability and impact estimates, in the form of ranges, as follows.

4

 

 

 

“But we cannot think of everything! So what’s the point?”

The number of potential scenarios that may result from any event is not often finite. With or without risk assessment, and regardless of assessment style, there will always be possible outcomes not considered or assessed. By performing these assessments regularly and pulling key risks from assessments performed by other organizations a risk assessment model grows, and if done correctly, reduces uncertainty. The alternative to this is creating deceptive models that create the illusion of visibility (see subjective single-point scoring and heat maps) or not measuring risk at all. By not measuring risk at all, you lose visibility into whether or not your decisions were good ones or not. You would not know that there were near misses during the year, that your successes were more chance than success, skill, or effort.

Cybersecurity Operations. Cyber Security analysts, such as those who would work in a CIRT or SOC, can provide the conclusions of their investigations in the form of ranges. The exception to this would be any work that requires analysts to suspend their judgement and to simply record observations, as may be the case with a forensic analyst or low level SOC analyst, at which point their observations would be passed on to someone to make a conclusion based on the observations made. The structure of this varies depending on how an organization handles cyber security, but the use of ranges instead of single-point estimates can be used to benefit from the literature on the use of estimates. Since resources are finite, and network events are often overwhelming, prioritizing those events for CIRT analysts by their probability and impact can help keep higher risk events at the top of the queue for investigation.

But what if the alerts are wrong? Then we ignore a crucial event!

Consider how this approach compares to the method an organization is using currently. If you discover that the alerting mechanism’s risk ratings are not appropriate, improve the mechanism. The alternative to this would be a human rating the probability and impact of every event, based on say, meta-data, before performing a full investigation. The only difference here is that the automated process would be less error prone, especially over time and given the numerous events assessed. Cursory prioritization like this is best performed by a computer. This approach is meant to be updated on a regular basis in response to environmental changes. The Cyber Security Decision Making level risk assessment can inform this automated mechanism, leaving prioritization of risks up to management. The organization will still get through all of the risks in the queue, they are just increasing the likelihood that events with the highest risk are investigated first.

The ranges used by the manual SOC analyst or automated system may look something like this:

hypo

The table above shows 2 different events and how the SOC analyst or automated system may adjust probability based on certain factors. The impact would, of course, remain the same throughout. The impact range would be generated by decision makers apriori in the same way shown in the previous section of this capstone.

Calibration. Calibrating experts allows decision makers to quantify the bias of their experts and monitor their progress reducing it with calibration training. The ranges provided by calibrated experts will more likely contain the correct probability and impact values, increasing estimate accuracy.

Trusting the opinion of experts can be risky for decision makers, especially when those experts are from outside of the organization, such as would be the case with consultants.

Calibration training provides a kind of grading system that shows an expert’s performance providing estimates of all kinds, regardless of the content. This ensures that decision makers and mathematical models are weighting the expert’s advice correctly and in a repeatable way.

Multiple experts can be used to increase accuracy in a repeatable and verifiable fashion. Using regression models the calibration grades of experts can be used to weight their advice respective to their observed bias. Additionally, regression models can be used to determine when additional expert opinions will not provide value commensurate with the cost of hiring them or taking them away from their existing cycles.

To take advantage of a person’s natural Bayesian tendencies, calibration questions and responses could take on the frequency format discussed previously. For calculating the performance of the expert via standard grading, percent correct vs. incorrect, frequency formats could be converted to standard percentages.

Cybersecurity Decision Making. Experts who will be providing estimates for risk assessment could undergo calibration training. Questions that resemble the kinds that the expert will be answering, but for which the answers are known, could be drafted by the person interviewing the expert. The questions used for calibration purposes should request numerical values such as the probability of certain events based on environmental factors or request impact values based on environmental factors. The answers requested from the expert should be in the form of ranges. Most experts are not well calibrated as the literature shows, for this reason, methods for improving their estimation abilities should be employed as the training progresses. If the training is effective, the percentage of ranges provided by the expert that contain the correct answer should increase. Regardless of the content they are being asked about; the expert’s ability to assess their uncertainty should be reflected by them consistently widening the ranges of their responses. The number of questions asked may depend on the expert’s progress in becoming calibrated. More questions and feedback on how to adjust their responses may be necessary for even the most brilliant of experts. Calibration questions may be outside of context like asking the person; what year was Benjamin Franklin Born? To which they may respond with a range like 1650-1750.

Cybersecurity Operations. Calibration training for analysts would be no different from that just described. It may help to include questions that the analysts can provide narrow ranges for to see how they adjust answers for things they are more confident for.

Aggregating Opinions. Aggregation of multiple expert opinions is yet another layer of increasing estimate accuracy. Expertise that is readily available can be optimally used, and unnecessary spending can be avoided with regression modeling. The same modeling can be used to integrate real data with the opinions of experts, further increasing estimate accuracy and precision.

Simulation with methods like Monte Carlo allow decision makers to take advantage of all of the different information made available to them and combine it with single values that are more understandable and on which further mathematical analysis can be performed. For scenarios that have too many factors for an expert to consider simultaneously, or for which there are many interacting and dependent variables, simulation can tie together all of the separate probabilities and impacts that the expert can provide.

Cybersecurity Decision Making. Decision makers may choose to ask multiple experts for their probability and impact estimates of key risk events occurring. The literature suggests that doing so may increase the accuracy of the end estimate chosen so long as each estimate is considered. If there is strong disparity between the expert’s estimates, the decision maker may benefit further by engaging both experts in conversation to see if a consensus can be achieved, or at the very least to hear their arguments for their ranges. The literature also suggests that the value of involving more experts diminishes quickly after only a few have provided their estimates. For this reason, decision makers can feel confident in their estimates without having to spend additional resources on more experts. More technical methods like regression can be employed to aggregate the estimates of experts as well if the expertise is available.

Cybersecurity Operations. Analysts may investigate anomalous activity that appears in their queue and conclude that the activity was normal or should be escalated to CIRT or another department. Aggregating the estimates of multiple analysts could be performed by having some level of redundancy in tickets or by having analysts review each other’s work before making escalations. If there is a disparity between their conclusions and a consensus cannot be reached, that event may be suitable for CIRT analyst investigation simply because there is much uncertainty.

Monte Carlo Simulation.

With the above methods, experts provide probability (as %), impact (as $) and certainty (as range width). Monte Carlo simulation integrates each of these and also allows consideration of the frequency distribution. The expert can not only provide a range but also whether or not they believe the high or low end is more likely. Alternatively, sample data may be sufficient to evaluate for the best fitting distribution. By using sample data in this way, Monte Carlo simulation can also assist in integrating data with expert opinion.

Cybersecurity Decision Making. What information is provided by experts using the above methods can be used in simulation models using Monte Carlo techniques. Probability and impact ranges can make up the minimum and maximum values for the simulated probability and impacts in your model. If the expert has additional information about the probability or impact of an event, such as what end of their range is more likely, that too can inform the simulation model via a distribution selection. If data is available on an event that closely resembles the one that  you are attempting to assess the risk of, distribution fitting methods may be used. That best-fit distribution can be used instead of one provided by the expert, so long as the data set being used had largely the same factors influencing the distribution. Estimates could be provided in the frequency format proposed by Gigerenzer et al. (1995) but will have to be mapped to their standard format equivalents for the calculations to occur. The results could then be converted back to frequency format to ensure optimal comprehension. An example of how the previously elicited estimates can be fit into a Monte Carlo simulation is shown below.

mcs

Estimates by experts for the “DDoS attack succeeds” event have been isolated into the above table for this example.

The following table shows a Monte Carlo simulation that uses these values.

no ctrls

Each white row shows a permutation, that is, a simulated year where the event occurs or not. In this case the expert and/or available data indicated a Normal distribution, which in Excel can be represented with the NORMINV function. The second column shows binary outputs 1 or 0 to represent the event occurring or not, which can be generated in Excel with IF(RAND()>0.75,1,0 if the probability of occurrence is 75%. This is just a snapshot of the many factors that may be involved in a simulation. Any number of factors could be represented with rows in columns in the same way, using discrete mathematical functions to connect them.

In this case analysts were interested in generating example costs of the risk event occurring  based on the probabilities and impacts provided. This would likely be used as part of a larger simulation with more factors. The far right column, “Successful attack with outage” contains values that may be treated as sample data. You could, for example, highlight all numerical values in that column and generate averages or generate a graph. You could also generate the Minimum and Maximum values to double-check your work by comparing them to the minimum and maximum values provided by the expert. This column could also be modified by including events that affect the probability or impact of the risk event occurring. In the end, all factors considered, you could calculate probabilities based on the sample data. This example is just meant to show how estimates would be included in a Monte Carlo simulation.

Cybersecurity Operations. Decision makers may identify which scenarios are the highest risk using all of the above methods presented in this recommendations section. One of the controls that may reduce the probability or impact of such an event could be monitoring security logs for indicators. For example, for the risk event of a DDoS attack, one of the recommended controls may have been to configure logging and alerting to notify the SOC of any indications of a DDoS attack beginning to occur. The experts may have provided a list of indicators to include in logging and alert as a result of this assessment. One such indicator may be an uncharacteristically fast increase of users loading resources from the company’s website, such as the home page. The experts may include a few real-world caveats. Sudden popularity of the website may be due to a successful marketing campaign, or because of some fluke event like a popular YouTube video having the company logo in the background causing users to research the organization out of curiosity, or because the company computers have the company website as their homepage and a connection issue causes everyone to reload their browsers at the same time. For this reason, it may be valuable to take more factors into consideration before creating such an alert, or bringing it to the top of the analyst’s events-to-investigate queue. In the same way that ranges were provided for a Monte Carlo simulation, ranges can be provided to weigh the risk of particular events occurring real-time. When asked about what factors an expert would look for to confirm a DDoS attack, they may recommend monitoring the webserver’s connection logs for a variety of conditions. Such conditions may be any single IP address initiating an above average number of connections to the webserver, or any user agent that is not that of a common web browser making the connections.

In either case, there is always the possibility that such traffic is normal, otherwise automated controls could simply prevent the connection attempts. What you instead want to know is the probability that such a traffic pattern is malicious. The impact of such an event has already been established by the risk assessment performed by decision makers in the previous section. Like the risk assessments performed by decision makers at a higher level, both probability and impact factors could be included in the model.

The same methods can be applied to measure the probability and impact of other threat scenarios that analysts are tasked with investigating. With probability and impact values established, analysts can sort the various different events in their queue by both the probability and impact of the risk event they are meant to help identify. By doing this, analysts are analyzing events that are not only most probable but also most impactful, first.

Excel was used in these examples because both  business level executives and technical level analysts are familiar with its capabilities and format. Since the methods discussed consist of discrete mathematics, the same operations can be recreated in other software or languages. In the criticisms of Monte Carlo section of the Literature Review, some known drawbacks of spreadsheet software were presented.  Research into other software, especially those specifically build for Monte Carlo simulation may be valuable, especially with exceptionally large data sets and simulations requiring exceptionally large permutations.

UPDATE: Once the concept of monte carlo simulation is understood, equivalent (and significantly more efficient) monte carlo simulations using Python and SciPy libraries can be used. My preferred method involves MEAN.io and python. Estimates and data are entered into a MongoDB through a clean Angular web interface. Python is used to loop through the monte carlo simulation. The loop pulls the estimates and data from the MongoDB, performs some basic arithmetic, then loops through the simulations. I will be documenting this approach shortly.

Each of the methods appears to apply easily to both, high-level cybersecurity risk assessment, and decision-making processes, as well as low-level technical SOC and CIRT operations. The methods discuss have the added benefit of also making a transparent link between cybersecurity risk, decision-making, and daily operations in a way that is transparent and can be rapidly and easily updated.

Conclusions

Research by Daniel Kahneman, Amos Tversky, and other judgment and decision-making (JDM) psychologists found humans are poor estimators of uncertainty. The study also found this to be true regardless of expertise or experience, and studies since then confirmed these findings (Onkal et al., 2003, Kahneman et al. 1972, Soll & Klayman, 2004, Speirs-Bridge et al., 2010). Researchers found experience and level of training only weakly relate to performance (Camerer & Johnson, 1991; Burgman et al., 2011), and reliance on experts for decision making in the presence of uncertainty is common in a number of fields (Ashton, 1974, Christensen-Szalanski et al., 1982, Jorgensen et al., 2004, McBride et al. 2012, Murphy and Winkler, 1984, Onkal et al., 2003, Oskamp, 1965). Decision makers have increasingly relied on the input of experts in the field of Information Security (Kouns and Minoli, 2010), but research found managers do not know if their successes and failures are a result of their subject matter expert’s guidance (Hilborn & Ludwif 1993, Sutherland, 2006; Roura-Pascual et al., 2009). The observations made in these studies suggest the use of experts in risk assessment may provide false metrics of risk and that determining if that is the case is not a straightforward endeavor.

Standards organizations provide guidance in information security risk management, which involves assessment, but none provide guidance on how to address human bias or even go so far as to explain why they believe their methods work, such as by presenting research. With no research verifying the effectiveness of methods applied to Information Security, it is not surprising that standards organizations, decision makers, and analysts continue to use intuitive non-verifiable methods.

Methods of reducing bias and enabling management to measure the accuracy and precision of their experts have been developed and tested in multiple disciplines (McBride et al, 2012; Bolger & Onkal-Atay, 2004, Lichtenstein et al., 1982; Clemen & Winkler, 1999), but not in information security. The purpose of this study was to identify methods for extracting knowledge from experts with minimal bias in this setting. Drawing on previous literature, I review such methods, grouping them into the categories of formatting of questions and answers, calibration of experts, aggregation of expert opinions, integrating of data with those opinions, and simulating modeling using all available information were evaluated for potential application to information security. I argue these categories of methods should be evaluated for application in the field of information security. Additional research into methods for reducing expert bias could also be performed. This capstone is limited to only a small fraction of methods have been evaluated in other fields that may be applicable to information security. Methods requiring mathematics were not represented well in this capstone due to the author’s lack of education and training to evaluate those methods. Further research could be performed evaluating the effectiveness of methods like interval estimates, calibration, multiple expert integration, data and opinion integration with regression models, frequency formats, and Monte Carlo simulation, in the information security data collection and decision making lifecycle. The evaluative methods from the literature could be repeated with content replaced with that found in information security. Additionally, much of the available research assumes that the conclusions of Kahneman et al. were correct when they may not have been. People have been observed correctly estimating value and risk in gambling situations (Anderson & Shanteau 1970; Shanteau 1974; Tversky, 1967) and also in assessing the likelihood of fairly complex joint events occurring (Beach & Peterson 1966, Lopes 1976; Shuford 1959). Because of findings like this, Lopes recommend more scrutiny on the works of Kahneman et al. Her interpretation of their research was that people use heuristics instead of probability theory when making decisions most of the time. Kynn found that there was a disproportionate bias toward the citation of Kahneman et al.’s research through the literature. Studies showing poor performance by human estimators compared to research showing good performance were cited 6:1 (Kynn, 2008). Researchers that took this into consideration, like Gigerenzer et al., found humans to be more Bayesian thinkers and intuitive statisticians than the works of Kahneman et al. suggested so long as information is communicated to them in frequency format (1995). The prevalence of research that shows poor performance by humans may adversely influence what methods are chosen to evaluate and what conclusions are made from them. Evaluating methods from both schools of thought, as was attempted in this capstone, may provide a more comprehensive review of what literature is available and what areas are deficient.

Appendix

References