Evaluation

April 3, 2016

C H A P T E R

3
Evaluation Methods
Learning Objectives
1. Recognize differences between evaluation methods and how they support the human factors design cycle
2. Design formative and summative human factors studies
3. Understand representative sampling and the implications for study design and generalization 4. Design an experiment considering variables that are measured, manipulated, controlled, and cannot be controlled
5. Interpret results and recognize the limitations of a study
6. Identify the ethical issues associated with collecting data with human subjects

1

April 3, 2016
PURPOSE OF EVALUATION

3

TIMING AND TYPES OF EVALUATION

5

LITERATURE REVIEW, HEURISTIC EVALUATION, AND COGNITIVE WALKTHROUGHS
USABILITY TESTING
COMPREHENSIVE EVALUATION AND CONTROLLED EXPERIMENTS
IN-SERVICE EVALUATION

5
7
8
9

STEPS IN CONDUCTING A STUDY

10

STUDY DESIGN

11

ONE FACTOR WITH TWO LEVELS
ONE FACTOR WITH MORE THAN TWO LEVELS
MULTIPLE FACTORS
BETWEEN-SUBJECTS DESIGN
WITHIN-SUBJECT DESIGNS
MIXED DESIGNS
SAMPLING PEOPLE, TASKS, AND SITUATIONS

13
13
13
14
14
14
15

MEASUREMENT

15

DATA ANALYSIS

16

ANALYSIS OF CONTROLLED EXPERIMENTS
ANALYSIS OF DESCRIPTIVE STUDIES

16
17

DRAWING CONCLUSIONS AND COMMUNICATING RESULTS

18

STATISTICAL SIGNIFICANCE AND TYPE I AND TYPE II ERRORS
STATISTICAL AND PRACTICAL SIGNIFICANCE
GENERALIZING AND PREDICTING

18
19
19

DRIVER DISTRACTION: EXAMPLE OF A SIMPLE FACTORIAL DESIGN

19

ETHICAL ISSUES

21

CONCLUSION

21

ADDITIONAL RESOURCES

21

QUESTIONS

22

2

April 3, 2016
A government official was involved in a car crash when another driver ran a stop sign while texting on a mobile phone. The crash led the official to introduce legislation that banned all mobile phone use while driving. However, the public challenged whether one person’s experience could justify a ban on all mobile phone use while driving. A consulting firm was hired to provide evidence regarding whether or not the use of mobile devices compromises driver safety. At the firm, Erika and her team must develop a plan to gather evidence to guide the design of effective legislation regarding whether or not mobile devices should be banned.
Where and how should that evidence be obtained? Erika might consult crash statistics and police reports, which could reveal that cell phone use was not as prevalent in crash even though the prevalence of mobile phone use (talking, texting, calling) while driving seems high when collected from a self-reported survey. But how reliable and accurate is this evidence? Not every crash report may have a place for the officer to note whether a mobile phone was or was not in use; and those drivers filling out the survey may not have been entirely truthful about how often they use their phone while driving. Erika’s firm might also perform their own research in a costly driving simulator study, comparing the driving performance of people while the cellular phone was and was not in use. But do the conditions in the simulator match those on the highway? On the highway, people choose when they want to talk on the phone. In the simulator, people are asked to talk at specific times. Erika might also review previously conducted research, such as those in basic laboratory studies that are designed to provide insights on driving. For example, a laboratory study may characterize interference between talking while sitting in front of a computer and performing a “tracking task”, as a way to represent steering a car, and performing a “choice reaction task”, as a way to represent responding to red lights
(Strayer & Johnston, 2001). But are these tracking and choice reaction task really like driving?
These approaches to evaluation represent a sample of methods that human factors engineers can employ to discover “the truth” (or something close to it) about the behavior of people interacting with systems. Human factors engineers use standard methods that have been developed over the years in traditional physical and social sciences. These methods range from the
“true scientific experiment” conducted in highly controlled laboratory environments to less controlled but more representative descriptive studies in the world. These methods are relevant to both the consulting firm trying to assemble evidence regarding a ban on mobile devices and to designers evaluating whether a system will meet the needs of its intended users. In Chapter 2 we saw that the human factors specialist performs a great deal of informal evaluation during the system design phases. This chapter describes more formal evaluations to assess the match of the system to human capabilities.
Given this diversity of methods, a human factors specialist must be familiar with the range of research methods that are available and know which methods are best for specific types of design questions. It is equally important for researchers to understand how practitioners ultimately use their findings. Ideally, this enables a researcher to direct his or her work in ways that are more likely to be useful to design, thus making the results applicable. Selecting an evaluation method that will provide useful information requires that the method be matched to its intended purpose.

PURPOSE OF EVALUATION
In Chapter 2 we saw how human factors design occurs in the cycle of understanding, creating, and evaluating. Chapter 2 focused on understanding peoples’ needs and characteristics and using that understanding to create prototypes that are refined into the final system through iteration. Central to this iterative process is evaluation. Evaluation identifies opportunities to improve a design so that it serves the needs of people more effectively. In the understand–create–
3

April 3, 2016 evaluate cycle, the evaluate step is both the final step in assessing a design and the first step of the next iteration of the design, which provides a deeper understanding of what people need and want. Evaluation methods serve three main purposes in the understand–create–evaluate cycle: Diagnose: how can it be improved? why did it fail? why isn’t it good enough?
Verify and Predict: is it good enough? which is better? how good is it?
Understand: does it address the real needs of people? is it used as expected?
Each of these questions might be asked in terms of safety, performance, and satisfaction. For
Erika’s analysis, predicting the effect of mobile phones on driving safety is most important: how dangerous is talking on a phone and driving?
Table 1shows the example evaluation techniques for each purpose. The first row of this table shows methods associated with diagnosing problems with qualitative data. Qualitative data is not numerical and include responses to open-ended questions, such as “what features on the device would you like to see?” or “what were the main problems in operating the device?”
Qualitative data also includes observations of behavior and interpretation of interviews. They are particularly useful for diagnosing problems and identifying opportunities for improvement.
These opportunities for improvement make qualitative data particularly important in the iterative design process, where the results of a usability test might guide the next iteration of the design. The second row of the table shows methods associated with verifying and predicting the performance of the system with quantitative data. Quantitative data include measures of response time, frequency of use, as well as subjective ratings of workload. Quantitative data include any data that can be represented numerically. The table shows that quantitative data are essential for assessing whether a system has met its objectives and if it is ready to be deployed.
Quantitative data offer a prediction of whether a system will succeed. In the evaluating of whether there should be a ban of mobile phones, quantitative data might include a prediction of the number of lives saved if a ban were to be adopted.
The last two rows show how both qualitative and quantitative data can support understanding people’s needs and characteristics relative to the design. Chapter 2 was primarily concerned with understanding people’s needs and using that understanding to guide design, which is highlighted in the last two rows of Table 1. Although methods for understanding and methods for evaluation are presented in separate chapters there is substantial overlap between them.
In this chapter, we focus on diagnosing design problems and verifying its performance, but these evaluations often require data that can inform understanding and can guide the next generations of the design.
Table 1. Methods for evaluation.
Purpose
Data used
Diagnose
Qualitative
Verify and Predict
Quantitative
Understand
Qualitative
Understand
Quantitative

Example
Usability test
Field test
Open-ended survey items
Task analysis

4

April 3, 2016

TIMING AND TYPES OF EVALUATION
In Erika’s evaluation of the effect of mobile phones on driving safety, a critical consideration is time. If there were two years to find an answer, she might have conducted a comprehensive field test, but if she has to provide an answer in weeks, then collecting field data might be difficult. More generally, the time available to provide an answer and the point in the design process are critical considerations in selecting an evaluation method.
Methods used early in the design process must diagnose problems and guide iterative design and they must do so in a very rapid manner. Methods used later in the design process, just before the product is released often take more time and must be more thorough. As discussed in
Chapter 2, there are many system design processes and the emphasis (safety, performance, satisfaction), can greatly affect what type of evaluation method and data collection tools the human factors engineer can use. The inner cycles of Figure 1 require very rapid evaluation methods that diagnose problems in a matter of day. Similarly, some design processes such as the scrum approach requires response in days or weeks, but the Vee process might require a precise answer that might be possible with evaluation studies taking months. A general challenge alluded to in Chapter 2 is matching the rapid response required in a scrum design context with the time to conduce a user study, particularly when such user studies are critical as in high-risk systems. More generally, it is critical that human factors practitioners work to identify the evaluation approach that fits the timeline and needs of design process.

Figure 1. Time limits and position in the design process.

Literature review, heuristic evaluation, and cognitive walkthrough
Literature reviews can serve as a useful starting point for evaluation. A literature review involves reading papers that describe previously completed studies that describe how people behave in similar situations. A good literature search can often substitute for a study itself if other researchers have already answered the question. In Erika’s case, hundreds of studies have addressed various aspects of driver distraction. One particular form of literature review, known as a meta-analysis, integrates the statistical findings of many experiments that have examined a common independent variable in order to draw a very reliable conclusion regarding the effect of that variable (Rosenthal & Reynard, 1991).
Like literature reviews, heuristic evaluations build on previous research and do not require additional data collection. A heuristic evaluation simply means applying human factors
5

April 3, 2016 heuristics—rules of thumb, principles, and guidelines—to identify ways to improve a design. It is important to point out that many guidelines are just that: guides rather than hard-and-fast rules. Guidelines require careful consideration rather than blind application. For a computer application, heuristic evaluation might mean examining every aspect of the interface to make sure it meets usability standards (Nielson, 1993; Nielson & Molich, 1990). However, there are important aspects of a system that are not directly related to usability, such as safety and satisfaction. Thus, the first step of a heuristic evaluation would be to select human factors principles that are particularly applicable to the design, such as those listed at the end of Chapters 4-18.
The second step of a heuristic evaluation is to carefully inspect the design and identify where it violates the heuristics. While an individual analyst can perform the heuristic evaluation, the odds are great that this person will miss most of the usability or other human factors problems. Nielson (1993) reports that, averaged over six projects, only 35 percent of the interface usability problems were found by single evaluators. Because different evaluators find different problems, the difficulty can be overcome by having multiple evaluators perform the heuristic evaluation. Nielson recommends using at least three evaluators, preferably five. Each evaluator should inspect the design in isolation from the others. After each has finished the evaluation, they should communicate and aggregate their findings.
Once the heuristic evaluations have been completed, the results should be conveyed to the design team. Often, this can be done in a group meeting, where the evaluators and design team members discuss the problems identified and brainstorm to generate possible design solutions (Nielson, 1994a). Heuristic evaluation has been shown to be very cost effective. For example, Nielson (1994b) reports a case study where the cost was $10,500 for the heuristic evaluation, and the expected benefits were estimated at $500,000 (a 48:1 benefit-cost ratio).
A heuristic evaluation provides a relatively complete or holistic assessment of the design, compared to the more focused approach of a cognitive walkthrough. With a cognitive walkthrough, analysts consider each task associated with a system interaction. As they consider each task, they pose a series of questions to highlight potential problems that might confront someone trying to actually perform the sequence of tasks. Questions that guide the cognitive walkthrough include (Blackmon, Polson, Kitajima, 2003):
• Is it likely that the person will perform the right action?
• Does the person understand what task needs to be performed?
• Will the person notice that the next task can be performed?
• Will the person understand how to perform the task?
• Does the person get feedback after performing the task indicating successful completion? Walking through each task with these questions in mind will identify places where people are likely to make mistakes or get confused, which can be noted for discussion with the design team in a manner similar to the results of a heuristic evaluation.
Literature reviews, heuristic evaluation, and cognitive walk though do not involve collecting data from people interacting with the system and are therefore, beneficial in being particularly fast to apply to a system, and are particularly useful early in the design. One important limitation of these approaches is that the analysts might suffer from learned intuition and the curse of knowledge about how the system works. In these situations, even with the help of the heuristics and the walkthrough questions, they might not notice problems that might frustrate a less familiar person. For this reason, testing with people who are similar to those who will eventually use the system is essential. For example, a team of engineers in their 30’s might not understand how an 85 year old woman does could be confused by what to do with a computer mouse. 6

April 3, 2016
Usability testing
Usability testing is a formative evaluation technique—it helps diagnose problems and identify opportunities for improvement as part of an iterative development process. Usability testing involves users interacting with the system and measuring their performance as ways to improve the design. Usability is primarily the degree to which the system is easy to use, or “user friendly.” This translates into a cluster of factors, including the following five variables (from
Nielson, 1993):
■ Learnability: The system should be easy to learn so that the user can rapidly start getting some work done.
■ Efficiency: The system should be efficient to use so that once the user has learned the system, a high level of productivity is possible.
■ Memorability: The system should be easy to remember so that the casual user is able to return to the system after some period of not having used it, without having to learn everything all over again.
■ Errors: The system should have a low error rate so that users make few errors during the use of the system and so that if they do make errors, they can easily recover from them.
Further, catastrophic errors must not occur.
■ Satisfaction: The system should be pleasant to use so that users are subjectively satisfied when using it; they like it.
Usability testing identifies how to improve a design on each of these usability dimensions, which differs substantially from typical experiments that can have anywhere from 20 to
100 participants. Usability testing typically includes just five participants as part of a sequence.
After each usability test, the results are shared with the design team, the design is refined, and another usability test is conducted with a new set of users. (https://www.nngroup.com/articles
/how-many-test-users/). A single usability test is not enough—a minimum of two and ideally five or more test and refinement iterations are needed.
Figure 2 shows a powerful way to enhance usability testing. Rather than focusing on a single design, three are developed in parallel and tested. The best elements of the three designs are merged and the resulting design is assessed and refined in a series of two to five tests and refinement iterations.

Figure 2. Parallel design process with iteration
(https://www.nngroup.com/articles/parallel-and-iterative-design/).
7

April 3, 2016
Usability is limited when considering complex systems and organization design because it has evolved primarily in the field of human–computer interaction (Chapter 15). However, usability methods generalize to any situation where it is possible to expose people to a typical interaction where the design can be refined repeatedly. Usability testing is also limited in its purpose: it serves as a formative evaluation guiding iterative design. For a design to be release to market, particularly for high-risk systems, a summative evaluation is needed. A summative evaluation aims to ensure that the design will operate as promised.
Comprehensive evaluation and controlled experiments
Comprehensive system evaluation provides a more inclusive, summative assessment of the system than does a usability evaluation. The data source for a comprehensive system evaluation often involves controlled experiments. Similarly, user studies aimed at understanding more general factors affecting human behavior, such as how voice control compares to manual operation of a mobile device while driving, also require controlled experiments. The experimental method consists of deliberately producing a change in one or more causal or independent variables and measuring the effect of that change on one or more dependent variables. The key to good experiments is control. That is, only the independent variable should be changed, and all other variables should be held constant or controlled. However, control is more difficult in a comprehensive system evaluation, where participants perform their tasks in the context of the environment where the results are to generalize, compared to a more focused user study.
As control is loosened, out of necessity, the researcher depends progressively more on descriptive methods: describing relations that exist, even though they could not be actually be manipulated or controlled by the researcher. For example, the researcher might describe the greater frequency of mobile phone crashes in city than in freeway driving to help draw a conclusion that mobile phones are more likely to disrupt drivers that are busy. A researcher might also simply observe drivers while traveling in the real world and use automatic data recorders to objectively examine their behavior.
In human factors, as in any kind of research, collecting data, whether experimental or descriptive, is only half of the process. The other half is inferring the meaning or message conveyed by the data, and this usually involves generalizing or predicting from the particular data sampled to the broader population. Do mobile phones compromise (or not) driving safety in the broad section of automobile drivers, and not just in the sample of drivers from a driving simulator experiment, from crash data in one geographical area, or from self-reported survey data?
The ability to generalize involves care in both the design of experiments and in the statistical analysis. In a controlled experiment, all independent variables are controlled. While experimentation in a well controlled environment is valuable for uncovering basic laws and principles, there are often cases where research is better conducted in the real world. In many respects, the use of complex tasks in a real-world environment results in more generalizable data that capture more of the characteristics of a complex, real-world environment. Unfortunately, conducting research in real-world settings often means that we must give up the “true” experimental design because we cannot directly manipulate and control variables. These situations often require quasi-experiments, where not all independent variables are controlled. Quasi-experiments share similar characteristics to controlled laboratory studies, but the trials are not randomly assigned or even controlled as rigorously as they are in a controlled laboratory experiment.
As control becomes less structured (or more relaxed), the researcher depends progressively more on descriptive methods: describing relations even though they are not manipulated or controlled by the researcher. For example, crash data can be used to identify the frequency of cell phone crashes in city driving when compared to freeway driving, which may help the research draw a conclusion that cell phones are more likely to disrupt the busier driver. A re8

April 3, 2016 searcher might also simply observe drivers while driving in the real world, objectively recording and later analyzing their behavior. One example is descriptive research, where researchers simply measure a number of variables and evaluate how they are related to one another. Examples of this type of research include evaluating the driving behavior of local residents at various intersections, measuring how people use a particular design of ATM (automatic teller machine), and observing workers in a manufacturing plant to identify the types and frequencies of unsafe behavior.
In-service evaluation
In-service evaluation refers to evaluations conducted after a design has been released, such as after a car has been on the market, after a modified manufacturing line has been place in service, or after a new mobile phone operating system has been released. Descriptive studies are critical for in-service evaluation because experimental control is often impossible. In the vignette presented at the beginning of this chapter, an in-service evaluation of existing mobile phone use might start by examining crash records, or moving violations. This will give us some insights on the road safety issues, but there is a great deal of variation, missing data, and underreporting in such databases. Like most descriptive studies. such a comparison of crashes is a challenge because each crash involves many different conditions and important driver-related activities (e.g., eating, cell-phone use, looked but did not see) might go unreported.
In situations where data are easily collected, it may be more sensible to sample only a part of the behavioral data available or to sample behavior during different sessions rather than all at once. For example, a safety officer is better off sampling the prevalence of improper procedures or risk-taking behavior on the shop floor during several different sessions over time rather than during one day. The goal is to get representative samples of behavior, and this is more easily accomplished by sampling over different days and during different conditions.
Surveys and questionnaires are a important tools for in-service evaluation. The design of questionnaires and surveys is a challenging task if it is to be done in a way that yields reliable and valid data. Questionnaires and surveys sometimes gather qualitative data from open-ended questions (e.g., “what features on the device would you like to see?” or “what were the main problems in operating the device?”). However more rigorous treatment of the survey results can typically be obtained from quantitative data, often obtained from a numerical rating scale, often with endpoints ranging between, say, 1–7 or 1–10. Such quantitative data has the advantage of being addressed by statistical analysis.
A major concern with questionnaires is their validity. Aside from assuring that questions are designed to appropriately assess the desired content area, under most circumstances, respondents should be told that their answers will be both confidential and anonymous. It is common practice for researchers to place identifying numbers rather than names on the questionnaires. Employees are more likely to be honest if their names will never be directly associated with their answers.
A problem is that many people do not fill out questionnaires if they are voluntary. If the sample of those who do and who do not return questionnaires is different along some important dimension related to the topic surveyed, the survey results will obviously be biased. For example, in interpreting the results of an anonymous survey of unsafe acts in a factory, those people who are time-stressed in their job are more likely to commit unsafe acts, but also do not have time to complete the survey. Hence, their acts will be underrepresented in the survey.

9

April 3, 2016

Figure 3. Types of evaluations involving data collection defined by degree of experimental control and type of measurement.
Human factors specialists have a wide range of evaluation techniques too choose from. They range from those that can performed on early design concepts (heuristic evaluation and cognitive walkthroughs) and do not require human subjects data collection to usability tests that involve groups of five participants. Comprehensive system evaluation and user studies require more time and more participants and typically use experiments or quasi-experiments to assess how people responded to design choices. Even after a design is complete and people are using it, evaluation continues with in-service evaluation, which typically cannot use experiments but must rely on descriptive methods. Figure 3 highlights those evaluations that require data collections shows some of these methods and how they compare in terms of degree of experimental control and type of variables. The following balance of this chapter describes how to conduct such studies with an emphasis on controlled experiments.

STEPS IN CONDUCTING A STUDY
Although the details depend on the type of study, the general steps are similar. An descriptive study, a usability evaluation, and a controlled experiment might differ substantially in the amount and type of data collected and how it would be analyzed, but the general steps would be similar. Here we outline the five general steps.
Step 1. Define problem, research questions, and hypotheses. In this step, the evaluator identifies the specific concern and then formalizes a prediction about the relationship between two or more variables in the population of interest. This formal hypothesis is then used to guide the experimental design, where data is collected to prove or disprove a research statement. to determine whether a cause-and-effect relationship does in fact exist. Examples of testable problem statements include:
• Does alternating operators’ work shift between day and night produce more performance errors than having people on a constant shift?
• Does attending to a mobile device while driving create more driving errors than attending to the roadway only?
Step 2. Specify the experimental plan. This step consists of identifying how to address the design problem or answer the research question by defining what is manipulated (the independ10

April 3, 2016 ent variables), what is measured (the dependent variables), and in what situations. What task will our participants be asked to perform, and what aspects of those tasks do we measure? Here we must specify exactly what is meant by the dependent variable. What do we mean by performance? For example, we could define performance as the number of keystroke errors in data entry, the time or speed to complete a task. We must also define each independent variable in terms of how it will be manipulated. For example, we would specify exactly what we mean by alternating between day and night shifts. Is this a daily change or a weekly change? Defining the independent variables is an important part of creating the experimental design. Which independent variables do we manipulate? How many levels of each? For example, we might decide to examine the performance of three groups of workers: those on a day shift, those on a night shift, and those alternating between shifts.
Step 3. Conduct the study. After designing the study and identifying a sample of participants, the evaluator is ready to conduct the experiment and collect data. The researcher then recruits participants, develops materials, and prepares to conduct the study. For all but the simplest experiments, conducting a small pretest, or pilot study, is essential before conducting the entire
“real” study. The pilot study should check that manipulation levels are set properly, that participants understand the instructions, and that the experiment will generally go smoothly. After everything is checked through a pilot study, the data collected. During data collection, the experimenter should take care that the methods remain constant. For example, an evaluator should not become more lenient over time and the measuring instruments should remain calibrated. Finally, all participants should be treated ethically, as described later.
Step 4. Analyze the data. In an experiment, the dependent variable is measured and quantified for each subject (there may be more than one dependent variable). For our example, you would have a set of numbers representing the keystroke errors for the people on changing work shifts, a set for the people on day shift, and a set for the people on night shift. Data are analyzed using both descriptive and inferential statistics to see whether there are meaningful differences among the three groups.
Step 5. Draw conclusions and communicate results. Based on the results of the statistical analysis, the researchers draw conclusions about the cause-and-effect relationships in the experiment. At the simplest level, this means determining whether hypotheses were supported. In applied research, it is often important to go beyond the obvious. For example, our study might conclude that shiftwork schedules affect older workers more than younger workers or that it influences the performance of certain tasks, and not others. Clearly, the conclusions that we draw depend on the experimental design. It is also important for the researcher to go beyond concluding what was found, to ask “why”. For example, are older people more disrupted by shiftwork changes because they need more sleep? Or because their natural circadian (day-night rhythms) are more rigid? Identifying underlying reasons, whether psychological or physiological, allows for the development of useful and generalizable principles and guidelines.
The following sections expand on steps associated with conducting a study that are particularly complicated: study design, measurement, data analysis, and drawing conclusions.

STUDY DESIGN
An experiment involves examining the relationship between independent variables and the resulting changes in one or more dependent variables, which are typically measures of performance, workload, preference, or other subjective evaluations. The goal is to show that manipu11

April 3, 2016 lations of the independent variable, and no other variable, causes changes in behavior and attitudes measured by the dependent variables.
An experimental design consists of a specific research statement that clearly identifies the independent variables that will be controlled in the study, and the dependent variables or outcomes of interest to be measured and observed. The key to good experiments is control. That is, only the independent variable should be manipulated, and all other variables should be held constant or controlled. However, control becomes increasingly difficult as the study becomes more complex, and as the tasks being examined need to be considered in the context of the environment to which the research results are to generalize.
In designing a study, it is important to consider all variables that might affect the dependent variables. Extraneous variables have the potential to interfere in the causal relationship and must be controlled so that they do not interfere. If these extraneous variables do influence the dependent variable, we say that they are confounding variables. One group of extraneous variables is the wide range of ways participants differ from one another. These variables must be controlled, so it is important that the different groups of people in a between-subjects experiment differ only with respect to the treatment condition and not on any other variable or category. For example, in the cellular phone study, you would not want elderly drivers using the car phone and young drivers using no phone. Then age would be a confounding variable. One way to make sure all groups are equivalent is to take the entire set of subjects and randomly put them in one of the experimental conditions. That way, on the average, if the sample is large enough, characteristics of the subjects will even out across the groups. This procedure is termed random assignment. Another way to avoid having different characteristics of subjects in each group is to use a within-subjects design. However, this design creates a different set of challenges for experimental control.
Other variables in addition to subject variables must be controlled. For example, it would be a poor experimental design to have one condition where cellular phones are used in a Jaguar and another condition where no phone is used in an Oldsmobile. There may be driving characteristics or automobile size differences that cause variations in driving behavior. The phone versus no-phone comparison should be carried out in the same vehicle (or same type of vehicle). We need to remember, however, that in more applied research, it is sometimes impossible to exert perfect control.
For within-subjects designs, there is another variable that must be controlled: the order in which the subject receives his or her experimental conditions, which creates what are called order effects. When people participate in several treatment conditions, the dependent measure may show differences from one condition to the next simply because the treatments, or levels of the independent variable, are experienced in a particular order. For example, if participants use five different cursor-control devices in an experiment, they might be fatigued by the time they are tested on the fifth device and therefore exhibit more errors or slower times. This would be due to the order of devices used rather than the device per se. In contrast, if the cursorcontrol task is new to the participant, he or she might show learning and actually do best on the fifth device tested, not because it was better, but because the cursor-control skill was more practiced. These order effects of fatigue and practice in between-subjects designs are both potential confounding variables; while they work in opposite directions, to penalize or reward the late-tested conditions, they do not necessarily balance each other out.
As a safeguard to keep order from confounding the independent variables, we use a variety of methods. For example, extensive practice can reduce learning effects. Time between conditions can reduce fatigue. Finally, researchers often use a technique termed counterbalancing.
This simply means that different subjects receive the treatment conditions in different orders.
For example, half of the participants in a study would use a trackball and then a mouse. The other half would use a mouse and then a trackball. There are specific techniques for counterbal12

April 3, 2016 ancing order effects; the most common is a Latin-square design. Research methods books (e.g.,
Keppel, 1992) provide instruction on using these designs.
In summary, the researcher must control extraneous variables by making sure they do not covary with the independent variable. If they do covary, they become confounds and make interpretation of the data impossible. This is because the researcher does not know which variable caused the differences in the dependent variable.
One factor with two levels
In a two-group design, one independent variable or factor is tested with only two conditions or levels of the independent variable. In the classic two-group design, a control group gets no treatment (e.g., driving with no cellular phone), and the experimental group gets some
“amount” of the independent variable (e.g., driving while using a cellular phone). The dependent variable (driving performance) is compared for the two groups. However, in human factors we often compare two different experimental treatment conditions, such as performance using a trackball versus using a mouse. In these cases, a control group is unnecessary: A control group to compare with mouse and trackball users would have no cursor control at all, which does not make sense. A common example of such a study is an A/B study, which is often used to assess a change in a website. A random sample of people using the site sees design A and another random sample sees design B.
One factor with more than two levels
Sometimes the two-group design does not adequately test our hypothesis of interest. For example, if we want to assess the effects of VDT brightness on display perception, we might want to evaluate several different levels of brightness. We would be studying one independent variable
(brightness) but would want to evaluate many levels of the variable. If we used five different brightness levels and therefore five groups, we would still be studying one independent variable but would gain more information than if we used only two levels/groups. With this design, we could develop a quantitative model or equation that predicts performance as a function of brightness. In a different multilevel design, we might want to test four different input devices for cursor control, such as trackball, thumbwheel, traditional mouse, and key-mouse. We would have four different experimental conditions but still only one independent variable (type of input device).
Multiple Factors
In addition to increasing the number of levels used for manipulating a single independent variable, we can expand the two-group design by evaluating more than one independent variable or factor in a single experiment. In human factors, we are often interested in complex systems and therefore in simultaneous relationships between many variables rather than just two. As noted above, we may wish to determine if shiftwork schedules (Factor A) have the same or different effects on older versus younger workers (Factor B).
A multifactor design that evaluates two or more independent variables by combining the different levels of each independent variable is called a factorial design. The term factorial indicates that all possible combinations of the independent variable levels are combined and evaluated. Factorial designs allow the researcher to assess the effect of each independent variable by itself and also to assess how the independent variables interact with one another. Because much of human performance is complex and human–machine interaction is often complex, factorial designs are the most common research designs used in both basic and applied human factors research.
13

April 3, 2016
Factorial designs can be more complex than a 2 ⋅ 2 design in a number of ways. First, there can be more than two levels of each independent variable. For example, we could compare driving performance with two different cellular phone designs (e.g., hand-dialed and voice-dialed), and also with a “no phone” control condition. Then we might combine that first three-level variable with a second variable consisting of two different driving conditions: city and freeway driving. This would result in a 3 ⋅ 2 factorial design. Another way that factorial designs can become more complex is by increasing the number of factors or independent variables. Suppose we repeated the above 2 ⋅ 3 design with both older and younger drivers. This would create a 2 ⋅ 3 ⋅ 2 design. A design with three independent variables is called a three-way factorial design.
Adding independent variables has three advantages: (1) It allows designers to vary more system features in a single experiment: It is efficient. (2) It captures a greater part of the complexity found in the real world, making experimental results more likely to generalize. (3) It allows the experimenter to see if there is an interaction between independent variables, in which the effect of one independent variable on performance depends on the level of the other independent variable, as we describe in the box.
Between-subjects design
In most of the previous examples, the different levels of the independent variable were assessed using separate groups of subjects. For example, we might have one group of subjects use a cellular car phone in heavy traffic, another group use a cellular phone in light traffic, and so on.
We compare the driving performance between groups of subjects and hence use the term between-subjects. A between-subjects variable is an independent variable whereby different groups of subjects are used for each level or experimental condition.
A between-subjects design is a design in which all of the independent variables are between-subjects, and therefore each combination of independent variables is administered to a different group of subjects. Between-subjects designs are most commonly used when having subjects perform in more than one of the conditions would be problematic. For example, if you have subjects receive one type of training (e.g., on a simulator), they could not begin over again for another type of training because they would already know the material. Between-subjects designs also eliminate certain confounds related to order effects, which we discuss shortly.
Within-subject designs
This is a design where the same participant is used in multiple treatment conditions, many experiments, it is feasible to have the same subjects participate in all of the experimental conditions. For example, in the driving study, we could have the same subjects drive for periods of time in each of the four conditions shown in Table 2.1. In this way, we could compare the performance of each person with him- or herself across the different conditions. This withinsubject performance comparison illustrates where the methods gets its name. When the same subject experiences all levels of an independent variable, it is termed a within-subjects variable.
An experiment where all independent variables are within-subject variables is termed a withinsubjects design. Using a within-subjects design is advantageous in a number of respects, including that it is more sensitive and easier to find statistically significant differences between experimental conditions. It is also advantageous when the number of people available to participate in the experiment is limited.
Mixed designs
In factorial designs, each independent variable can be either between-subjects or withinsubjects. If both types are used, the design is termed a mixed design. If one group of subjects
14

April 3, 2016 drove in heavy traffic with and without a cellular phone, and a second group did so in light traffic, this is a mixed design.
Sampling people, tasks, and situations
Once the experimental design has been specified with respect to independent variables, the researcher must decide what people will be recruited, what tasks the people will be asked to perform, and in what situations. The concept of representative sampling guides researchers to select people, tasks, and situations that they are designing for.
Participants should represent the population or group in which the researcher is interested in studying. For example, children under 14 may not be the appropriate sample for studying driver behavior. If we are studying systems that will be used by the elderly, the target population should be those aged 65 and older. If the study was to examine a system for English speakers in the US, the elderly people of interest would be those living in the United States who is healthy, speak English, and read at a certain grade level. Representative of typical user not of the entire population of people and is often not the population of a university or of the engineers designing the product.
Just as you would not conduct a study with a single study participant, it is also important to include a sample tasks people might encounter with the design. In the example of the mobile phone evaluation, these tasks might include placing a call, answering a call, reading a text, sending a text. Just as the sample of people needs to be representative of the population using the design, the sample of tasks should be representative of tasks people are likely to perform.
For applied research, we try to identify tasks and environments that will give us the most generalizable results. This often means conducting the experiments in situations that are most representative of those actually encountered with the design.

MEASUREMENT
Because an experiment involves examining the relationship between independent variables and changes in one or more dependent variables, defining what is measured—the dependent variables—is crucial. The dependent variables are what can be measured that relate to the outcomes described in the research questions. The research questions are often stated in terms of abstract constructs, where constructs describe theoretical entities that cannot be measured directly. Common constructs in human factors studies include: workload, situation awareness, fatigue, safety, acceptance, trust, and comfort. These constructs cannot be measured directly and the human factors researcher must select variables that can be measured, such as subjective ratings and response times that are strongly related to the underlying constructs of interest.
For an assessment of how mobile phones might affect driving. The underlying construct might be safety and the measure that relates to safety might be error in lane keeping where the car’s tire crosses a lane boundary. Safety might also be measured by ratings from the drivers indicating how safe they felt.
Subjective ratings are often contrasted with objective performance data, such as error rates or response times. The difference between these two classes of measures is important, given that subjective measures are often easier and less expensive to obtain, with a high sample size. Both objective and subjective measures are useful. For example, in a study of factors that lead to stress disorders in soldiers, objective and subjective indicators of event stressfulness and social support were predictive of combat stress reaction and later posttraumatic stress disorder and that “subjective parameters were the stronger predictors of the two” (Solomon, Mikulincer, and Hobfoll, 1987 p. 581). In considering subjective measures, however, it is important to realize that what people subjectively rate as “preferred” is not always the system feature that supports best performance (Andre & Wickens, 1995). For example, people almost always prefer a
15

April 3, 2016 colored display to a monochrome one, even when the color is used in such a way that it can be detrimental to performance.
Subjective and objective dependent variables provide important and complementary information. We often want to measure how causal variables affect several dependent variables at once. For example, we might want to measure how use of a cellular phone affects a number of driving variables, including deviations from the lane, reaction time to brake for cars or other objects in front of the vehicle, time to recognize objects in the driver’s peripheral vision, speed, acceleration, and so forth. Using several dependent variables helps triangulate on the truth—if all the variables indicate the same outcome then one can have much greater confidence in that outcome. DATA ANALYSIS
Collecting data, whether experimental or observations, is only half of the process. The other half is inferring the meaning or message conveyed by the data, and this usually involves generalizing or predicting from the particular data sampled to the broader population of people and context of use. Do mobile phones compromise driving safety in the broad section of automobile drivers, and not just in the sample of drivers used in the simulator experiment or the sample involved in accident statistics? Do all mobile phones comprise safety in a similar way? The ability to generalize involves care in both the design of experiments and in the statistical analysis.
Analysis of controlled experiments
Once the experimental data have been collected, the researcher must determine whether the dependent variable(s) actually did change as a function of experimental condition. For example, was driving performance really “worse” while using a cellular phone? To evaluate the research questions and hypotheses, the experimenter calculates two types of statistics: descriptive and inferential statistics. Descriptive statistics are a way to summarize the dependent variable for the different treatment conditions, while inferential statistics tell us the likelihood that any differences between our experimental groups are “real” and not just random fluctuations due to chance. Differences between experimental groups are usually described in terms of averages.
Thus, the most common descriptive statistic is the mean. Research reports typically describe the mean scores on the dependent variable for each group of subjects (e.g., see the data shown in Table 2.1 and Figure 2.2). This is a simple way of conveying the effects of the independent variable(s) on the dependent variable. Standard deviations are also sometimes given to convey the spread of scores.
While experimental groups may show different means for the various conditions, it is possible that such differences occurred solely on the basis of chance. Humans almost always show random variation in performance, even without manipulating any variables. It is not uncommon to get two groups of subjects who have different means on a variable, without the difference being due to any experimental manipulation; in the same way that you are likely to get a different number of “heads” if you do two series of 10 coin tosses. In fact, it is unusual to obtain means that are exactly the same. So, the question becomes: are the difference big enough that we can rule out chance and assume the independent variable had an affect? Inferential statistics give us, effectively, the probability that the difference between the groups is due to chance. If we can rule out the “chance” explanation, then we infer that the difference was due to the experimental manipulation.
A comparison of two conditions is usually conducted using a t-test. Comparison of proportions is done using a chi-square test. For more than two groups, we use an analysis of vari16

April 3, 2016 ance (ANOVA). Both tests yield a score; for a t-test, we get a value for a statistical term called t, and for ANOVA, we get a value for F. Most important, we also identify the probability, p, that the t or F value would be found by chance for that particular set of data if there was no effect or difference. The smaller that p is, the more significant our result becomes and the more confident we are that our independent variable really did cause the difference. This p value will be smaller as the difference between means is greater, the variability between our observations within a condition (standard deviation) is less, and, importantly, as the sample size increases
(more participants, or more measurements per participant). A larger sample gives an experiment greater statistical power to find true differences between conditions.
Although p values are a common inferential statistic, a more useful approach is to report confidence intervals. Confidence intervals show the range of the mean value that might be expected if the study were to be repeated. The confidence interval is more informative than the p value because it can show the difference between a situation where the mean values of two conditions were very large but the variability was also large and a situation where the difference in the mean values was small but the variability was very small. These conditions might produce the same p values, but considering the means and confidence intervals would suggest a very different interpretation of the results.
Analysis of descriptive studies
Most descriptive studies are conducted to evaluate the relationships between a number of variables. Whether the research data has been collected through observation or questionnaires, the goal is to see whether relationships exist and to measure their strength. Relationships between variables can be measured in a number of ways.
If we were interested in determining if there is a relationship between job experience and safety attitudes within an organization, this could be done by performing a correlational analysis. The correlational analysis measures the extent to which two variables covary such that the value of one can be somewhat predicted by knowing the value of the other. For example, in a positive correlation, one variable increases as the value of another variable increases; for example, the amount of illumination needed to read text will be positively correlated with age. In a negative correlation, the value of one variable decreases as the other variable increases; for example, the intensity of a soft tone that can be just heard is negatively correlated with age. By calculating the correlation coefficient, r, we get a measure of the strength of the relationship.
Statistical tests can be performed that determine the probability that the relationship is due to chance fluctuation in the variables. Thus, we get information concerning whether a relationship exists (p) and a measure of the strength of the relationship (r). As with other statistical measures, the likelihood of finding a significant correlation increases as the sample size—the number of items measured on both variables—increases.
Correlational analysis often goes beyond reporting a correlation coefficient, and typically describes the relationship with a regression equation. This equation uses the observed data to show how much change one variable will change another variable. This statistical model can even be used to predict future outcomes and might suggest optimal values for a design.
One caution should be noted. When we find a statistically significant correlation, it is tempting to assume that one of the variables caused the changes seen in the other variable. This causal inference is unfounded for two reasons. First, the direction of causation could actually be in the opposite direction. For example, we might find that years on the job is negatively correlated with risk-taking. While it is possible that staying on the job makes an employee more cautious, it is also possible that being more cautious results in a lower likelihood of injury or death. This may therefore cause people to stay on the job. Second, a third variable might cause changes in both variables. For example, people who try hard to do a good job may be encouraged to stay on and may also behave more cautiously as part of trying hard.
17

April 3, 2016

DRAWING CONCLUSIONS AND COMMUNICATING RESULTS
Statistical analysis provides an essential method to differentiate between systematic effects of the independent variables and random variation between people and conditions. This analysis often seems to provide a clear decision: if the p value is less than .05 there is an important difference between conditions. This clarity is an illusion and drawing conclusions from statistical results requires careful judgment and communication.
Statistical significance and Type I and Type II errors
Researchers usually assume that if p is less than .05, they can conclude that the results are not due to chance and therefore that there was an effect of the independent variable. Accidentally concluding that independent or causal variables had an effect when it was really just chance is referred to as making a Type I error. If scientists use a .05 cutoff, they will make a Type I error only one time in 20. In traditional sciences, a Type I error is considered a “bad thing” (Wickens, 1998). This makes sense if a researcher is trying to develop a cause-and-effect model of the world. The Type 1 error would lead to the development of false theories and misplaced expectations about the benefits of design changes.
Researchers in human factors have also accepted this implicit assumption that making a
Type I error is bad. Research where the data result in inferential statistics with p > .05 is not generally accepted for publication in most journals. Experimenters studying the effects of system design alternatives often conclude that the alternatives made no difference. Program evaluation where introduction of a new program resulted in statistics of p > .05 often conclude that the new program did not work, all because there is greater than a 1-in-20 chance that spurious factors could have caused the results.
Focusing only on the cost of Type I errors and choosing a p value of .05 or .01 ignores the cost of Type II errors, concluding that the experimental manipulation did not have an effect when in fact it did. This means, for example, that a safety officer might conclude that a new piece of equipment is no easier to use under adverse environmental conditions, when in fact it is easier. The likelihood of making Type I and Type II errors are inversely related. Thus, if the experimenter showed that the new equipment was not statistically significantly better (p < .05) than the old, the new equipment might be rejected even though it might actually be better, and if the p level had been set at 0.10 instead of .05, it would have been concluded to be better.
Focusing on the p = .05 criterion is especially problematic in human factors because we frequently must conduct experiments and evaluations with relatively low numbers of subjects because of expense or the limited availability of certain highly trained professionals (Wickens,
1998). As we saw, using a small number of subjects makes the statistical test less powerful and more likely to show no significance, or p > .05, even when there is a difference. In addition, the variability in performance between different subjects or for the same subject but over time and conditions is also likely to be great when we try to do our research in more applied environments, where all confounding extraneous variables are harder to control. Again, these factors make it more likely that the results will show no significance, or p > .05. The result is that human factors specialists frequently conclude that there is no difference in experimental conditions simply because there is more than a 1-in-20 chance that it could be caused by random variation in the data.
In human factors, researchers should consider the probability of a Type II error when their difference is not significant at the conventional .05 level and consider the consequences if others use their research to conclude that there is no difference (Wickens, 1998). For example, will a safety-enhancing device fail to be adopted? In the cellular phone study, suppose that performance really was worse with cell phones than without, but the difference was not big enough to reach .05 significance. Might the legislature conclude, in error, that cell phone use was “safe”? There is no easy answer to the question of how to balance Type I and Type II sta18

April 3, 2016 tistical errors (Keppel, 1992; Nickerson, 2001). The best advice is to realize that the higher the sample size, the less either type of error will occur, and to consider the consequences of both types of errors when, out of necessity, the sample size and power of the design of a human factors experiment must be low.
Statistical and practical significance
Once chance is ruled out, meaning p < .05, researchers discuss the differences between groups as though they are a fact. However, it is important to remember that two groups of numbers can be statistically different from one another without the differences being very large. Suppose we compare two groups of Army trainees. One group is trained in tank gunnery with a low-fidelity personal computer. Another group is trained with an expensive, high-fidelity simulator. We might find that when we measure performance, the mean percent correct for the personal computer group is 80, while the mean percent correct for the simulator group is 83. If we used a large number of subjects in a very powerful design, there may be a statistically significant difference between the two groups, and we would therefore conclude that the simulator is a better training system. However, especially for applied research, we must look at the difference between the two groups in terms of practical significance. Is it worth spending millions to place simulators on every military base to get an increase from 80 percent to 83 percent? This illustrates the tendency for some researchers to place too much emphasis on statistical significance and not enough emphasis on practical significance. Focusing on mean values and confidence intervals (uncertainty about the mean value) can help avoid misinterpreting statistical significance as practical significance.
Generalizing and predicting
No single study proves anything. As we will see in the following chapters, despite substantial regularity in human behavior, individual differences are substantial as are the effect of expectations and context. A different sample of people, different instructions, and different tasks might have produced a different outcome. Communicating the results of a study to the design team should reflect this uncertainty. When Erika interprets the results of a study that shows a statistically significant effect of mobile phone use on lane keeping error she must consider the degree to which that effect depends on the specific people, tasks, and situations that she included in her study. Would her prediction about safer driving without mobile phone materialize if the government enacted a ban? This uncertainty makes it important to consider any study as part of the cycle where evaluation feeds back to understanding of human behavior that improves in an iterative manner after many studies.

DRIVER DISTRACTION: EXAMPLE OF A SIMPLE FACTORIAL DESIGN
To illustrate the logic behind controlled experiments we consider an example of a simple factorial design. This is where two levels of one independent variable are combined with two levels of a second independent variable. Such a design is called a 2 ⋅ 2 factorial design. Imagine that a researcher wants to evaluate the effects of using a cellular phone on driving performance (and hence on safety). The researcher manipulates the first independent variable by comparing driving with and without use of a cellular phone. However, the researcher suspects that the driving impairment may only occur if the driving is taking place in heavy traffic. Thus, he or she may add a second independent variable consisting of light versus heavy traffic driving conditions.
The experimental design would look like that illustrated in Figure 4: four groups of subjects derived from combining the two independent variables.
Imagine that we conducted the study, and for each of the subjects in the four groups shown in Figure 4, we counted the number of times the driver strayed outside of the driving
19

April 3, 2016 lane as the dependent variable. We can look at the general pattern of data by evaluating the cell means; that is, we combine the scores of all subjects within each of the four groups. Thus, we might obtain data such as that shown in Table 1.
If we look only at the effect of cellular phone use (combining the light and heavy traffic conditions), we might be led to believe that use of cell phones impairs driving performance.
But looking at the entire picture, as shown in Figure 5, we see that the use of a cell phone impairs driving only in heavy traffic conditions (as defined in this particular study). When the lines connecting the cell means in a factorial study are not parallel, as in Figure 5, we know that there is some type of interaction between the independent variables: The effect of phone use depends on driving conditions. Factorial designs are popular for both basic research and applied questions because they allow researchers to evaluate interactions between variables.

Figure 4. The four experimental conditions for a 2X2 factorial design.
Table 2. Hypothetical Data for a Driving Study: Average Number of Lane Deviations
Cell Phone Use
No cell phone
Cell phone

Light Traffic
2.1
2.2

Heavy Traffic
2.1
5.8

Figure 5. Interaction between cellular phone use and driving conditions.

20

ETHICAL ISSUES
The majority of human factors studies involve people as participants. Many professional affiliations and government agencies have written specific guidelines for the proper way to involve participants in research. Federal agencies rely strongly on the guidelines found in the Code of
Federal Regulations HHS, Title 45, Part 46; Protections of Human Subjects (Department of
Health and Human Services, 1991). The National Institute of Health has a Web site where students can be certified in human subjects testing (http://ohsr.od.nih.gov/cbt/). Anyone who conducts research using human participants should become familiar with the federal guidelines as well as APA published guidelines for ethical treatment of human subjects (American Psychological Association, 1992). These guidelines fundamentally advocate the following principles:
■ Protection of participants from mental or physical harm
■ The right of participants to privacy with respect to their behavior
■ The assurance that participation in research is completely voluntary
■ The right of participants to be informed beforehand about the nature of the experimental procedures When people participate in an experiment, or to provide data for research by other methods they are told the general nature of the study. Often, they cannot be told the exact nature of the hypotheses because this will bias their behavior. Participants should be informed that all results will be kept anonymous and confidential. This is especially important in human factors because often participants are employees who fear that their performance will be evaluated by management. Finally, participants are generally asked to sign a document, an informed consent form, stating that they understand the nature and risks of the experiment, or data gathering project, that their participation is voluntary, and that they understand they may withdraw at any time. In human factors field research, the experiment is considered to be reasonable in risk if the risks are no greater than those faced in the actual job environment. Research boards in the university or organization where the research is to be conducted certify the adequacy of the consent form and that the potential for any risks to the participant is outweighed by the overall benefits of the research to society.
As one last note, experimenters should always treat participants with respect. Participants are usually self-conscious because they feel their performance is being evaluated (which it is, in some sense) and they fear that they are not doing well enough. It is the responsibility of the investigator to put participants at ease, assuring them that the system components are being evaluated and not the people themselves. This is one reason that the term user testing has been changed to usability testing to indicate the system, not the person, is the focus of the evaluation.

CONCLUSION
Evaluation completes the understand-create-evaluate cycle providing an indication of how well the design meets the users’ needs. Evaluation also provides the basis for understanding what how design can be improved, and so also serves as the beginning of the cycle.

ADDITIONAL RESOURCES
[to be added*]

21

QUESTIONS
1. How is the process of evaluation related to that of understanding in the human factors design cycle.
2. What are the three general purposes of evaluation?
3. Would qualitative or quantitative data be more useful in diagnosing why a design is not performing as expected?
4. Would qualitative or quantitative data be more useful in assessing whether a design meets safety and performance requirements?
5. What is the role of quantitative data and what is the role of qualitative data in system design? 6. Is qualitative data an important part of usability testing given the role of usability testing in the design process?
7. Give an example of qualitative data in evaluating a vehicle entertainment system.
8. Given an example of quantitative data in evaluating a vehicle entertainments system.
9. Describe the role of formative and summative evaluations in design.
10. Identify a method suited to formative evaluation and a method suited summative evaluation. 11. Identify the evaluation method is best suited to early design concepts, to the prototypes, and to pre-production designs, and designs that are in service.
12. Why are evaluation methods that do not require human subjects data collection useful in design.
13. Describe two evaluation methods that do not require human collection.
14. What is an important limit of both cognitive walkthroughs and heuristic evaluation?
15. Describe the steps of heuristic evaluation.
16. What might be differ when applying a heuristic evaluation to a design of a manufacturing cell and to a website.
17. How many analysts should be used to assess a system with heuristic evaluation?
18. What is the main difference between a cognitive walkthrough and a heuristic evaluation?
19. What evaluation techniques would be particularly useful in a scrum development environment?
20. Before a large system is deployed that is being developed using a Vee development cycle, what evaluation technique would you be expected to use.
21. In using scrum in a high-risk domain what evaluation technique might be difficult to complete even though it might be the right thing to do?
22. How many participants do you need for a usability study?
23. How many usability tests would you recommend as part of an iterative design process?
24. What is the difference between a controlled experiment and a descriptive study?
25. How does a quasi experiment relate to a descriptive study and to an experiment?
26. When would you use a between subjects experimental design and when would you use a within subjects design?
27. How many participants do you need for a controlled experiment?
28. In the evaluation of an entertainment system for a car, what would be dependent variables of interest?
29. What is the benefit of subjective measures?
30. What is a limitation of subjective measures?
31. What is the relationship between a construct and a measure?
32. Describe how the driving performance data in Figure 5 represents a two-way interaction, and what the graph would look like without the interaction.
33. What is the role of descriptive statistics and how does if differ from inferential statistics?
22

34. Describe a descriptive statistic in assessing the distraction potential of a vehicle entertainment system.
35. Describe an inferential statistic in assessing the distraction potential of a vehicle entertainment system.
36. What inferential statistical approach is most commonly used for multi-factor experiments?
37. What inferential statistical approach is most commonly used for descriptive studies.
38. Describe what is meant by experimental control and its role in designing an experiment, quasi-experiment, and a descriptive study.
39. Describe an example of confounding in a field test of a vehicle entertainment system.
40. What is meant by representative sampling in selecting people, tasks, and situations in designing a study?
41. What is the purpose of representative sampling in selecting people, tasks, and situations in designing a study?
42. Describe the meaning of a Type 1 and Type 2 error and its implications for system evaluation. 43. What is the difference between practical and statistical significance?
44. Show why it is useful to consider confidence intervals and not rely on p values.
45. Describe four essential aspects of protecting participants in research.

23

Similar Documents

Evaluation

Evaluation

Evaluation

Employee Evaluation

Evaluation of Organization

Training Evaluation

Employee Evaluation

Performance Evaluation

What Is Evaluation

Self Evaluation

Scenario Evaluation

Job Evaluation

Performance Evaluation

Assessment and Evaluation

Outcome Evaluation

Popular Essays