1 00:00:09,342 --> 00:00:09,909 All right. 2 00:00:09,909 --> 00:00:11,945 I talked fast again. I have caffeinated. 3 00:00:11,945 --> 00:00:13,947 So, let's do a quick recap here. 4 00:00:13,947 --> 00:00:17,417 We're going to determine what concept we want to measure and why. 5 00:00:17,417 --> 00:00:19,152 We're going to go and collect 6 00:00:19,152 --> 00:00:23,189 qualitative data from people to make sure that we really have defined this thing. 7 00:00:23,189 --> 00:00:26,359 Once we've got it defined, we can write items for it. 8 00:00:26,359 --> 00:00:30,130 We're going to go take the items to the actual types of people 9 00:00:30,130 --> 00:00:34,167 that are going to be taking this measure, find out if they understand it, 10 00:00:34,901 --> 00:00:38,538 revise the items, make sure our revisions work with new people. 11 00:00:38,538 --> 00:00:43,476 Then we're going to go and give the items to a whole bunch of people, 12 00:00:43,476 --> 00:00:45,812 so that we can do psychometric analyses. 13 00:00:45,812 --> 00:00:47,113 And those psychometric analyses 14 00:00:47,113 --> 00:00:52,052 are going to allow us to test how well our original model fits the data. 15 00:00:52,052 --> 00:00:55,021 Well, there's a better model we've got to use. 16 00:00:55,021 --> 00:00:58,658 And it's going to allow us to derive a score, right, 17 00:00:58,658 --> 00:01:02,962 or a set of scores maybe, but some score that we can use. 18 00:01:03,229 --> 00:01:07,500 And then finally, once we're in possession of this, we can take that out 19 00:01:07,500 --> 00:01:12,072 and go test it now to see what kind of reliability and validity it exhibits. 20 00:01:12,072 --> 00:01:12,372 Okay? 21 00:01:14,941 --> 00:01:15,875 All right. 22 00:01:15,875 --> 00:01:21,981 So, that was the development and evaluation process in a very small nutshell. 23 00:01:21,981 --> 00:01:27,220 So, I'd like to now talk about some very interesting developments 24 00:01:27,220 --> 00:01:31,758 that have occurred, and how we developed patient-reported outcome measures, 25 00:01:31,758 --> 00:01:37,230 and some of the cool things we can do with them now. 26 00:01:37,230 --> 00:01:41,301 And it's especially relevant because NIH has actually invested 27 00:01:41,301 --> 00:01:46,739 a good deal in developing measures using some of these newer approaches. 28 00:01:46,773 --> 00:01:47,440 And remember 29 00:01:47,440 --> 00:01:52,812 I said new, it's new for health, but not new for the measurement world, okay? 30 00:01:52,812 --> 00:01:57,484 What we're going to be talking about is something called item response theory. 31 00:01:57,484 --> 00:01:57,851 Okay? 32 00:01:57,851 --> 00:02:02,355 And what I want to focus on here are -- is what's different 33 00:02:02,355 --> 00:02:06,860 between a regular old measure and a measure developed using item response theory. 34 00:02:06,860 --> 00:02:11,364 And then I want to tell you about one very cool little thing 35 00:02:11,364 --> 00:02:13,800 you could do called differential item functioning. 36 00:02:13,900 --> 00:02:14,434 Okay? 37 00:02:14,434 --> 00:02:18,238 So, let's start with contrasting the different types of measures. 38 00:02:18,238 --> 00:02:22,909 And let's start with -- here is -- here's your basic off-the-shelf measure. 39 00:02:22,909 --> 00:02:27,113 This is the FACIT-Fatigue Scale, right, developed by Dave Cella and colleagues 40 00:02:27,347 --> 00:02:29,749 to measure fatigue especially in cancer populations. 41 00:02:29,749 --> 00:02:34,420 And that -- this is a great example of your traditional off-the-shelf measure. 42 00:02:34,420 --> 00:02:36,222 I'd say off-the-shelf PRO measure 43 00:02:36,222 --> 00:02:40,894 because you just go and you grab it and you're ready to go. 44 00:02:41,728 --> 00:02:45,899 Everyone's going to complete the same items on the FACIT-Fatigue, all right? 45 00:02:45,899 --> 00:02:49,369 All of the items are necessary to obtain a score. 46 00:02:49,369 --> 00:02:53,907 You got to complete all of the items, because you average or sum 47 00:02:53,907 --> 00:02:57,377 up all the items to get the score, all right? 48 00:02:59,746 --> 00:03:02,248 And the score on the FACIT-Fatigue 49 00:03:02,248 --> 00:03:08,054 might not be on the same metric as other measures of the same thing. 50 00:03:08,054 --> 00:03:13,893 So, for example, the SF-36 has an energy domain that's supposed to be measuring 51 00:03:13,893 --> 00:03:15,562 about the same thing. 52 00:03:15,562 --> 00:03:18,898 But those scores are on a different metric. 53 00:03:18,898 --> 00:03:24,704 The absolute value of those scores mean a different thing for the FACIT measure, 54 00:03:24,737 --> 00:03:25,104 right? 55 00:03:25,104 --> 00:03:29,676 It's kind of like having two different thermometers that are both supposed 56 00:03:29,676 --> 00:03:35,415 to be measuring temperature, but one gives it in one set of numbers and another 57 00:03:35,415 --> 00:03:40,353 gives it in another set of numbers, and they're not comparable at all. 58 00:03:40,353 --> 00:03:40,720 Okay? 59 00:03:40,720 --> 00:03:44,357 This is what we get with traditional off-the-shelf measures. 60 00:03:44,357 --> 00:03:47,594 In contrast, when we use item response theory, 61 00:03:47,594 --> 00:03:51,030 we're able to create something called an item bank. 62 00:03:51,631 --> 00:03:56,803 An item bank is a large collection of items measuring a single domain 63 00:03:56,803 --> 00:03:59,606 like fatigue, right, or pain intensity. Okay? 64 00:03:59,606 --> 00:04:07,213 Any and all of the items from the bank can be used to provide a score for that domain. 65 00:04:07,213 --> 00:04:12,385 So, you can pluck out one item and estimate a score on fatigue. 66 00:04:12,385 --> 00:04:19,192 Or you can pluck out 40 items and combine them all and estimate a score on fatigue. 67 00:04:19,692 --> 00:04:19,959 Okay? 68 00:04:23,162 --> 00:04:26,499 That means that item banks can be dynamic, not fixed. 69 00:04:26,499 --> 00:04:31,504 We can add items to them overtime if we end up getting some better items. 70 00:04:31,504 --> 00:04:36,209 We can sit in some new items and maybe get rid of some items 71 00:04:36,209 --> 00:04:39,879 that are -- that don't seem to be quite as useful. 72 00:04:39,879 --> 00:04:41,714 So, it's a dynamic resource. 73 00:04:41,714 --> 00:04:42,548 All right. 74 00:04:42,548 --> 00:04:44,584 Let's look in an item bank 75 00:04:44,584 --> 00:04:48,254 for the domain of physical functioning just as an example here. 76 00:04:49,822 --> 00:04:50,256 Okay. 77 00:04:50,256 --> 00:04:56,229 So, we know there's this range of physical functioning out there in the world. 78 00:04:56,229 --> 00:05:01,367 We know that we would like to systematically measure that range, right? 79 00:05:01,367 --> 00:05:07,340 And so, to do that, we will create an item bank for physical functioning. 80 00:05:07,340 --> 00:05:12,478 And in this case, the item bank is a collection of items 81 00:05:12,478 --> 00:05:14,614 that all measure physical function. 82 00:05:14,614 --> 00:05:17,583 Okay? And so, we've got examples here. 83 00:05:17,617 --> 00:05:21,788 "Are you able to get in and out of bed?" "Are 84 00:05:21,788 --> 00:05:26,693 you able to stand without losing your balance for 1 minute?" All right. 85 00:05:26,693 --> 00:05:29,729 "Are you able to run 5 miles?" Okay. 86 00:05:29,729 --> 00:05:33,166 Any item or set of items from this bank 87 00:05:33,166 --> 00:05:38,071 could be used to generate a score for a person for physical functioning. 88 00:05:38,071 --> 00:05:39,238 Any of them. 89 00:05:39,238 --> 00:05:40,740 And why is that? 90 00:05:40,740 --> 00:05:42,642 It's because they -- we've 91 00:05:42,642 --> 00:05:47,513 put these items through the ringer to make sure that they fit in item 92 00:05:47,513 --> 00:05:48,581 response theory model. 93 00:05:48,581 --> 00:05:53,219 It's a statistical model that allows all of these properties to hold, okay? 94 00:05:53,219 --> 00:05:57,490 And note here, too, that these items, as I've laid them out 95 00:05:57,490 --> 00:06:01,394 here, are kind of arrayed in order of severity, all right? 96 00:06:01,394 --> 00:06:04,931 Are you able to get in and out of bed? 97 00:06:04,931 --> 00:06:06,366 That's kind of question 98 00:06:06,366 --> 00:06:10,970 you'd ask of a person who looked like they were incredibly weak, right? 99 00:06:10,970 --> 00:06:14,140 That's an item that's good for discriminating among people 100 00:06:14,140 --> 00:06:17,276 who are at the lower end of functioning, right? 101 00:06:17,276 --> 00:06:20,413 "Are you able to run 5 miles?" is something 102 00:06:20,413 --> 00:06:25,318 that would be informative to ask of a person who's in the higher end. 103 00:06:25,318 --> 00:06:28,454 And this tells us another interesting property about item 104 00:06:28,454 --> 00:06:32,992 banks, is that with item response theory, not all items are the same. 105 00:06:33,092 --> 00:06:36,796 Some items are better used for lower severity or higher severity. 106 00:06:36,796 --> 00:06:40,133 And when we're thinking about which items we might want, 107 00:06:40,133 --> 00:06:43,136 we have to think about the population we're using. 108 00:06:43,136 --> 00:06:46,172 And we have the option to tailor our measures. 109 00:06:46,172 --> 00:06:48,508 More about that in a second here. 110 00:06:48,508 --> 00:06:49,008 Okay? 111 00:06:49,008 --> 00:06:50,510 So, back to off-the-shelf, 112 00:06:50,510 --> 00:06:55,214 if I'd just use my traditional off-the-shelf measure, I take it off the shelf, 113 00:06:55,214 --> 00:06:59,886 and I just give it to the patient or send a web link or 114 00:06:59,886 --> 00:07:04,590 however it is, and they all fill out the same measure and that's it. 115 00:07:04,757 --> 00:07:05,224 Okay? 116 00:07:05,224 --> 00:07:09,695 So, it's really easy to figure out what to do. 117 00:07:09,695 --> 00:07:13,533 You take the off-the-shelf measure and administer it. 118 00:07:13,533 --> 00:07:15,535 It's not as straightforward 119 00:07:15,535 --> 00:07:20,473 with an item bank, because you've got some decisions, all right? 120 00:07:20,473 --> 00:07:22,742 Let's go through those here. 121 00:07:22,742 --> 00:07:26,779 Basically, I've got one of two main strategies here, 122 00:07:26,779 --> 00:07:29,549 a fixed-length or computerized adaptive tests. 123 00:07:29,549 --> 00:07:31,717 With the fixed-length measures, 124 00:07:33,453 --> 00:07:37,390 everyone is going to get the same number of items. 125 00:07:37,390 --> 00:07:40,126 It's a fixed-length for everyone, all right? 126 00:07:40,126 --> 00:07:43,262 And I've got a couple different options there. 127 00:07:43,262 --> 00:07:46,399 There are some very handy ready-made fixed-length measures. 128 00:07:46,399 --> 00:07:51,070 So, if you want to use measures from the NIH PROMIS system, 129 00:07:51,070 --> 00:07:56,943 you can go to the Assessment Center website and you could find ready-made short forms. 130 00:07:57,210 --> 00:07:59,812 These are forms with about six 131 00:07:59,812 --> 00:08:04,617 to 10 items per domain that are good for general populations. 132 00:08:04,617 --> 00:08:05,918 They're generally sensitive 133 00:08:05,918 --> 00:08:11,157 to the range of functioning there that the PROMIS folks have picked. 134 00:08:11,157 --> 00:08:16,395 It's sort of a good kind of off-the-shelf version of PROMIS measures. 135 00:08:16,395 --> 00:08:19,866 So, there's some ready-made fixed-length measures, all right? 136 00:08:19,866 --> 00:08:24,670 I could also go in -- if I've achieved the requisite 137 00:08:24,670 --> 00:08:28,608 level of geekiness, I could go into the system 138 00:08:28,608 --> 00:08:33,813 and I could pick out the items that I would like that 139 00:08:33,813 --> 00:08:39,685 I think are going to be most sensitive for my population, all right? 140 00:08:39,685 --> 00:08:44,190 So, if I'm -- if I've got a trial where I've got severely ill heart failure 141 00:08:44,190 --> 00:08:48,694 patients, I don't want to be asking people how much trouble they have running 5 miles. 142 00:08:48,694 --> 00:08:50,663 It's a waste of their time, right? 143 00:08:50,663 --> 00:08:56,002 I'm going to go in and I'm going to select the items that are going to be most sensitive 144 00:08:56,002 --> 00:08:56,869 for my population. 145 00:08:56,869 --> 00:09:01,274 And I'm going to give everyone in my trial those items, all right? 146 00:09:01,274 --> 00:09:04,310 Okay. So, I could tailor them. 147 00:09:04,310 --> 00:09:06,812 I can make my own. 148 00:09:06,812 --> 00:09:10,583 So, those are both versions of fixed-length. 149 00:09:10,583 --> 00:09:12,084 The other 150 00:09:12,084 --> 00:09:18,858 main way of using items from an item bank is through computerized adaptive tests. 151 00:09:18,858 --> 00:09:25,865 And, you know, a lot of you might be familiar with computerized adaptive testing, 152 00:09:25,865 --> 00:09:29,902 because they use it a lot for education 153 00:09:29,902 --> 00:09:34,907 like the SAT, GRE, a lot of the certificate -- 154 00:09:34,907 --> 00:09:40,413 certification tests in medicine, nursing, and government safety tests a lot. 155 00:09:40,580 --> 00:09:45,518 What it does have is these big banks of items, and you sit down 156 00:09:45,518 --> 00:09:50,122 with a computer and they show you one at a time, all right? 157 00:09:50,122 --> 00:09:51,524 So, same deal here. 158 00:09:51,524 --> 00:09:55,761 The idea is that the next item that you receive is dependent 159 00:09:55,761 --> 00:09:58,564 upon your answer to previous items, all right? 160 00:09:58,564 --> 00:10:04,236 And so, if I've got a range of items across this -- the spectrum of severity, 161 00:10:04,236 --> 00:10:09,175 a computerized adaptive test or CAT might start with one right in the middle. 162 00:10:10,076 --> 00:10:14,213 And if I seem to have a lot of difficulty doing that, it's going to put me 163 00:10:14,213 --> 00:10:17,149 on the more severe side and say, "This guy sounds more severe. 164 00:10:17,149 --> 00:10:21,554 Let me ask him a question on the more severe end and see how he answers, all right? 165 00:10:21,554 --> 00:10:25,691 I'm going to answer that." I'm just going to say, "Oh, it seems like it's around here. 166 00:10:25,691 --> 00:10:30,930 I'm going to ask a question here." And so, in some ways it's very similar to the way 167 00:10:30,930 --> 00:10:33,532 a clinician might be asking questions of a patient. 168 00:10:33,532 --> 00:10:35,267 Let's start with a general one. 169 00:10:35,267 --> 00:10:39,338 And then as you're getting the sense of how severe or mild they are, 170 00:10:39,338 --> 00:10:43,542 ask more questions to get a more precise picture of what that person looks like. 171 00:10:44,210 --> 00:10:48,748 But the idea with the -- with a CAT is that in theory 172 00:10:48,748 --> 00:10:52,218 every person could get a different set of items, right, 173 00:10:52,218 --> 00:10:56,756 because the items that you get pulled out of that big item bank 174 00:10:56,756 --> 00:11:01,994 are the ones that are getting us to the most precise estimate of your score 175 00:11:01,994 --> 00:11:03,996 as quickly as possible. All right? 176 00:11:03,996 --> 00:11:06,866 And we find often that you can get 177 00:11:06,866 --> 00:11:11,404 a very precise estimate of people's scores by asking them only four questions. 178 00:11:15,341 --> 00:11:18,878 But they have to be the right four questions. 179 00:11:18,878 --> 00:11:23,549 And how do you figure out which are the right four questions? 180 00:11:23,549 --> 00:11:25,518 Well, you have an algorithm, 181 00:11:25,518 --> 00:11:30,756 a computerized adaptive testing algorithm that is drawing items from that item bank. 182 00:11:30,756 --> 00:11:31,691 All right? 183 00:11:31,691 --> 00:11:37,630 Having an item bank now with these different options allows a few things to happen. 184 00:11:37,630 --> 00:11:43,936 Imagine we got this fatigue item bank here, and it has said 40 items in it. 185 00:11:43,936 --> 00:11:49,442 Well, I could use items from that item bank in all different settings, right? 186 00:11:49,442 --> 00:11:53,012 And I could use them in different ways too. 187 00:11:53,012 --> 00:11:56,949 So, the chemotherapy trial might have items one through 10. 188 00:11:56,949 --> 00:12:01,687 For the osteoarthritis trial, I might be using a CAT, a computerized 189 00:12:01,687 --> 00:12:02,888 adaptive test, right? 190 00:12:02,888 --> 00:12:06,425 And what's great is because all of the items 191 00:12:06,425 --> 00:12:11,163 being pulled for all these different studies were from the item bank, 192 00:12:12,331 --> 00:12:15,968 they all have the same metric, the same meaning. 193 00:12:15,968 --> 00:12:20,840 The scores are comparable across all of those different studies, all right? 194 00:12:20,840 --> 00:12:25,344 You can imagine how that might help facilitating things like meta-analysis 195 00:12:25,344 --> 00:12:28,581 and the like. Here's another great application. 196 00:12:28,581 --> 00:12:33,853 Right now, there's a lot of interest in pragmatic clinical trials or comparative 197 00:12:33,853 --> 00:12:38,758 effectiveness research where we're trying to do a research within healthcare systems. 198 00:12:39,625 --> 00:12:45,297 I might have different healthcare systems each of whom are using different measures, 199 00:12:45,297 --> 00:12:46,165 all right? 200 00:12:46,165 --> 00:12:51,403 Well, right now, each of those measures results in a different metric, 201 00:12:51,403 --> 00:12:54,774 and it's really difficult to pull results then. 202 00:12:55,040 --> 00:12:59,278 If, however, all of those items from the different measures 203 00:12:59,278 --> 00:13:05,818 could be placed within an item bank, which means we do the kinds of item 204 00:13:05,818 --> 00:13:10,623 response theory analyses that test whether these guys can all sit 205 00:13:10,623 --> 00:13:15,561 in the same bank, and maybe we've got some other items in there, 206 00:13:15,561 --> 00:13:20,399 now we're in a great position because we can get one metric out of that. 207 00:13:20,399 --> 00:13:20,733 Right? 208 00:13:20,733 --> 00:13:21,367 So, now, 209 00:13:21,367 --> 00:13:25,271 even though different healthcare systems might be using different sets of items, 210 00:13:25,271 --> 00:13:30,442 if they're all on the same item bank, we can -- we've got a cross-walk going 211 00:13:30,442 --> 00:13:35,181 and we can get the same metric out of all of them, right? 212 00:13:35,181 --> 00:13:35,881 Okay. 213 00:13:35,881 --> 00:13:39,952 So, a quick review here comparing this traditional PRO 214 00:13:39,952 --> 00:13:44,056 measures, the off-the-shelf measures with the PRO item bank. 215 00:13:44,056 --> 00:13:48,828 With the traditional, all items are required to compute a score. 216 00:13:49,261 --> 00:13:55,401 With item banks, any and all subsets of the items can generate a score. 217 00:13:55,401 --> 00:13:59,939 With the traditional, everybody's got to take the same items. 218 00:13:59,939 --> 00:14:03,976 With an item bank, different people could get different items. 219 00:14:03,976 --> 00:14:09,348 With the traditional measure, you've got a -- you can just pull it off the shelf. 220 00:14:09,348 --> 00:14:14,420 But when we've got an item bank, you've got to use items in the bank 221 00:14:14,420 --> 00:14:16,789 to create measures for a specific use. 222 00:14:16,789 --> 00:14:21,493 And we went through some of the different options there, right, fixed-length or CATs. 223 00:14:21,493 --> 00:14:26,565 If it's fixed, you can get ready-made ones that someone might have created for you, 224 00:14:26,966 --> 00:14:30,569 or you can go and create your own custom one. 225 00:14:30,803 --> 00:14:34,807 And finally, in traditional PRO measures, the scores are not easily 226 00:14:34,807 --> 00:14:38,911 comparable to other scores from other measures of the same thing. 227 00:14:38,911 --> 00:14:43,015 Whereas with an item bank, there's much greater opportunity for cross-walk 228 00:14:43,015 --> 00:14:47,152 between scores from different measures, as long as the items from 229 00:14:47,152 --> 00:14:50,890 the different measures can be put in the same bank. 230 00:14:51,724 --> 00:14:53,692 Okay? All right. 231 00:14:53,692 --> 00:14:57,930 I mentioned that NIH has invested in item banks 232 00:14:58,264 --> 00:15:01,533 quite heavily and integrate benefit, I think. 233 00:15:01,533 --> 00:15:03,502 A few examples here. 234 00:15:03,502 --> 00:15:08,440 The biggest one is probably PROMIS, the Patient-Reported Outcomes Measurement 235 00:15:08,440 --> 00:15:09,408 Information System. 236 00:15:09,408 --> 00:15:14,346 And over about 10 years of development, PROMIS has amassed 237 00:15:14,346 --> 00:15:20,252 quite a few excellent measures of the main types of health concepts 238 00:15:20,252 --> 00:15:26,158 that are relevant across chronic diseases for adults and children and adolescents. 239 00:15:26,625 --> 00:15:28,060 They're freely available. 240 00:15:28,060 --> 00:15:32,298 You can go to nihpromis.org for more information there. 241 00:15:32,298 --> 00:15:37,036 NIH also supported the NIH Toolbox, which is for assessing 242 00:15:37,036 --> 00:15:40,806 neurological and behavioral function, ages 3 to 85. 243 00:15:40,806 --> 00:15:44,576 Again, freely available. You can go to nihtoolbox.org. 244 00:15:44,576 --> 00:15:49,315 And finally, Neuro-QoL is for use in chronic neurological conditions. 245 00:15:49,315 --> 00:15:53,552 And it's free. And you can go to neuroqol.org. 246 00:15:53,552 --> 00:15:57,122 So, you might want to check those out, and you can get more information 247 00:15:57,122 --> 00:15:59,892 from those sites too about how item banks work, all right? 248 00:16:01,994 --> 00:16:03,529 I said that there was 249 00:16:03,529 --> 00:16:08,167 a very cool thing you can do -- there are a lot of cool things 250 00:16:08,167 --> 00:16:13,405 you can do with item response theory, but I wanted to tell you about a specific one. 251 00:16:13,405 --> 00:16:14,940 It's differential item functioning, okay? 252 00:16:14,940 --> 00:16:17,743 So, check out this very simple question, all right? 253 00:16:17,743 --> 00:16:20,212 "In the past seven days, did you cry? 254 00:16:20,212 --> 00:16:24,516 Yes or no?" This is -- I've just made a yes-no version of an 255 00:16:24,516 --> 00:16:28,654 -- a kind of item you often see in depression measures, all right? 256 00:16:28,654 --> 00:16:33,826 And I think that we would all agree that, in general, responses to this item 257 00:16:33,826 --> 00:16:36,929 might inform us to some degree about the likelihood 258 00:16:36,929 --> 00:16:40,366 that a person might be a bit depressed, all right? 259 00:16:40,366 --> 00:16:44,136 The more likely you already have cried in the last week, 260 00:16:44,136 --> 00:16:47,239 the more likely you are to have been depressed. 261 00:16:47,239 --> 00:16:50,509 This has been shown to be the case, okay? 262 00:16:50,509 --> 00:16:54,446 Imagine that we wanted to compare men and women's depression though, 263 00:16:54,446 --> 00:16:58,917 and we had an item like this and a few other similar items. 264 00:16:58,951 --> 00:17:03,088 We found that men and women differ in their depression levels. 265 00:17:03,088 --> 00:17:07,626 What we'd like to know is that men and women truly do 266 00:17:07,626 --> 00:17:11,030 differ in their underlying depression levels and not that 267 00:17:11,030 --> 00:17:15,534 they actually differ in the way the items work for them, okay? 268 00:17:15,534 --> 00:17:17,803 So, hang on for a second. 269 00:17:17,803 --> 00:17:20,072 We're talking about differential item functioning. 270 00:17:20,072 --> 00:17:24,243 It means the item behaves differently for two or more groups. 271 00:17:25,110 --> 00:17:26,845 So, check this out. 272 00:17:26,845 --> 00:17:30,749 There's a very simple item response theory curve here. 273 00:17:30,749 --> 00:17:32,618 And you can see 274 00:17:32,618 --> 00:17:38,123 we're modeling the probability of answering yes to that question, on the y-axis, 275 00:17:38,123 --> 00:17:42,895 as a function of the person's underlying depression, on the x-axis. 276 00:17:42,895 --> 00:17:43,328 Okay? 277 00:17:43,328 --> 00:17:48,100 This, by the way, is why it's called item response theory. 278 00:17:48,100 --> 00:17:51,136 They're models of the likelihood of responding 279 00:17:51,136 --> 00:17:55,908 with a given answer based on your underlying level of whatever. 280 00:17:55,908 --> 00:17:58,310 In this case, depression, okay? 281 00:17:58,310 --> 00:18:04,716 So -- and you can see here as the person's depression -- as depression increases, so too 282 00:18:04,716 --> 00:18:10,923 does the probability they're going to answer yes to "Did you cry in the past week?" Okay? 283 00:18:10,923 --> 00:18:15,694 And so, what we're saying when we say that this is informative about 284 00:18:15,694 --> 00:18:20,432 depression, is we're saying, well, we know that of people who answer yes, 285 00:18:20,432 --> 00:18:25,170 those folks are most likely to be on the high end of depression. 286 00:18:25,704 --> 00:18:31,143 So, we can make an inference that they are on the higher end of depression. 287 00:18:31,143 --> 00:18:33,312 That's how we're using that item. 288 00:18:33,312 --> 00:18:39,118 This is a map that maps the responses to that item to their underlying depression, okay? 289 00:18:39,118 --> 00:18:44,923 And so, we take someone who is, say, one standard deviation above the mean of 0 290 00:18:44,923 --> 00:18:49,895 on depression, they've got a 0.9 probability of answering yes on that, right? 291 00:18:49,895 --> 00:18:50,496 Okay. 292 00:18:50,496 --> 00:18:54,566 So, we've said that differential item functioning is what happens 293 00:18:54,566 --> 00:18:58,637 when an item behaves differently for two or more groups. 294 00:18:58,637 --> 00:19:03,108 Another way of saying that is that the map between depression 295 00:19:03,108 --> 00:19:07,179 and the item is different for two or more groups. 296 00:19:07,179 --> 00:19:11,850 I was talking about men and women with this crying item. 297 00:19:11,850 --> 00:19:14,119 Okay. Let's look at this. 298 00:19:14,186 --> 00:19:18,924 This is the map computed separately for females and males. 299 00:19:18,924 --> 00:19:21,527 And what is it showing here? 300 00:19:21,994 --> 00:19:27,933 If we've got a male who we know because we've suddenly become omniscient 301 00:19:27,933 --> 00:19:34,106 for just a second is one standard deviation above the mean on depression, 302 00:19:34,106 --> 00:19:39,811 the male's likelihood that they're going to answer yes is 0.3, right? 303 00:19:39,811 --> 00:19:44,550 It's a 30 percent chance that they're going to answer and say, "Yes, 304 00:19:44,550 --> 00:19:48,954 I've cried in the past week." A woman though who is one 305 00:19:48,954 --> 00:19:53,725 standard deviation above the mean has a 0.9 probability of answering yes, right? 306 00:19:53,725 --> 00:19:54,459 Now, why? 307 00:19:54,459 --> 00:19:58,830 On average, women will respond that they cried in the past week 308 00:19:58,830 --> 00:20:03,535 more often than men independent of their underlying level of depression, right? 309 00:20:03,535 --> 00:20:09,575 But you can see how messed up you'd get here though if you just used one 310 00:20:09,575 --> 00:20:14,813 version of the map and applied it the same for males and females here. 311 00:20:14,813 --> 00:20:18,383 So, this is an example of differential item functioning. 312 00:20:18,383 --> 00:20:23,488 And when we're developing measures using item response theory, one of the things 313 00:20:23,488 --> 00:20:28,360 that's really important for us to do is to think, "Are there groups 314 00:20:29,561 --> 00:20:32,164 for whom these items might behave differently? 315 00:20:32,164 --> 00:20:35,901 Would they behave differently for young versus older people? 316 00:20:35,901 --> 00:20:39,638 Would they behave differently for different cultural groups?" Right? 317 00:20:39,638 --> 00:20:44,209 And we can explicitly test for that using item response theory 318 00:20:44,209 --> 00:20:47,112 and conducting tests of differential item functioning. 319 00:20:47,112 --> 00:20:51,683 And it will identify for us those items that are exhibiting 320 00:20:51,683 --> 00:20:55,420 differential item functioning. And then we've got a choice. 321 00:20:55,687 --> 00:20:58,390 We can either chuck those items, if we've got other items 322 00:20:58,390 --> 00:21:02,327 that are just as good, but don't exhibit that. We've got to get rid of them. 323 00:21:02,327 --> 00:21:04,029 Or we can say, "You know what? 324 00:21:04,029 --> 00:21:07,232 We're not going to get rid of it, because it's really informative item. 325 00:21:07,232 --> 00:21:09,434 But now we know we need to interpret it. 326 00:21:09,434 --> 00:21:11,169 We need to score it differently for 327 00:21:11,169 --> 00:21:16,375 men and women." Still useful, but we got to score it differently, right? 328 00:21:16,375 --> 00:21:22,614 By the way, you know, when you see tests for potential race bias in achievement 329 00:21:22,614 --> 00:21:27,019 tests and IQ tests, what they're doing is differential item functioning. 330 00:21:27,019 --> 00:21:29,821 They're looking to see, for two people 331 00:21:29,821 --> 00:21:35,027 who are at the exact same underlying level of analytic reasoning, for example, 332 00:21:35,027 --> 00:21:39,865 do they have different probabilities of getting this item correct or not? 333 00:21:39,865 --> 00:21:42,868 Because of maybe different cultural backgrounds, right? 334 00:21:42,868 --> 00:21:45,737 So, differential item functioning is very important. 335 00:21:45,737 --> 00:21:50,475 Very, very -- isn't it really a terrific technique we've got? 336 00:21:50,475 --> 00:21:51,109 Okay. 337 00:21:51,109 --> 00:21:57,316 So, that was our quick little foray into some of the things that item response 338 00:21:57,316 --> 00:22:01,853 theory has brought us, namely measures based on item banks, right? 339 00:22:01,853 --> 00:22:05,557 And how they differ from your traditional off-the-shelf measures. 340 00:22:05,557 --> 00:22:11,330 And some other really cool tools like being able to assess differential item functioning. 341 00:22:11,330 --> 00:22:11,730 Okay?