>> FULL PROFESSOR AT GALLAUDET UNIVERSITY, JOURNALIST, PUBLISHED IN NATURE, NEW SIGNIST, " SCIENTIST AND OTHERS. HER ARTICLE ABOUT THE 2014 EXCELLENCE IN STATISTICAL FACILITATED ASA WORKING GROUP CONFERENCE THAT PRODUCED THE P-VALUE CONSENSUS STATEMENT. REGINA IS AN AMAZING SCIENTIST AND SCIENCE COMMUNICATOR. WE'RE LUCKY TO HAVE HER TALKING ABOUT PITFALLS OF INTERPRETING P-VALUES. REGINA? >> THANK YOU. CAN YOU HEAR ME OKAY? OKAY. SO BEFORE WE GET STARTED, I ACTUALLY THANK YOU FOR THE INVITATION TO TALK ABOUT ONE OF MY FAVORITE SUBJECT, P-VALUES. AND MY SECOND FAVORITE SUBJECT, FOOLING YOURSELF. BEFORE WE GET STARTED, A PRIOR -- ON WHO THE AUDIENCE IS, WHO SELF IDENTIFIES AS A STATISTICIAN HERE? MMM, SMALL NUMBER. WHAT ABOUT SELF IDENTIFIES AS A DATA SCIENTIST? NICE. NEUROSCIENTIST? ANY OTHER CATEGORY IN THERE? CLINICIAN OR -- ANY OTHER CATEGORY? ALL RIGHT. WELL, LET'S GET INTO IT. I HAVE A DISCLOSURE. I SELF IDENTIFY AS A CYBORG. [LAUGHTER] I WAS BORN MOSTLY DEAF, AND THEN OVER TIME MY HEARING GOT WORSE AND WORSE, AND TWO YEARS AGO I GOT A COCHLEAR IMPLANT, SO I'M A 2-YEAR-OLD TODDLER CYBORG. AND IT'S BEEN PRETTY AMAZING. AND BRAIN PEOPLE UNDERSTAND WHAT IT WOULD BE LIKE TO START TO LEARN TO HEAR NEW SOUNDS IN YOUR MID-40s AFTER NEVER HAVING HEARD THEM BEFORE. THAT'S A LITTLE BIT OF EXPLANATION WHY YOU HEAR A DEAF ACCENT, AND IF I CAN'T -- I CAN'T SEE SO THERE'S NO WAY I'M GOING TO LOOK RIGHT NOW, IF I CAN'T HEAR QUESTIONS FROM THE AUDIENCE I'M JUST GOING TO ASK ADAM TO REPEAT. I MIGHT BE ABLE TO GET HIM. YES, I'M STILL WAITING FOR MY FANCY FACE PLATE. [LAUGHTER] HE TODAY I'M GOING TO BE TALKING ABOUT SOME NEW MATERIAL I HAVEN'T PRESENTED BEFORE. SO YOU GUYS ARE MY GUINEA PIGS. AND THAT MEANS THAT IF YOU NEED IT'S POSSIBLE THE SLIDE IS WRONG, OR JUST CONFUSING, SO GO AHEAD AND INTERRUPT ME THERE. BUT IF IT'S QUESTIONS ABOUT EXTENSIONS OR OTHER THINGS, THEN WE'LL SAVE THAT FOR THE END AND SEE IF WE CAN HAVE SOME DISCUSSION. ALL RIGHT. IN CASE YOU HAVE BEEN LIVING UNDER A ROCK FOR THE PAST DECADE OR SO, APPARENTLY WE HAVE A REPRODUCIBILITY CRISIS IN SCIENCE. IN 2005 JOHN SAID EVERYTHING -- MOST THINGS WE PUBLISH ARE FALSE, AND SO THE WORLD AS WE KNOW IT IS GOING TO END. [LAUGHTER] THAT JUST SET OFF THIS, YOU KNOW, MEDIA STORM OF PROBLEMS. BUT THAT'S SCIENCE. WHAT ABOUT REAL LIFE? WHEN IN DOUBT, TURN TO XKCD, RIGHT? I'M PSYCHIC, YOU KNOW. ADAM SAID, NO, NO, NO SUCH THING. ALL RIGHT. THINK OF A NUMBER. 1 TO 100. GO AHEAD. PICK A NUMBER 1 TO 100. 28. >> NO. >> SO DARN. OKAY. AND ANOTHER ALTERNATIVE UNIVERSE, I WAS RIGHT, AND ADAM SAYS HOLY SHIT, THAT'S AMAZING. I'M LIKE YEAH, YEAH, YEAH, YOU KNOW, I'M GOOD BUT PRY TRY NOT TO LET IT BOTHER ME. THIS TRICK ONLY WORKS 1% OF THE TIME BUT WHEN IT DOES IT'S TOTALLY WORTH IT. RANDOM MONROE IS GENIUS IN ADDITION TO BEING HILARIOUS. WE'LL COME BACK, IT'S FUNNY. REPRODUCIBILITY CRISIS, LOTS OF THINGS GOING ON IN THERE. AND REWARD SYSTEM, YOU KNOW, OUR MODERN SCIENTIFIC ENTERPRISE, LOTS OF THINGS. ONE PROBLEM IS P-VALUES. NOT JUST P-HACKING BUT THE FACT THAT OUR STATISTICAL TOOLS REALLY WERE NEVER DESIGNED TO BE USED THE WAY THAT WE'RE USING THEM TODAY. SO WE'RE TAKING A PERFECTLY GOOD SCREWDRIVER, WHACKING AT A NAIL, AND THEN COMPLAINING THAT IT'S NOT MAKING A VERY GOOD HAMMER. SO OF COURSE IT'S NOT. ONE OF MY FAVORITE QUOTES ABOUT P-VALUES, I'M A PSYCHOLOGY, P-VALUE DID NOT TELL US WHAT WE WANT TO KNOW, AND WE SO MUCH REALL WANT TO KNOW WHAT WE WANT TO KNOW THAT OUT OF DESPERATION WE NEVERTHELESS BELIEVE MAYBE WE NEED TO WRITE A POSITION PAPER. STATISTICIANS HAVE BEEN COMPLAINING ABOUT THE WAY PEOPLE ARE USING P-VALUES FOR A LONG TIME. AND SO THIS IS PUBLISHED 2016, I ENCOURAGE YOU TO READ IT. IT SAYS ARTICULATE IN NON-TECHNICAL TERMS IF YOU SELECT PRINCIPLES. FAST FORWARD, JULY, A HOST OF STATISTICIANS AND SCIENTISTS POSTED A PRE-PRINT SAYINGS WE WANT TO CHANGE THAT TO .005, AND THEY HAVE VERY GOOD STATISTICAL REASONS FOR DOING SO THAT I THINK -- I DON'T KNOW IF YOU SAW THIS, MOST OF THE ARTICLES IN THE MEDIA COMPLETELY MISSED THE REAL UNDERLYING POINT. I'M GOING TO SEE IF I CONVINCE YOU . .005 IS QUALITATIVELY DIFFERENT AND NOT JUST CHANGING THE GOAL POST. I'M NOT A PRACTICING STATISTICIAN, I'M MORE OF A STATISTICAL COMMUNICATOR. I THINK IT'S QUITE IMPORTANT. THE WORDS WE USE TO TALK ABOUT P-VALUES AND STATISTICAL RESULTS MAKE A DIFFERENCE, HUMAN-CENTERED ANALOGIES DO TOO. WHAT DO I MEAN? FIRST OF ALL, I WAS VERY SADDENED BECAUSE I HELPED TO DRAFT THIS P-VALUE STATEMENT THE AFA CAME UP, THIS WAS THE BEST WE COULD DO TO SAY WHAT A P-VALUE WAS. P-VALUES CAN INDICATE HOW INCOMPATIBLE THE DATA ARE WITH A SPECIFIED STATISTICAL MODEL. SCIENCE IS EXCITED STATISTICIANS ARE DEMYSTIFYING P-VALUES, THEY OPENED IT UP AND GOT BIT. YOU HAVE NO IDEA HOW HARD IT WAS TO GET 25 STATISTICIANS TO AGREE ON EVEN THIS. WHAT DO THEY SAY, A PLURALITY OF STATISTICIANS IS NO -- IT'S MISERABLE. KEVIN AT MOTHER JONES READ THIS, JUSTIFIABLY UNHAPPY, AND SAID -- THIS IS A SUMMARY. HOW HARD CAN IT BE? THE PROBABILITY THE SMALL DATASET VALIDATED YOUR HYPOTHESIS, NICE AND SIMPLE, RIGHT? I SEE SOMEONE SHAKING THEIR HEAD BUT COMPLETELY WRONG. SIMPLE, WRONG, COMPLICATED, RIGHT? LET'S SEE. SO THAT GROUP THAT HAD THE PRE-PRINT, THEY ARE A PRECISE BUNCH OF PEOPLE, DEFINED UNDER THE NULL HYPOTHESIS A TEST STATISTICS IS AS EXTREME OR MORE EXTREME. COMPLETELY RIGHT AND UTTERLY UNHELPFUL IN A REALISTIC WAY, RIGHT? SO I SAY WE SHOULD GO BACK TO RANDOM MONROE, HOLY SHITNESS, THAT'S NOT GOING TO PASS! I HAVE AN INDEX OF SURPRISE. INSIDE YOU SHOULD BE THINKING HOLY SHITNESS. A BELLWETHER AND NUDGE FOR PLAUSIBILITY BUT NOT STRENGTH FOR YOUR HYPOTHESIS, WE'LL SEE WHAT THAT LOOKS LIKE. LET'S TALK ABOUT THE INDEX OF SURPRISE, HOW SURPRISED ARE YOU BY EVIDENCE OF MY PSYCHIC MIND-READING ABILITY? LET'S TRY IT. DO YOU HAVE PSYCHIC MIND-READING ABILITIES? I'M THINKING OF A NUMBER RIGHT NOW. MAKE A GUESS. HAVE YOU GUESSED? OKAY. 1 TO 100. I GOT A NUMBER IN MIND. OKAY, 99. DID ANYONE GET IT RIGHT? ALL RIGHT! HOW MANY PEOPLE ARE IN HERE? 100? SOMETHING LIKE THAT? OKAY. SO WHAT IS -- ALL RIGHT. WELL, LET'S SAY WE TRY ANOTHER ONE. PICK A NUMBER 1-10. ALL RIGHT. I'VE GOT MY NUMBER. DO YOU HAVE YOUR GUESS? READY? ONE. ANYONE GET IT RIGHT? ONLY ONE? [ LAUGHTER ] REALLY? NO ONE ELSE PICKED ONE? ALL RIGHT. THAT'S KIND OF AMAZING. ALL RIGHT. THAT ONE IS NOT AS CONVINCING, RIGHT? LET'S SAY I PICK THREE NUMBERS 1-10. AND GET THEM ALL RIGHT. THIS IS MUCH MORE EVIDENCE, RIGHT? MUCH MORE CONVINCING. OR WHAT ABOUT IF I ONLY GET TWO OUT OF THREE RIGHT? SO GOING BACK TO THE RESULTS OF THE FIRST EXPERIMENT, DOES THAT SEEM, IF THIS WERE AN EXPERIMENT, A SCIENTIFIC EXPERIMENT, DID THAT FEEL LIKE, YEAH, YOU KNOW, HOLY SHIT LEVEL, IMPLAUSIBLE, PUBLISHABLE KIND OF THING? WHO FEELS THAT WOULD BE GOOD STRONG EVIDENCE? YOU GUYS ARE NOT INTERACTIVE TODAY. [LAUGHTER] ALL RIGHT. SKIP ON AHEAD. FOR THE FIRST ONE, THE P-VALUE WAS 1 OUT OF 100. ALL RIGHT, I'M GOING TO BE SUSPICIOUS THAT HOLY SHIT IF YOU'RE JUST GUESSING AND GET IT RIGHT, IT'S 1 OUT OF 100, .01, HIGHLY SIGNIFICANT UNDER P-VALUE RULES. THE SECOND ONE. 1 OUT OF 10, 1 OUT OF 10, NOT SO MUCH. .1. WE'RE NOT GOING TO COUNT THAT. ALL RIGHT. NEXT ONE. THREE NUMBERS. I GET THEM ALL RIGHT. .001. BUT I ONLY GET TWO OUT OF THREE RIGHT, .028. SO THAT'S KIND OF MIDDLING AS FAR AS EVIDENCE. THIS IS A P-VALUE IS REALLY DOING. IT'S SAYING, OKAY, LET'S ASSUME NOTHING IS HAPPENING, I'M GOING TO BE SKEPTICAL. HOW SURPRISED SHOULD I BE BY THESE RESULTS? SO I KIND OF LIKE THINKING ABOUT IT, AS THIS INDEX OF SURPRISE. THIS IS A REALLY UGLY TABLE YOU'LL HAVE TO BEAR WITH ME ON THIS. WE HAVE THE P-VALUE, WHY DON'T WE TRANSLATE IT SO WE HAVE THIS GUT FEELING ABOUT WHAT THAT LEVEL OF SURPRISE REALLY MEANS AND PUT IT INTO OTHER SORT OF UNITS? SO IF YOU HAVE A P-VALUE OF .5, THAT WOULD BE EQUIVALENT OF ONE COIN FLIP, RIGHT? DOES THAT MAKE SENSE WHY THAT IS? OKAY. SO .17 WOULD BE THE SAME AS IF YOU GUESS CORRECTLY ON ONE OUT OF SIX. AND .1, THAT WOULD BE THE EASY, THE ONE OUT OF TEN. IT'S NOT UNTIL WE GET TO .01 THAT WE GET THE XKCD PSYCHIC GUESS. WHEN WE TALK ABOUT THIS EVIDENCE FOR .05, IT'S REALLY -- YOU SHOULD BE JUST AS SURPRISED BY THOSE RESULTS AS IF YOU WERE TO PREDICT IN ADVANCE COIN FLIPS AND GET BETWEEN FOUR AND FIVE RIGHT. IT'S PRETTY INTERESTING EXPERIMENT TO GO TO ONLINE SIMULATOR. YOU CAN TRY IT, YOU KNOW, AND SAY, OKAY, HOW OFTEN AM I REALLY GOING TO -- I TRIED DOING THIS, IT'S A FUN EXPERIMENT TO SAY, OKAY, FLIP FOUR, DID I GET IT RIGHT? FLIP FOUR AGAIN AND KEEP GOING UNTIL YOU ACTUALLY GET IT RIGHT. YOU'RE LIKE, OH, THAT'S THAT GUT FEELING OF SURPRISE THAT I SHOULD BE ACCOMPANYING MY P-VALUE OF .05, A NIFTY WAY ABOUT THINKING OF THINGS. .005 WOULD BE LIKE GETTING THREE DIE ROLLS CORRECT. .008 IS IF YOU HAD A DECK OF CARDS AND YOU WANTED TO GUESS THE VALUE ON THAT. SO WHAT IS REALLY NEAT YOU CAN START MIXING AND MATCHING. WE'RE TALKING ABOUT DIFFERENT UNITS. THE PROBLEM IS PROBABILITY ISN'T REALLY GREAT ON A HUMAN SCALE. WE DON'T HAVE A GOOD ANCHOR FOR IT. IT'S MUCH BETTER. WE HAVE A MORE INTUITIVE UNDERSTANDING OF GAMBLING, GAMES OF CHANCE. .05 IS LIKE A COIN FLIP. .01 IS A DIE ROLL, FOUR COIN FLIPS. TO ME, THIS MAKES -- IT HAS A LOT MORE INTUITIVE UNDERSTANDING. .005, WHAT THE SCIENTISTS ARE NOW RECOMMENDING, IT WOULD BE GUESSING ONE CARD RIGHT AND FOUR COIN FLIPS, OR ONE DIE ROLL AND FIVE COIN FLIPS. WE'LL TALK ABOUT WHY THAT'S ADDITIVE LATER. IF WE STARTED TO THINK ABOUT P-VALUES IN TERMS OF THIS, THIS HUMAN ANCHORED, HUMAN CENTERED UNITS, WE MIGHT DO BETTER WITH UNDERSTANDING THAT SURPRISE OF A P-VALUE. BELLWETHER AND FUTURE STUDIES, ALL RIGHT. A P-VALUE GIVES YOU SOME INDICATION OF WHAT'S COMING UP. WHEN YOU USE AN EXAMPLE, HAVE PEOPLE HEARD OF THIS? YOU SHOULD WEAR RED, YOU SHOULD WEAR RED BECAUSE IT MAKES YOU SEXY AND DESIRABLE AND STRONG AND ALL THOSE GOOD THINGS. TONS OF STUDIES. AND I HAVE TO CONFESS, I WAS ONE OF THESE PEOPLE THAT WAS TOUTING THIS. SO BEFORE I STARTED WRITING A LOT ABOUT STATISTICS FOR FREELANCE, I WAS WRITING FOR THE L.A. TIMES ABOUT THE SCIENCE BEHIND MATING, DATING AND SEX. I WROTE A TONGUE IN CHEEK COLUMN ABOUT EVOLUTIONARY PSYCHOLOGY, WHY DON'T YOU USE IT TO GET YOUR SOUL MATE AT A HOLIDAY PARTY? YOU SHOULD WEAR RED. AND IT TURNS OUT THAT MEN WHO ARE HUNGRY LIKE FATTER WOMEN SO YOU SHOULD GO TO THE CHEAP PARTIES WHERE THEY DON'T SERVE FOOD AND YOU'LL BE MORE ATTRACTIVE, ALL THESE THINGS. "THE TODAY SHOW" IS LIKE, THIS IS GREAT INFORMATION. WHY DON'T YOU COME ON AND TELL US. THEY DIDN'T QUITE GET THAT I PREPPED. ONE OF THE STUDIES WAS THE RED, I REALLY WANTED TO TALK ABOUT MONKEY BUTTS, I'M EXCITED, I CAN DIE HAPPY TALKING ABOUT THAT. MONKEY BUTTS ON LIVE NATIONAL TELEVISION, NOTHING BETTER. I'M KIND OF IMBARESSED ABOUT THAT NOW BECAUSE, WELL, IT'S REALLY -- A REALLY GOOD STUDY, EAGER TO DO MY JOURNALISM, NOT THINKING WITH A CRITICAL STATISTICIAN'S EYE. SO PEOPLE HAVE TRIED TO REPLICATE THESE THINGS. AND THIS IS ONE OF THE LATEST ONES IN 2017, IS IT REALLY ROMANTIC? AND IT SEEMS TO KEEP FAILING. LET'S TALK ABOUT ONE ASPECT IN PARTICULAR, I THINK THAT WILL ILLUSTRATE SOME OF THE PROBLEMS WITH P-VALUES. SO THEY SHOWED MEN, ONE OF THESE TWO PICTURES, SAME WOMAN, IN RED OR BLUE. AND SAID, OKAY, IMAGINE YOU GOT $100 IN YOUR POCKET AND YOU'RE& GOING OUT ON A DATE WITH HER. HOW MUCH MONEY WOULD YOU BE LIKELY TO SPEND? THEN THEY WANTED TO LOOK AT THE DIFFERENCE. SIGNIFICANT DIFFERENCE, .002, THE DIFFERENCE WAS $26. DOES THAT SEEM PLAUSIBLE TO YOU? THINK ABOUT IT. THIS IS A FEW YEARS AGO SO MAYBE THINGS WERE CHEAPER, BUT STILL $26 ON A DINNER, THAT'S $13 EXTRA A PERSON? I MEAN, ARE YOU BUYING -- THIS IS THE DIFFERENCE BETWEEN A STEAK AND SANDWICH, JUST ON THE BASIS OF WHETHER YOUR DATE IS WEARING RED OR NOT? YEAH, I'M NOT SO SURE ABOUT THAT. LET'S PEEK UNDER THE HOOD A LITTLE BIT, WHAT'S GOING ON. SO WE WERE ABLE TO SEE THAT THE STANDARD ERROR FROM THE STUDY WAS $8.15. SO ESSENTIALLY WHAT THAT MEANS FOR US, WE CAN KIND OF RECONSTRUCT, AND SAY THAT IF I'M GOING TO GET A SIGNIFICANT RESULT, NO MATTER WHAT, IT HAD TO HAVE BEEN AT LEAST $17 IN EITHER DIRECTION. NO MATTER WHAT YOUR THRESHOLD, RIGHT HERE, HAD TO BE PRETTY BIG TO BEGIN WITH. AND THAT'S -- BECAUSE WE HAVE THIS LARGE STANDARD OF ERROR, A LOT OF VARIABILITY IN THE POPULATION, WHETHER MEN LIKE RED OR NOT, AND SMALL SAMPLE SIZE. SO THAT'S ONE THING THAT WE START WITH. SO THEN WE CAN THEN TAKE THAT AND SAY, OKAY, IF OUR STUDY WERE PERFECT, IF $26 WAS REALLY THE TRUE AVERAGE DIFFERENCE THAT MEN ARE WILLING TO SPEND ON RED-CLAD WOMEN VERSUS BLUE-CLAD WOMEN, AND I WAS LOOKING FOR A SIGNIFICANT RESULT, BASED ON THE SAME SAMPLE SIZE, THE SAME EVERYTHING, THEN I WOULD EXPECT A REPLICATION OF THE SAME SIZE, A STUDY OF THE SAME SIZE, TO BE GREATER THAN -- GET A RESULT GREATER THAN $17, 87% OF THE TIME. SO THIS IS THE POWER. THIS IS THE POWER OF A REPLICATION STUDY. IF YOUR ORIGINAL STUDY WERE PERFECT. NOTICE THAT YOU STILL HAVE 13% CHANCE OF REPLICATION FAILING, EVEN IF YOUR FIRST STUDY WAS PERFECT, 13% CHANCE OF FAILING. MOST PEOPLE DON'T THINK ABOUT THIS PART. AGAIN, IF YOUR FIRST STUDY WERE PERFECT, $26 REALLY WAS THE TRUE MEAN DIFFERENCE BETWEEN WHAT MEN ARE WILLING TO SPEND FOR RED AND BLUE WOMEN, THE PROBABILITY OF YOUR REPLICATION GETTING THE EXACT SAME P-VALUE IS ONLY 50%. AGAIN, LIKE A COIN FLIP. THAT'S IT. SO YOUR REPLICATION, SO YOUR P-VALUE, ALL OF THIS, IS JUST STRAIGHT FROM THE P-VALUE. P-VALUE IS TELLING YOU WHAT SORT OF BEHAVIOR YOU SHOULD EXPECT FROM A REPLICATION. SO I CAN SAY, IF I'M LOOKING FOR ANY KIND OF SIGNIFICANCE, IT WOULD BE THIS. IF I'M LOOKING FOR THE SAME P-VALUE, IT WOULD BE ONE-HALF. I CAN EVEN GO ONE STEP FURTHER AND SAY, YOU KNOW WHAT? I REALLY DON'T BELIEVE THE $26 DIFFERENCE AND I CAN GO BACK TO THE LITERATURE, TALK TO EXPERTS, SO A PERSON WHO WEARS A LOT OF BLUE, I WAS OFFENDED AND CHEAPED OUT. WHAT IS THE TRUE DIFFERENCE WERE ONLY $5? AGAIN, YOU CAN DO THIS, A KIND OF SUPPOSITION, TIME TRAVELING, WHAT IF IT WERE ONLY THIS? WHAT WOULD HAPPEN? AND THIS IS WHERE IT STARTS TO GET INTERESTING. A STATISTICIAN AT COLUMBIA UNIVERSITY AND CRITICAL SCIENTIST I THINK ACTUALLY BY TRAINING, A GREAT STATISTICIAN, SAYS YOU KNOW WHAT, IT'S NOT REALLY ABOUT TYPE 1 ERROR AND TYPE 2 ERROR. THAT'S NOT WHAT WE'RE TALKING ABOUT. THAT'S NOT WHAT'S REALLY IMPORTANT. LET'S EXPLORE DIFFERENT ALTERNATIVES, LIKELY ALTERNATIVES, BASED ON WHATEVER OTHER INFORMATION WE HAVE, AND THEN SEE WHAT THAT MEANS FOR THE FUTURE. OKAY. SO JUST LIKE WE DID BEFORE, IF THE TRUE DIFFEENCE WERE ONLY $5 NOW, THIS IS THE DISTRIBUTION THAT WE WOULD EXPECT RESPONSES, IF WE WERE TO DO A REPLICATION OF THE SAME SIZE. SO, AGAIN, IT WOULD HAVE TO PASS ONE MUCH THESE THRESHOLDS. NOTICE THAT NOW WE'RE OVER HERE IN THE NEGATIVE, CAN ACTUALLY BE IN THE OPPOSITE DIRECTION. WE CAN RANDOMLY GET A RESULT IN THE WRONG DIRECTION. SO THERE'S A 7.9% RIGHT TAIL THERE, A .7% PROBABILITY THAT IT'S GOING TO BE IN THE OTHER DIRECTION. SO IF WE PUT THEM TOGETHER, THAT'S A POWER, THAT'S A POWER OF A REPLICATION. ONLY 8.6%. THAT MEANS IF I WERE TO DO A REPLICATION BASED ON THIS CRAZY STUDY I WAS TOUTING ON TV, THERE'S ONLY AN 8% CHANCE THAT IT WOULD REPLICATE. SO IS IT REALLY SURPRISING WHEN IT DOESN'T REPLICATE? NO. THE PROBLEM IS THAT IT WAS AN INFLATED EFFECT TO BEGIN WITH. IT WAS AN INFLATED STUDY. ANDREW TALKS ABOUT TYPE S ERROR RATE. HE SAYS IT'S NOT ABOUT TYPE I& AND TYPE II. IT'S ABOUT WHAT IS THE PROBABILITY THAT IF THE TRUE DIFFERENCE WERE ONLY $5, AND I DID A REPLICATION, AND IT WAS SIGNIFICANT, WHAT ARE THE CHANCES THAT MY STUDY GAVE ME SIGNIFICANT RESULT, BUT IN THE OTHER DIRECTION, IN THE WRONG DIRECTION, TELLING ME THAT BLUE IS BETTER THAN RED? OR EXTRAPOLATE IT TO MY CURRENT STUDY, WHAT ARE THE CHANCES THAT THERE REALLY IS A DIFFERENCE ON HOW MUCH MEN ARE WILLING TO SPEND BETWEEN RED AND BLUE-CLAD WOMEN, BUT WHAT IT'S BLUE IS BETTER, AND NOT RED? WHAT ARE THE CHANCES THAT IT'S FLIPPED AROUND? THAT IT'S WRONG? HERE IT WOULD BE THIS .7%, OVER THE .7 PLUS 7.9, GIVING YOU THE 8.1%. IT'S LIKE A 1 IN 12 CHANCE THAT THIS STUDY THAT WE'RE TALKING ABOUT RIGHT HERE, WHICH IS FLAT OUT WRONG, THE WRONG DIRECTION. THAT'S BAD. IT'S NOT HOW MUCH OFF IS THE EFFECT SIZE, IT'S SIGNIFICANT, IT'S TRUE, THERE'S SOMETHING REALLY THERE. BUT YOUR STUDY IS POINTING YOU IN THE WRONG DIRECTION. SO THIS ISN'T SO BAD IF WE'RE TALKING ABOUT RED AND BLUE, BUT WHAT IF WE'RE TALKING ABOUT -- YOU KNOW, DRUGS AND IMPORTANT THINGS LIKE THAT? THERE'S A BIG CHANCE THAT YOUR RESULTS ARE TELLING YOU THE OPPOSITE OF WHAT IT SHOULD BE. AND ANDREW ALSO TALKS ABOUT WHAT HE CALLS THE TYPE M ERROR. MAGNITUDE. THE TYPE S ERROR, SIGN. AND TYPE M IS MAGNITUDE. HE SAYS WHAT'S THE TYPICAL EXAGGERATION FACTOR? SO MY RESULTS WERE $26. HOW MUCH MUCH I EXPECT THAT'S INFLATED, THAT'S EXAGGERATED? THE PROBLEM IS, IF YOU HAVE AN UNDERPOWERED STUDY AND YOU HAVE A HIGH THRESHOLD FOR SIGNIFICANCE, AND YOU'RE ONLY USING THAT SIGNIFICANCE AS A BAR, ANYTHING THAT PASSES THAT THRESHOLD PROBABLY GOT THERE BY RANDOM CHANCE, LIKE REGRESSION TO THE MEAN IDEA, WE'RE ONLY SEEING IT BECAUSE IT'S SIGNIFICANT BUT IT PROBABLY ONLY GOT THERE BY RANDOM CHANCE OF GETTING 21 MEN LIKE THAT. SO WE CAN LOOK AT WHAT THE TYPICAL EXAGGERATION WOULD BE, SO WE CAN LOOK AT THE EXPECTED VALUE OF EACH OF THESE TAILS, WHICH IS ABOUT $21. AND THEN COMPARE THAT TO, IN THIS HYPOTHESIZED EXAMPLE, $5. SO ANDREW IS SAYING THIS 4.2, YOU SHOULD EXPECT THAT IF YOU DESIGN THE STUDY UNDER THESE CIRCUMSTANCES WITH THIS MUCH VARIABILITY IN THE POPULATION AND ONLY 22 MEN, THAT NO MATTER -- AND YOU GET A SIGNIFICANT RESULT, IT'S PROBABLY TOO HIGH BY A FACTOR OF 4, QUADRUPLE HIGH. THAT'S KIND OF CRAZY. SO AGAIN THIS IS ALL PREDICATED ON WHAT IF IT WERE $5. SO YOU'RE BRINGING IN THE EXTRA INFORMATION. SO YOU CAN EXPLORE FOR DIFFERENT SORTS OF POSSIBILITIES, BUT $5 WE WOULD SAY IS A MORE PLAUSIBLE OF AN IDEA FOR A TRUE MEAN OF $26. YOU CAN START LOOKING AT DIFFERENT THINGS AND BRINGING IN OTHER INFORMATION. SO IT'S AN INTERESTING WAY OF NOT GOING FULL-ON, WHERE YOU -- FULL ON BAYESIAN WITH A COMPLICATED ANALYSIS BUT YOU CAN BRING INFORMATION, USE IT FOR INFORMATION. SO IN CASE YOU'RE WONDERING, WHEN YOU DESIGN YOUR STUDY, IS POWER IMPORTANT? FOR THIS TYPE S ERROR, GETTING IT WRONG, YOU JUST NEED TO, YOU KNOW, MAKE SURE THAT YOU'RE ABOVE 20%, SOMETHING LIKE THAT. YOU'D BE SURPRISED WHEN THEY ACTUALLY INVESTIGATE STUDIES IN MUCH OF SCIENCE, THEY ARE FINDING IT'S LESS THAN 20%, THEY ARE FINDING IT'S 14%. SO LOW POWER MEANS YOU DON'T HAVE ENOUGH SAMPLE SIZE, VARIABILITY IS TOO HIGH. SO A LOT OF SCIENCE IS PROBABLY HANGING OUT IN THIS AREA. WE'RE GETTING RESULTS THAT ARE SIGNIFICANT AND TRULY SIGNIFICANT. THE PROBLEM IS IT'S TAKING US IN THE WRONG DIRECTION. SO THIS IS A HUGE PROBLEM. SO THAT EXAGGERATION RATIO, NOT SO BAD. BUT STILL, YOU KNOW, IF YOU WANT TO GET UNDER DOUBLE SOMETHING LIKE THAT, THEN AGAIN YOU NEED TO BE ABOVE 20%. SO IT'S AN EXPONENTIAL THING. SO WHEN YOU INCREASE YOUR POWER, IT'S NOT JUST THAT YOU'RE IMPROVING YOUR RESULTS IN A LINEAR FASHION. IT'S WORKING EXPONENTIALLY. SO EVERY LITTLE BIT OF, YOU KNOW, EXTRA SAMPLE SIZE THAT YOU'RE GETTING ATTEMPTS TO REALLY REDUCE VARIABILITY IN THE POPULATION THAT YOU'RE STUDYING, IT PAYS OFF WHEN YOU'RE TALKING ABOUT EXAGGERATION AND ERRORS SENDING YOU IN THE WRONG DIRECTION. SO LET'S BRING THIS BACK TO WHAT WE SHOULD HAVE BEEN THINKING ABOUT FOR THESE POOR WOMEN WEARING RED OR BLUE. SO IF THIS WERE A PERFECT STUDY, THE CHANCE OF REPLICATION GETTING THE EXACT SAME P-VALUE IS ONLY A HALF, AND THE CHANCE OF A SAME-SIZE REPLICATION FAILING WOULD BE 13%. IF MEN ONLY SPEND $5 EXTRA, THE CHANCE OF A SAME-SIZE REPLICATION FAILING IS 92%. AND IF YOU GET A SIGNIFICANT RESULT LIKE WE GOT HERE, THE CHANCE OF IT IN THE WRONG DIRECTION 8% LIKELY EXAGGERATED BY 4. YOU DIDN'T REALLY THINK OF THAT. ALL OF THIS IS STRAIGHT FROM THE P-VALUE. CAN YOU DO ALL OF THAT JUST STRAIGHT FROM THE P-VALUE. SO WHEN THEY DID THE REPLICATION SLIGHTLY DIFFERENT BUT POINT SCALES WERE HARDER, I CHOSE THE DINING ROOM, SHOWING WOMEN PICTURES OF MEN, GRAY BACKGROUND OR RED BACKGROUND. IN THE ORIGINAL STUDY, RED WAS WAY SEXIER. AND IN AN ONLINE REPLICATION, AND IN-PERSON REPLICATION THEY FOUND THE OPPOSITE, GRAY WAS SEXIER. SO THIS JUST FITS INTO THE PROBLEM, AND THE AUTHORS TALKED ABOUT THIS, THAT THE EFFECT SIZE WAS EXAGGERATED, AND YOU HAVE THIS STATISTICAL SIGNIFICANCE FILTER, SO ANYTHING THAT GETS THROUGH IT IS PROBABLY WRONG. AND THEN YOU'RE KIND OF REPLICATING BASED ON THAT, EVEN WITH A BIGGER SAMPLE SIZE, AND YOU'RE ENDING UP WITH NOTHING. SO WE TALK ABOUT REPRODUCIBILITY, I THINK IT'S GOOD TO KEEP THESE IDEAS IN MIND, THE P-VALUE IS NOT -- NOTICE WE'RE NOT TALKING ABOUT STRENGTH OF EVIDENCE, YOU KNOW, FOR YOUR HYPOTHESIS. IT'S TALKING ABOUT WHAT TO EXPECT COMING UP FOR YOUR REPLICATION. AND WHAT KIND OF EXAGGERATION OR ERRORS WE CAN TALK ABOUT FOR THIS ONE. SO THE FIRST ONE, IS NOTHING IS GOING ON HOW SURPRISED SHOULD YOU BE AT YOUR RESULTS, THAT'S WHAT A P-VALUE IS TELLING US. IF SOMETHING IS INDEED GOING ON, HOW OFTEN SHOULD YOU EXPECT IT TO BE REPLICATED, AND HOW OFTEN WILL IT BE IN THE WRONG DIRECTION AND HOW EXAGGERATED& WILL YOUR EFFECT SIZE BE? I DON'T THINK PEOPLE TALK ABOUT THESE THINGS OFTEN ENOUGH. ALL RIGHT. NUDGE FOR PLAUSIBILITY. ANOTHER ROMANCE STUDY. I TENDED TO WRITE A LOT ABOUT THESE. IT WAS AN ONLINE SURVEY WHERE THEY JUST RANDOMLY ASKED A BUNCH OF PEOPLE WHO WERE MARRIED, HOW DID YOU MEET? HOW DID YOU MEET YOUR PARTNER? ONLINE OR IN PERSON? AND GUESS WHO SPONSORED IT? E-HARMONY. SO THIS IS FOR TINDER, I THINK. I WAS MUCH MORE CRITICAL AND SKEPTICAL AND FAIR-MINDED ABOUT THIS ONE. I TOOK THEM TO TASK FOR BAD STATISTICS, I DON'T THINK IT MADE THE E-HARMONY PEOPLE HAPPY. THEY LOOKED AT RATE OF BREAKUP, PEOPLE WHO MET ONLINE OR IN PERSON. AND THE DIFFERENCE WAS 1.7 PERCENTAGE POINTS, REALLY SMALL. AFTER YOU ADJUSTED FOR EVERYTHING, YOU KNOW, EMPLOYMENT, THIS AND THAT, THE P-VALUE IS ABOUT .05. SO WHEN I ASKED THE AUTHORS ABOUT THIS, WHEN I INTERVIEWED THEM AND WROTE ABOUT THIS FOR "NATURE," I SAID THAT'S REALLY, YOU KNOW, A SMALL EFFECT GOING ON, LIKE 1.7 PERCENTAGE POINTS, THEY DON'T EVEN CARE. HE SAID, OH, BUT, YOU KNOW, WE FOUND IT AND IT'S STRONG, AND WE KNOW IT. IT MAY BE SMALL, BUT WE'RE VERY SURE IN IT. SO I WOULD SAY, ARE YOU REALLY? SO LET'S AGAIN PEER UNDER THE HOOD AND SEE WHAT'S GOING ON. THE MOST IMPORTANT THING TO REMEMBER WHEN WE GO OVER THE NEXT FEW SLIDES IS THAT P-VALUES BOUNCE AROUND. P-VALUES THEMSELVES ARE RANDOM VARIABLES. MOST PEOPLE DON'T THINK ABOUT THAT. YOU DO YOUR STUDY, AND YOU GET YOUR P-VALUE, AND YOU SAY, OKAY, THAT'S IT. YOU KNOW, THIS IS A FIXED THING. NO, BECAUSE THE P-VALUE, DEPENDING ON THE SAMPLE, EVERY TIME YOU REPEAT THE STUDY YOU GAIN A NEW SAMPLE. SO P-VALUES BOUNCE AROUND. SO THIS IS WHAT IT IS SIMULATED, 100,000 P-VALUES, UNDER THE NULL HYPOTHESIS THAT THERE'S REALLY NO DIFFERENCE BETWEEN MEETING ONLINE AND IN PERSON. AND YOU CAN SEE THAT IT'S UNIFORM. SO BETWEE -- IT DOESN'T MATTER IF UNDER THE NULL HYPOTHESIS, IT JUST COMPLETELY IS UNIFORM, WE WOULD EXPECT THIS. IT'S ONLY UNDER THE ALTERNATIVES THAT YOU START TO SEE P-VALUES STACKED SMALLER. SO THIS IS THE SAME THING. 100,000 SIMULATED P-VALUES, BUT NOW .7 PERCENTAGE POINTS INSTEAD OF THE TWO GROUPS BEING EQUAL, NOW IT'S A SMALL DIFFERENCE. YOU CAN SEE THAT IT'S STARTING TO GET A DISTRIBUTION LOOKING AT THIS. A LITTLE BIT MORE. A LITTLE BIT MORE. ONE, AND YOU TAKE OFF. SO THE STANDARDIZEED EFFECT OF 1, 1.4 PERCENTAGE POINTS. 2.8 PERCENTAGE POINTS, EFFECT IS GETTING BIGGER. STANDARDIZED EFFECT OF 2. THIS IS CUT OFF, UP HERE, BECAUSE IT GETS SO SKEWED. THIS IS -- IF THE DIFFERENCE WERE 5.6 PERCENTAGE POINTS, SO I'M STRUCK BY WHAT'S GOING ON RIGHT HERE. AND THIS IS IF IT WERE 11.2 PERCENTAGE POINTS, SO THIS IS HUGE, HUGE EFFECT. EVERYTHING IS CONCENTRATED HERE. THERE'S NOTHING OUTSIDE. SO LET'S ZOOM IN SO YOU CAN REALLY SEE. THIS GETS ZOOMED IN, BETWEEN 0 AND .1, AROUND THE P OF .5. THIS IS WHAT I'M GOING TO USE TO TRY TO CONVINCE YOU .005 IS ACTUALLY QUALITATIVELY BETTER THAN .05, AND .05 IS BAD FOR& STRENGTH OF EVIDENCE. OKAY. SO THIS IS THE NULL. IT'S UNIFORM. NOW WE HAVE THOSE FOUR DIFFERENT SCENARIOS THAT I WAS JUST TALKING ABOUT. WE ZOOMED IN, WE CAN SEE THE BEHAVIOR. WE'RE STILL SEEING THE SAME THING HERE. THIS IS WHAT'S GOING ON AROUND .05. SO WHEN THE EFFECT IS SMALL, THEN IT'S UNIFORM, BUT THEN AS THE EFFECT IS GETTING BIGGER AND BIGGER, IT'S SHIFTING MORE OVER HERE. UNTIL WE HAVE A NICE HEALTHY EFFECT OF 4, AND MOST OF IT IS NOW VERY TINY, NOT EVEN GETTING MUCH OVER HERE. AND OF COURSE IF WE HAVE A HUGE EFFECT, EVERYTHING, EVERYTHING IS OVER HERE. NOTHING, YOU NEVER SEE A P-VALUE BIGGER THAN THAT. SO LET'S PUT THEM TOGETHER. AND BECAUSE YOU DON'T GET TO SEE THEM SEPARATED OUT. THIS IS JUST A SIMULATION. THIS IS IF YOU WERE TO DO 200,000 STUDIES, 100,000 OF THEM UNDER THE NULL HYPOTHESIS BEING TRUE, THAT NOTHING'S REALLY GOING ON, AND 100,000 OF THEM WITH A MODERATE -- A SMALL EFFECT. YOU CAN SEE THAT YOU WOULD GET A LITTLE BUMP OVER HERE, AND THEN IT STARTS TO GET BIGGER AND BIGGER, STARTS TO SHIFT OVER. SO AGAIN LET'S ZOOM IN. IF YOU COUNT THE NUMBER OF P-VALUES THAT YOU GOT BETWEEN .04 AND .05, IT'S ABOUT A 1:1. WHAT WE'RE DOING NOW IS TURNING AROUND AND SAYING, OKAY, GIVEN THAT I HAVE A .05 HOW MANY OF THEM CAME FROM A STUDY WHERE THERE WAS A TRUE EFFECT AND HOW MANY OF THEM CAME FROM A STUDY WHERE THERE WAS NO EFFECT? SO GIVEN I HAVE .05, IT'S LIKE WE'RE KIND OF LOOKING BACKWARDS NOW, THEN WHAT PROPORTION CAME FROM EACH? SO THE STRENGTH OF EVIDENCE FOR THIS IS KIND OF 1:1, IF YOU HAVE A SMALL EFFECT. IF YOU HAVE MODERATE EFFECT IT GETS BIGGER. SO MORE OF THE TIME, OVER HERE. IF YOU GET A .05, MORE OF THE TIME IT CAME FROM A TRUE STUDY THAN A STUDY WITHOUT AN EFFECT. THEN IT'S GETTING BIGGER. BUT WAIT A MINUTE. LOOK WHAT HAPPENS HERE. WHEN WE HAVE A BIG EFFECT, ALL THOSE P-VALUES ARE SHIFTED OVER HERE, AND NOW IF WE'RE FOCUSING IN AT THAT .05, WE KNOW WE HAVE A P-VALUE OF .05, WHAT IS THE CHANCE IT CAME FROM GREEN VERSUS RED? THERE'S A HIGHER CHANCE IT CAME FROM RED. SO THAT MEANS IF I HAVE A .05, AND I HAD THIS KIND OF EFFECT SIZE, CHANCES ARE IT CAME FROM A STUDY WHERE THERE WAS NO REAL EFFECT. IF I HAVE A REALLY HUGE EFFECT, AND I SEE A P OF .05, CHANCES ARE THAT'S JUST A FLUKE. SO ALL OF THESE RED ONES ARE JUST FLUKES. SO THIS RATIO IS THE STRENGTH OF EVIDENCE. OKAY. SO PEOPLE HAVE DONE SOME INTERESTING MATHEMATICS TO SHOW THAT THE MAXIMUM YOU COULD GET OF THAT GREEN TO RED IS 3.4 TO 1. THAT'S THE MAXIMUM STRENGTH OF EVIDENCE YOU CAN GET. AND THE PEOPLE WHO MADE THAT SUGGESTION, CHANGE .05 TO .005, HAVE THIS NICE GRAPH DOWN BELOW THAT SAYS .05, WE'RE GOING TO TRANSLATE THAT TO THAT RATIO, AND THEY HAVE SOME DIFFERENT BOUNDS FOR THE RATIO, SO THE DIFFERENT COLORED LINES ARE DIFFERENT ASSUMPTIONS, COOL MATHEMATICAL TECHNIQUES THEY ARE USING. YOU CAN SEE THE GENERAL TREND IS IF IT'S A SMALLER P-VALUE, THEN THERE'S MORE EVIDENCE. .IF IT WERE AT .005 THAT JUMPS TO 26. SO WAY MORE IN FAVOR AT .005, YOU CAN SEE THAT WHEN YOU'RE DOING THAT. SO THE INTERESTING THING IS THAT'S HOW MUCH EVIDENCE WAS IN YOUR STUDY, BUT HOW MUCH -- SO ARE MY ODDS THAT MY CONCLUSIONS ARE REAL, THAT MY EFFECT IS REAL? SO I OBSERVE A P OF .05, BUT YOU'RE TELLING ME, OKAY, IT'S, YOU KNOW, THREE TIMES AS MUCH EVIDENCE, HOW DO I TRANSLATE THAT? AND SOME OF YOU MIGHT RECOGNIZE THAT THE BASE DOWN HERE, WRITTEN IN THE ODDS FORM, SO YOU HAVE THE POSTERIOR ODDS EQUAL TO BASE FACTOR TIMES PRIOR ODDS. I GOT PRIOR ODDS, THE BASE FACTOR WHICH IS TELLING ME HOW MUCH MORE EVIDENCE THERE IS IN FAVOR OF MY EFFECT, AND THEN YOU END UP WITH POSTERIOR ODDS. SO LET'S TRY TO BRING THIS BACK INTO THIS IDEA OF PLAUSIBILITY IN HUMAN-CENTERED THINGS AGAIN. IF WE CONVERT OUR PROBABILITY TO THOSE SURPRISALS, THAT YOU KNOW, HOLY SHIT-NESS, WE CAN SAY ONE BIT OF SURPRISE IS LOG TO THE BASE 2 NEGATIVE TO TO PROBLEM -- PROBABILITY. THIS IS GETTING BACK TO THE IDEAS THAT I TALKED ABOUT BEFORE. AND THIS IS STUFF FROM INFORMATION THEORY. SO IF I HAVE A PROBABILITY OF A QUARTER, 1/4 FOR AN EVENT, THEN THAT EVENT HAS TWO BITS OF SURPRISE, WHEN IT HAPPENS. WHICH IS EQUIVALENT TO TWO COIN GUESSES. LET'S DO THE SAME THING, USING THAT -- DOESN'T THAT MAKE SENSE WHY IT'S LOGS LIKE THAT? LOG TO THE BASE 2. LET'S CONVERT ODDS TO THE SAME THING, BITS OF PLAUSIBILITY AND BAYES FACTOR, BITS OF EVIDENCE. IF I HAVE PRIOR ODDS ON MY EFFECT BEING REAL, WITH THE ALTERNATIVE HYPOTHESIS, H1 OF 2 TO 1, THEN I HAVE ONE BIT OF PLAUSIBILITY IN FAVOR OF H1. IF IT'S THE OTHER WAY, IF I HAVE PRIOR ODDS OF 1:4, MEAN FOUR TIMES AS LIKELY THAT IT'S NOT REAL AS IT IS, THEN NOW I'M AT NEGATIVE 2 BITS OF PLAUSIBILITY, THE NICE THING ABOUT THIS IS WE CAN START TO ADD, IT'S ON THE HUMAN-CENTERED MUCH MORE MANAGEABLE SCALE WHEN YOU ADD THINGS. WE CAN DO THE SAME THING WITH BAYES FACTORS. THE BAYES FACTOR, THAT THING WE WERE TALKING ABOUT HERE, WHICH IS THIS RATIO, THAT WE WERE TALKING ABOUT SO BAYES FACTOR OF 3.4 IS 1 BIT OF EVIDENCE. 25.7, 4:1 BITS OF EVIDENCE. IF WE START TRANSLATING THINGS TO A LOG SCALE, WE CAN THINK MORE EASILY HOW MUCH EVIDENCE IS IN OUR EXPERIMENT, AND HOW MUCH IT BOOSTS THINGS. SO SAY I HAD PRIOR ODDS OF -- WHAT DID I SAY BEFORE? LET'S JUST SAY 2:1 IN FAVOR OF H1, SO THAT MEANS I'M STARTING WITH 1 BIT OF PLAUSIBILITY, SO I'M STARTING HERE. THEN SAY I GOT A P OF .05. THAT 1.8 BITS OF EVIDENCE SHIFTING ME OVER HERE. NOW I HAVE THREE BITS OF EVIDENCE IN FAVOR OF MY EFFECT BEING REAL. THE PROBLEM COMES WHEN YOU HAVE AN ODD OF -- ODDS OF 1:4, SAY, AGAINST. SO THAT BIT OF NEGATIVE 2, SO IT'S LIKE WITH THAT ONE, THEN I'M STARTING OVER HERE. AND THAT BAYES FACTOR IS NUDGING ME AND I GOT A P OF .05, I'M NOT EVEN BREAKING EVEN. THAT'S NOT ENOUGH EVIDENCE TO SHIFT ME OVER, SO ZERO IS THAT DIVIDING LINE BETWEEN H1 AND H 0, IF I'M STARTING OVER HERE WITH ODDS WEIGHTED AGAINST MY EFFECT BEING REAL, HOW MUCH EVIDENCE IS CONTAINED IN THAT .05? NOT ENOUGH TO GET ME PAST. IT'S STILL MORE LIKELY NOW IF I STARTED WITH THIS ODD, AND GOT THIS P-VALUE, IT'S STILL MORE LIKELY THAT EVEN THOUGH NOW I'M CLAIMING THIS IS A SIGNIFICANT RESULT, AND I'M PUBLISHING AND I'M CLAIMING THIS IS REAL, IT'S MORE LIKELY THAT THIS IS WRONG, THIS IS A FLUKE. I HAVE MORE EVIDENCE IN FAVOR OF THE NULL HYPOTHESIS. SO PEOPLE HAVE DONE SOME VERY INTERESTING WORK TO SAY THAT MOST OF SCIENCE IS -- OOPS. MOST OF SCIENCE IS STARTING SOMEWHERE AROUND HERE, WITH THEIR PRIOR ODDS, AND THEY HAVE DONE THIS IN VERY INTERESTING WAYS. THEY HAVE LOOKED AT PREDICTION MARKETS WHERE THEY ARE ASKING RESEARCHERS TO MAKE BETS ON DIFFERENT THINGS. AND IT COMES OUT ENTIRELY CONSISTENTLY. THEY HAVE ASKED EXPERTS. THEY HAVE LOOKED AT REPLICATION. THEY HAVE DONE ALL KINDS OF DIFFERENT THINGS AND IT'S SAYING THAT MOST OF THE HYPOTHESES THAT WE'RE TESTING, WE'RE STACKED AGAINST THEM. THAT'S BECAUSE WE'RE THROWING SO MUCH INTO THE POT. WE'RE DOING THESE GIANT STUDIES. WHERE NOT EVERY HYPOTHESIS CAN BE REAL. NOT ALL THE EFFECT CAN BE TRUE. SO IF WE'RE STARTING OVER HERE AND WE ONLY GET .05, 1.8 BITS OF EVIDENCE, IT'S NOT TAKING US -- WE'RE STILL IN THIS LAND WHERE IT'S PROBABLY WRONG. SO EVEN WHEN .05, YOUR RESULT IS PROBABLY WRONG. AND SO THAT'S WHY THE RESEARCHERS ARE SAYING .05 ISN'T BAD. IT'S BAD FOR THIS REASON, BECAUSE THERE'S ONLY 1.8 BITS OF EVIDENCE. .005, WHAT THEY ARE SUGGESTING, 4.8 BITS OF EVIDENCE. SO EVEN IF YOU'RE STARTING OVER HERE, IT'S ENOUGH TO GET YOU TO THE PLACE WHERE FINALLY YOU CAN BE CONFIDENT THAT YOUR EFFECT IS REAL. THERE'S ENOUGH EVIDENCE IN .005, NOT IN .05, NOT IN .01, THAT'S WHY THEY ARE SUGGESTING THIS. SO AGAIN WHAT SHOULD A P-VALUE BE TELLING US? WHAT SHOULD WE BE LOOKING AT? IF YOU WANT STRENGTH OF EVIDENCE FOR YOUR EFFECT THINK OF P-VALUE OF HOW MUCH IT'S NUDGING YOUR PLAUSIBILITY. NOT HOW MUCH STRENGTH OF EVIDENCE IS THERE. BUT IF YOU HAVE A P OF .05, IT CAN STILL BE MORE LIKELY THAT THIS IS A FLUKE THAN NOT. EVEN THOUGH YOU'RE ALLOWED TO PUBLISH, EVEN THOUGH IT LOOKS LIKE IT'S SIGNIFICANT. THERE'S STILL PROBABLY MORE EVIDENCE AGAINST YOUR EFFECT BEING REAL. IF YOU START PUTTING THINGS ON HUMAN SCALES, WITH HUMAN ANCHORS, AND START COMMUNICATING œTHAT WAY, THEN WE CAN TALK ABOUT THIS IDEA OF SURPRISE AND PLAUSIBILITY AND EVIDENCE ON A DIFFERENT SCALE, A LOG SCALE, ONE THAT MAKES ZENS, RATHER THAN MISUSE THE P-VALUES IN WAYS WE CONTINUE TO DO SO THOSE ARE MY IDEAS WHAT A P-VALUE SHOULD BE TELLING YOU AND HOW WE SHOULD COMMUNICATE IT. AND I'M OPEN TO QUESTIONS. THANK YOU. [APPLAUSE] >> REGINA, A FEW QUESTIONS. PLEASE COME TO THE MIC. >> GREAT TALK. SO WHERE DOES MULTIPLE TESTING COME IN? WHERE YOU HAVE -- YOU KNOW, ESPECIALLY A LOT OF HIGH THROUGHPUT SEQUENCING, YOU HAVE THOUSANDS AND THOUSANDS OF P-VALUES. I FEEL LIKE THIS IS THE PRIME PLACE TO FIND WHAT IS THE RIGHT CUTOFF IN P-VALUES. SO IF YOU CAN GIVE A LITTLE BIT ON WHAT YOU THINK OF BONFORNI AND HAWKBERG AND WHERE THEY GUY LIE IN THE P-VALUE NUDGING. >> RIGHT. GOOD QUESTION. HE'S TALKING ABOUT THE MULTIPLE TESTING AND THEN THERE'S THE CORRECTION FACTORS. >> SURE. >> RIGHT. LIKE A BONFORNI, IT'S KNOWN TO BE OVERLY CONSERVATIVE, BUT NOT CHANGING THE P-VALUE, HOW MUCH IT SHOULD BE NUDGING YOUR PLAUSIBILITY OF THIS SORT OF THING. IT'S JUST CHANGING YOUR BEHAVIORS, CHANGING THE STANDARD. I THINK IT'S IMPORTANT TO SEPARATE OUT THE P-VALUE AS HOW MUCH EVIDENCE IT'S CONTAINING FROM THEN KIND OF WHERE YOU'RE MOVING THE GOAL POST TO. SO WHEN YOU'RE TALKING ABOUT MULTIPLE COMPARISONS, YOU REALIZE THAT YOU ARE -- YOU'RE DIPPING OFTEN, AND -- LET'S SEE IF I CAN -- OKAY. SO ARE YOU FAMILIAR WITH THE WORK JOHN STORY HAS Q VALUES, HAS APPLIED TO GENOMICS? >> YEAH. >> HE'S BASICALLY TALKING ABOUT THIS. HE SAYS, ALL RIGHT, SUPPOSE THIS IS WHAT YOU OBSERVED, YOU GET YOUR EMPIRICAL DISTRIBUTION OF THE P-VALUES, SO THESE ARE HYPOTHESIZED, THIS IS SIMULATED, WHERE I KNEW THE ANSWERS. BUT YOU'RE GETTING SOMETHING LIKE THIS WHEN YOU'RE DOING GENOMIC TESTING. SO HE'S SAYING FIND THAT BIT WHERE THIS STOPS BEING UNIFORM, BECAUSE IT'S AT THAT PLACE WHEN YOU'RE GOING TO START TO ACTUALLY HAVE MORE INFORMATION FOR YOUR EFFECT THAN AGAINST. AND SO HE'S TALKING ABOUT MOVING THE GOAL POST THAT WAY. AND SO I THINK IT'S VERY INTERESTING HOW HE STILL IS BRINGING IN BAYESIAN IDEAS AND LETTING YOU BRING IN PRIOR KNOWLEDGE ABOUT WHAT TO EXPECT ON THAT SORT OF THING. SO DID YOU HAVE A PARTICULAR QUESTION OR DID THAT -- >> NO, I THINK OVERALL. OVERALL IS FINE. THANK YOU. >> OTHER QUESTIONS? JOHN, PLEASE. >> YOU ARE PROPOSING LOWER THE P-VALUE FROM THE POPULAR .05 TO .005. >> THE THRESHOLD? >> YEAH. SO-- BUT THAT SAYS TO ME IT'S STILL UNDER THE CONVENTIONAL FRAMEWORK OF NULL HYPOTHESIS, SIGNIFICANT TESTING, STILL FORCE EVERYBODY TO MAKE A BINARY OR DICHOTOMOUS DECISION. THAT WOULD STILL HAVE THE PROBLEM, YOU MAKE VERY ARTIFICIAL THRESHOLD, AND ALSO THAT IT WILL BRING FULL KIND OF PROBLEMS. FOR EXAMPLE ONE PROBLEM IS THAT THE SO-CALLED -- THE DIFFERENCE BETWEEN STATISTICALLY SIGNIFICANT RESULT AND STATISTICALLY INSIGNIFICANT, THAT DIFFERENCE MAY STILL -- (INDISCERNIBLE) -- STILL PROBLEMATIC. SO UNDER THIS NULL HYPOTHESIS SIGNIFICANT TESTING FRAMEWORK, SO THAT'S WHY SOME STUDIES ARGUE WE SHOULD TOTALLY ABANDON THE P-VALUE, THE WHOLE THING. >> RIGHT, RIGHT. GOOD QUESTION. HE'S SAYING YOU'RE STILL DICHOTOMIZING. YOU HAVE A THRESHHOLD.05, .005, STILL DOING THE SAME BLACK AND WHITE THINKING. .049 IS OKAY,. NOT EVERYONE IS READY OR WILLING TO GIVE UP P-VALUES. P-VALUES ARE EASY. WE DON'T HAVE WIDELY ACCEPTED ALTERNATIVES SO UNTIL THEN AS A BRIDGE LET'S AT LEAST MOVE THE THRESHOLD WHERE IT'S SOMETHING WHERE WE ACTUALLY HAVE -- IT'S A GOOD STRENGTH OF EVIDENCE GIVEN OUR CURRENT SCIENTIFIC ENVIRONMENT. SO THEY REALIZE THAT THE LEAD AUTHOR OF THE PAPER, VAL JOHNSON, IS HIMSELF A BAYESIAN, SO HE'S, YOU KNOW, HE'S NO -- HE'S NOT INSISTING THE P-VALUES ARE THE WAY TO GO. THE OTHER IDEA THAT HAS BEEN FLOATED AS A BRIDGE OR A PRAGMATIC THING IS USING BAYES FACTORS, SO THEY ARE PROPOSING THAT YOU JUST REPORT WHATEVER BAYES FACTORS YOU'RE GETTING, GETTING YOU AWAY FROM DICHOTOMOUS THINKENING. 3.4 IN THAT GOES CONTINUOUSLY ON TO, YOU KNOW, 4, 4 TIMES AS MUCH EVIDECE OR 3 TIMES AS MUCH EVIDENCE. SO I THINK THAT'S NICE BUT THE PROBLEM IS THAT THIS IS MULTI-PLICITAIV. DO WE HAVE AN INTUITIVE UNDERSTANDING HOW MUCH IS IN A BAYES FACTOR OF 3 VERSUS 25? I DON'TS REALLY. CONVERTING TO THIS IDEA OF SURPRISALS WILL AT LEAST GET YOU AWAY FROM THAT. EVERYONE CAN PICK THEIR OWN PRIOR AND MOVE IT, THINK THINK ABOUT STRENGTH OF EVIDENCE HOW MUCH YOU'RE SHIFTING THINGS. I AGREE, UNTIL JOURNALS COME UP WITH A BETTER WAY FOR DECIDING WHETHER TO ACCEPT OR REJECT WE'RE STILL GOING TO BE STUCK IN THIS BLACK AND WHITE ZERO AND 1 KIND OF THINKING AND THE SAME PROBLEMS. >> IS IT A QUICK QUESTION? IT'S 12:05. >> YES, I WANT TO ASK ABOUT REPLICATION. ONE OF THE MAJOR CRITICISMS OF CHANGING THE THRESHOLD I HEARD FROM THE FIELD WAS, WELL, I WOULD RATHER HAVE TWO INDEPENDENT STUDIES WITH P LESS THAN .05, THAN ONE ISOLATED STUDY WITH P LESS THAN .005. HOW DOES THAT ISSUE FIT IN? >> THEY SAID THEY WOULD RATHER HAVE TWO STUDIES WITH .05 THAN ONE, WHY WOULD THEY RATHER HAVE -- THEY SAY THERE'S MORE EVIDENCE IN THAT OR -- >> THEY ARE SAYING EVIDENCE OF INDEPENDENT REPLICATION IS STRONGER THAN HAVING JUST A SINGLE HIGHLY SIGNIFICANT RESULT. >> BUT WHAT IF IT'S NOT, THEN -- YOU KNOW, AS WE SAW THAT P OF .05, REPLICATING, EVEN IF THAT WERE PERFECT, EVEN IF IT WAS EXACTLY THAT EFFECT SIZE, AND WE'RE LOOKING FOR ANOTHER SIGNIFICANT RESULT, WE ONLY HAVE A 50% CHANCE OF GETTING IT. SO THAT SEEMS LIKE A LOT OF RESPONSIBILITY TO PLACE ON SOMETHING, SO BASICALLY EVEN IF EVERYTHING'S PERFECT, THERE'S ONLY 50% CHANCE THAT YOU'LL BE ABLE TO PUBLISH UNDER THIS NEW CRITERIA, WHICH IS I THINK NOT NECESSARILY THE WAY THAT WE WANT TO BE GOING. >> ALL RIGHT. LET'S THANK REGINA ONE MORE TIME. >> THANK YOU. [APPLAUSE]