>> GOOD AFTERNOON AND WELCOME TO THE LECTURE THIS AFTERNOON. MY PLEASURE TO WELCOME YOU TO THE EXTRAMURAL PROGRAMS I'D LIKE TO WELCOME TO YOU THE LECTURE SERIES, OUR PRESENTER THIS AFTERNOON WILL BE DONE BY DR. JASON MOORE AND THIS IS PART OF OUR NLM 175th ANNIVERSARY EVENT. NOW AS YOU CAN SEE WE HAVE SIGN LANGUAGE YOUNG INTERCEPTER, SO DURING THE PRESENTATION, IF YOU NEED SIGN LANGUAGE, PLEASE INDICATE SO, IF NOT, WE WILL NOT CONTINUE THAT. DR. MOORE RECEIVED HIS Ph.D. OF BIOLOGICAL SCIENCES AT AND HE [INDISCERNIBLE]--FROM THE UNIVERSITY OF OF MICHIGAN EMPLOY AFTER RECEIVING HIS Ph.D. IN 1999, DR. MOORE HAD ALL PEOPLE OF THE CT MEDICAL SCHOOL, DR. MOORE WAS PROMOTED TO ASSOCIATE PROFESSOR WITH [INDISCERNIBLE] AND AWARDED PROFESSORSHIP IN CANCER RESEARCH. IN 2004 DR. MOORE MOVED TO DARTMOUTH MEDICAL SCHOOL AS THE RESEARCH SCHOLAR AND COMPUTATIONAL GENETICS AND PROFESSOR OF GENETICS AND PROFESSOR OF COMMITTEE AND FAMILY MEDICINE. IN 2008, HE WAS PROMOTED TO PROFESSOR OF GENETICS AND COMMUNITY AND FAMILY MEDICINE. IN 2010 HE WAS AWARDED ACEPTORY PROFESSORSHIP AND APPOINTED FOUNDING DIRECTOR [INDISCERNIBLE]--FOR BIOMEDICAL SCIENCES. DR. MOORE NLM RESEARCH PROGRAM FOCUSES ON THE APPLICATION OF COMPUTATIONAL MATRIXS FOR IDENTIFYING AND CHARACTERIZING GENE INTERACTIONS OF GENE STUDIES COMMON HUMAN DISEASES. HE CURRENTLY SERVED ON THE BIOMEDICAL LIBRARY AND INFORMATICS REVIEW COMMITTEE AND HE IS ALSO EDITOR IN CHIEF OF THE JOURNAL BAR DATA MINING. >> WELL, THANK YOU VERY MUCH FOR THAT KIND INTRODUCTION, IT'S WONDERFUL TO BE HERE AT THE NIH TODAY AND THANK YOU TO THE NLM FAR THE INVITATION. IT'S A GREAT HONOR TO BE HERE WITH ALL OF YOU. I WOULD LIKE TO THANK JIM MALLY WHO I SPENT TIME WITH TODAY AND TOOK ME ON A TOUR OF THE NIH FROM HIS PERSPECTIVE. HE KNOWS REALLY INTERESTING PLACES AROUND CAMPUS AND THANKS FOR LUNCH. REALLY APPRECIATE IT. SO I WANT TO TALK SOME TIME TO TELL YOU ABOUT SOME OUR NLM FUNDED RESEARCH TODAY ON THE GENETIC ANALYSIS OF COMPLEX TRAITS AND SOME OF THE COMPUTATIONAL METHODS THAT WERE DEVELOPING. SO, THIS IS A SLIDE I'LL START AND FINISH WITH TO GIVE YOU A SENSE FOR THE OREGON OREGON--ORGANIZATION OF THE PRESENTATION AND WHERE OUR THINKING IS ON HOW TO OPROACH THE COMPLEX GENETIC TRAITS AND SOME OF THESE WERE IN AA PAPER PUBLISHED IN TRENDS AND PHARMACOLOGIC SCIENCES AND THIS IS MY PERSPECTIVE ON WHERE WE'VE BEEN AND WHERE I THINK WE'RE GOING, SO I THINK IT'S REALLY BEEN AN--IT HAS BEEN AND IS CURRENTLY AN EXCITING TIME TO BE A GENETICIST BECAUSE OF ALL THE TECHNOLOGICAL ADVANCES IN MEASURING THE HUMAN GENOME, MEASURING GENOMIC PROCESSES AND IT'S--YOU WITHIN I WAS A GRADUATE STUDENT WE DREAMED OF HAVING ALL THIS DATA AND NOW WE HAVE A LOT OF IT AND IT'S TIME DO SOMETHING WITH IT. SO THIS IS WHERE WE'VE BEEN, WE'VE SEQUENCED THE HUMAN GENOME AND CHARACTERIZED THE HUMAN GENOME, WE HARNESSED THAT VARIATION TO DO GENOME WIDE ASSOCIATION STUDIES FOR A VARIETY OF DIFFERENT HUMAN TRAITS. WE'RE NOW SEQUENCING WHOLE GENOMES, JUST AT THE AMERICAN SOCIETY OF HUMAN GENETICS MEETING LAST MONTH AND THE PROJECTION IS THAT WE WILL HAVE OVER 30,000 GENOME SEQUENCED BY THE END OF NEXT YEAR, MORE THAN 5000 BY THE END OF THIS YEAR. SO WE'RE VERY SQUARELY IN THE ERA OF WHOLE GENOME SEQUENCING SO THAT'S WHERE WE ARE RIGHT NOW ANDOT FUNCTIONAL GENOMICS SIDE, WE CAN MEASURE GENE EXPRESSION, PROTEIN EXPRESSION, WE HAVE NEW GENOMIC MECHANISM SUCH AS MICRORNA WHICH ADD COMPLEXITY TO THE GENOMICS STORY AND METHODALATIONADS COMPLEXITY AND THE ENCODE PROJECT IS AN IMPORTANT PROJECT,AN OITATING WHAT ALL OF THE PIECES OF DNA DO AND SO THIS IS WHERE WE'VE BEEN THIS, IS WHERE WE ARE TODAY. AND YOU KNOW MANY BIOINFORMATICS LIKE TO THINK WE HAVE BEEN IN THE GOLDEN ERA OF BIOINFORMATICS BUT I THINK WE'RE JUST NOW ENTERING THE GOLDEN ERA OF BIOINFORMATICS BECAUSE ONCE WE'VE SEQUENCED EVERYBODY'S GENOMES, IT REALLY IS VERY MUCH A BIOINFORMATICS EXERCISE TO FIGURE OUT WHAT ALL OF THAT DATA MEANS AND SO, I THINK THE NEXT FIVE PROBABLY 10 YEARS IS GOING TO BE VERY MUCH FOCUSED ON BIOINFORMATICS AND EVERYBODY WILL HAVE WHOLE GENOME SEQUENCE DATA AND ARE GOING TO BE LOOKING FOR BIOINFORMAT CYSTS WHO HAVE THE TIME TO HELP THEM MAKE SENSE OF ALL THAT INFORMATION. BUT I THINK ONCE WE'VE DONE ALL THE SEQUENCING, ONCE WE FIGURE OUT HOW TO MANAGE AND ANALYZE SOME OF OF THAT DATA, THERE'S GOING TO HAVE TO BE A RETURN TO A LABORATORY, GOT TO BE ANITYRATIVE CYCLE BETWEEN ANALYSIS AND THE ACTUAL EXPERIMENTS THAT VALIDATE AND DETERMINE BIOLOGICAL FUNCTION AND WHAT I WOULD REFER TO SYSTEM GENETICS AND THINKING ABOUT HOW ALL THESE PIECES AND PARTS FIT TOGETHER AND HOW WE CAN GO BACK AND FORTH WITH THE LABORATORY SCIENTIST TO REALLY, REALLY UNDERSTAND THE ROLE OF GENETIC VERYIATION AND GENOMIC VARIATION IN HUMAN HEALTH. SO I'M GOING TO BREAK MY TALK DOWN INTO THESE THREE BASIC PARTS, TECHNOLOGY, BIOINFORMATICS AND SYSTEMS GENETICS PARTS. SO, WE'RE COMING TO THE CLOSE OF THE GENOME WIDE ASSOCIATION STUDY ERA, IT DOESN'T LAST VERY LONG, ABOUT FOUR YEARS. SCIENCE IS MOVING VERY, VERY, RAPIDLY. THIS IS A SNAPSHOT FROM ABOUT A YEAR AGO OF THE NHGRI CATALOG OF PUBLISHED GENOME WIDE ASSOCIATION STUDIES AND SO AT THAT TIME THERE WERE OVER A THOUSAND GENOME WIDE ASSOCIATION HITS FOR OVER 200 HUMAN TRAITS AND THAT NUMBER IS MUCH LARGER NOW, BUT CAN YOU SEE HERE A MAP OF THE HUMAN GENOME AND EACH CIRCLE REPRESENTS A POINT IN THE GENOME THAT'S BEEN ASSOCIATE WIDE A PARTICULAR DISEASE AND THE COLOR OF COURSE REPRESENTS THE DISEASE FOR THAT PARTICULAR GENETIC FINDING. SO THE POINT OF THIS SLIDE IS LOTS AND LOTS OF INTERESTING THING VS BEEN FOUND BUT WHAT I WOULD THRICE DO IS COMMUNICATE SOME OF THE CAVEATS. THIS WAS THE FIRST BIG STUDY THAT WAS PUBLISHED INDEED 2007 IN NATURE BY THE WELCOME TRUST CASE CONTROL CONSORTIUM AND AT THE TIME THIS WAS WAS QUITE A TOUR DEFORCE AND THEY INVESTED A LOT OF MONEY IN IN EFFORT, RESEARCH GROUPS, VERY LARGE SAMPLE SIZE FOR THE TIME, 500,000 SNIPS MEASURED ON 17,000 PEOPLE, THEY LOOKED AT SEVEN DIFFERENT COMPLEX DISEASES. THIS IS A FIGURE FROM TAKEN--THEY PAPER SHOWING THE HITS. SO THESE ARE THE CHROMOSOMES HERE, EACH CIRCLE REPRESENTS ONE OF THE 500,000 SNPs AND THIS IS THE LOG BASE TO THE P-VALUE AND HIGHER VALUES ARE MORE IMPORTANT AND THE GREEN REPRESENTS THE HITS THAT WERE STATISTICALLY SIGNIFICANT AND TING THINGS THAT WERE IMPORTANT. NOW AT THE TIME WHEN I SAW THIS FIGURE, MY FIRST IMPRESSION WAS, WELL, THERE'S NOT A LOT OF GREEN ON THIS SLIDE. I DIDN'T THINK THEY FOUND VERY MUCH, A LOT OF WHAT THEY FOUND WAS KNOWN ABOUT BEFORE. BUT THERE WERE SOME NEW THINGS FOUND AND SUBSEQUENT STUDIES THERE HAVE NOW BEEN HUNDREDS AND MAYBE EVEN THOUSANDS OF THESE KINDS OF STUDIES. THERE HAVE BEEN A LOT OF NEW THINGS FOUND JUST AS SHOWN ON THIS SLIDE, BUT I THINK THE IMPORTANT THING TO THINK ABOUT IS WHAT IS THE CLINICAL USEFULNESS OF ALL THIS INFORMATION. SO THIS IS A PAPER THAT WAS PUBLISHED LAST YEAR IN THE BRITISH MEDICAL JOURNAL IN A LARGE PROTECTIVE COHORT STUDY, IN TYPE TWO DIABETES AND THEY ASKED A QUESTION, IF YOU TAKE THE GENETIC LOCI, THE GENETIC LOCI THAT HAVE BEEN IMPLICATE INDEED TYPE TWO DIABETES AND PUT THEM IN A PREDICTIVE MODEL WHAT DO YOU GET? THIS IS AN R. O. C. CURVE SHOWING THE ONE MINUS SPECIFICITY AND SENSITIVITY SO CURVES THAT ARE UP LEER ARE BETTER PREDICTORS AND THE GRAY LINE HERE DOWN THROUGH THE MIDDLE INDICATES WHAT HAPPENS IF YOU JUST FLIP A COIN AND THE MODEL HAS NO PREDICTIVE ABILITY AND SO, THE BLUE LINE IS WHAT HAPPENS IF YOU PUT THE GENETIC RISK FACTORS, KNOWN RISK FACTORS FOR DIABETES INTO A PREDICTIVE MODEL. YOU CAN SEE IT DOESN'T DO MUCH BETTER THAN FLIPPING A COIN AND THE RED LINE IS WHAT'S CALLED THE FRAME BEING HAM OFFSPRING SCORE SO IF YOU PUT THE TRADITIONAL KNOWN RISK FACTORS IN THE DIABETES MOTEDLE THAT'S WHAT YOU GET AND THE GREEN LINE IS IF YOU ADD IN THE GENETIC RISK FACTORS AND YOU CAN SEE THERE ARE SEVERAL OTHER PAPERS PUBLISHED JUST LIKE THIS SHOWING THE SAME KINDS OF RESULTS INDICATING A LACK OF PRETICKETSIVE ABILITY, AT LEAST FOR TYPE TWO DIABETES AND I THINK THIS STORY IS PRETTY SIMILAR FOR A LOT OF OTHER COMMON HUMAN DISEASES. AND JUST TO GIVE YOU AN INDICATION OF THE EFFECT SIZES, THESE ARE MY OWN GENETIC TESTING RESULTS FOR TYPE TWO DIABETES SO I DID THE 23 AND ME SERVICE AND HAD THE SNPs ON MY OWN HUMAN GENOME MEASURED SO THESE ARE THE KNOWN FACTORS IDENTIFIED FROM GENOME WIDE ASSOCIATION STUDIES AND YOU CAN SEE HERE, THE SCALE HERE THIS, IS A TWO FOLD INCREASE RISK OF DIABETES AND A TWO FOLD DECREASE RISK AND THE AVERAGE MALE OF EUROPEAN ETHNICITY HAS A 23.7 CHANCE OF DEVELOPING TYPE TWO DIABETES AND BASED ON MY GENETIC TESTING RESULTS I HAVE A 22% CHANCE OF DEVELOPING DIABETES AND YOU CAN SEE HERE FOR EACH INDIVIDUAL MARKERS, WE'RE TALKING ABOUT VERY SMALL GENETIC EFFECT SIZES, AND SO, YOU KNOW FROM MY OWN BASED ON MY OWN GENETIC TESTING AND INFORMATION, I DON'T THINK THIS TELLS ME A WHOLE LOT ABOUT MY RISK OF DIABETES. AND THESE ARE VERY SMALL GENETIC EFFECT SIZES AND THIS IS TYPE TWO DIABETES AND PERHAPS AN EXTREME CASE BECAUSE WE ALL KNOW DIABETES HAS A VERY LARGE ENVIRONMENTAL COMPONENT, BI THESE RESULTS ARE PRETTY TYPICAL OF A LOT OF GENETIC RESULTS FOR COMMON COMPLEX HUMAN DISEASES. THE AND I THINK, THIS TIMES NEWS ARTICLE FROM 2009, SO 2000 YEARS AGO, NOW, ANNOUNCING THE FAILURE OF DECODE GENETICS TO BUILD A BUSINESS MODEL AROUND GENOME WIDE ASSOCIATIONS STUDIES AND GENETIC RESULTS REALLY KIND OF SAYS IT ALL AND I LOVE THE TAG LINE OF THIS ARTICLE, GENETICS COMPANY FAILS ITS RESEARCH TOO COMPLEX. AND DECODE SET UP A BUSINESS TO GENOTYPE PEOPLE FROM ICE LAND AND THE ICE LANDIC POPULATION IS A HOMOGEANIOUS, RELATIVELY HOMOGEANIOUS POPULATION FOUND FRIDAY SEVERAL HUNDRED PEOPLE WHO FIRST WENT TO ICE COMMAND THEY HAVE RELATIVELY HOMOGEANIOUS ICE LANDIC BACKGROUNDS SO THE IDEA WAS TO START A COMPANY, USE THIS WONDERFUL RESOURCE TO IDENTIFY GENETIC RISK FACTORS AND THEN BUILD A BUSINESS MODEL AROUND IT. THEY WEREN'T ABLE TO DO THAT, THE COMPANY WENT BANKRUPT, THEY'RE NOW LIMPING ALONG TRYING TO MAKE A GO OF IT AGAIN, BUT, THIS--THIS IS ENTIRELY CONSISTENT WITH THE RESULTS I JUST SHOWED YOU, THAT WHAT'S BEEN FOUND ARE TYPICALLY THINGS WITH SMALL EFFECTS AND NOT REALLY CLINICALLY RELEVANT AT THIS POINT IN TIME. SO, I THINK THE LESSON FROM DENOME WIDE ASSOCIATION STUDIES--GENOME WIDE ASSOCIATION STUDY SYSTEM THAT MORE DID DATA DOES NOT NECESSARILY MEAN MORE KNOWLEDGE AND AND I THEN IS WHAT WE REALLY NEED TO CONSIDER MOVING FORWARD IS GENERATING MASSIVE VOLUMES OF FORMATION IS NOT A ENOUGH. WE NEED TO UNDERSTAND THE DATA AND WE NEED TO THINK CAREFULLY ABOUT HOW TO ANALYZE THE DATA. IT'S NOT ENOUGH TO GENERATE THE DATA. AND SO, MANY HAVE ASKED IN THE LAST COUPLE OF YEARS ABOUT, WELL, IF WE HAVEN'T FOUND ALL THE GENETIC VARIATION ASSOCIATED WITH TYPE TWO DIABETES AND OTHER DISEASES THAT WE KNOW HAVE A GENETIC COMPONENT, WHERE IS THE MISSING HERITABILITY AND THIS WAS THE PAPER I WAS ASKED TO PARTICIPATE IN THAT WAS PUBLISH INDEED JUNE OF 2010 AND WE WERE ABLE ASKED THE BASIC QUESTION, HOW SHOULD WE SOLVE THE PROBLEM OF MISSING HERITABILITY AND COMPLEX DISEASES. AND I SAID BASICALLY THAT IT SHOULDN'T BE A MYSTERY TO ANYBODY GIVEN THE INHERENT COMPLEXITY BETWEEN GENOTYPE AND PHENOTYPE. WE KNOW THE GENOME IS COMPLEX, WE KNOW GENETIC EFFECTS TRANSPLATE THROUGH A HIERARCHY OF PHYSIOLOGICAL SYSTEMS BEFORE ARRIVING AT A CLINICAL END POINT SUCH AS TYPE TWO DIABETES AND SO I BASICALLY SAID THAT WE NEED TO PHILOSOPHICALLY AND ANALYTICALLY RETOOL FOR A COMPLEX GENETIC ARCHITECTURE OR WE WILL CONTINUE TO UNDERDEDELIVER ON THE PROMISES OF HUMAN GETTICS AND LIFE IS COMPLICATED AND SOME WILL ASK WHETHER WE'RE TRYING TO PREDICT THE UNPREDICT ANNUAL AND THE QUESTION IS ARE WE TRYING TO PREDICT THE UNPREDICTABLE AND GENETICISTS WILL BE OUT OF A JOB IN A FEW YEARS. SO THIS IS HOW I THINK CONCEPTUALLY ABOUT THE GENETICS OF COMMON HUMAN DISEASES, BLADDER CANCERS, A DISEASE, I'LL MENTION A FEW TIMES IN THE TALK WE STUDY IN NUCLEOTIDES HALFSHIRE AND THEY HAVE A HIGHER RISK OF BLADDER CANCER THAN IN OTHER PARTS OF THE COUNTRY SO DOWN HIRE WE HAVE THE GENOME, SO THERE ARE HUNDREDS IF NOT THOUSANDS OF GENES THAT INVOLVED IN BLADDER CANCER DIRECTLY OR INDIRECT O, PROTEINS AND ENZYMES THAT ARE INVOLVE INDEED ALL SORTS OF DIFFERENT BIOCHEMICAL SYSTEMS. ALL OF THIS BIOLOGY HAPPENS IN THE CONTEXT OF OF OUR ECOLOGY AND IT'S REALLY, ALL OF THESE THINGS WORKING TOGETHER THAT DETERMINE OUR INDIVIDUAL TRAJECTORY TOWARD A COMPLEX END POINT SUCH AS BLADDER CANCER AND WHAT'S BEEN HAPPEN NOTHING THE GENOME WIDE ASSOCIATION STUDY, COMMUNITY LARGELY IS TAKING A SINGLE POINT IN THE DNA OUT OF THIS CONTEXT AND LOOKING FOR ITS RELATIONSHIP WITH BLADDER CANCER AND FROM MY PERSPECTIVE IT DOESN'T MAKE SENSE TO DO THAT. WE HAVE TO UNDERSTAND GENETIC VARIATION IN THE CONTEXT OF ALL OF THIS BIOLOGY IF WE REALLY WANT TO UNDERSTAND WHY AND HOW GENETIC VARIATION IS A RISK FACTOR FOR COMMON DISEASE. SO THIS IS WHERE WE'VE BEEN ONE METRIC AT A TIME, BIOSTATTISTICAL ANALYSIS OF THIS DATA AND WE'VE HAD LIMITED SUCCESS WITH THIS, BUT CERTAINLY IT HAS NOT LIVED UP TO THE HYPE WHEN ALL OF THIS STARTED FIVE OR SIX YEARS AGO. AND FOR, I THINK FOR MOST OF THE COMMON HUMAN DISEASES ASTERISKS BEST, WE'RE GOING TO EXPLAIN PERHAPS 20% OF THE GENETIC VARIANTS AND THAT'S TRUE FOR CROHN'S DISEASE WHICH IS ONE OF THE BEST SUCCESS STORIES. WE'VE ONLY--WITH MORE THAN 100 GENETIC MARKERS HAVE ONLY EXPLAINED 20% OF THE GENERATED THETIC VARIANCE, THE TOTAL VARIATION THAT'S DUE TO GENETICS WITH COMMON GENETIC VARIATION THROUGH THE HUMAN GENOME. SO I THINK THAT MOST OF THE MISSING HERITABILITY IS TIED UP IN THESE COMPLEX EFFECTS EPISTASE EPIGENETIC, AND DIABETES, IS A DISEASE FOR WHICH THE ENVIRONMENT HAS A HUGE ROLE TO PLAY. DIET AND OTHER FACTORS, AND WE KNOW THAT THAT'S GOING TO INTERACT ACT--ISHT ACT WITH OUR GENETIC BACKGROUND. SO GENE INTERACTION PLAYS AN IMPORTANT ROLE, YOU LOOK AT OTHER DISEASES, AUTISM ARE DISEASES OF MISCELLANEOUSIVE HETEROGENEITY, THE MANY SPECTRUM DISORDER IS A SPECTRUM. IT'S DOING GENETIC STUDIES ON WHAT WE CALL AUTISM, BUT WHEN IT COMES RIGHT DOWN TO IT, OUR GENOME WIDE ASSOCIATION STUDIES OF AUTISM ARE PROBABLY COLLECTIONS OF MANY SUBSETS OF PATIENTS THAT HAVE REALLY GENETICALLY DISTINCT DISEASES AND SO, WHAT WE'RE REALLY LOOKING FOR IN DISEASES LIKE AUTISM ARE PROBABLY MANY, MANY DIFFERENT GENETIC MODELS EXPLAINING DIFFERENT PARTS OF MY SPECTRUM. THAT'S PROBABLY MY CELL PHONE MAKING THAT NOISE. >> SO THAT'S WHERE MOST OF THE MISSING INHERITABILITY IS, AND WE'RE NEVER GOING TO--WE'RE NEVER GOING TO UNCOVER THIS LOOKING ONE SNIP AT A TIME AND I THINK THE POINT HERE IS THAT THE LINEAR BIOSTATTISTICAL MODEL HAS THE PLACE IN THE ANALYSIS OF THIS DATA BUT TO UNCOVER THESE MORE INTERESTING EFFECTS WE NEED DIFFERENT METHODOLOGY. WE NEED METHODOLOGIES THAT ASSUME COMPLEXITY RATHER THAN IGNORE COMPLEXITY AND THAT'S WHERE A MULTIINFORMATICS APPROACH HAS AN IMPORTANT ROLE TO PLIE. SO, I'VE SORT OF ILLUSTRATED HERE, THE LAYERS OF HOW TO THINK ABOUT METHOD LOGIC APPROACHES TO THIS PROBLEM. AT THE HEART WE HAVE THE LINEAR STATISTICAL MODEL WHICH SERVED US WELL AND CERTAINLY--THERE'S A LOT OF GOOD REASONS TO HAVE THE LINEAR MODEL AS A STARTS POINT FOR ANALYSIS BUT I THINK WE NEED TO BE THINKING BEYOND THE LINEAR MODEL TO MACHINE LEARNING APPROACHES SUCH AS NEURAL NET WORKS, I WILL TELL YOU ABOUT OUR OWN NOVEL MACHINE LEARNING APPROACH, MDR, AND WE ALSO KEEP A CLOSE EYE ON THE FRINGES OF COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE COMPUTATIONAL ASSISTANCE, GENETIC PROGRAM, DIGITAL BIOLOGY, A WHOLE FRINGE OF SCIENTISTSOT EDGES OF COMPUTER SCIENCE THAT ARE THINKING OUTSIDE THE BOX THAT ARE REALLY PUSHING THE ENVELOPE AND DEVELOPING ALGORITHMS THAT LOOK AT DATA AND VERY, VERY INNOVATIVE AND NOVEL WAYS AND WE'VE DONE A LOT OF WORK IN THESE FRINGE AREAS AND I THINK THERE ARE A LOT OF GOOD IDEAS THERE THAT NEED TO BE EXPLORED. OKAY, SO LET ME TELL YOU A LITTLE BIT ABOUT BIOINFORMATIC APPROACH TO GENOME WIDE ASSOCIATION STUDY AND I GUESS THINK WE'RE ENTERING THE GOLDEN ERA OF BIOINFORMATICS, THIS IS A REVIEW PAPER I PUBLISHED LAST YEAR, SORT OF CONCEPTUALIZING A COMPUTATIONAL APPROACH OF THIS ANALYSIS OF THIS KIND OF DATA AND REALLY THINKING ABOUT IT FROM A COMPUTATIONAL POINT OF VIEW, COMPUTATIONAL MODELS RATHER THAN STATISTICAL MODELS AND HOW YOU WRAP COMPUTATION AROUND THIS TYPE OF MODELING TO DISCOVER THESE MORE COMPLEX GENETIC EFFECTS SO A LOT OF THE IDEAS I'M TALKING ABOUT STUDY ARE IN THIS PAPER IF YOU'RE INTERESTED. SO LET ME TELL YOU ABOUT A METHOD WE'VE BEEN WORKING WITH FOR MORE THAN 10 YEARS NOW. AND AT THE TIME, WE WANTED A MACHINE LEARNING APPROACH THAT WAS CAPABLE OF DETECTING SOME OF THESE MORE GENETIC EFFECTS BUT THE MACHINE LEARNING HAD A BAD REPUTATION AND HAS--ALWAYS HAD A BAD REPUTATION BECAUSE IT'S CONSIDERED A BLACK BOX APPROACH, RIGHT? BIOLOGIST VS A HARD TIME LOOKING AT A NEURAL MACHINE AND BED UNDERSTANDING WHAT IT'S DOING AND WHAT THE MODELS ARE GIVING YOU AND TELLING YOU AND THAT'S VERY UNDERSTANDABLE SO WHEN WE DEVELOP MULTIDIRECTIONAL APPROACH, WE WANT TO DEVELOP IT BECAUSE IT WAS MORE INTUITIVE TO EPIDEMIOLOGISTS AND GENETICISTS. SOCIETY THIS IS THE IDEA. THIS IS REAL DATA HERE FROM AN ATRI OLDER PEOPLE FIB ROUGH ATOMALATION STUDY. LOOKING AT THREE FACTORS TOGETHER, SO THERE ARE 27 GENOTYPE COMBINATIONS SO THESE ARE GENETIC VARIANTS WITH THE PHENOTYPES. SO EACH PERSON FALLS INTO ONE OF THESE 273 LOC OUST GENE COMBIN AGES AND WHAT WE'RE SHOWING IS THE NUMBER OF PEOPLE WITH DISEASE AND THE NUMBER WITHOUT DISEASE AND WE SIMPLY CODE THE GENOTYPE COMBINATIONS THAT HAVE MORE SICK THAN HEALTHY PEOPLE AS HIGH RISK AND DARK GRAY AND WHERE WE HAD MORE HEALTHY THAN SICK, WE CODE THOSE AS LOW RISK OR LIGHT GRAY AND WHAT MDR DOES IS COLLAPSE ALL THE HIGH RISK GENOTYPES IN ONE GROUP AND ALL THE LOW RISKS GENOTYPES IN ANOTHER GROUP AND RECODES THE DATA HERE AS A SINGLE FACTOR. AND NOW IN COMPUTER SCIENCE THIS ISN'T A GENERAL APPROACH CALLED CONSTRUCTIVE INDUCTION, THE IDEA OF TAKING TWO OR MORE VARIABLES AND SOMEHOW MAPPING THEM DOWN TO A LOWER DIMENSION AND SO WE'VE TAKEN--THIS IS A NOVEL METHOD THAT FALLS THUNDER BROAD CATEGORY OF CONSTRUCTIVE INDUCTION METHODS, SO WE'RE TAKING THIS THREE DIMENSIONAL SPACE AND ONCE YOU HAVE THE ONE DIMENSION, LET ME BACK UP AND JUST SAY, THIS WAS A--THIS WAS A STUDY WITH 500 SUBJECTS IN IT, AND NOTICE, THE GENOTYPES HERE WITH NO DATA AND MANY OF THEM HAD ONLY A FEW DATA POINTS. THIS IS WHERE THE PARAMETRIC STATISTICAL METHOD, WHERE THEY START TO HAVE PROBLEMS AND OFTEN WON'T CONVERGE ON ACCURATE PARAMETER ESTIMATES. AND SO, MDR WAS DESIGNED TO DEAL WITH THIS DIMENSIONALITY ISSUE. SO ONCE YOU SQUEEZE THIS DOWN, CAN YOU PLUG IT INTO YOUR FAVORITE METHOD. CAN YOU PUT THAT BACK IN LODGISTICAL REGRESSION AND YOU CAN ESTIMATE AN ODDS RATIO, DECISION TREE OR NEURAL NETWORK. WE USED A PROBABLYISTIC CLASSIFIER TO ASSESS THE RELATIONSHIP BETWEEN THE SINGLE VARIABLE AND CASE CONTROL STATUS OR THE DISEASE, THE DISEASE END POINT AND THIS--THIS METHOD WAS ORIGINALLY PUBLISHED IN THE AMERICAN JOURNAL OF HUMAN GENETICS AND WE PUBLISHED A SIRES OF PAPERS WITH SIMULATION STUDIES AND REAL DATA ANALYSIS AND SHOWING THAT THE METHOD HAS BETTER POWER THAN LODGISTICAL REGRESSION FOR DETECTING INTERACTIONS AND SPECIFICALLY IN THE CASE OF INTERACTIONS IN THE ABSENCE OF MAIN EFFECTS. SO I'M INTERESTED IN, IF IT'S THREE OR FOUR OR FIVE SNPs, WORKING TOGETHER AND THEY'RE WORKING TOGETHER, HOW DO YOU DETECT AND MODEL IT WE PRODUCE INDEED 2005, OPEN SOURCED MDR AND IT'S IN JAVA IT WILL RUN ON ANY PLATFORM WITH JAVA INSTALLED AND IT HAS A NICE INTERFACE IT'S FREE FREELILY AVAILABLE, CAN YOU GET IT THROUGH MY WEB SITE, AND PROVIDES PUBLICATION, QUALITY, GRAPHICAL OUTPUT, DOES THINGS LIKE CROSS VALIDATION AND PERMITATION TESTING FOR THE SIGNIFICANCE OF A MODEL AND SO, BASICALLY THE WAY THIS WORKS IS YOU LOAD YOUR DATA IN AND IT LOOKS AT ALL THE TWO WAY, THREE WAY, FOUR WAY COMBINATIONS AND USING CROSS VALIDATION PICKS OF A MOST PREDICTIVE MODEL AND THERE'S A LOT OF EXTRA TOOLS BUILT IN HERE TO ALLOW YOU TO FILTER YOUR DATA TO PROVIDE THE INTERPRETATION OF THE DATA, ET CETERA. THE AND THE SOFTWARE'S BEEN DOWNLOADED OVER 30,000 TIMES SINCE 2005 SO IT'S BEEN POPULAR AND WIDELY USE INDEED THE GENETICS AND EPIDEMIOLOGY COMMUNITIES. IT'S BEEN--THERE ARE OVER 300 STUDIES NOW AND A WIDE RANGE OF DIFFERENT DISEASES AND WHAT'S REAL EXCITING FOR ME IS THERE'S AN COMMUNITY OF SCIENTISTS OUT THERE DEVELOPING EXTENSION SYSTEM MODIFICATIONS TO MDR AND THIS IS A--THIS IS A VERY SMALL LIST AND THESE ARE SOME OF THE OTHER GROUPS THAT HAVE PUBLISHED THINGS LIKE ODDS RATIO MDR INSTEAD OF WHEN PICKING THE HIGH RISK AND LOW RISK GROUPS FOR THE CONSTRUCTIVE INDUCTION APPROACH, THESE PEOPLE DEVELOPED A WAY TO USE ODDS RATIO AND TESTING WITHIN THE GENOTYPE COLONEL BINNATIONS TO DETERMINE THAT MAPPING MODEL BASED MDR, INTEGRATING REGRESSION PUTTING METHODS INTO THE MDR METHOD, SO LOTS OF INTERESTING, OTHER APPROACHES SO LET ME GIVE YOU A QUICK REAL DATA EXAMPLE JUST TO HIGHLIGHT THE APPROACH. THIS IS A PAPER WE PUBLISHED A FEW YEARS AGO IN COLLABORATION WITH ANGELLA IS AND MARGARET WHO ARE HEAD OF THE NEW HAMPSHIRE BLADDER STUDY. THIS A 71 PHONE CONTROLS, FROM THE STATE OF NEW HAMPSHIRE AND THIS WAS A CANDIDATE GENE STUDY LOOKING AT THE DNA REPAIR ENZYME GENES WHICH ARE THOUGHT TO PLAY AN IMPORTANT ROLE IN CANCER. THEY ALSO LOOK AT SMOKING AGE AND GENDER SO WE DID AN NDR ANALYSIS AND WE LOOKED AT TWO WAY, THREE WAY, FOUR WAY COMBINATIONS OF FACTORS AND THIS WAS OUR BEST MODEL. IT HAD A PREDICTIVE ACCURACY OF .66 SO CROSS VALIDATION PREDICTED BLADDER CANCER 66% OF THE TIME. THE ODDS RATIO IS ABOUT THREE WHICH HAD A HIGHLY SIGNIFICANT 95% CONFIDENCE INTERVAL. WE NEVER SAW A MODEL THIS GOOD IN THE BERMUDA DATA, WE RAN THE ENTIRE ANALYSIS TO SEE WHAT WE FIND IN RANDOM DATA AND THEN I'M GOING TO TELL YOU ABOUT--IN A SECOND ABOUT A SPECIAL PERMIAITATION TEST WE DEVELOPED TO SPECIFICALLY TEST THE HYPOTHESIS OF INTERACTION WHICH IS WHAT I'M REALLY INTERESTED IN. SO HERE'S THE MODEL. IT CONSISTED OF TWO POLYMORPHISMS AND THE PIGMENTOSA GENES AND IMPACT SMOKING AND YOU CAN SEE THE IMPACT OF SMOGGING AND CAN YOU SEE, THE DARKER GRAY CELLS HERE, SO THERE WAS MORE CANCER IN THE PEOPLE THAT SMOKED MORE AS WE WOULD EXPECT, SMOKE SUGGEST A KNOWN RISK FACTOR FOR BLADDER CANCER OUR METHOD TO PICK THAT UP BUT YOU CAN SEE THERE'S NONLINEARITY THERE AND INTRUSION OF HIGH RISK AND LOW RISK CELLS AND THE PEOPLE THAT SMOKE LESS, IS--IT DOES NOT FOLLOW A LINEAR PATTERN AND SO, THE QUESTION IS, WHAT'S REALLY GOING ON HERE, IS THIS REALLY INTERACTION OR IS SMOKING, DRIVE BEING, DRIVING THIS EFFECT. SO WE DEVELOPED A SPECIAL PERMIAITATION TEST TO GET AT THAT QUESTION, SO IN A NORMAL PERMIAITATION TEST, THIS IS SAY, THREE SNPs, EACH ROW IS A SUBJECT, THEE MIGHT BE THE CASES AND THE CONTROLS WITH THE STANDARD PERMIAITATION TEST, YOU RANDOMIZE THE CASE CONTROL LABELS WITH RESPECT TO THE GENOTYPE SPECIALIZATION OF SPECIFIC ENDOTHELIAL ANY ASSOCIATION THAT YOU SEE IS DUE TO CHANCE ALONE. WHAT WE DID WAS DEVELOP A DIFFERENT PERMIAITATION TEST, IT'S THE EXPLICIT TEST FOR INTERACTION AND INSTEAD OF RANDOMIZING THE CONTROL LABELS WE RANDOMIZE THE GENOTYPES WITHIN EACH COLUMN AND WITHIN EACH CLASS AND WHAT THIS DOES IS IT PRESERVES THE INDEPENDENT, MAIN EFFECTS OF VARIABLES, RIGHT BECAUSE THE FREQUENCIES ARE THE SAME BETWEEN THE TWO CLASSES SO YOU--YOU'RE FIXING THE MAIN EFFECTS HOLDING THOSE CONSTANT BUT BECAUSE YOU'RE RANDOMIZING THE GENE EFFECTS WITH RESPECT TO EACH OTHER, YOU'RE REMOVING INTERACTION, SO ANY INTERACTION YOU SEE IS DUE TO CHANCE ALONE. SO WE'RE FIXING THE MAIN EFFECT, THE MAIN EFFECT STAYS THE SAME AND WE'RE SCRAMBLING THE INTERACTION AND THIS ALLOWS US SO IF WE GET A SIGNIFICANT P-VALUE WE KNOW IT'S ONLY DUE TO INTERACTION AND IT'S NOT COMPOUNDED WITH THE MAIN EFFECT. SO THESE ARE WHAT RESULTS LOOK LIKE. SO THIS IS THE DISTRIBUTION OF ACCURACYS OF THE MDR CLASSIFIER, FOR THE STANDARD PERMIAITATION TEST AND IT'S CENTERED AT ABOUT .5 WHICH IS WHAT YOU GET IF YOU FLIP A COIN. SO THAT'S WHAT YOU EXPECT. YOU EXPECT THAT TO BE CENTERED AT .5. AND THIS IS THE INTRUSION WE GET FROM THE EXPLICIT TEST OF INTERACTION AND THIS WAS THE ACTUAL ACCURACY OF THE PARTICULAR MODEL THAT WE WERE TESTING AND YOU CAN SEE ISSUES IT'S VERY DIFFERENT FROM THE PERMIAITATION STANDARD PERMIAITATION DISTRIBUTION AND THAT RESULT INDEED THE P-VALUE OF .001 AND IT WAS EXTREME WITH THE RESPECT OF THE DISTRIBUTION OF THE EXPLICIT TEST WITH THE P-VALUE OF 0.005. NOW THE CENTER OF THIS DISTRIBUTION IS ACTUALLY SHIFTED WITH RESPECT TO THE MAIN EFFECT. THIS IS ACTUALLY THE SMOKING EFFECT HERE THAT WE'RE SEEING SO THIS INTRUSION IS CENTERED AROUND THE MAIN EFFECT SO THAT'S DIFFERENT THAN THAT. OKAY, SO, SO WE CAN MODEL INTERACTIONS IN THE ABSENCE OF MAIN EFFECTS WE CAN TEST THEM, MDR DOES A REALLY GOOD JOB AT DOING THAT BUT HOW DO YOU SCALE THIS TO A GENOME WIDE ASSOCIATION STUDY? IT TURNS OUT IT'S VERY COMPUTATIONALLY INTENSIVE PROBLEM AND TO ENOWM RIGHT THE THREE WAY AND FOUR WAY INTERACTIONS AND THE GENOME WIDE ASSOCIATION STUDY, THERE AREN'T ENOUGH COMPUTER O SINGLE THE PLAN TOTE DO THAT EFLUMEERATION SO WE'VE BEEN WORKING ON MORE INTELLIGENT WAYS TO DO THAT BUT WHAT I THOUGHT I WOULD DO IS GIVE YOU A QUICK OVERVIEW OF SOME OF THEA HIGH PERFORMANCE COMPUTING APPROACHES THAT ENABLE US TO DO A TWO WAY INTERACTION ANALYSIS. THIS WAS A SHORT PAPER A FEW YEARS AGO. ADAPTING MDR ALEGORITHMS ON GPUs, SO IT TURNS OUT THAT MOST OF YOU PROBABLY KNOW THAT YOU'RE VIDEO CARDS ARE LITTLE SUPER COMPUTERS. THEY HAVE HUNDREDS OF PROCESSORS AND YOU CAN HARNESS THE POWER OF THOSE COMPUTATIONS. HF-COMPUTE CLUT WEAR FOUR CPUs, WITH 150 RPUs COMPARED TO ONE GRAPHICS CARD OR TWO OR THREE GRAPHICKINGS CARD--GRAPHICS CARD AND YOU CAN SEE THAT ONE, TWO, AND THREE GRAPHICS CARDS ARE DOING IT IS WORK OF ABOUT 80 TO 150 CPU CLUSTER. SO SHORT STORY IS WE'RE GETTING 100 FOLD SPEED UP WHEN YOU DO THE PRICE TO PRICE COMPARISON OF GPU VERSES CPU, WE GET THE GPU ON THE COMPUTER SO THAT'S PRETTY IMPORTANT FOR THESE KINDS OF COMPUTATIONALLY INTENSIVE PROBLEMS AND I'LL MAKE A SIDE NOTE, THIS IS A PICTURE OF THE GUY THAT DID THIS WORK, HE WAS A HIGH SCHOOL STUDENT, HANOVER HIGH SCHOOL STUDENT IN MY LAB WHEN HE DID THIS WORK. HE IS ONE OF THE SMARTEST PEOPLE I'VE EVER MET IN MY ENTIRE LIFE AND HE ENTERED THIS WORK INTO A COMPUTER SCIENCE--GPU PROGRAMMING COMPETITION AND ONE FIRST PLACE. THE OTHER NINE FINALISTS WERE PROFESSORS OF COMPUTER SCIENCE, SO THE HIGH SCHOOL STUDENT BEAT THEM ALL. IT WAS IMPRESSIVE AND HE WON THAT COMPETITION TWO YEARS IN A ROW. HE WON IN T LAST YEAR AS WELL. HE JUST STARTED HIS JUNIOR YEAR AT BROWN UNIVERSITY. SO TO GIVE YOU AN EXAMPLE OF THIS GENOME WIDE ASSOCIATION STUDY TO DO A GENE, GENE, AND AND GENOME WIDE DATA SETS MAY BE FOR SPORADIC ALS, 500,000 SNPs, AND WE DID ALL THE TWO WAY ANALYSIS, EXHAUSTIVELY, WE REPEATED THE ENTIRE ANALYSIS A THOUSAND TIMES AND WE IDENTIFIED A STATISTICALLY SIGNIFICANT MDR MODEL THAT REPLICATED IN THE SECOND DATA SET AND WE PUBLISHED THAT FINDING AND NOTICE THE DIFFERENCE HERE, IT TOOK A THREE GPU COMPUTER WHICH WE BUILT FOR $5000, A HUNDRED HOURS TO DO THIS WHOLE TWO WAY INTERACTION ANALYSIS, SO NOT TOO BAD, A COUPLE DAYS COMPUTING TIME ON A CPU CLUT TERTOOK 40 DAYS TO DO THE SAME ANALYSIS. SO THIS WAS A $5000 SYSTEM AND THIS IS PROBABLY MORE LIKE A $50,000 SYSTEM. OKAY ANY QUESTIONS AT THIS POINT ABOUT MDR? OKAY, WELL I'M GOING TO SHIFT INTO TALK ABOUT KNOWLEDGING ABOUT INTERACTIONS FROM THE POINT OF VIEW TWO ORALLY OR FOUR WAY INTERACTIONS, COMBINATION OF THREE OR FOUR SNPs THAT IS JOYMENTLY REDICKIVE OF DISEASE AND ESSENTIALLY IN THE CASE WHERE INDIVIDUALLY THEY MIGHT NOT BE. WHAT WE'RE DOING NOW, WE'RE MOVING INTO THINKING MORE AT A SYSTEMS LEVEL ABOUT NETWORKS OF SNPs AND GETTING AWAY FROM MODELS OF TWO OR THREE SNIPS AT A TIME AND HUNDREDS OF SNPs SO I'LL GIVE YOU BRAND NEW RESULTS FROM THIS YEAR, BEGINNING TO THINK ABOUT A SYSTEMS GENETICS APPROACH TO GENETIC ARCHITECTURE. SO FIRST LET ME TELL YOU ABOUT A DIFFERENT MEASURE OF INTERACTION THAT WE'VE BEEN WORKING WITH. CALLED INTERACTION INFORMATION THIS, IS BASED ON ENTROPY THEORY AND INFORMATION THEORY. THIS IS NOT A NEW IDEA. THIS GOES BACK TO PSYCHOMETRICS LITERATURE AND 1950S MIGUEL PUBLISHED A PAPER THINK BEING INTERACTION INFORMATION AND ALEX YAKALUN, DID A Ph.D. DISSERTATION IN THE COMPUTER SCIENCE DEPARTMENT AND DID A NICE DISSERTATION, SORT OF REDISCOVERING INFORMATION, IT'S A GREAT AUTHORITATIVE REFERENCEOT TOPIC AND THAT WORK WAS DONE IN THE EARLY 2000S. AND THEN WE PICKED UP ALEX'S WORK AND ADAPTED IT TO HUMAN GETTICS, AND SHOWING HOW MUCH WE CAN USE THESE INFORMATION THEORY MEASURES AS A MEASURE OF SYNERGISTIC INTERACTION BETWEEN SNPs. SO THE IDEA IS PRETTY SIMPLE, DOES COMBINING TWO SNPs, COMBINE PROVIDE NOR INFORMATION ABOUT WHO'S SICK AND WHO'S HEALTHY AND THE SNPs INDIVIDUALLY OR COMBINEDADDATIVELY SO WE'RE REALLY MEASURING THIS JOINT INFORMATION HERE, AND ONE OF THE APPEALING THINGS ABOUT THIS MEASURE OF INTERACTION IS IT'S COMPUTATIONALLY VERY FAST. SO THAT'S ONE OF THE REASONS WE'RE ATTRACTED TO IT. SO I'LL RETURN TO OUR PLAIDER CANCER EXAMPLE, AND THEN IN THIS PARTICULAR STUDY, WE MET, WE HAD 1422 SNPs, ACROSS 500 CANCER SUSCEPTIBILITY GENES AND THIS IS THE BIGGER DATA SET THAN WHAT I SHOWED YOU BEFORE AND BUILDING IPT GREATER STASIS NETWORK SO INSTEAD OF BUILDING A MOTEDLE OR NOW THINK BEING WHOLE NETWORKS OF SNPs AND NETWORK OF SNPs INTERACTION, SO HERE THE VERT TEXS AND THE NODES IN THE NETWORK ARE SNPs AND WE DRAW AN EDGE BETWEEN THEM IF THEY HAVE SYNERGISTIC INTERACTION. AND MOST DOCK IN MY LAB, IT HAS BEEN PUBLISHED IN BIOINFORMATICS. SO WHAT WE DID HERE IS JUST LOOK AT DISTRIBUTION OF THE NUMBER OF EDGES AND THE NUMBER OF OF VERTISS, SO THE NUMBER OF SNPs, THE NUMBER OF INTERACTIONS ACROSS DIFFERENT THRESH HOLDS, SO AT A ZERO THRESH HOLD, EVERYTHING THE NETWORK IS FULLY CONNECTED SO WE HAVE ALL THE SNPs AND WE CAN SEE HERE THE THRESH HOLD IS ZERO, WE GET ALL 1422 SNPs IN THE NETWORK AND THE NETWORK IS FULLY CONNECTED AND YOU CAN SEE THERE ARE MORE THAN 10 TO THE FIFTH CONNECTION WHEN IS THE THRESH HOLD IS AT ZERO. THE RED LINE IS REAL NETWORK SIZE, THE NUMBER OF EDGES AND THE NUMBER OF NODES THAT WE OBSERVED IN THE REAL DATA, THE GRAY DATA ARE THE SAME LINES FOR A THOUSAND PERMIAITATIONS AND WE RANDOMIZE THE CASE CONTROL LABELS AND REBUILD THE NETWORK, COUNT THE NUMBER OF NODES AND NUMBER OF EDGES AND THAT'S WAWE'RE SEEING HERE AND IN BOTH CASES, THE RED LINE IS BIGGER THAN THE THOUSAND GRAY LINES SUGGESTING THAT IT IS THE SIZE OF THE NETWORKS ARE BIGGER THAN WE WOULD EXPECT BY CHANCE AND RANDOM DATA. WE CHOSE IN THIS FIRST STUDY TO FOCUS ON THE LARGEST CONNECTED COMPONENT SO WE LOOK AT MANY POINTS IN IN CURVE APPROXIMATE YOU PULL OUT THE NETWORK AND LOOK AT IT, IT WOULD EXIST AS LOTS OF NETWORKS INDEPENDENT OF ONE ANOTHER. SO WHAT WE DID WAS WE DECIDED TO LOOK AT THE LARGEST COMPONENT, THE BIGGEST SINGLE NETWORK IN WHICH ALL THE SNPs WERE CONNECTED AND AGAIN HERE IF WE MEASURE THE SIZE OF THE LARGEST CONNECTED COMPONE SPENT DO THE SAME THRESH HOLDING WE CAN SEE A HIGHLY SIGNIFICANT RESULT AGAIN. THE LARGEST CONNECTED COMPONENT IS MUCH BIGGER IN THE REAL DAT THAN YOU SEE IN RANDOM DATA. AND WE PICK IN INFLECTION POINT AS THE NETWORK TO STUDY AND THERE'S THEOR NEUROECTODERMAL NETWORK SCIENCE THAT SUGGESTS THAT WHERE YOU SEE THIS INFLECTION AS THE NETWORK YOU SHOULD STUDY. SO THIS IS WHAT OUR NETWORK LOOKED LIKE SO THIS IS THE NETWORK FROM THE REAL DATA. CONSISTED OF 319 SHIPS WITH 255 INTERACTIONS. AND THERE WERE 17 AND OBSERVED, AND I'LL SHOW YOU THIS IN„i MORE DETAIL IN A SECOND, THIS IS WHAT IT LOOKS LIKE IF YOU RANDOMIZE THE CONTROL NETWORK, YOU GET SMALL UNINTERESTING NETWORKING THIS, IS TYPICAL OF WHAT YOU SEE IN RANDOM DATA SO THERE'S MORE STRECTURE IN THE REAL DATA THAN WHAT YOU SEE IN RANDOM DATA SO THIS WAS OUR LARGEST CONNECTED COMPONENT. AGAIN, THIS CONSISTED OF 39 SNPs AND YOU„i CAN SEE THE DISTRIBUTION OF THE INTERACTIONS ACROSS THOSE SNPs, THESE ARE THE NAMES OF THE SNPs WHICH YOU CAN'T SEE AND THERE ARE A FEW INTERESTING THINGS WE FOUND ABOUT THIS NETWORK, FIRST OF ALL, AND LET ME SAY THAT THE SIZE OF THE CIRCLE REPRESENTS THE SIZE OF THE MAIN EFFECT OF THAT SNP, AND THE THICKNESS OF THE LINE IS THE STRENGTH OF THE INTERACTION. AND I DON'T THINK ANY OF THESE SNPs HAD A MARGINALLY SIGNIFICANT EFFECT ON RISK. SO, TWO INTERESTING OBSERVATIONS, FIRST, ALL OF THE GENES IN THIS NETWORK ARE TARGETED BY THE AERO HYDROCARBON RECEPTOR WHICH IS AN ENVIRONMENTAL RESPONSE GENE THAT TARPGYS A LOT OF OTHER GENES SO YOU GET TOXICANTS IN THE CELL, THEY BIND TO THE AAH, HR, IT GOES INTO THE NUCLEUS AND RETURNS TO A WHOLE BUNCH OF GENES TO RESPOND SO THAT'S REALLY INTERESTING TO US BECAUSE WE'RE STUDYING BLADDER CANCER AND WE THINK ARSENIC AND MERCURY AND OTHER METALS IN THE ENVIRONMENT ARE RISK FACTORS FOR BLADDER CANCER THAT IT WOULD MAKE SENSE THAT GENES THAT ARE TARGETED BY AHR IMPORTANT IN BLADDER CANCER SO THAT WAS ENCOURAGING. THE OTHER THING IS THAT SMOKING IS A KNOWN RISK FACTOR FOR BLADDER CANCER AND BENZO PYRENE TARGETS THE RECEPTOR. SO THAT WAS INTERESTING. THE OTHER INTERESTING THING ABOUT THIS IS WE DID A GENE ONTOLOGY ANALYSIS OF THIS NETWORK AND THE ONLY SIGNIFICANT FINDING WAS THAT THERE IS A NONRANDOM DISTRIBUTION OF TRANSCRIPTION FACTORS SNPs IN THIS NETWORK, THESE ARE THREE IN GENES THAT ENCODE TRANSCRIPTION FACTORS. AND NOTE THAT THESE ARE THE MOST HIGHLY CONNECTED NODES IN THE NETWORK WHICH WOULD MAKE SENSE FOR TRANSCRIPTION FACTORS. SO, THIS WAS ALL INDIRECT SUGGESTIVE EVIDENCE THAT WE MIGHT HAVE SOMETHING INTERESTING HERE, BUT WE'VE ACTUALLY STARTED THE EXPERIMENTS. WE'RE TAKING NORMAL BLADDER CELLS AND BLADDER CANCER CELLS THAT ARE DOING THE EXPERIMENTS RIGHT NOW, TAKING BLADDER CELLS, TREATING THEM WITH BEN VOID PYRENE WHICH IS A MAIN COMPONENT, ONE OF THE CHEM CAMS IN TOBACCO SMOKE THAT INCREASES RISK OF CANCER. TREATING CELLS WITH BENZOPYRENE TO SEE WHAT GOES UP AND DOWN IN RESPONSE TO IT. AND THEN WE WILL DO A FUNCTIONAL ANNOTATION OF THIS NOTE WORK TO SEE IF THERE'S A NONRANDOMMEER ABUNDANCE OF GENES REGULATED BY BEN VOID PYRHINE SO WE'RE ATTEMPTING TO DO FUNCTIONAL VALIDATION SO I'LL JUST MENTION THIS FOLLOW TO SCALE FREE INTRUSION. AND THIS RESEMBLES A RANDOM NETWORK WHICH IS ALSO INTERESTING. THIS IS SOME OF MY COLLABORATORS THINKING ABOUT THESE THINKING ABOUT HOW NETWORKS OF INTERACTING GENES MIGHT INFLUENZ DISEASE RISK INFLUENCE AND SPECIFICALLY REGULATORY NETWORKS AND IF YOU'RE INTERESTED IN MAYBE AND HOW THESE MIGHT BE IMPORTANT FOR DISEASE, WE GO THROUGH SOME OF THAT IN THIS PAPER. I'LL JUST MENTION MATTHEW AS ONE OF MY COLLABORATORS AT DARTMOUTH DOING WOBDERFUL GENOMICS WORK, CANCER GENOME CYST, TAKING SNIPS FROM DENOME WIDE ASSOCIATION STUDIES THAT ARE IN REGULATORY REGIONS AND GOING INTO THE LAB AND WE DO THE COMPUTATIONAL PREDICTION TO SHOW THAT THAT SNP IS LIKE LOW ON CHANGE A BINDING SITE AND THE LAB INVALIDATES THAT AND WE HAVE REALLY INTERESTING FUNCTIONAL EVIDENCE FOR GENOME WIDE ASSOCIATION HITS IN BLADDER--IN BREAST CANCER AND PROSTATE CANCER. SO COMING FULL CIRCLE. THIS IS WHERE I THINK THE FUTURE IS, THINKING ABOUT THESE NETWORKS OF INTERACTING SNPs, NETWORKS OF INTERACTING OTHER GENOMIC MOLECULES, MESSENGER RMA, MICRORNA, PROTEIN NETWORKS AND THE CHALLENGE IS TO PUT THIS TOGETHER AND PUT IT TOGETHER WITH THE EXPERIMENTAL SIDE AND THAT'S WHERE WE NEED TO BE HEADED. JUST MEASURING SNPs AND DOING ONE SNP AT A TIME AND ANALYSIS WILL NOT GET US THERE. AND THIS IS MOY FAVORITE QUOTE FROM WINSTON CHURCHILL. YOU KNOW AND IT'S HARD FOR ME TO SAY THIS AT THE NIH BUT AT SOME POINT THE GENETICS FUNDING SPIGGOT IS GOING TO TURN OFF AND WE'RE EITHER GOING TO COMPLETE ALL THE SEQUENCING AND THE MONEY AM STOP OR PEOPLE ARE GOING TO GROW TIRED. ONCE THE PATIENT ADVARMIS COSY GROUPS GET TIRED OF HEARING THE SAME STORY, THAT GENETICS WILL CURE THEIR DISEASES AND WE'VE BEEN PROMISING THEM THAT FOR 20 YEARS NOW SINCE THE BEGINNING OF THE HUMAN GENOME PROJECT IT'S JUST A MATTER OF TIME BEFORE THE PATIENTS START SAYING WHAT HAS GENETICS DONE FOR ME LATELY? AND WORCHES THEY TURN ON GENETICS THEN IT'S JUST A MATTER OF TIME BEFORE CONGRESS TURNS ON GENETICS AND THEN THE FUNDING FLOW IS GOING TO TURN OFF AND I THINK WE NEED TO BE CAREFUL NOT TO OVERLINEUP THE PROMISE OF WHAT GENETICS WITHH DELIVER AND. AT SOME POINT, THE MONEY WILL RUN OUT AND THEN IT'S GOING TO BE--WE'RE GOING TO BE LEFT WITH MOUNTAINS OF DATA TO CYST THROUGH AND WE'RE GOING TO HAVE TO THINK DEEPLY ABOUT GENETICS AND PHYSIOLOGY AND HOW ALL THESE THINGS WORK TOGETHER AND WHAT KIND OF ANALYSIS TOOLS WE NEED AND HOW TO INTERPRET THEM AND HOW TO TALK TO BIOLOGIST AGAIN AND SO WE HAVE A LOT OF WORK TO DO. SO LET ME ACKNOWLEDGE AN ARMY OF GRADUATE STUDENTS AND POST DOCS THAT DID A LOT OF THIS WORK. TING, DID WORK IN THE ANALYSIS I RESENT AND THE CASEY HELPED WITH THE GPU COMPUTING I MENTIONED. PETER IS THE SOFTWARE ENGINEER THAT BUILT THE MDR SOFTWARE PACKAGE WE DISTRIBUTED AND WE MAINTAIN ITS AND GO PROGRAMMING ON OTHER PROJECTS AND HE'S DONE A LOT OF WORK ON A LOT OF METHOD THAT IS I'VE MENTIONED. JEFF CORALIS IS A MATHEMATICIAN AND DOT MATH AND WORKING FOR ME PART-TIME AND KEEPS US HONESTOT MATH SIDE OF THINGS. THIS IS MY NLM GRANT THAT FUNDS A LOT OF OUR MDR DEVELOPMENT WORK AND THE NEW VERSION OF THIS GRANT IS FUNDING A LOT OF THE NETWORK ANALYSIS THAT WE'RE DOING. THIS IS A GRANT I DIDN'T HAVE TIME TO GET INTO. THIS IS A NEW NLM GRANT THAT I HAVE IS FOCUSED ON THE BASIC QUESTION OF WILL A PATHWAY BASED APPROACH TO SNP ANALYSIS BE MORE SUCK SERVICEFUL THAN A STATISTICAL APPROACH. IN OTHER WORDS DOES USING KNOWLEDGE--DOES KNOWING SOMETHING ABOUT INSULIN METABOLISM HELP YOU IN THE GENETIC ANALYSIS MORE SO THAN IF YOU IGNORED IT. MY E-MAIL, MY WEB PAGE, MY BLOG, I'M ON TWITTER AND HAVE A STREAM OF CONGRESSNESS ABOUT THESE KIND--KRNSNESS ABOUT THESE KINDS OF THING. I'VE MET WITH PEOPLE THIS MORNING WHERE THERE'S OPPORTUNITY FOR COLLABORATION AND IF YOU SEE A WAY WE CAN WORK TOGETHER, PLEASE LET ME KNOW AND I'LL STOP THERE AND THERE'S PLENTY OF TIME FOR QUESTIONS. [ APPLAUSE ] >> [INAUDIBLE QUESTION FROM AUDIENCE ] >> WELL, YOU HAVE TO REMEMBER, THIS IS NOT--THIS IS NOT ONE SNP AFFECTING ANOTHER SNP. WE'RE NOT MAKING THIS INFERENCE, WE'RE SAY THANKSGIVING SNP AND THIS TOGETHER HAVE JOINT INFORMATION ABOUT DISEASE STATUS AND THERE'S NO DIRECTIONALITY HERE. WE'RE SAY THANKSGIVING IS A UNIT HERE THAT IS PREDICTIVE OF BLADDER CANCER. NOW IF YOU WENT TO THE FUNCTIONAL LEVEL, RIGHT? AND YEAH, SAID THIS TRANSCRIPTION FACTOR BIPEDS TO THE GENE FOR THIS TRANSCRIPTION FACTOR AND TURNS IT ON, THEN YOU COULD--BUT THAT WOULD AB I BIOLOGICAL NETWORK. THIS IS A STATISTICALLY INFERRED NETWORK. [INDISCERNIBLE] >> YEAH. >> WE HAVE BEEN DOING THAT, PART OF THE PROBLEM SYOU HAVE TO REMEMBER THIS DATA IS A BOUTIQUE SNP CHIP FOCUSED ON EXCLUSIVELY ON 500 CANCER GENES SO YOU HAVE TO BE CAREFUL BECAUSE IT'S EASY TO SPEND AN INTERESTING--SPIN AN INTERESTING STORY IN IN STATTA NO MATTER HOW YOU LOOK AT BECAUSE THESE ARE ALL CANCER GENES AND WE'RE STUDYING CANCER SO THIS IS NOT A TRUE DENOME WIDE ASSOCIATION STUDY SO THIS IS A CANDIDATE GENE APPROACH SO SWREE TO BE CAREFUL WHEN WE DO THAT. BUT THAT'S REALLY THE FUTURE RIGHT? IS LAYERING THIS KIND OF STATISTICALLY DERIVED NETWORK WITH BIOROUGEICALLY INFERRED PROTEIN AND THE POST DOC AND LOOKING AT PRO-PROTEIN INTERACTION NETWORK AND MY GRADUATE STUDENT AT REGULATORY NETWORKS AND WE'RE THINK HAPPENING ABOUT HOW YOU PUT ALL THOSE TOGETHER AND WHEN I SAY SYSTEMS GENERATEDETETTICS APPROACH AND WHAT I'M TOCK BEING IS PUTTING THAT TOGETHER BUT ALSO COUPLING THAT WITH THE LABORATORY SCIENTIST, BEING ABLE TO GO ON THE LAB AND DO FUNCTIONAL GENOMIC EXPERIMENTS THAT I TALKED ABOUT WITH THE BEN VOID PYRINE, YOU KNOW ACTUALLY GIVING CELLS THE STUFF THAT'S IN TOBACCO SMOKE AND REGULATING THOSE GENES BECAUSE THEN WE CAN DO WHAT YOU'RE TALKING ABOUT ACTUALLY BUILDING A GOON REGULATORY NETWORK FROM FUNCTIONAL DATA AND LAYERING IT ON TOP OF THIS AND THAT'S INTERESTING AND THAT REALLY IS THE FUTURE. >> [INDISCERNIBLE]. >> IF YOU ASK PEOPLE IN THE GWAS COMMUNITY, I GIVE THIS KIND OF TALK A LOT AND THIS IS A GENOME WIDE ONE SNP AT A TIME AND WE LOOKED AT IRPT ACTIONS AND WE--DIDN'T FIND ANYTHING AND THEY DON'T EXIST AND I HEARD SOMEBODY SAY THAT RECENTLY WITH 500 PEOPLE IN THE ROOM AND YOU KNOW THE PROBLEM IS THAT STATISTICIAN WHEN IS THEY APPROACH INTERACTION ANALYSIS, THEY FIRST OF ALL THEY CONDITION ON MAIN EFFECTS SO THE ONLY TEST FOR INTERACTIONS AMONG SNPs AND THAT'S WHAT THEY TEACH YOU AND THEY FIT THE INTERACTION TERM AFTER YOU FIT THE MAIN EFFECTS. THE OTHER THING IS THAT LINEAR MODELS HAVE A LOT MORE POWER TO DETECT THE MAIN EFFECTS THAN THE INTERACTION EFFECT SO THE MODEL ITSELF IS LIMITED AND SO THEY LOOK AT THAT AND THEY TOOK THEIR TOP HITS CONDITIONEDOT TOP HITS AND LOOKED FOR INTERACTIONS AND DIDN'T FIND ANYTHING. THAT'S UNDERPOWERED ANALYSIS. AND WE SEE THE SAME THING THAT THE INTERACTIONS DON'T ATTEND TO OCCUR, WITH THOSE EFFECTS RATHER THEY OCCUR BETWEEN THINGS WITH SMALLER OR NONSIGNIFICANT EFFECTS. AND THAT SUPPORTS THE HYPOTHESIS, THAT INTERACTIONS ARE A COMPONENT OF THE MISSING HERITABILITY ANDA'S WHY WE HAVEN'T SEEN THEM BECAUSE PEOPLE HAVEN'T LOOKED FOR THEM USING POWERFUL METHODS. >> YEAH. >> THERE'S NOTHING WRONG WITH THE DATA. THE DATA IS FAG LOWS, IT'S GREAT. BUT THAT'S NOT AN INHERENT PROBLEM IN THE TECHNOLOGY, WE'RE STUDYING COMPLEX DISEASES AND THERE'S LAY A LOT OF STOICASTISSITY IN WHAT HAPPENS IN BIOLOGY AND THAT'S WHEN WE WOULD CALL THE NOISE. I DON'T THINK THE DATA IS THE PROBLEM, IT'S MEASURED BEAUTIFULLY, THE PROBLEM IS WHEN THEY ANALYZE THE DAT THEY ASSUME IT'S A SIMPLE MODEL THEY'RE LOOKING FOR AND I DON'T THINK THAT'S A VALID ASSUMPTION AND WE ALL KNOW THAT OUR STATISTICAL RESULTS ARE ONLY AS GOOD AS THE ASSUMPTIONS WE MAKE IF YOU ASSUME YOU MAKE A BIG MAIN EFFECT AND THAT'S WHAT YOU'RE LOOKING FOR, AND THAT DOESN'T EXIST THEN YOU'VE MADE A BAD DECISION, I ALWAYS ASK MY STUDENTS WE KNOW WHAT TYPE ONE ERRORS FALSE-POSITIVE, AND 22 ERO SINGLES FALSE-NEGATIVE AND WHAT IS A TYPE THREE ERA. I ASK THEM. YOU'VE ASKED THE WRONG QUESTION. AND I THINK THAT'S THE PROBLEM IN GENETICS IS WE'VE BEEN ASKING THE WRONG QUESTION. WITH THE DATA. >> IN GENOME WIDE ASSOCIATION DATA, YEAH, I MEAN, YOU KNOW IT REALLY IS--IF YOU'RE LOOKING FOR FIVE SNPs, THAT HAVE A COMPLEX INTERACTION AND THERE ARE NO LOWER ONE OR TWO WAY EFFECTS, IT'S LIKE LOOKING FOR A NEEDLE IN A GIANT HAY STACK, RIGHT? YOU'RE--HAVE YOU NOTHING TO GO ON TO FIND THAT FIVE WAY INTERACTION AND THERE AREN'T ENOUGH COMPUTERS TO ENOWMERATE ALL THE FIVE WAY COMBINATIONS IF YOU HAVE A MILLION SNPs, SO THIS IS A--THIS A BIG PROBLEM. SO I DIDN'T HAVE TIME TO TALK ABOUT IT BUT ONE OF THE MAJOR RESEARCH AREAS FOR MY LAB IS HOW DO WE USE EXPERT KNOWLEDGE. I THINK THE ONLY THING THAT'S GOING TO SOLVE THAT IS EXPERT KNOWLEDGE. IF YOU GO INSULIN METABOLISM GENES YOU'RE IMPORTANT, CAN YOU GIVE YOUR ALEGORITHMS A PUSH, A NUDGE IN THAT DIRECTION TO PROBABLILY CONSIDER INSULIN METABOLISM GENES MORE SO THAN IT WOULD OTHER GENES. NOW OF COURSE THAT IS DEPENDENT ON HOW GOOD YOUR KNOWLEDGE OF THE DISEASE AND THAT'S A WHOLE OTHER SET OF ASSUMPTIONS THAT YOU MAKE THAT YOU KNOW SOMETHING. BUT IN MY MIND IT MAKES NO SENSE TO APPROACH THESE PROBLEMS WITHOUT CONSIDERING THAT KNOWLEDGE. WHY WOULD YOU--WHEN YOU'RE--I WENT TO A TALK AT THE AMERICAN SOCIETY OF HUMAN GENETICS MEET NOTHING MONTREAL. THIS IS SOMEBODY FROM THE MAGIC CONSORTIUM WHICH IS THE BIG DIABETES SON SORTIUM, 30,000 PEOPLE FROM MANY OF THESE LARGE GENOME WIDE ASSOCIATION STUDIES THEY MEASURED SEVERAL MILLION SNIPS A LOT OF VCRIANTS PUT IT TOGETHER AND DID A BIG META-ANALYSIS AND FOUND NOTHING. THIS WAS A PLATFORM PRESENTATION AT A NATIONAL MEETING FOUND, YOU KNOW, GREAT DATA, GREAT SAMPLE SIZES, BUT THE ANALYSIS WAS SO BRUTEALLY SIMPLE IT WAS ONE SNP AT A TIME AND THEY DIDN'T ADJUST FOR--OR CONSIDER BODY MASS AS A CO VARIANT. DIDN'T LOOK AT GENDER, SMOKING, DIDN'T DO ANYTHING, ANYTHING THAT EVERY EPIDEMIOLOGIST WOULD TELL YOU TO DO THAT STUDIES DIABETES. IT'S ABSOLUTELY INSANE IN MY OPINION, IF YOU'RE STUDYING DIABETES, WOULD WOULD YOU IGNORE EVERYTHING YOU KNOW ABOUT DIABETES? AND THIS IS WHAT'S HAPPENING IN THE GENETICS COMMUNITY. >> [INDISCERNIBLE] >> I DIDN'T. I JUST--YOU KNOW--I COME ACROSSs A CRUMOGGION AT THESE MEETINGS AND I SIT AND MAKE MAY REMARKS ON TWITTER SO CAN YOU FOLLOW ON ALONG ON TWITTER AND SEE WHEY THINK. YEAH? [INDISCERNIBLE]. >> YEAH, YEAH,. >> YEAH, YEAH, THERE ARE A LOT OF ISSUES WITH LOOKING AT INTERACTIONSS, A LOT OF TOUGH PROBLEMS BACK TO THIS MODEL, THIS HAD AN ODDS RATIO OF THREE, AND THIS HAS BEEN REPLICATED AND THIS A BELIEVABLE RESULT. THIS IS AN ODDS RATIO OF THREE. THERE HAVE BEEN SEVERAL GENOME WIDE STUDIES THAT HAVE BEEN PUBLISHED FOR BLATTER CANCER, ODDS RATIOS OF 1.1 AND 1.2. AND THE INTERACTION AT LEAST IN THIS CASE IS SIGNIFICANTLY BIGGER THAN ANY OF THE MAIN EFFECTS THAT HAVE BEEN SEEN. AND SO, AND WE KNOW THAT THESE TWO SNPs HAVE A VERY STRONG INTERACTION, HAVE ALMOST ZERO MARGINAL EFFECTS. THESE SNPs WOULD HAVE NEVER BEEN FOUND IN A GERONTOLOGYSTS NOASM WIDE ASSOCIATION STUDY EVEN IF YOU WEREN'T TESTING THEY WOULD HAVE BEEN NEVER BEEN FOUND SMOKING SO THAT'S ANOTHER ASSUMPTION PEOPLE MAKE GOING INTO ANALYSIS, EFFECTS SIZE IS A VERY IMPORTANT PART EVER POWER AND THE THEY ASSUME THE SAME OR LESS THAN THE MAIN EFFECTS. WHEN ACTUALLY WHAT WE'VE SEEN IN OUR EXPERIENCE IS THAT THE EFFECTS SIZES FOR INTERACTIONS ARE ALMOST ALWAYS LARGER THAN THE MAIN EFFECTS. THEY HAVE TO BE. I HAVE AGREE. >> [INDISCERNIBLE]. >> SO HERE'S WHAT PEOPLE DON'T REALIZE ABOUT INTERACTION ANALYSIS. ALL THE BUZZ RIGHT NOW IN GENETICS ABOUT ARRA VARIANCE, WE NEED TO SEQUENCE EVERYBODY TO FIND THE RARE VARIANCE AND THAT'S GOING TO PEEL OFF ANOTHER PART OF THE ONION BUT IT WON'T EXPLAIN THE WHOLE THING. TELL ADD SOMETHING, GWAS ALLOWED SOMETHING RARE VARIANCE EXPLAINED SOMETHING, BUT ONE OF THE THINGS THAT PEOPLE FAIL TO REALIZE IS THAT WHEN DOING AN INTERACTION ANALYSIS, THAT THESE INDIVIDUAL GENOTYPE COMBINATIONS ARE THEMSELVES RARE, SO WHAT WE'RE REALLY TALKING ABOUT IS RARE COMBINATIONS OF COMMON VARIATION AND SO, THE RARE VARIANT PEOPLE WHO ARE STILL IGF NORRING INTERACTIONS DON'T REALIZE THAT WELL THIS, IS A RARE EFFECT IN THE POPULATION. --RARE EVENT IN THE POPULATION AND THERE'S A NICE PAPER PUBLISHED IN PNAS AND VALIDATING WHAT WE'VE DONE FOR YEAR SYSTEM THAT THERE IS AN INVERSE RELATIONSHIP BETWEEN THE SIZE OF THE EFFECT AND HOW RARE IT IS THAT'S WHY THERE'S SO MUCH VARIANCE WITH BIGGER EFFECTS ON DISEASE RISK, THERE'S THE BEAUTY OF--THESE ARE RARE EVENTS. WELL THAT'S WHAT I WAS SAYING EARLIER. IT HAS TO BE EXPERT KNOWLEDGE. IT HAS TO BE COMPUTING POWER IS NOT ENOUGH. THERE HAS TO BE SOME BIOLOGICALLY DRIVEN HYPOTHESIS THAT ARE INVESTIGATED IN THESE DATA THAT'S THE ONLY WAY TO DO IT, IF THE PATIENTS IN THE CLINIC UPPER IS AS COMPLEX AS I'M PAINTING IT, THAT'S THE ONLY THING THAT'S GOING TO SAVE US IS OUR KNOWLEDGE ABOUT THE CLINICAL KNOWLEDGE, THE BASIC SCIENCE KNOWLEDGE OF THE DISEASE, I THINK IS THE ONLY THING THAT'S GOING TO SAVE US. WE HAVE TO--I DIDN'T HAVE TIME TO TALK ABOUT IT TODAY, I THOUGHT ABOUT PUTTING THESE SLIDES IN THERE BUT WE'VE BEEN DEVELOPING A COMPUTATIONAL INTELLIGENCE SYSTEM, THIS IS AGAIN ON THE FRINGES OF OF COMPUTER SCIENCE KIND OF STUFF BUT WHAT I WANT TO DO, WHAT I'M INTEREST INDEED DEVELOPING A COMPUTATIONAL SYSTEM THAT CAN SOLVE A PROBLEM AS I WOULD. IF I CAN SIT DOWN WITH A SET OF DATA AND A PHYSICIAN AND A BIOCHEMIST AND WE COULD SIT AT THE COMPUTER AND WE HAD INFINITE TIME, WE WOULD SAY LET'S BUILD THIS MODEL, LET'S LOOK AT THIS PATHWAY, LET'S TRY TO ALEGORITHMS AND WE WOULD DO THAT UNTIL WE FIND THE ANSWER, IN PRACTICE WE LET THE COMPUTER DO THAT, WHAT I'M INTEREST INDEED IS HOW CAN WE TEACH THE COMPUTER HOW TO SOLVE A PROBLEM THE WAY I WOULD IF I TIRCHGERRED WITH THE DATA. --TINKERED WITH THE DATA. SO WE GIVE IT SOURCES OF KNOWLEDGE ABOUT PATHWAYS, ABOUT PRIOR STATISTICAL RESULTS, AND THE ALEGORITHMS CANOOSE THAT DATA PROBABLY--CAN USE THAT DATA PROBABLISTICALLY TO BUILD MODELS OR IT CAN NOT AND WE FOUND THAT OVER TIME THE IF THE DATA IS USEFUL FOR SOLVING A PROBLEM, IF YOU GIVE IT A PROTEIN-PROTEIN INTERACTION NETWORK AND YOU PROVIDE IT AT AN INTERACTION AND THE COMPUTER LEARNS THAT THAT IT CAN LEARN TO USE THAT INFORMATION OVERTIME AND THAT'S WHERE MY RESEARCH IS HEADED IS USING THE DIFFERENT SOURCES OF EXPERT KNOWLEDGE TO GUIDE THE ANALYSIS BECAUSE BRUTE FORCE COMPUTING DOESN'T WORK. BUT THAT'S PROBABLY 50 YEARS OFF. >> [INDISCERNIBLE]. PUT HIS Ph.D. ON INTEGRATING PROTEOMIC DATA WITH SNP DATA AND MODELING DATA JOINT ME AND THAT WAS SUCCESSFUL AND PUBLISHED A COUPLE PAPERS SHOWING THE BEST MODELS INCLUDE BOTH SNPs AND PROTEINS AND THE REASON I DIDN'T SHOW MORE OF THAT IS THAT, YOU KNOW THOSE DATA DATA SETS LARGELY DON'T EXIST. , YOU HAVE MICRO DATA, MOTE I DON'T MEANICS MicroRNA, METHYLATION, EVERYBODY HAS THYROID CARCINOMA OWN PARTICULAR FOCUS ON THE LAYER OF THE HIRE ARCH SCHETHAT'S THE DAT THAT DIRECT AND IT'S VERY--VERY RARE TO HAVE DATA SETS WITH MULTIPLE LAYERS OF THE HIERARCHY. >> SO I HAVEN'T WORKED ON MICROARRAY DATA RECENTLY, DISCIPLINARY EARLIER IN MY CAREER, IT'S SOMETHING I ALWAYS THOUGHT ABOUT IS HOW WOULD YOU ANALYZE MICROARRAY DATA THE SAME WAY WE'RE ANALYZING SNP DATA AND I BET IF YOU--IF YOU DID, YOU WOULD FIND OUT THAT THERE ARE STRONG INTERACTIONS IN MICROARRAY DATA. IT JUST--AND IT MAKES SENSE, GENE REGULATORY NETWORKS ARE ALL ABOUT INTERACTIONS WHY WOULD IT NOT BE THAT WAY? >> [INDISCERNIBLE] >> YEAH, YEAH. YEAH, YEAH, I AGREE. YEAH, YEAH. I AGREE COME PLETELY AND I HAVE A GRADUATE STUDENT WORKING ON THAT VERY PROBLEM. HOW DO YOU MODEL HETEROGENEITY AND INTERACTION AT THE SAME TIME? IT'S A TOUGH PROBLEM BUT HE'S DEVELOPED--HE'S DEFENDING HIS DISSERTATION IN FEBRUARY AND DEVELOPED A METHOD CALLED LEARNING CLASSIFIER SYSTEM. INSTEAD OF DEVELOPING THE BEST MODTHEY'LL FITS THE DATA, WHAT THAT METHOD DOES IS DEVELOP LOTS OF MODELS THAT EACH DESCRIBE DIFFERENT SUBSETS OF THE DATA AND COLLECTIVELY FORM THE MODEL. SO THE MODEL IS ACTUALLY A COLLECTION OF MODELS THAT DESCRIBE SUBSETS OF THE DATA AND I THINK FOR AUTISM AND SCHIZOPHRENIA THAT'S PROBABLY WHAT WE'RE LOOKING FOR. >> YEAH. >> YEAH. ABSOLUTELY. YEAH. >> YEAH. >> I ACTUALLY SAW, THERE'S A BIOMEDICAL INFORMAT CYST AT STANFORD NAMED OMAR DOS THAT DOES THAT, HE DOES FILE O GENETIC MODELS OF DIAGNOSIS AND SYMPTOMS TO DO EXACTLY WHAT YOU'RE TALKING ABOUT SO PEOPLE ARE WORKING ON THAT AND I THINK IT'S A VERY IMPORTANT PROBLEM. OKAY, LTHANK YOU VERY MUCH. ALL GREAT QUESTIONS. [ APPLAUSE ]