IT'S A VERY GREAT PLEASURE TO INTRODUCE MIKE NALLS AS OUR SPEAKER. HE SPENT TIME AT NIH IN THE LABORATORY OF EPIDEMIOLOGY, THERMOGRAPHY AND BIOMETRY. HE STAYED WITH NIH AS A POSTDOC IN THE LAB OF NEUROGENETICS UNDER ANDREW SINGLETON. HE LED INTERNATIONAL EFFORTS TO FIND MORE THAN 90% OF THE GENETIC RISK FACTORS FOR PARKINSON'S DISEASE. HE HAS PUBLISHED OVER 300 PEER REVIEWED PAPERS AND OVER 100 OF THEM HAVE CITED BUY WULF. BIOWULF. HE TAUGHT A CLASS CALLED IMPUTING BIG DATA, GENOME WISE STUDIES WHICH WITH SO POPULAR HE HAD TO TEACH IT THREE TIMES TO CLEAR OUT OH THE WAIT LIST. IN 2017, HE FOUNDED DATATECNICA, WHICH PROVIDES DATA SCIENCE CONSULTING FOR HEALTHCARE AND BUSINESS. MIKE'S TALK TODAY IS TITLED NEUROGENETICS AND BIOWULF, FROM GENOME WIDE ASSOCIATION STUDIES TO MACHINE LEARNING. MIKE? [APPLAUSE] >> THANKS FOR HAVING ME OUT HERE. IT'S A REAL PLEASURE TO GIVE A TALK FOR PEOPLE THAT MADE PRETTY MUCH EVERYTHING IMPORTANT THAT I'VE EVER DONE POSSIBLE. SO WHAT WE'RE GOING TO TALK TODAY -- WHAT WE'RE GOING TO TALK ABOUT TODAY, IT'S GOING TO BE HOPEFULLY 45 MINUTES OR LESS, NOBODY LIKES THE FULL HOUR, RIGHT, AND THEN WE GET TIME FOR DISCUSSION. WE'RE GOING TO TALK ABOUT PARKINSON'S DISEASE GENETICS AND GENOME WIDE ASSOCIATION STUDIES WITH KIND OF A GENERAL OVERVIEW OF THE FIELD. WE'RE GOING TO TALK ABOUT HOW THE GROWTH OF THE FIELD OF GWAS KIND OF PARALLELS BIOWULF'S GROWTH AND IS DEFINITELY FACILITATED BY IT. EVERYTHING WE HAVE DONE FOR THE PAST 10 YEARS, THAT'S REALLY MADE A DIFFERENCE, HAS BEEN WAY TOO BIG FOR JUST ONE LAB. BIOWULF HAS REALLY FACILITATED A LOT OF THAT. A LOT OF WHAT WE DEAL WITH IS MASSIVE DATASETS AND CURRENT GENERATION GWAS TECHNIQUES SO STANDARDIZED PIPELINES, THINGS LIKE THAT. TO GROW THE SCOPE OF GENETICS AND CONTROL THE BASIC CASE CONTROL JE THEY TICS, DEEP DATASETS, CONCEPTS LIKE DISEASE PROGRESSION THAT REALLY MOVED TO A MORE GRANULAR LEVEL THAN PREVIOUS STUDIES OF JUST RECRUITING AS MANY PEOPLE AS POSSIBLE AND ASKING THEM, HEY, ARE YOU A CASE OR CONTROL WITH VERY LITTLE OTHER INSIGHT. WE'RE ALSO WORKING TO TAKE ANOTHER LEVEL OF ANALYTICS AND REALLY APPLY METHODS FROM THE BUSINESS SIZE OF ANALYTICS AND BUILD A MACHINE LEARNING ECOSYSTEM HERE AT NIH, THAT'S EXTENSIBLE GLOBALLY AND IN AN OPEN SOURCE ENVIRONMENT. THIS INCLUDES NOT ONLY ANALYTICS TOOLS BUT ANALYSIS PLATFORMS WITH A LOT OF BIOWULF INTEGRATION FOR COLLABORATION AND HYBRID CLOUD SCENARIOS. SO THIS IS JUST A QUIT NOTE ON META-GWAS, USUALLY USING FIXED EFFECT MODELS. ESSENTIALLY JUST A COMPARISON OF PROGRESSION COEFFICIENTS OF STANDARD ERRORS ACROSS DIFFERENT SITES THAT CAN'T SHARE PARTICIPANT DATA. THIS IS AN AWESOME FIGURE THAT I THINK I SEE 15 TIMES A YEAR IN DIFFERENT PEOPLE'S TALKS SO I WANTED TO JOIN THE COOL KIDS GROUP AND PUT IN ONE OF MIGHT MINE. THIS IS FROM THE 2009 NATURE PAPER THAT'S TEULY STILL REALLY, REALLY ON POINT. WE'RE LOOKING AT BASICALLY THE BLUE DOTS ON THIS CHART. THAT'S WHAT WE HAVE POWERED FOR THE MOST PART. THERE'S A LOT OF HERITABILITY TO BE EXPLAINED IN PARTICULARLY PARKINSON'S DISEASE IN THAT REGION. SO THIS IS -- YOU CAN FIND A LOT OF REALLY GREAT IMAGES ABOUT THE GROWTH OF BIOWULF JUST USING GOOGLE IMAGE SEARCH SO I STOLE THIS ONE. THIS IS KIND OF A COMPARISON OF PD GENETICS, PARKINSON'S DISEASE GENETICS FROM 2005 TO TWO YEARS FROM NOW HOPEFULLY, AND WHEN WE DO THE NEXT GWAS, AND PARALLEL TO THAT IS THE BIOWULF CPU USAGE PER MONTH. YOU CAN SEE THAT THERE'S NO DENYING THAT COMPUTATIONAL SCIENCE AT NIH IS GROWING, AND WITH THAT, THESE LARGE KIND OF -- LIKE GWAS AND META-GWAS HAVE BECOME MUCH MORE SUCCESSFUL. WHEN I FIRST STARTED WORKING ON PARKINSON'S DISEASE IN AROUND 2010, THERE WAS THREE KNOWN COMMON LOCI FOR PARKINSON'S DISEASE. THERE WAS ABOUT 2,000 CASES THAT HAD BEEN GENOTYPED AND ABOUT 4,000 CONTROLS. NOW WE'RE BEING LOOING LOOKING AT AROUND ALMOST 40,000 40,000 CASES AND WE HAVE ABOUT 90 COMMON INDEPENDENT RISK FACTORS FOR PARKINSON'S DISEASE. YOU LOOK AT BIOWULF RIGHT NOW, THEY WENT FROM UNDER 10 MILLION CPU HOURS BACK THEN TO 90 MILLION. THE NUMBER 90 KEEPS SHOWING UP. SO MY PERSONAL BEGINNING WITH BIOWULF WAS A FRESH PH.D. THAT TRANSFERRED FROM A BIOSTATS/EPIDEMIOLOGY LAB TO NEUROGENETICS LAP. THERE WAS JUST A FEW SMALL GWAS OF PD, LIKE WE WERE SAYING, JUST A FEW THOUSAND CASES. WE REALLY WANTED TO STEP THINGS UP SO WE STARTED DOING LARGE SCALE IMPUTATION ACROSS DIFFERENT DATA SILOS TO HARMONIZE DATASETS SO ESSENTIALLY USING REFERENCE HAPLOTYPES TO GUESS WHAT GENOTYPES HAD NOT BEEN GENERATED IN EACH STUDY AND SMOOTH OUT THE DATA ACROSS SITES SO THAT WE COULD COMPARE DISPARATE DATA SOURCES THAT WERE NOT ASSAYED UNIFORMLY ACROSS THESE DATASETS. IT'S NOT HAPPENING ON YOUR LAPTOP, ONE OF THE PEOPLE I WORKED WITH SUGGESTED, WELL, YOU SHOULD GET A BIOWULF ACCOUNT, AND THEN I USED THAT ACCOUNT TOO MUCH AND HAD TO SEND A LOT OF REQUESTS FOR ADDITIONAL STORAGE, AND PROBABLY ANNOYED A LOT OF PEOPLE. BUT WE GOT SOME GOOD PAPERS OUT OF IT, INCLUDING THIS ONE PAPER WHICH I THINK MIGHT BE THE FIRST IMPUTATION FROM GENOME SEQUENCING IN A META-ANALYSIS. SO EIGHT MORE YEARS OF GROWTH, A LOT MORE SAMPLES, A LOT MORE RISK LOCI AND A LOT MORE COLLABORATORS, A LOT MORE DATA ON BIOWULF, A LOT MORE STORAGE REQUESTS, A LOT MORE CPU HOURS, AND THEN NOW I'M KIND OF JUST GOING TO FOCUS ON THIS RIGHT HERE THAT WE HAVE ON BIOARCHIVE. IT'S NOT TECHNICALLY OUT YET BUT BIOARCHIVE IS KIND OF WHAT MATTERS TO ME. SO THIS IS A META-GWAS OF 42,000 CASES AND 1.4 MILLION CONTROLS OVER 7 MILLION VARIANTS EE. I KNOW THAT'S AN EXCESSIVE NUMBER OF CONTROLS BUT, YOU KNOW, YOU TAKE WHAT YOU CAN GET. WE'RE LUCKY ENOUGH TO HAVE ALL THESE COLLABORATORS AT DOZENS OF SITES ACROSS THE GLOBE, AND ALSO, YOU KNOW, A HUGE CONSORTIUM JUST SOLELY FOCUSED ON PARKINSON'S DISEASE GENOMICS. ALSO SPECIAL THANKS TO THE GREAT PEOPLE AT BIOWULF DOWN HERE THAT PUT UP WITH MY REQUESTS FOR DIFFERENT MODULES, AND MORE STORAGE. SO THIS IS KIND OF THE IDEA OF WHY YOU NEED SOMETHING LIKE BIOWULF TO DO THIS KIND OF RESEARCH, BECAUSE THIS IS THE WORK FLOW FOR ONE PAPER THAT WE DID. IT'S REALLY A HUGE AMOUNT OF DATA TO THINK ABOUT, YOU KNOW, BIOBANK SCALE STUDIES OF GENETICS. DISCOVERY PHASE, ANALYSES OF SUMMARY STATISTICS FROM ALL OF THESE DIFFERENT SOURCES, THEN CALCULATING HERITABILITY AND POLLGENIC RISK SHARED ACROSS DIFFERENT DISEASES SO WE'RE NOT ONLY GENERATING OUR OH OWN OWN GWAS SUMMARY STATISTICS BUT ALSO BANKING THEM FROM PUBLICLY AVAILABLE DATA ALL IN THE SAME PLACE. WE'RE DOING QTLs, SO QUANTITATIVE TRAIT LOCUS YOU -- INFERRING FUNCTIONAL CHANGES IN METHYLATION EXPRESSION AT THESE SITES TO HELP GUIDE FOLLOW-UP STUDIES FOR THE FUNCTIONAL BIOLOGY PEOPLE THAT UNLIKE ME ACTUALLY UNDERSTAND BIOLOGY. A LOT OF GENE SET ANALYSES, A LOT OF HERITABILITY ESTIMATES. IT'S KIELT A LARGE QUITE A LARGE AMOUNT O F DATA JUST TO DO THESE PAST THE LOCUS DISCOVERY IN THE BOX ON THE LEFT SIDE. SO WHAT IT TAKES TO ACCOMPLISH THAT WORK FLOW IN THE PREVIOUS SIDE IS A LOT OF DATA. SO A HUGE AMOUNT OF IMPUTED GENOTYPES. A LOT OF REFERENCE DATASETS. VARIOUS WEB SOURCES FOR PATHWAYS, GWAS CATALOGS, IMPUTATION SERVERS THAT WE USE. A LOT OF STANDARDIZED WORK FLOWS. THIS INCLUDES A LOT OF THINGS YOU SEE ON THE NEUROGENETICS GIT HUBS FOR COMPUTATION ANALYSES, ME IT TA ANALYSES, WE TRY TO GET AS MUCH OF THIS PUBLIC IN REALTIME BECAUSE AT THE NIH, IT'S OUR JOB. OVERALL IN BIOWULF JUST TAKING A QUICK INVENTORY, THIS PAPER ALONE INCLUDES 2.4 TERABYTES OF GENOTYPE DATA FROM OUR SITE HERE, NOT INCLUDING THE SITES THAT WE DON'T HAVE ACCESS TO IN EUROPE. AS WELL AS ANOTHER 4.2 TERABYTES OF ADDITIONAL REFERENCE INFORMATION FOSH FOR GENE EXPRESSON METHYLATION, DIFFERENT PATHWAYS. MY MUSICAL TASTES WILL TELL YOU YOU THAT I'M PROBABLY FROM THE 90s AND THE TWO THOWRs SO I'M THINKING ABOUT THIS IN TERMS OF CD-ROMs AND THIS WOULD BE A 300-FOOT TALL STACK OF CD-ROMs WITHOUT THE CASES. THAT'S HOW MUCH DATA WE HAVE STORED ON BIOWULF FOR THIS ONE PAPER. SO HERE'S THE RESULTS. WE'LL STOP TALKING ABOUT JUST HOW PAINFUL IT IS TO WORK ON THIS STUFF SOMETIMES. WE'VE DONE A PRETTY GOOD JOB OF FINDING INDEPENDENT RISK FACTORS FOR PARKINSON'S. I THINK PARKINSON'S -- THIS IS A MANHATTAN PLOT, THE P VALUES ARE THE VERTICAL DISTANCES ACROSS ALL THE CHROMOSOMES ON THE BOTTOM. THEN YOU CAN SEE THE NEW GENES IN ARE RED AND THE OLD GENES ARE IN PURPLE. THIS IS REALLY KIND OF THE SUMMARY OF THE WORK OF THOUSANDS OF PEOPLE IN TERMS OF COLLECTING SAMPLES, OF DOING ANALYSES. I'M PRETTY LUCKY TO BE ABLE TO SHOW THIS TO YOU ALL. WE ALSO GOT LUCKY WITH PARKINSON'S DISEASE BECAUSE THERE'S SO MANY MODERATE EFFECTS THAT WE'RE PULLING OUT OF HERE, IT'S DIFFERENT WITH DISEASES LIKE ALZHEIMER'S WHERE ALZHEIMER'S HAS A FEW GENES, A MUCH LARGER EFFECT WHERE IN PARKINSON'S WE HAVE -- WE DON'T HAVE AN APOE E, WE HAVE A BUNCH OF SMALL KIND OF SOMEWHAT DIFFICULT TO FIND AND IN LESS THAN 100,000 SAMPLES KIND OF SITUATIONS. SO THIS IS THE OVERWELL MING TABLE OF 38 NOVEL RISK FACTORS. IF YOU'RE CURIOUS ABOUT THESE NEW HITS, YOU CAN DOWNLOAD THIS FROM OUR GITHUB, ALL THE SLIDES ARE UP THERE. THESE ARE GENE-LEVEL ANALYSES SO THIS IS JUST ANOTHER TABLE TO SHOW YOU HOW OVERWHELMING THIS DATA CAN GET AND THESE ARE JUST HITS, AND THE BEST HIT, USING THIS OH TO MAKE FUNCTIONAL INFERENCES, REALLY JUST A CONCEPT WHERE YOU'RE COMPARING WHAT IS BASICALLY TWO AGGREGATE EFFECTS ACROSS NUMBER OF VARIANTS WITH TWO DISPARATE OUTCOMES AND LOOKING IT AT THE OVERLAP BETWEEN THEM AND TREATING IT ALMOST AS A RANDOMIZED CONTROL TRIAL AND MAKING FUNCTIONAL INFERENCES. SO THIS GIVES US AN IDEA OF WHAT GENE IS BEST TO FOLLOW UP WITH UNDER THESE GWAS PEAKS, SO ALL OF THOSE TOWERS ESSENTIALLY FROM THE MANHATTAN PLOT HAVE MULTIPLE GENES UNDER THEM AND WE'RE PRIORITIZING DIFFERENT GENES FOR THE SMART BIOLOGIST ACTUALLY TO GO AFTER. AND DO REAL SCIENCEY BIOLOGY-TYPE STUFF THAT I DON'T UNDERSTAND. SO THIS IS ONE THING TO KIND OF SUMMARIZE ALL OF THOSE THREE OVERWHELMING PLOTS AND TABLES IN ONE PLACE, IS LOOKING AT JUST ONE LOCUS, SO ONE OF THOSE TOWERS IN THE MANHATAN PLOT. AND THIS IS AN INTERESTING RESULT WHERE WE COMBINE THE GWAS DATA ITSELF, THESE QTL MENDELIAN RANDOMIZATION ANALYSES IN ONE FIGURE, AND WHAT WE SEE IS THAT UNDER THIS PEAK, IF YOU LOOK AT THE TOP VARIANTS THAT HAVE FUNCTIONAL CONSEQUENCES IN THE BRAIN, SO THAT'S IN THE PART OF THE BRAIN MOST AFFECTED BY PARKINSON'S DISEASE, THE GRN IS ACTUALLY HUGELY ASSOCIATED WITH THE RISK AT THIS REGION, CHANGES IN EXPRESSION IN THAT GENE. WHAT THAT SUGGESTS THAT THEY MAY BE A POSSIBLE FUNCTIONAL LINK BETWEEN PARKINSON'S DISEASE AND FRONTOTEMPORAL DEMENTIA THAT IS FACILITATED BY GENETICS INFLUENCING CHANGES IN EXPRESSION. PEOPLE WHO KNOW MUCH MORE ABOUT THESE CONDITIONS APPARENTLY WERE PRETTY EXCITED ABOUT THAT FINDING. I JUST THOUGHT IT WAS A CORRELATION. SO WE'RE GOING TO TALK ABOUT SOME POPULATION SCALE ANALYSES NOW. I DON'T KNOW IF EVERYBODY IN THE ROOM KNOWS WHAT PRS IS. BUT IT'S A POLYGENIC RISK SCORE, SO IT'S I VERY SIMILAR TO WHAT UNDERLIES THOSE MENDELIAN -- INDEPENDENT GENETIC RISK PER DISEASE PER INDIVIDUAL. YOU WEIGHT THESE BY GWAS EXTERNAL EFFECT ESTIMATES SO IF YOU HAVE A NEW SET OF PARKINSON'S DISEASE SAMPLES AND SUMMARY STATISTICS FROM PARKINSON'S DISEASE GWAS, YOU WEIGHT INTD IND INDEPENDENT VARIANT SETS IN YOUR DATASET -- NOT ALL SNPs HAVE THE SAME EFFECT ON THE DISEASE. SO ONE COULD HAVE AN ODDS RATIO OF 1.5 AND ANOTHER COULD HAVE AN OH ODDS RATIO OF 2.5. YOU WANT TO WEIGHT THAT MORE RISKIER SNP HIGHER. THIS IS THE ONLY FORMULA I'LL SHOW YOU IN THE ENTIRE TALK, I HOPE. YES? >> WHAT WAS THE -- [INAUDIBLE] >> R. >> [INAUDIBLE] >> FOR MOST OF IT, YEAH. WE'RE SWITCHING EVERYTHING TO PYTHON, SO IF YOU GO TO THE GITHUB, ALL OF THAT STUFF WILL BE DEPRECATED SOON. SO TO REALLY ALLOWS YOU, THESE METHODS ALLOW YOU TO CONSIDER OVERALL DISEASE HERITABILITY IN SAMPLE SIZE, IT'S MUCH STRONGER TO PREDICT THE DISEASE BASED ON A PRS THAN IT IS TO PREDICT ON ONE SNP, SO SOME COMMERCIAL APPLICATIONS THAT TRY TO PREDICT YOUR DISEASE USING ONE OR TWO INDEPENDENT SNPs ARE PROBABLY NOT THE BEST OPTION WHEN YOU CAN USE WEIGHTED AGGREGATE SNPs. THIS MAY NOT EVEN BE OPTIMAL COMPARED TO MORE FINANCE DERIVED MACHINE LEARNING METHODS. YOU CAN COMBINE THIS AND WE'VE HAD ACTUALLY QUITE KUCK SES USING THE PRS IN AN ENSEMBLE PREDICTION OFTENTIMES IN A PROGRESSION FRAMEWORK COMBINED WITH CLINICAL AND DEMOGRAPHIC DATA TO GET YOU MUCH BETTER PREDICTIONS. SO IF YOU LOOK AT THE FIGURE ON THE RIGHT, YOU GET AN ACCURACY AT THE TIME THIS WAS PUBLISHED JUST ABOUT 64% IN GENETIC DATA, SO JUST THE PRS ALONE. WHEN YOU COMBINE IT IN A STEP WISE REGRESSION, SO ADDING IN ALL THE FEATURES WE THOUGHT MIGHT BE IMPORTANT IN PREDICTING PARKINSON'S DISEASE, AND WORKING BACKWARDS TO KEEP THE ONES THAT ACTUALLY MATTER, A FEW THINGS THAT SOME PEOPLE THINK ARE PRETTY SIMPLE STAND OUT, SO THE JE NE IT TICK RISK SCORE, A SMELL TEST, SO JUST HOW YOUR ABILITY TO SMELL, FAMILY HISTORY OF PARKINSON'S DISEASE AND YOUR GENDER AND YOUR AGE ALLOW US TO PREDICT IN THE CASE CONTROL SETTING IF YOU'RE A PD CASE WITH AN ACCURACY AROUND 90%. THESE BLOBS ON THE PLOT ARE YOUR CASE VAISHT FOR AT RISK CONTROLS AND PDs AND LASTLY THIS CATEGORY, THE RELEVANCE OF THIS MODEL AND MODELS LIKE THIS GOING FORWARD IN THE FUTURE IS THAT CLINICAL TRIALS CAN BE AFFECTED BY THINGS LIKE THESE MODELS. ROUGHLY 10 TO 20% DEPENDING ON WHAT YOU SAM PEL ARE SAMPLE ARE THESE PARKINSON'S MIMIC SYNDROMES. WE CAN FIND AND FLAG THOSE CASES THAT HAVE PARKINSON MIMIC SYNDROMES THAT ARE DEFINED BASED ON IMAGING DATA, THEY DON'T HAVE ANY -- FUNCTION BUT AS YOU CAN SEE HERE, THERE'S A BUY D -- MORE SIGNIFICANT ENRICHMENT OF THE RIGHT SIDE OF THE GRAPH, WHETHER PREDICTED TO BE CASES, THEY ACTUALLY AS A FOLLOW-UP BECOME TRUE PARKINSON'S DISEASE CASES WHERE THE ONES THAT ARE CONTROLS ON THE LEFT SIDE ACTUALLY STAY CONTROLS. YOU CAN IMAGINE IF YOU APPLY THIS TO A CLINICAL TRIAL BEFORE RECRUITMENT YOU'VE JUST UPPED YOUR EFFICACY QUITE A BIT. IF YOU HAVE ALL THESE MIMICS UP THERE AND YOU'RE ALREADY HEDGING YOUR BETTS OF FAILURE. IF YOU KNOW ABOUT ANYBODY USING IT OR PLAN TO USE IT, PLEASE LET ME KNOW. SO ANOTHER THING WE'RE REALLY INTERESTED IN FROM A PRS STANDPOINT NOT DIRECTLY PRS BUT VERY SIMILAR WORK IS SHARED HERITABILITY ACROSS PHENOTYPES BECAUSE THIS ALLOWS YOU TO KIND OF GET PAST OBSERVATIONAL -- UNDERPOWERED OBSERVATIONAL STUDIES. SO USING WEB SERVICE LIKE LD HUB TO PULL DOWN TONS AND TONS OF SUMMARY STATS FROM A VARIETY OF DISEASES. WHAT WE SEE IS ESSENTIALLY A CORRELATION OF POLYGENIC RISK FOR THESE FOUR TRAITS. SO INTRACRANIAL VOLUME, CURRENT TOBACCO, -- THIS IS FROM A UK BIOBANK, ENGLISH METRIC OF BASICALLY GOING TO COLLEGE OR NOT. SO WE SAW THESE ASSOCIATIONS FOR THE SHARED POLYGENIC RISK BEING CORRELATED ACROSS THESE AND WANTED TO DO TEST OF FUNCTIONALITY WITHIN MENDELIAN RANDOMIZATION FRAMEWORK. WHAT WE SEE IS SOME VERY CONTRADICTORY RESULTS TO A LOT OF THE OBSERVATIONAL LITERATURE FROM A LOT OF UNDERPOWER WITHED OR -- STUDIES OF SMOKING STATUS AND THINGS LIKE CAFFEINE BEING FUNCTIONALLY ASSOCIATED WITH PARKINSON'S DISEASE. SO NONE OF THOSE FACTORS WERE IN ANY OF OUR PREDICTIVE MODELS USED BACKWARDS STEP WISE REGRESSION FOR PARKINSON'S DISEASE, THOSE FACTORS WERE NEVER IN THERE, WE USED THE PRS AND SMELL TEST AND STUFF LIKE THAT. SO THAT MADE ME -- NOT SHOWING THEM GENETICALLY CORRELATED AS MUCH PARTICULARLY CAFFEINE BECAUSE PEOPLE THINK CAFFEINE HAS SOMETHING TO DO WITH PARKINSON'S DISEASE. WE ALSO LOOK AT THE FUNCTIONAL ASSOCIATION BETWEEN SMOKING STATUS AND THERE'S NO CLEAR FUNCTIONAL ASSOCIATION IN TERMS OF CAUSALITY FOR MENDELIAN RANDOMIZATION BETWEEN SMOKING STATUS AND PARKINSON'S DISEASE. THERE'S SOME REVERSE CAUSALITY. THE ONLY EXTREMELY CLEAR CAUSATIVE ASSOCIATION THAT WE CAN MAKE AN INFERENCE ABOUT IS COGNITIVE PERFORMANCE AND EDUCATIONAL ATTAINMENT. SO ALL THE SMART PEOPLE AT THIS ROOM ARE PROBABLY AT HIGHER RISK FOR PARKINSON'S DISEASE THAN WHEN A LOT A LOT OF OTHER PEOPLE. SO JUST TO TALK A LITTLE FURTHER ABOUT HERITABILITY AND POLYGENIC RISK ESTIMATES, THIS IS SOMETHING THAT WE ALWAYS TALK ABOUT IS HOW GOOD CAN WE GET WITH JUST GENETICS. AND YOUR ABILITY TO MAKE A GENETIC PREDICTER IS BASED LARGELY ON THE POSSIBLE HERITABILITY OF A DISEASE BASED ON COMMON VARIANTS. SO WE'VE DONE EXTREMELY -- WE'VE BEEN EXTREMELY LUCKY PRS BETWEEN 65 AND 70% OF THE -- OF THE CURVE AT VALIDATION. LOOKING AT THESE PRS GOING WAY DOWN IN SIGNIFICANCE, AROUND 18,000 VARIANTS, YOU CAN SEE THESE TRENDS HERE AND HERE FOR AS THE QUARTILE PRS INCREASE, YOUR ODDS RATIOS REALLY INCREASE AT TRAINING AND VALIDATION. SOME OF THE RESULTS FOR THIS IN THE PINK LINE WHICH IS VALIDATION IS BECAUSE THEY HAVE SOME INHERENT POPULATION STRUCTURE THAT HAS ELEVATED FREQUENCY OF LARGER RISK VARIANTS FOR PARKINSON'S DISEASE, YOU WOULD EXPECT THE MODEL TO PERFORM A LITTLE BIT BETTER AND THAT'S WHY WE SEE UNDERFIT IN THE TRAINING SET AND SLIGHTLY INCREASED AUC IN THE VALIDATION. WE'RE FINDING THINGS THAT HAVE AN ODDS RATIO OF 1.1, SO THAT'S NOT HUGE FOR RISK PREDICTION OR CLINICAL APPLICATIONS BUT WE'RE BUILDING NETWORKS, BUILDING THESE CASS SCAIDZ OF WHAT HAPPENS IN PARKINSON'S DISEASE. WE FOUND POWER CALCULATIONS BASED ON -- THAT KIND OF BUILD THE OPTIMAL PRS. THESE HAVE A P VALUE RIGHT NOW OF FIVE TIMES E TO THE NEGATIVE THREE -- US A GET TO THE MORE MARGINAL P VALUES, THE ALLELE FREQUENCIES GET LOWER AND SO DO THE EFFECT ESTIMATES SO WE NEED MORE SAMPLES TO KIND OF REACH THAT GENOME WIDE SIGNIFICANCE. TO DO THAT, IT'S GOING TO TAKE AROUND 99,000 CASES SO WE'RE AT LIKE ALMOST 40,000 CASES NOW SO WE NEED TO DOUBLE THAT. ALSO WHAT'S GOING TO HELP WITH THIS IS ACTIVELY RECRUITING MORE DIVERSE POPULATIONS OF PD CASES AND CONTROLS. THIS IS PROBABLY THE MOST IMPORTANT NEXT STEP BECAUSE THIS WILL REALLY CHANGE HOW YOU VIEWED HAPLOTYPE STRUCTURE BUT ALSO BRING IN DIFFERENT ALLELE FREQUENCIES FOR LOCI YOU MIGHT FIND IN THE FUTURE. IT ALSO RAISES THE QUESTION OF WOULD A GENETIC PREDICTER WORK IN DID DIVERSE POPULATIONS AS WELL AS IT DOES IN THESE EUROPEAN POPULATIONS THAT WE'VE BEEN WORKING INSOFAR. SO THE NEXT STEPS IN TERMS OF TOPICS ARE PREDICTERS OF PROGRESSION, SO SINGLE OUTCOMES LIKE COGNITIVE SCORES BUT ALSO GENERAL PROGRESSION TRAJECTORIES WHICH WE'LL TALK ABOUT IN A COUPLE MINUTES. MORE DATA FROM DIVERSE SOURCES, MULTIPLE MODALITIES. WE'RE IN AN IPS REVOLUTION RIGHT NOW. BUT THAT RAISES THE QUESTION OF HUGE NUMBER OF COLLABORATORS OF SITES AND SILOS, MASSIVE DATASETS HARBORED THIS DIFFERENT UNIVERSITIES. WE NEED TO REALLY START WORKING ON IMPROVED DISEASE PREDICTERS LIKE -- THERE NEEDS TO BE MORE DIVERSITY IN OUR RESEARCH PROGRAMS IN TERMS OF ANCESTRY AND ALSO LOOKING AT LOWER PREVALENCE DISEASES AND THESE RAISE GENERAL CONCERNS OF APPLICABILITY, REPRODUCIBILITY, REALTIME SHARING AND SCALABILITY THAT WILL REALLY ONLY BE REMEDIED IF WE START INVESTING IN COMPUTE INFRASTRUCTURES IN STUDIES LIKE THIS FOR A LARGE SCALE. I THINK A LOT OF THE TOOLS WE'RE WORKING TOWARDS ARE DEEPLY PHENOTYPE STUDIES WHICH IS SOMETHING THAT WE'VE BEEN PAYING A LOT OF ATTENTION TO LATELY. UNSUPERVISED LEARNING FOR SUBTYPING, SUPERVISED FOR PREDICTION, FEDERATED ACROSS SILOS, SHARING DATA TO A CENTRALIZED SERVER KIND OF LIKE HOW EVERYONE IN THIS ROOM'S CELL PHONE HAS ESSENTIALLY TAUGHT ANDROID OR IOS HOW BAD WE ALL ARE AT TYPING. WITHOUT GIVING AWAY ANY OF OUR DATA. WE THINK IN TERMS OF BIOWULF, WITHIN NIH, ALSO GOOGLE CLOUD ASPECT FOR ALL OF OUR COLLABORATORS AS A MIRROR SITE OF WHAT WE HAVE ON BIOWULF SO THAT THEY CAN RUN ANALYSES IN THE CLOUD IS REALLY IMPORTANT. WE'VE BEEN USING TERA TO FACILITATE THIS FOR THE PD PROJECT WHICH IS A LOT OF WHAT I WORK ON. AND ALSO ANTHOS, ANOTHER TOOL FOR -- THAT WE'VE BEEN REALLY LOOKING INTO. WE GET TO DO EVERYTHING FOR THE PUBLIC GOOD AND HELP OTHER RESEARCHERS DO THINGS THAT ARE VALUABLE, SO WE REALLY ARE THINKING THAT IF YOU'RE NOT WORKING IN JUPITER OR GITHUB AND NOT PUTTING YOUR PAPERS ON BIOARCHIVE THAT IT MIGHT BE GOOD TO START DOING THAT, BECAUSE WE'RE DOING THIS FOR THE SICK PEOPLE, RIGHT? SO THERE'S ALL THESE TOOLS FOR OPEN SCIENCE NOW THAT ARE FACILITATING THAT. SO IF YOU WERE TO SEE MORE OF THE INFRASTRUCTURE, A PERFECT EXAMPLE IS WHAT'S COMING OUT IN THIS LINK ON THE SLIDE. SO THEN WHEN YOU TALK ABOUT SOME QUICK MACHINE LEARNING BASED ANALYSES AND PLATFORM, TOOL ECOSYSTEM WORK WE'VE BEEN DOING, FOR ABOUT THE NEXT 10 MINUTES. JUST A QUICK INTRO TO KEY CONCEPTS IN MACHINE LEARNING. SO THIS IS A REALLY GOOD REVIEW THAT I WAS NOT INVOLVED IN, BUT PEOPLE WHO SEEM TO REALLY BE RESPONSIVE TO THIS ARTICLE WERE THINGS LIKE JOURNAL CLUBS. THE COMMON THING YOU ALWAYS HEAR IS WHY DO YOU USE MACHINE LEARNING WHEN YOU CAN JUST DO REGRESSION? THAT'S BECAUSE PROBLEMS ARE SOMETIMES NON-LINEAR IN THEIR SOURCE AND THIS IS A PERFECT EXAMPLE OF THAT. SOME OF THE STUFF THAT WE DEVELOP, WE ACTUALLY USE REALLY THESE REGRESSION MODELS ARE PARTS OF THE TOOLKIT OF MACHINE LEARNING. IT'S JUST AN ALGORITHM, RIGHT? EVERYTHING THAT YOU HEAR, DEEP LEARNING, IT'S PART OF SAYING LIKE SCREWDRIVER, HAMMER, SAW. IT WORKS WHEN THE DATA WANTS TO WORK, RIGHT? WE HAVE SUPERVISED AND UNSUPERVISED LEARNING IN THESE TWO FIGURES. BREAKING DOWN THESE FANCY FIGURES ONE SENTENCE EXPLANATIONS WOULD BE THAT SUPERVISED LEARNING IS HOW YOU PREDICT SOMETHING THAT ALREADY HAS A LABEL ON IT, THAT ALREADY HAS AN OUTCOME. UNSUPERVISED LEARNING IS HOW YOU GENERATE A LABEL BASED ON DATA, SO JUST A DATA-DRIVEN LABELING. THE MAIN IMPORTANT THING HERE IS CROSS VALIDATION PLUS TUNING, SO THAT'S HOW THESE SPECIFIC MODELS CAN REALLY MAKE BETTER PREDICTIONS BUT ALSO MORE GENERALIZABLE PREDICKS BECAUSE CROSS VALIDATION, YOU'RE WITHHOLDING A PIECE OF YOUR DATA FITTING TO THAT MODEL OF HELD DATA AND SEEING HOW THAT PERFORMS OVER AND OVER AGAIN SO YOU KNOW YOU'RE NOT OVER OR UNDERFITTING AND MAKING UP STORIES. DID YOU HAVE A QUESTION? >> [INAUDIBLE] >> THAT'S NOT MINE. THAT'S A PAPER FROM NATURE GENETICS. THESE IMAGES ARE FROM A GREAT REVIEW I HAD NO PART OF, BUT IT'S REALLY GOOD. IT'S A GREAT REVIEW. SO JUST A CASE STUDY IN WHAT WE'VE BEEN DOING IN SUPERVISED MACHINE LEARNING APPLICATIONS. THIS IS JUST PREDICTIONS ACROSS A LARGE VARIETY OF PARKINSON'S DISEASE COHORTS. AREA UNDER THE KUR, THAT'S A GOOD INDEX OF HOW GENERALIZABLE AND STRONG YOUR MODEL IS. SO WITH THE POLYGENIC RISK SCORES, WE SEE AUC ACROSS ALL 65% FOR GENETIC PREDICTER. MACHINE LEARNING, IN THIS CASE, THIS IS AN -- ALGORITHM ACROSS I THINK 17 COHORTS, HAD AN AUC CROSS VALIDATION OF ALMOST 7%. WHEN YOU TAKE THE PRS PLUS THE CLINICAL DATA WE DISCUSSED EARLIER LIKE SMELL TEST AND AGE, FAMILY HISTORY, AND SEX, WE SEE AN AUC IN A LINEAR MODEL FROM ROUGHLY 92% TO -- IN A BOOSTED MODEL OF 94.6% FOR A CROSS-SECTIONAL PREDICTION OF PARKINSON'S DISEASE. IN GENERAL, YOU CAN EXPECT FROM OUR SIMULATIONS AROUND A 1% INCREASE PRETTY EASILY. FOR A STANDARD REGRESSION MODEL. AND THAT'S BECAUSE LIKE, LINEAR MODELS ARE LIKE A SWISS ARMY LIFE, RIGHT? IT DOES EVERYTHING PRETTY DWOOD, GOOD, BOTH SOME OF THESE MODELS, BOOSTED MODELS IN PARTICULAR ARE FANTASTIC IN GENETICS, AND IF YOU LOOK AT WHAT WINS ON CAGLE, WHICH IS A WEBSITE FOR FINANCE BASED MACHINE LEARNING ALGORITHM COMPETITIONS, YOU'LL SEE THAT ANYBODY, PARTICULARLY I THINK IT WAS A GERMAN PAPER THAT SAID MACHINE LEARNING IS NOT AS GOOD AS LINEA MODELS, YOU'LL FIND THAT THAT'S NOT THE CASE IN A DATA-DRIVEN WAY WHERE THE WINNERS ARE TAILORED ALGORITHMS THAT REALLY HAVE BEEN COMPETED AGAINST EACH OTHER TO PICK THE BEST ALGORITHM AND FIT THE DATA. IT'S MORE LABOR INTENSIVE BUT YOU GET A BETTER RESULT GENERALLY. AND THERE'S A NICE PAPER COMING OUT BY SOME OF THE PEOPLE IN THIS ROOM THAT'S A GWAS IN A SPANISH COHORT WHICH ACTUALLY HAS A DIFFERENT FREQUENCY FOR A NUMBER OF MAJOR RISK VARIANTS IN PARKINSON'S DISEASE, THAN IN OTHER EUROPEAN POPULATIONS, AND THEY'VE ACTUALLY USED SOME OF THE TOOLS THAT WE'RE GOING TO TALK ABOUT IN A COUPLE SLIDES TO REALLY MAKE A NICE ASSESSMENT OF RISK AND CUMULATIVE RISK IN THAT POPULATION. SO RON AND SARAH DID A GREAT JOB WITH THAT. SO ONE THING THAT WE'LL TALK ABOUT ON THE UNSUPERVISED LEARNING SIDE IS REALLY STELLAR EFFORT IN BUILDING TRAJECTORIES OF PARKINSON'S DISEASE, SO AK GATING HUGE AMOUNTS OF LONGITUDINAL CLINICAL DATA AND REALLY BUILDING THESE VECTORS OF YOUR IMPACT OF HOW PARKINSON'S DISEASE IMPACTING VARIOUS FACETS OF YOUR DAILY LIFE FROM I THINK WHAT WAS IT, IT ENDED UP BEING ABOUT OVER 100 FACTORS THAT WE BOILED DOWN TO 60? 154, YEAH. SO 154 DIFFERENT CLINICAL MEASURES ON DIFFERENT SCALES. TO BUILD THREE CLUSTERS OF PARKINSON'S DISEASE THAT ALL HAVE DIFFERENT KIND OF TRENDS AND PROGRESSION. DOWN HERE ARE THE NON-PDs, WHICH ARE JUST A REFERENCE FOR NORMAL AGING. SO YOU CAN SEE HOW THEY SPLIT OFF INTO PEOPLE THAT ARE MORE COGNITIVELY IMPACTED VERSUS PEOPLE THAT ARE MORE MOTOR IMPACTED. THE HALLMARK, YOU CAN SEE THE BANDWIDTH IS IS MUCH LARGER BASED ON THE -- ON KIND OF HOW EVERYONE CHANGES WITH TIME. IN THEIR SEVERITY. THIS IS COMPLETELY DATA-DRIVEN, SO TO ACTUALLY PICK OUT THESE CLUSTERS, WE DIDN'T ASK ANY CLINICIANS, THIS IS ALL JUST THE DATA. AND WHAT YOU CAN SEE IS WHEN YOU ADD THESE LABELS TO THE DATASET, YOU CAN START PREDICTING THE LABELS QUITE WELL ACROSS ALL OF THE -- USING SUPERVISED LEARNING METHODS TO PREDICT WHO'S GOING TO BE WHAT CATEGORY OF PROGRESSION. CATEGORY 2 IS KIND OF THE MODERATE CATEGORY AND AS YOU WOULD EXPECT THAT VALIDATION, IT'S HARD TO PREDICT CATEGORY 2 BUT THE VERY MILD PROGRESSING PARKINSONISMS AND VERY SEVERE PROGRESSING PARKINSONISMS ARE EASY TO PREDICT. SO THIS IS HOW YOU CAN SEE HOW& IT BREAKS DOWN, THERE'S NOTICEABLY DIFFERENT TRENDS IN ALL OF THESE SINGLE PARAMETERS. SUCH AS SCORES FOR 1 AND 2 WHICH ARE KIND OF INDEXES OF THE DISEASE SEVERITY ACROSS TIME, SO THIS IS THE TSH THE X AXIS IS THREE YEARS FROM FOLLOW-UP DIAGNOSIS. SO YOU CAN SEE HOW THIS COULD INFORM A CLINICAL TRIAL READOUT PARTICULARLY HOW THE FDA IS CHANGING HOW THEY'RE DOING DRUG EFFICACY, ALSO IN PATIENT RECRUITMENT, YOU CAN USE THESE MACHINE LEARNING ALGORITHMS TO REROUTE FAST PROGRESSORS OR PEOPLE WITH THE SAME KIND OF FLAVOR OF DISEASE, THE SAME KIND OF COGNITIVE OR SLEEP-BASED IMPACTS. ALSO ONE THING THAT WE'VE BEEN WORKING ON IS WAYS WE COULD POSSIBLY RESCUE FAILED TRIALS IN MORE HOMOGENEOUS SAMPLE SERIES. SO ANOTHER QUICK CASE STUDY WE'RE TALKING ABOUT IS WORK THAT'S UNDERWAY THAT DOESN'T HAVE RESULTS QUITE YET IN TERMS OF MACHINE LEARNING BUT HAS BEEN A STELLAR PROJECT. WE'RE BUILDING THESE TRAJECTORIES OF MACHINE LEARNING, THESE KIND OF CLUMPS OR SAMPLES THAT LOOK SIMILAR AND LOOK WHERE THEY'RE GOING. BUT THE IDEA WOULD BE HOW DO WE PREDICT THESE INDIVIDUAL MARKERS ACROSS MANY DIFFERENT DATA SILOS, SO THIS IS SOME REALLY GREAT WORK, THEY'VE DONE A FANTASTIC JOB ON THIS ALONG WITH HAMPTON LEONARD, AMOUNT OF DATA CLEANING THAT WENT INTO WENT INTO THIS, I F YOU THINK ABOUT CLEANING ESSENTIALLY CLINICAL DATA FOR 4,000 PEOPLE ACROSS 12 DIFFERENT STUDIES THAT HAVE LOOSE STAN ACROSS ALL HOSPITAL CENTERS AND THINGS LIKE THAT AND NATIONS. FOR 25,000 OBSERVATIONS AND THEN DOING A GWAS OF, YOU KNOW, 22 MILLION SNPs ON EACH STUDY AND COMBINING THE DATA FOR 30 DIFFERENT MEASURES, IT'S A STELLAR AMOUNT OF WORK. I THINK THEY TOOK AS LONG AS A FOLLOW-UP IS PROBABLY HOW MUCH TIME I'VE SPENT WITH THEM CLEANING THIS DATA. AND JEFF'S BUILT A GREAT WEB BROWSER IN CASE YOU'RE INTERESTED IN LOOK AT ANY OF THE RESULTS FOR THESE OUTCOMES FOR ANY GENE OF INTEREST OR ANY VARIANT OF INTEREST. THE IDEA IS THAT THIS IS AN EXTREMELY DIFFICULT THING TO PREDICT BECAUSE OF HETEROGENEITY, BUT USING FEDERATED LEARNING METHODS, WHAT WE'RE WORKING ON RIGHT NOW IS A FRAMEWORK WHERE WE CAN APPLY CENTRALIZED SERVE VER THAT READS IN THE PREDICTIONS ACROSS ALL OF THESE DATA SILOS TO MAKE PREDICTIONS THAT WILL BE ON A LARGE SCALE TO ALL OF THESE OTHER INDEXES SUCH AS UPR SCORES AND THINGS LIKE THAT. AND MIGHT BE USEFUL TO CLINICIANS FOR PLOTTING AND TREATMENTS AND THINGS LIKE THAT IN THE FUTURE. SO IT'S A SUBSTANTIAL A INFRASTRUCTURE BUILD AND AN ECOSIX CYST TEM ACTUALLY WE'RE PUTTING TOGETHER FOR THAT. IT'S ALSO HARD BECAUSE THESE EFFECTS ARE VERY SMALL SO AGGREGATE EFFECTS ARE IMPORTANT AND WE ONLY HAVE ACTUALLY TWO GWAS HITS FOR ALL DOZEN TRAITS OR IS THAT WE ANALYZED. ALSO THE SIZE OF THE DATA, WORKING WITHIN THE ML VALIDATION FRAMEWORK MAKES A LITTLE MORE SENSE TO ME THAN A PRS BECAUSE YOU HAVE -- YOU DON'T HAVE AN EXTERNAL SET OF BETAS TO WEIGHT VARIANTS BY, YOU REALLY JUST HAVE THE DATA ITSELF, SO WHY NOT BUILD DE NOVO CLASSIFIERS. IMPORTANT IN THE FUTURE PARTICULARLY IF WE CAN GET SIMILAR DATA FROM LARGE EMR COLLECTIONS OR OH BIOBANKS WHERE YOU CAN SHARE DATA OUTSIDE OF THE FEW VERY CUTTING EDGE EUROPEAN BIOBANKS LIKE THE UKB AND STUFF LIKE THAT. SO THESE ARE THE TOOLTS WE'RE WORKING ON TO FACILITATE A LOT OF THE RESEARCH. SOME OF IT'S ON BIOWULF ALREADY IN BETA BUT WE'RE ACTUALLY REBUILDING THE CORE THIS WEEK TO PYTHON TO SPEED IT UP A LITTLE BIT. THIS IS A GENOME ANALYSIS, SOMETHING THAT ALL THIEVES AWESOME PEOPLE HAVE REALLY CONTRIBUTED IN SOME WAY OR ANOTHER. IT'S AN AUTOMATED MACHINE LEARNING SYSTEM, FOLKS ON SUPERVISED LEARNING, IT WORKS EXTREMELY WELL ON BUY OH WULF THANKS TO SOME HELP FROM ALL THE ORGANIZE OHERS OF THIS TALK, AND WE'RE HAPPY IF ANYBODY WANTS TO GIVE IT A SHOT. A LOT OF THE SUPERVISED LEARN PREEG DICKS THAT WE'VE TALKED ABOUT IN THIS PAPER ARE DONE USING THIS SOFTWARE. HOPING TO EXPAND THIS IN THE FUTURE. YOU CAN SEE THE SMALL TIMELINE DOWN HERE INTO A LARGER ECOSYSTEM WHERE WE'RE LOOKING TO DO -- HAVE A PORTAL TO FACILITATE THIS AUTOMATED MACHINE LEARNING PIPELINE TO MAKE IT EASY FOR PEOPLE WHO AREN'T BIOSTATISTICIANS OR AREN'T COMPUTER SCIENTISTS TO APPLY THIS AND I THINK THERE'S A LARGE DEMAND FOR THAT AT NIH. A LOT OF THIS WILL BE MANAGED ACROSS DIFFERENT RESOURCE SETS. WE'RE GOING TO TRY AND MAKE IT THIS AS FLUID AS POSSIBLE TO DEPLOY ON SYSTEMS LIKE TERRA OR BIOWULF, ALL THE COMMERCIAL CLOUDS OR JUST IN-HOUSE ON A LAPTOP IN BUILDING 35. THE IDEA IS HAVING THIS MACHINE LEARNING FRAMEWORK AND ECOSYSTEM TO FACILITATE COLLABORATION ACROSS THE SILOS. JUST FACT THAT YOUR EXTERNAL COLLABORATOR CAN'T GET ON THE BIOWULF BUT WE'RE TRYING TO MAXIMUM THE RESOURCES WE HAVE HERE AND EVERYWHERE SO THEY FUNCTION SEAMLESSLY TOGETHER. THAT BRINGS US TO THE IDEA OF SOME WORK FLOWS THAT WERE INSPIRED BY THE 4GH AND SIMILAR INITIATIVES, HOW WE'RE TRYING TO BUILD OUR OWN OPEN SOURCE HYBRID CLOUD TO FACILITATE THESE COLLABORATIONS AND THIS INCLUDES BIOWULF AND GOOGLE CLOUD PRIMARILY AS THE UNDERPINNINGS FOR THIS, WITH A LAYER OF APPS THAT WE REALLY FIND USEFUL THAT ARE ALL JUST A FANTASTIC OPEN SOURCE TOOLS. NOBODY SHOULD RILEY HAVE TO REALLY HAVE T O PAY FOR ANYTHING ANYMORE, THE OPEN SOURCE IS GOOD AND JUST KEEPS GROWING. THIS GOES FROM DATA MANAGEMENT TO SHARING RESULTS IN REALTIME. SO THAT'S WHAT WE'RE WORKING ON, AND I THINK, YOU KNOW, THIS OPENS THE WAY WE HOPE TO FACILITATE SOME OPEN SCIENCE. SO THANKS A LOT, AND I KNOW THIS WAS A LOT TO COVER, I BUDGETED 45 MINUTES, IT'S 43 SO I'M PRETTY HAPPY ABOUT THAT. THE SLIDES ARE AVAILABLE ON GITHUB FOR THE NEUROGENETICS LAB. IF YOU'RE INTERESTED IN COLLABORATING ON ANY OF THESE TOPICS, PLEASE GET IN TOUCH. THE MORE PEOPLE THAT WE GET TO WORK WITH, THE MORE IDEAS WE HAVE, THE BETTER THINGS ARE, THE MORE THE MERRIER. DO YOU HAVE ANY ANALYSIS PROJECTS YOU WANT TO WORK ON WITH US BECAUSE WE'VE SHOWN A LOT OF DATA THAT YOU CAN PROBABLY ANALYZE BETTER THAN I CAN. ARE YOU INTERESTED IN CONTRIBUTING -- THERE'S SO MANY PEOPLE WITH GREAT IDEAS IN TERMS OF THE CONTEXT OF MACHINE LEARNING. DO YOU HAVE NEUROGENIC DISEASE SAMPLES THAT NEED GENOTYPING OR SEQUENCING? WE HAVE A NUMBER OF INITIATIVES THAT WE'RE DOING RIGHT NOW TO GROW THESE DATASETS. ALSO WE'RE RECRUITING EIGHT COMPUTER SCIENTISTS FOR DIFFERENT DD INITIATIVES AT NIH. IF YOU'RE INTERESTED IN THAT OR LOOKING -- KNOW ANYBODY LOOKING FOR A JOB THAT HAS SOME COMPUTATIONAL SKILLS, PLEASE GET IN TOUCH. THANK YOU. [APPLAUSE] >> [INAUDIBLE] >> IT'S TRICKY. WE USED THE MENDELIAN RANDOMIZATION APPROACH TO KIND OF ADDRESS THAT IN SOME DEGREE. THE ENVIRONMENTAL DATA THAT I'VE ENCOUNTERED IS VERY DIFFICULT TO WORK WITH AND IN SMALL BITES. [INAUDIBLE QUESTION] >> SO THERE'S A RANKING IN THAT IN WHICH -- I'M TRYING TO REMEMBER THE EXACT WAIT WEIGHTINGS FOR AGE AND SEX BUT I'M NOT 100% ON THOSE. I KNOW IN GENERAL, MOST OF THE TIMES THAT CROSS VALIDATION SEEMS LIKE SMELL TEST IS THE MOST IMPORTANT FEATURE. SO LOSING YOUR SENSE OF SMELL BECAUSE THAT'S A PROXY FOR FRONTAL DEGENERATION, AND GNS IS THE SECOND MOST IMPORTANT. SO YOU'RE GETTING A HUGE WEIGHTING, ALMOST THREE TIMES AS MUCH AS THE NEXT COMPONENT IN THE MODEL IN THE SIX COMPONENTS WE'VE USED FOR THE SMELL TEST. SO THE PERSON WITH JUST THE GENETIC RISK SHOULD BE AROUND 65 TO 70% ACCURACY IN TERMS OF AUC, BUT UNTIL YOU GET CLOSER TO THE TIME OF DIAGNOSIS, WE'VE DONE SIMULATIONS AND IT WORKS DOWN TO ABOUT 80% ACCURACY THREE YEARS OUT BUT WE DON'T HAVE ANY LONGITUDINAL DATA LIKE ANY PREDIAGNOSTIC LONGITUDINAL DATA ON IT, IT JUST DOESN'T EXIST. BUT SIMULATION IT WORKS AT -- IT'S NOT LIKE YOU JUST LOSE YOUR SENSE OF SMELL IMMEDIATELY. IT'S THE PROBLEM WITH A LOT OF THESE THINGS IS, YOU HAVE A LOT OF TIME-DEPENDENT FACTORS THAT ARE REALLY GOOD PERIDIAGNOSTICLY, BUT TIME INDEPENDENT FACTORS ARE WHAT YOU NEED FOR A GOOD PREDIAGNOSTIC MODEL, AND THAT'S WHY STROK GENETIC ASSOCIATIONS ARE SO VALUABLE. BECAUSE THEY'RE NOT CHANGING. >> [INAUDIBLE] >> YEAH, SO -- MODELS WORK REALLY WELL WITH LOTS OF SMALL SLIGHTLY, SLIGHTLY CORRELATED PREDICTERS IN PARTICULAR. YOU'LL SEE THAT IN A LOT OF THE FINANCE LITERATURE AS WELL. A LOT OF THE THINGS WE DEAL WITH ARE AT AN ODDS RATIO OF 1.1, YOU KNOW, TO 1.3. THAT'S PRETTY SMALL EFFECTS. AND WE'RE DEALING WITH HUNDREDS OF THOUSANDS, SO IT'S WELL SUITED FOR THAT ALGORITHM. IN GENOME -- WE COMPETE A DOZEN ALGORITHMS FOR DISCRETE TRAITS AND A SEPARATE SET OF A DOZENS ALGORITHMS CURATED FOR CONTINUOUS TRAITS WHEN WITH HE RUN THAT PIPELINE TO MAKE THESE PREDICTIONS AND TEASE ARE BASED OFF OF ACTUALLY TOP HITS AT CAGLE FOR ALGORITHMS THAT WIN IN CONTESTS. SO WE'RE COMPETING LIKE ESSENTIALLY THE HOT ALGORITHMS OF THE WEEK, YOU KNOW? >> [INAUDIBLE] >> ALL RIGHT, THANK YOU. [APPLAUSE]