WELCOME TO THE DATA SCIENCE BASICS COURSE. I'M DIANNE BABSKI AND DIRECTOR ASSOCIATE DIRECTOR AT NLM AND DIRECTOR OF DATA SCIENCE TRAINING PROGRAM HERE. LAST MONTH WE CONDUCTED THE DATA SCIENCE READINESS SURVEY. MANY OF YOU COMPLETED THAT. RIGHT NOW OR LAST WEEK OR THIS WEEK YOU ARE RECEIVING YOUR INDIVIDUAL TRAINING PLANS AND YOU MIGHT HAVE NOTICED YOUR TRAINING PLAN IN THE COURSE CATALOG ARE BROKEN UP FOR COMPETENCE OF DATA SCIENCE. WE WANT TO SHARE WITH YOU WHAT THE COMPETENCIES ARE AND GIVE YOU BACKGROUND ON THEM AND EXAMPLE YOU CAN TAKE AWAY. BEFORE -- >> GOOD AFTERNOON, EVERYONE. WHO IS OUR TRAINER TODAY FROM BOOZ ALLEN HAMILTON. WE ARE LIVE STREAMING THESE. SILENCE YOUR CELL PHONES IF YOU HAVE TO LEAVE DON'T LET THE DOOR SLAM TO GET AS MUCH BACKGROUND NOISE OUT AS POSSIBLE. WITHOUT FURTHER WAITING, HERE IS DR. PATTY BRENAN. >> AUDIENCE: [APPLAUSE]. IT IS NICE FOR YOU TO APPLAUSE WHEN YOU ARE HERE TO WORK. THAT IS GREAT. BEFORE YOU GET TOO COMFORTABLE, I WANT EVERYONE TO PLEASE STAND UP. ALL RIGHT. THIS IS THE FIRST DAY OF THE REST OF YOUR LIFE. >> AUDIENCE: [LAUGHING]. >> DR. PATRICIA FLATLEY BRENNAN: WE ARE GETTING READY. JOIN ME IN A CHEER. DATA, DATA, DATA, DATA. THANK YOU. ALL RIGHT. YOU READY? WE HAVE A LOT GOING ON. I AM SO EXCITED THAT WE ARE HERE. WE MADE IT TO THIS POINT! THIS HAS BEEN A LOT OF PLANNING. THANK YOU TO THE PLANNING TEAM AND PEOPLE THAT WERE THINKING ABOUT PLANNING OVER A YEAR AGO. LOOK HOW FAR WE HAVE COME. WE ARE ON A JOURNEY TOGETHER. WE WILL BE TRANSFORMING TOGETHER. THIS TO ME IS ABOUT THE MOST IMPORTANT TRANCE NLM ACTIVITY THAT COULD BE OCCURRING. EVERYONE INVOLVED OVER 750 PEOPLE TAKING THE FIRST TEST AND SEVERAL OF US TAKING THE SECOND AND LEARNING A LOT MORE ABOUT OURSELVES THAN WE WANTED TO KNOW. STRATEGIC PLAN FOR NIH CALLED IT AN INTERDISCIPLINARY FIELD. EACH OF US BRINGS DIFFERENT SKILLS AND IS READY IN A DIFFERENT WAY. I HOPE YOU FOUND IN YOUR PROFILE STRENGTHS THAT YOU ALREADY HAVE. AND SOME OPPORTUNITIES TO GROW. MAYBE YOUR EXCELLENT AT STATISTICAL MODELING BUT NOT STRONG AT PROGRAMMING OR DIDN'T RECOGNIZE THE PYTHON STREAM AT ALL LIKE I DIDN'T OR YOU HAVE GOOD DATABASE MANAGEMENT BUT WANT TO EXPAND VISUALIZATION. WE ARE HEAR TO LEARN AND SUPPORT YOUR LEARNING. MOST IMPORTANTLY, RECOGNIZE THAT DATA SCIENCE IS NOT A SKILL OF THE SPECIALIST. IT IS A TEAM SPORT THAT REQUIRES ALL OF US TO PARTICIPATE IN. THIS REQUIRES US TAKING THE SKILLS WE ALREADY HAVE AND MOVING INTO NEW APPLICATIONS AND LOOKING AROUND FOR NEW OPPORTUNITIES FOR THE SKILL AND MAYBE FOR SUPERVISORS IN THE ROOM THINKING ABOUT HOW WE RESIGN JOBS AND MAKE NEW OPPORTUNITIES OPEN TO PEOPLE WHO WANT TO CONTINUE TO GROW. IN YOUR NEW POSITION THERE WILL BE NO ACTIVITIES AN HOUR AND A HALF FROM NOW YOU WILL BE TAKING ON ONE IS A NEW AWARENESS DATA DRIVES ARE ALIVE AND WAYS YOU MAY BEGIN TO EVEN TODAY TO EMPLOY THOSE SKILLS AND SECONDLY TO RECOGNIZE WE WILL HAVE NEW OPPORTUNITIES TO KNOW ABOUT THE LIBRARY IN NEW WAYS. I WANT TO CAUTION YOU A BIT TO REMEMBER THAT WE HAVE ISSUES OF UNDERSTANDING THAT HAVE TO BE COUPLED WITH ISSUES OF CONFIDENTIALITY. YOU MAY KNOW ABOUT YOUR NEIGHBOR'S SKILL SET OR YOUR OWN OR LEARN PATTERN OF HOW LIBRARY IS SPENDING MONEY OR WE ARE ACCOUNTING FOR ACTIVITIES HERE THAT SUDDENLY DISCLOSES TO YOU SOMETHING THAT HPT BEEN KNOWN BEFORE. BE JUDICIOUS IN SHARING INFORMATION AND REMEMBER THAT AS PROFESSIONALS DATA GIVES INSIGHTS INTO THINGS THAT ARE IMPORTANT AND INSIGHT TO THE THINGS THAT MIGHT ACTUALLY BE BETTER SHARED WITH ONE'S SUPERVISOR OR DISCLOSED ONLY UNDER CERTAIN CIRCUMSTANCES INDIVIDUAL'S LEARNING GOALS AND ALSO INFORMATION ABOUT THE LIBRARY. RECOGNIZE THAT WE WANT YOU TO GROW THAT MEANS YOU WILL LEARN THINGS YOU HAVEN'T EXPERIENCED IN YOUR WORK HERE AT THE LIBRARY BEFORE. AS YOU ENCOUNT IR NEW IDEAS EXPECTATIONS AS PROFESSIONALS TO KEEP LIBRARY BUSINESS IN THE LIBRARY. THIRD STRATEGIC PLAN IS BUILD WORKFORCE IN DATA DRIVEN RESEARCH IN HEALTH AND SKILLS THAT YOU GET IN TRAINING YOU WILL GET IN THE NEXT HOUR AND A HALF HERE THROUGH YOUR INDIVIDUAL TRAINING PLANS AS WELL AS THROUGH THOSE THAT WILL LAUNCH INTO OUR INTENSIVE COURSE THIS SUMMER WILL HELP YOU IN YOUR EVERY-DAY LIVES AND IN YOUR PERSONAL LIFE MAKING NEWSPAPER MORE INTERESTING AND HELPING YOU QUESTION MORE WHAT YOU HEAR ON THE NEWS PERHAPS AND UNDERSTANDING NEW WAYS TO THINK ABOUT AND FRAME THE WORLD. INDIVIDUAL DEVELOPMENT PLAN IS SOMETHING YOU WILL REVIEW WITH YOUR SUPERVISOR NOT AN ASSIGNMENT DO EVERYTHING ON THE LIST BUT LIST OF OPPORTUNITIES SPEND TIME YOU NEED AND STRETCH AND GROW IN A NEW AREA AFRND BY END OF THE SUMMER EACH PERSON WILL HAVE A NEW PROFILE TO BRING FORWARD. YOU ARE NOT ALONE IN THIS. I HAVE MY OWN TRAINING PLAN. I LOOKED AT MY SCORES AFTER THE FIRST TWO TESTS I WAS EXCITED I GOT A 10 ON THE FIRST TEST. I WON'T ASK WHAT YOU GOT BUT I SUSPECT SOME DID BETTER THAN I DID. I GOT A 1 ON THE SECOND TEST. >> AUDIENCE: [LAUGHING]. >> REALLY? HOW MANY PEOPLE GOT MORE THAN A 1 ON THE SECOND TEST? OKAY. YOU WIN. CONGRATULATIONS. IT IS A GOOD THING I'M THE DIRECTOR NOT PYTHON PROGRAMMER. YOU HAVE TO REMEMBER THAT. I'M GLAD YOU ARE OUT THERE. YOU NEED TO KNOW THINGS I WON'T EVER KNOW. I NEED TO LEARN TABLOT AND I WON'T LEARN PYTHON'S CODE AND SCRIPTING WILL BE BEYOND MY SKILLSET FOREVER I KNOW PLANS TO FACILITATE YOUR GROWTH AND OUR GROWTH AND GROWTH OF THE LIBRARY. WE RECEIVED WONDERFUL NEWS YESTERDAY THOSE WATCHING THE PAPER CAREFULLY. HOUSE PROPOSED $2 BILLION INCREASE TO THE NIH THIS TRANSLATES TO NATIONAL LIBRARY OF MEDICINE IS TARGETED $64.3 MILLION BUDGET HIGHEST WE HAVE HAD WE WILL HAVE A WORKFORCE TRAINED AND READY FOR DATA SCIENCE. I'M DELIGHTED TO HAVE COLLEAGUES HERE FL BOOZ ALLEN HAM IPTOILTOHAMILTON. I WILL TAKE M Y NOTES HERE AND WE WILL BE MORE EDUCATED BY THE END OF THIS. >> AUDIENCE: [APPLAUSE]. >> THANK YOU. MANY OF YOU MAY HAVE SEEN ME AROUND OR KNOW ME BY NOW. I'M KATE HELFET AND WITH BOOZ ALLEN HAMILTON FOR A WHILE. -- SO JUST BUILDING OFF THE CONVERSATION, HOW MANY OF YOU HAVE RECEIVED YOUR INDIVIDUAL TRAINING PLANS AND HAD THOSE CONVERSATIONS? AMAZING! THAT IS GREAT. YOU ARE WITH US ON THIS JOURNEY YOU MAY HAVE ALSO ATTENDED THE KICKOFF SESSION OR IF YOU HAVE YOUR TRAINING PLANS YOU HAVE TAKEN THE DATA SCIENCE READINESS SURVEY. CONTINUING ON THE JOURNEY TODAY WITH THE BASICS COURSE AND WILL CONTINUE THROUGH THE SUMMER WITH DATA SCIENCE OPEN HOUSE THAT WE ARE ALL LOOKING FORWARD TO. DON'T FORGET TO COLLECT YOUR BADGES ALONG THE WAY. WE HAVE THEM HERE AND I HAVE SEEN THEM ON OFFICE DOORS AND OTHER PLACES. PHILIP HAS THEM DOWN HERE RAISING HIS HAND MAKE SURE TO GET ONE IF YOU HAVEN'T. THINKING ABOUT DATA SCIENCE BASICS COURSE AND GIVE EVERYONE COMPREHENSIONAL UNDERSTANDING OF 10 COMBETTANCIES TO DATA SCIENCE. WE WENT BACK TO THE DATA SCIENCE COMPETENCY FRAMEWORK. HOW DID WE ARRIVE AT THE 10 DATA SCIENCE COMPETENCIES? SOME QUESTIONS ABOUT THIS. 4 YEARS AGO MY LEADERSHIP TEAM WAS SAYING HOW DO WE QUANTIFY WHAT MAKES A GOOD DATA SCIENTIST AND TALKED ABOUT THINGS LIKE IS THERE A CREDIT SCORE WITH YOUR EXPERIENCE AND EDUCATION AND ALL YOUR APPLIED SKILLS FOR DATA SCIENCE GOING INTO IT? HOW DO WE GET THERE? LUCKILY SOMEONE ASKED A HUMAN CAPITALIST EXPERT YOU CAN CONDUCT JOB TASK ANALYSIS AND UNDERSTAND WHAT ARE SKILLS REQUIRED FOR SUCCESSFUL DATA SCIENCE. WE REACHED OUT TO OUR 500-PERSON DATA SCIENCE TEAM AND CONDUCTED A SURVEY AND CLUSTERED TASKS INTO 10 TECHNICAL COMPETENCY AREAS AND ADDITIONAL PERSONALITY TRAITS WE ARE NOT TALKING ABOUT TODAY. PART OF ANALYSIS WON'T THROUGH SYNTHETIC VALIDITY -- JUST TO SURE THERE WAS REPRODUCIBILITY AND IT VALIDATED THE CONTENT. I'M EXCITED TO TURN IT TO AARON SANT MILLER WHOSE EXPERTISE ACROSS SO MANY DOMAINS HAS GIVEN HIM TECHNICAL EXPERTISE IN COMPETENCY AREAS I'M EXCITED HE IS HERE TODAY. AARON, PLEASE JOIN US FOR THIS SESSION. >> AUDIENCE: [APPLAUSE]. >> AARON SANT-MILLER: HEY, HOW ARE YOU GUYS DOING? YOU CAN HEAR ME ALL RIGHT? OKAY. SO YOU GUYS ARE PROBABLY ALL HERE AND LOTS OF YOU WERE HERE IN JANUARY WHEN MY COLLEAGUE CAPTAIN ORDAN INTRODUCED THE COURSE WITH A KICKOFF. THIS IS THE SECOND ONE. TODAY WE WILL TALK ABOUT COMPETENCIES. KATE ELUDED TO WE HAVE 10 OF THEM AND WE WILL WALK THROUGH THE 10 AND SHOW HOW THEY FIT TOGETHER AND DO A DEFINITION WITH EXAMPLES AND HOPEFULLY IT WILL BE RELATABLE AND RECOGNIZE EVERYONE WILL COME IN WITH A DIFFERENT BACKGROUND AND SOME FOLKS ARE REALLY TECHNICAL AND DOING STUFF AND WRITING PYTHON CODE AND BUILDING MODELS AND DOING DATABASES OTHER FOLKS ARE STARTING TO SCRATCH THE SURFACE OF THIS STUFF. WE WILL WALK THE LINE BETWEEN THE TWO AND YOU CAN FIND RELATABLE THROUGH THE EXAMPLE OR DEFINITION AND IT WILL STIMULATE THOUGHT AROUND DIFFERENT THINGS. THIS IS ME AND I HAVE BEEN AT BOOZ ALLEN NOW FOR ALMOST FOUR YEARS MOSTLY DOING MY WORK IN MACHINE LEARNING AND DATA SCIENCE AND HAVE WORKED IN LOTS OF DOMAINS AND FOR A WHILE I DID INFORMATION RETRIEVAL WITH NATURAL LANES PROCESSING AND WORK AT IRS DOING FRAUD DETECTION AND RIGHT NOW I'M FOCUSED IN THREE AREAS. ONE IS CYBER SECURITY WITH MACHINE LEARNING USING GPUS PARTNERSHIP WITH VIDEO AND LARGE SCALE AI COMPUTING AND LOTS OF WORKS IN GEOPHYSICS CLIMATE SCIENCE AND SEISMIC ANALYSIS. PERSONAL NOTE MY PARENTS ARE BOTH IMMUNOLOGISTS AND MET AT NIH30 YEARS AGO. THANKS TO NIH I'M HERE. THANKS, GUYS. >> AUDIENCE: [LAUGHING]. >> AARON SANT-MILLER: THERE IS A SENTIMENTAL VALUE TO BEING HERE. THANK YOU. I WAS DOING A LOT OF RESEARCH IN SOCIOLOGY AND I LOVE THE FIELD AND STILL LOVE THE FIELD AND THINK IT IS FASCINATING HOW SOCIAL GROUPS WORK TOGETHER AND PEOPLE INTERACT EXPLAINING SOCIAL PSYCHOLOGY WHY PEOPLE DO THINGS. I GOT MORE INTO QUANTIFYING TINGS AND SOCIAL NETWORK ANALYSIS AND SOCIAL INTEGRATE THEORY. QUANTITATIVE DROVE ME TO APPLIED MATH AND STATISTICS. IF YOU TOLD ME STARTING FORMAL EDUCATION I WOULD BE AS TECHNICAL AS I AM. IF YOU TOLD ME FOUR YEARS AGO WHEN STARTING AT BOOZ ALLEN I WOULD BE TECHNICAL IT IS BY CHANCE JOURNEY START AT ONE PLACE AND ACCUMULATE KNOWLEDGE OVER TIME. HUGE FIELD AND YOU WILL FIND A NICHE THAT FASCINATES YOU. TALKING ABOUT WHAT DATA SCIENCE IS INTERDISCIPLINARY FIELD OF -- DEVELOPED AND USED TO EXTRACT KNOWLEDGE AND INSIGHTS FROM COMPLEX DATA SETS FROM NIH STRATEGIC PLAN. TIDBIT FOR YOU NIH IS A ONE OF FEW ORGANIZATIONS WITH A PLAN FOR DATA SCIENCE. THAT SAYS A LOT WHERE THE ORGANIZATION IS GOING AND FORWARD LEANER HOW WE ARE SETTING GOOD POLICIES AND TRENDS FOR OTHERS TO FOLLOW. IF I WERE TO SAY ONE THING, DATA SCIENCE IS A REALLY BROAD THING THAT CAN OR NEEDS BREAKING DOWN TO SUBCOMPONENTS. LOTS OF PEOPLE HAVE DIFFERENT DEFINITIONS FOR WHAT IT IS AND HOW IT LOOKS. WE WILL SEE SOME OF THE NUANCE AS WE TALK THROUGH IT AND THERE WILL BE AREAS FOR FURTHER DISCUSSION. SO PEOPLE OFTEN BREAK DATA SCIENCE DOWN TO THREE PILLARS THAT I BELIEVE WAS HIGHLIGHTED IN JANUARY WHEN CATHERINE PRESENTED. MATH AND STATISTIC COMPUTER SCIENCE AND SUBJECT EXPERTISE. COLOR BLACK THERE IS WHERE DATA SCIENCE ALL COMES TOGETHER. YOU HAVE YOUR SUBJECT MATTER EXPERTISE THAT WILL DEFINE THE PROBLEM AND WHAT IS NEEDED AND HOW THE SOLUTION WILL COME TOGETHER AND MATH AND STATISTICS DEFINES THE METHOD HOW YOU WILL PUT ANALYTICS AROUND THE DATA YOU HAVE TO DEVELOP AN IMPACTFUL SOLUTION AND COMPUTER SCIENCE HELPS YOU TURN IT INTO SOMETHING REAL. IT IS WHERE YOU BUILD CODE AND INTERACT WITH DATA AND HOW YOU DEPLOY SOFTWARE THAT WILL DO THIS AT A LARGER SCALE. I THINK DR. BRENNAN ELUDED TO THIS DATA SCIENCE IS A TEAM SPORT. EVERY PROJECT I WORK ON I'M NOT THE SUBJECT MATTER EXPERTISE I HOPE TO CONTRIBUTE IN MATH AND STATISTICS REALM AND AM DANGEROUS IN -- NOT IN WHEN WORKING IN SEISMIC SIGNAL ANALYSIS. PEOPLE HAVE PHDS IN SEISMOLOGY AND I DON'T. THEY ARE THE PEOPLE THAT STEP IN AND ARE EXPERTS. WHEN YOU THINK OF COVERING THESE THREE PILLARS AND COVERING COMPETENCIES WE WILL WALK THROUGH TODAY I WILL EMPHASIZE THE LARGER TEAM OR ORGANIZATION WILL COVER ALL OF THEM AND ONE INDIVIDUAL RARELY IS THE EXPERT IN ALL OF THEM. AND SO A LOT OF DATA SCIENCE SUCCESS COMES FROM THE COLLABORATIVE ELEMENT OF IT. I ENJOY IT AND ENJOYED TEAM-BASED WORK THAT I THINK IS REALLY CORE TO DATA SCIENCE. I WILL ALSO SAY THAT DATA SCIENCE IS A FIELD IS EVOLVING. 5 OR 10 YEARS AGO DATA SCIENCE WAS PREDOMINANTLY A FIELD OWNED AND RUN BY PHDS YOU HAD TO BE A SCIENTIST TO BE A DATA SCIENTIST AND SOFTWARE IN TECHNOLOGY MAKES DOING THIS WORK EASIER AND MORE RESEARCH GOING INTO THIS BE A FOUNDATION TO BUILD NEW METHODS AND APPLIED RESEARCH. YOU ARE SEEING IT BECOME MORE ACCESSIBLE. WHAT DATA SCIENCE IS, IS CHANGING EVERY DAY AND CONTINUING TO EVOLVE. WE ARE CATCHING IT AT INFLECTION POINT WHERE IT IS AND IS GOING YOU WILL SEE IT CONTINUE TO EVOLVE IN THE NEXT FEW YEARS. EVERYONE BRINGS A DIFFERENT FLAVOR TO IT WHETHER SCIENTISTS OR PEOPLE THAT DO MOSTLY ANALYTICS OR EXCEL. PEOPLE ARE BRINGING A DIFFERENT PERSPECTIVE AND BACKGROUND WHICH IS WHY IT IS A REALLY INTERESTING FIELD. OKAY. SO THESE ARE OUR COMPETENCIES. WE WILL WALK THROUGH THEM ALL INDIVIDUALLY AND PUT THEM ALL ON ONE SLIDE. DON'T KILL YOURSELF TRYING TO ANALYZE IT ON THE ONE GRAPHIC. I THINK THE HANDOUT SHOULD HELP. YOU WILL SEE THEM FIT TOGETHER. A PIECE HERE WE WILL CALL METHOD LOGICAL DESIGN THAT WILL GO INTO ALL THESE APPLIED TECHNIQUES THAT ARE FIVE FORM FOUNDATIONAL SKILLS. IT GOES THROUGH THE HORIZONTAL. YOU HAVE PIECES GOING UP THROUGH THE HORIZONTAL AND OVERARCHING CIRCLES IS HOW THEY INTEGRATE TOGETHER. I WILL GO BACK AND WALK THROUGH IT. TO TAKE A STEP BACK AGAIN YOU HAVE THREE PILLARS WE BROKE DOWN INTO COMPETENCIES THAT ARE HERE LISTED OUT AND EASIER TO FOLLOW. I WILL SAY MAYBE WE SHOULD DEFINE WHAT A COMPETENCY IS. WE WILL CALL THOSE I GUESS BROAD LABELS THAT WILL BE DECOUPLED AND DECOMPOSED INTO SKILLS AND TECHNIQUES ALL WHICH MAY BE SHARED BY OTHER COMPETENCIES. I WILL EMPHASIZE THAT THE COMPETENCIES ARE A CONTINUIUM. THERE ARE ELEMENTS IN STATISTICAL MODEL SKILLS AND TECHNIQUES IN STATISTICAL MODELING THAT ARE RESEARCH THEY ARE NOT MUTUALLY EXCLUSIVE CATEGORIES AND LAYERS OF SKILLS GROUPED INTO THESE FAMILIES BUT THEY ARE NOT MUTUALLY EXCLUSIVE. WE HAVE GROUPED THE 10 COMPETENCIES INTO THREE HIGHER ONTOLOGICAL GROUPS. YOU HAVE METHOD LOGICAL PRACTICE PEOPLE ARE FAMILIAR WITH IN RESEARCH DESIGN AND FROM THERE YOU HAVE A CORE OF FOUR FOUNDATIONAL SKILLS PROGRAMMING AND SCRIPTING AND COMPUTER SCIENCE AND ADVANCED MATH AND DATABASE SCIENCE THAT ARE YOUR CORE FOUNDATIONS YOU NEED TO BUILD ANALYTIC STUDY OR DATA SCIENCE METHOD AND FIVE APPLIED TECHNIQUES DATA MINING INTEGRATIONS STATISTICAL MODEL MACHINE LEARNING OPERATIONS RESEARCH AND DATA VISUALIZATION THOSE ARE MORE APPLIED TECHNICAL SKILLS THAT LEAN ON APPLIED TECHNICAL ELEMENTS THAT LEAN ON FOUNDATIONAL SKILLS TO BUILD OUT THE ACTUAL SOLUTION. THIS IS HOW WE ARE BREAKING IT DOWN AND WE WILL WALK THROUGH EACH CATEGORY AND EACH COMPETENCY ONE AT A TIME. TO START WE WILL GO THROUGH METHOD LOGICAL PRACTICE THAT IS RESEARCH DESIGN. EVERYONE IS FAMILIAR WITH IT. IT REALLY GETS FOUNDATION FROM SCIENTIFIC METHOD THAT I'M SURE LOTS OF FOLKS ARE FAMILIAR WITH AND SOME MAY BE ABSOLUTE EXPERTS IN, IN THIS ROOM OR ON THE PHONE RIGHT NOW. WHAT IS IT SYSTEMATIC PLAN FOR SCIENTIFIC STUDY WITH DATA WHEN APPLIED TO DATA SCIENCE. I WOULD SAY YOU PROBABLY LEARNED ABOUT THIS A LITTLE BIT IN KICKOFF MEETING WE WILL GO TO GRAPHIC IN A LITTLE BIT THAT WILL BE FAMILIAR. HOW RESEARCH DESIGN IS DEFINED AND APPLIED IN DATA SCIENCE IS VERY BROAD. EVERY FIELD DEFINES RESEARCH DESIGN AND APPLIED RESEARCH DESIGN DIFFERENTLY AND DATA SCIENCE DRAWS ON ALL THOSE TOGETHER. IT IS AN IMPORTANT ELEMENT TO ANYTHING YOU DO IN DATA SCIENCE. I WILL EMPHASIZE THE SCIENTIFIC STUDY PIECE AND PHRASE FOLKS THROW AROUND SOMETIMES I LIKE DATA IF TORTURED ENOUGH WILL CONFESS TO ANYTHING. IF YOU GO IN LOOKING FOR A SPECIFIC ANSWER YOU CAN FIND A WAY TO GET IT REPRESENTED IN DATA MOST TIMES. YOU HAVE TO HAVE A GOOD SCIENTIFIC BACKBONE IN HOW YOU ANALYZE DATA AND LET DATA GIVE YOU ANSWERS NOT TRY TO FIND YOUR OWN ANSWER IN THE DATA. IN THAT, RESEARCH DESIGN SHOULD CAN THE FOR ANY BIAS THAT MAY COME IN AND IT LOOPS BACK TO THE SAME POINT AND SHOULD BE WELL FOUNDED MAKE SURE STUDY AND FINDINGS MAINTAIN INTEGRITY AS YOU GO THROUGH PROCESS AND DELIVER CONCLUSIONS. DATA SCIENCE IS FLEXIBLE. YOU LET DATA GUIDE YOU BUT SOMETIMES YOU GET SOME TYPE OF FINDING AND JUMP BACK AND REANALYZE IT AGAIN AND JUMP BACK AND REANALYZE AGAIN AND OVERALL APPROACH NEEDS TO STAY FLEXIBLE. THIS IS THE GRAPH YOU MAY RECOGNIZE FROM JANUARY LINEARLY WALKING THROUGH IT START WITH QUESTION AND GO GET DATA TO ANSWER THE QUESTION AND PREPARE IT WHICH IS AN ARDUOUS TASK TO GET IT CLEANED AND STRUCTURED THE RIGHT WAY ANALYZE IT GET ANALYTICS YOU WANT AND GET ANSWER FROM THE DATA. SOMETIMES ANSWER IS WHAT YOU EXPECT AND YOU CAN TAKE ANSWER FORWARD AND BUILD METHODOLOGY AROUND IT AND EXECUTE OFF THAT OTHER TIMES THAT IS NOT WHAT I EXPECTED AND DATA SAYS RELATIONSHIP YOU THOUGHT WOULD BE THERE IS NOT THERE YOU HAVE TO GO BACK AND LOOK FOR NEW FOUNDATIONAL -- IT ISILITIERATIVE PROCESS THAT ENDS UP IN EXECUTION PHASE BUILD OUT LARGER APPLIED METHODOLOGY TO COMMUNICATE RESULTS FROM. ONE EXAMPLE HERE HIGH LEVEL EXAMPLE LET'S SAY YOU ARE INTERESTED YOU ARE HAVING ARGUMENT WITH FRIEND WHO IS MORE PROLIFIC MUSIC ELVIS OR JOHN A CASH? WHO SAYS ELVIS? WHO SAYS JOHNNY CASH? MORE SAY JOHNNY. INTERESTING. HOW DO YOU GO ABOUT SETTLING THIS ARG. MENT METRICS DECIDE WHO IS BETTER AND MORE PROLIFIC HOW MANY ALBUMS DID THEY LEE LEASE AND MONEY DID THEY MAKE IN THIR CAREER WHO IS MORE LISTENED TO TODAY AND AT THEIR PEAK YOU WILL GET THESE AND THAT IS HOW YOU WILL ANALYZE THIS DISCUSSION AND YOU MIGHT GO OUT TO GOOGLE; RIGHT? AND YOU WILL GO TO GOOGLE AND GET DATA AND FIND NUMBERS SOME SOURCES WILL BE BETTER THAN OTHERS AND OTHERS RIGOROUS AND FOUNDATIONAL SOURCES YOU CAN TRUST AND OTHERS WILL BE SOMEBODY'S FACEBOOK POST YOU PROBABLY CAN'T BASE LOTS OF CONCLUSIONS OFF OF. YOU ARE DETERMINING MEASUREMENTS AND COLLECTING DATA AND ASK QUESTIONS OF THE DATA AND COME TO CONCLUSIONS BASED OFF FRAMEWORK YOU PUT IN PLACE AND THIS IS HOW YOU PUT RESEARCH DESIGN INTO PRACTICE. RESEARCH DESIGN AS I TALKED ABOUT IS METHOD LOGICAL PIECE THAT DEFINES EVERYTHING WE WILL DO IN DATA SCIENCE ITSELF. THE NEXT PIECE ARE THESE FOUR FOUNDATIONAL SKILLS THAT YOU HAVE THERE DOWN AT THE BOTTOM LEFT IS PROGRAMMING AND SCRIPTING AND COMPUTER SCIENCE AND ADVANCED MATH AND DATABASE SCIENCE. YOU CAN SEE THEY ARE ALL AT THE PERIPHERY. THAT IS THE SPECIFICS HOW YOU APPLY SO METHODOLOGY AND BROADER THEORETICAL FIELDS FOUNDATIONAL FOR APPLIED TECHNIQUES WE WILL GO TO IN A SECOND. WE CAN'T CAPTURE ALL OF WHAT IS IN THE FIELDS. THEY ARE HUGE FIELDS. ADVANCED MATH IS A MASSIVE FIELD OF STUDY WE WON'T GIVE YOU A CRASH COURSE ON ADVANCED MATH HERE. I AM HOPING WE CAN SHOW YOU LARGE FIELDS APPLIED IN DATA SCIENCE AND GET YOU INTERESTED IN THEM OR SOME OF THEM YOU WANT TO RESEARCH A LITTLE BIT MORE. WE WILL START WITH PROGRAMMING AND SCRIPTING. WHAT IS PROGRAMMING AND SCRIPTING? YOU KNOW, I THINK VERY BROADLY IT IS THE PROCESS OF PUTTING COMPUTER CODES TOGETHER INTO AN OPERATIONAL PROCESS. SO CODING SOMETHING OUT. YOU CAN KIND OF BREAK PROGRAMMING SCRIPTNG INTO THREE DIFFERENT PIECES: CODE, SCRIPT, AND PROGRAMMING. CODE IS A LOGICAL DIRECTION YOU GIVE A COMPUTER TO DO SOMETHING. IN THIS CASE, WITH DATA. RUN THIS TEST. ANALYZE THIS ELEMENT. MERGE THESE TWO DATA SOURCES AND WHAT HAVE YOU. A SCRIPT PUTS LOTS OF LOGICAL DIRECTIONS TOGETHER TO AN OVERALL PROCESS. MERGE THE TWO DATA SOURCES AND RUN ANALYTIC AND RETURN THE RESULTS AND VISUALIZE THIS PLOT AND ALL THAT TOGETHER MAY BE ONE SCRIPT. PROGRAMMING IS THE PRACTICE OF DOING IT WELL. THERE ARE GOOD AND BAD SCRIPTS. GOOD SCRIPTS ARE EFFICIENT AND DO WHAT YOU NEED. BAD SCRIPTS BREAK AND DON'T DO WHAT YOU WANT AND ARE INEFFICIENT AND SLOW AND ALL THAT. THIS IS ALL DIGITAL LANGUAGE YOU USE TO TALK TO THE COMPUTER TO GET THE COMPUTER TO DO WHAT YOU WANT. THIS IS A NICE LITTLE EXAMPLE WE HAVE. LET'S SAY THAT YOU WANT TO OPEN MICROSOFT WORD AND START TYPING OUT A PAPER OR YOU KNOW YOU WANT TO READ A PAPER SOMEBODY SENT YOU. WHEN YOU CLICK THAT, MICROSOFT WORD ICON. THAT ACTUALLY RELATES TO A SCRIPT. THAT SCRIPT WILL OPEN MICROSOFT WORD. IF THE SCRIPT IS WRITTEN WELL WITH GOOD PROGRAM IT WILL DO WHAT YOU WANT OPEN IT AND YOU HAVE MICROSOFT WORD AND YOU DO WHAT YOU ARE PLANNING TO DO. IF SCRIPT THAT RUNS MICROSOFT WORD IS BAD AND YOU CLICK MICROSOFT WORD, IT MIGHT OPEN GOOGLE CHROME TO NAVIGATE TO YOUTUBE AND BLAST TOM PETTY NOT A BAT THING BUT NOT WHAT YOU WANTED IT DISTRACTED FROM ACTUAL EFFORT. GOOD SCRIPT DOES WHAT YOU NEED IT TO AND BAD SCRIPT DOES EXTRA STUFF. PRACTICE OF WRITING SCRIPTS WELL IS PRACTICE OF PROGRAMMING OR THE FIELD OF PROGRAMMING. PROGRAMMING AND SCRIPTING IS PROBABLY THE MOST DIRECT OF THE FOUR. NEXT THREE ARE A LITTLE MORE ABSTRACT. WE WILL START WITH COMPUTER SCIENCE. YOU CAN REALLY BREAK COMPUTER SCIENCE DOWN INTO TWO WORDS THAT MAKE IT UP. IT IS THE SCIENCE OF COMPUTING. YOUR COMPUTER IS A COMPUTER WHICH IS REALLY A MACHINE DESIGNED TO PROCESS DATA AND INFORMATION. SO I THINK WHEN WE THINK ABOUT WHAT COMPUTER SCIENCE IS, IT IS REALLY UNDERSTANDING HOW THE COMPUTER INTERPRETS THE COMMANDS YOU GIVE IT AND WHAT YOU ASK IT TO DO. IT IS HOW THE STUDY -- STUDY OF HOW COMPUTERS STORE AND PROCESS INFORMATION. LESS SO HOW YOU TALK TO THE COMPUTER BUT WHAT DOES THE COMPUTER INTERPRET WHAT YOU ARE TELLING IT. DEPENDS ON HARDWARE OF SYSTEM LOTS OF DIFFERENT OPERATING SYSTEMS ALL OF WHICH REALLY OPERATE DIFFERENTLY AND FUNCTION DIFFERENTLY AND INTERPRET THINGS DIFFERENTLY AND HAVE DIFFERENT FOUNDATIONAL ELEMENTS. IT IS UNDERSTANDING HOW THE OPERATING SYSTEMS WORK. WHAT COMPUTER SCIENCE IS IS UNDERSTANDING YOU CAN'T ASK A COMPUTER TO DO SOMETHING IT CAN'T DO. YOU KNOW, I THINK THIS IS A GOOD EXAMPLE THAT WE HAVE. WHEN WE THINK ABOUT DATA, WE ARE ALSO THINKING ABOUT AN ACTUAL SET OF NUMBERS AND COMPUTATIONS. SO COMPUTERS HAVE KIND OF TWO DIFFERENT TYPES OF MEMORY. ONE IS STORAGE. THAT IS ALL OF THE INFORMATION THAT YOUR COMPUTER CAN HOLD ON IT AT ONE TIME. IT MAY NOT NEED TO ACCESS IT AT ALL TIMES AND THE OTHER IS RANDOM ACCESS MEMORY OR RAM. THIS AIS YOUR COMPUTER'S THINKING SPACE. YOUR COMPUTER OR PHONE OR WHATEVER HAS A TON OF STORAGE AND A MUCH SMALLER AMOUNT OF RAM WHICH IS ESSENTIALLY MEMORY YOU CAN HOLD AT ONE TIME TO PROCESS INFORMATION. IF YOU WRITE A SCRIPT THAT ASKS YOUR COMPUTER TO HOLD MORE DATA IN RAM OR RANDOM ACCESS MEMORY THAN IT CAN POSSIBLY HOLD IT WILL CRASH. WHAT GOOD COMPUTER SCIENTISTS THEY WILL DESIGN THEIR PROGRAMS AND SYSTEMS TO REDUCE THE AMOUNT OF MEMORY COMPUTER NEEDS TO USE AT ANY GIVEN POINT IN TIME TO BE EFFICIENT AND COMPUTER DOESN'T CRASH. LE BEST DATA SCIENTIST -- DISTRIBUTE IT TO MULTIPLE MACHINES AT ONCE TO DO THE SAME AMOUNT OF LARGEST AMOUNT OF DATA BUT USING MULTIPLE BRAINS TOGETHER. IT IS LITERALLY COLLABORATION. COMPUTER SCIENTIST CAN OPERATIONALIZE LOTS OF COMPUTERS TO THINK TOGETHER ON SOMETHING. THAT IS WHERE YOU SEE MORE ADVANCED COMPUTER SCIENCE GOING BUT IT COMES BACK TO THE CONCEPT OF UNDERSTANDING HOW MUCH MEMORY A COMPUTER HAS WHAT MEMORY FOR A COMPUTER IS AND WHAT MEMORY FOR A COMPUTER CAN AND CAN'T DO. NEXT IS ADVANCED MATH. I'M SURE EVERYONE KNOWS WHAT MATH IS. I WON'T EXPLAIN AND DEFINE MATH. I WILL SAY, YOU KNOW, MATH AS YOU GET TO MORE ADVANCED METHODS, YOU REQUIRE MORE ADVANCED MATH. SO YOU KNOW I THINK OF REALLY SIMPLE THING THAT PROBABLY EVERYONE DOES SEMIFREQUENTLY IS CALCULATE THE TIP ON, YOU KNOW, WHEN YOU GO OUT TO DINNER. RIGHT? YOU ARE USING MATH TO DO SOME OF IT THAT. SOMETHING THAT PEOPLE ARE DOING LESS SO PROBABLY IN THIS ROOM AND I'M DEFINITELY NOT DOING IT IS DESIGNING AN ALGORITHM TO IMAGE A BLACK HOLE THAT WAS ALL OVER THE NEWS RECENTLY WHEN KATIE BOWMAN DID THIS. AWESOME WORK AND FOUNDATIONAL THEORY YOU USE TO CALCULATE THAT IS SIMILAR TO ELEMENTS OF FOUNDATIONAL THEORY SHE USED TO IMAGE THAT BLACK HOLE. CORE FOUNDATIONAL ELEMENTS AND THEORIES IN MATH THAT GO ACROSS ALL THINGS AND ADVANCED ELEMENTS OF MATH USED FOR MORE ADVANCED METHODS. HOW ADVANCED MATH AND MATH IN GENERAL IS USED MOST OFTEN IN DATA SCIENCE IS VALIDATING METHODS YOU ARE PLANNING ON USING. I RARELY GO TO A PIECE OF PAPER AND WRITE OUT A PROOF FOR A METHOD I WANT TO DO. WHAT I WILL DO IS HAVE THIS TYPE OF DATA AND SOLVE THIS TYPE OF PROBLEM, I WILL MAKE SURE THAT THE ALGORITHM I WANT TO USE IS APPROPRIATE ON THAT DATA. THE DATA I HAVE MEETS THE FOUNDATIONAL ASSUMPTIONS IT NEEDS TO FOR THAT ALGORITHM TO BE APPLICABLE. LOTS OF THIS IS BASED OFF PREVIOUS RESEARCH. YOU DON'T REALLY HAVE TO PRO OF THAT, THAT ALGORITHM WORKS ON THE DATA. YOU CAN FIND OUT THERE IF SOMEBODY SAYS IF YOU USE THIS ALGORITHM YOU IF NEED TO TEST THESE ASSUMPTIONS OUT THERE AND BUILT ON RESEARCH. THEORETICAL APPLIED VALIDATION OF USING DIFFERENT METHODS YOU WANT TO USE. I WILL ALSO SAY THAT SOMETIMES THE WAY THAT YOU APPLY MATH IS YOU MAY USE A REALLY SIMPLE CONCEPT BUT DO IT REALLY WELL AND CREATIVELY THAT CAN ADD A LOT OF VALUE. HERE IS A REALLY STRAIGHTFORWARD APPLICATION OF MATHEMATICAL CONCEPT THAT IS AN AVERAGE OR MEAN. INSTEAD OF GETTING AVERAGE FOR ALL OF THE DATA WE DO IT AS MOVING AVERAGE. THINK OF IT AS A WINDOW MOVING ACROSS THAT RAW DATA THAT IS IN GRAY AND EVERY POINT IN THE WINDOW TAKES THAT AVERAGE THERE AND CREATES THE BLUE PLOT AND MOVES OVER AND AVERAGES AGAIN AND CREATES ANOTHER BLUE POINT AND AGAIN AND AGAIN AND AGAIN AND SLIDES THIS MOVING WINDOW TAKING AVERAGE IN EVERY POINT IN THE DATA TO CREATE A SMOOTHED OUT CURVE. CONCEPT OF AN AVERAGE IS NOT TOTALLY CRAZY. CONCEPT OF A MOVING AVERAGE IS NOT TOTALLY CRAZY. IT IS REALLY VALUABLE HERE BECAUSE IN THIS CASE WE LOOK AT STOCK PERFORMANCE THAT IS VOLATILE AND VARIABLE DAY-TO-DAY AND MINUTE TO MINUTE AND CRAZY UNDERSTANDING HOW STOCK IS JUMPING AROUND ANY GIVEN DAY BUT MORE INTERESTED IN THE TREND IT IS BEHAVING MOVING OVER TIME SMOOTH OUT VOLATILITY IN DAY-TO-DAY TO GET LARGER TREND TO USE VALUABLE WHEN PREDICTING WHETHER YOU WANT TO BUY STOCK OR MAKE A FINANCIAL DECISION. MOVING AVERAGES A LOT WE ARE USING IT RIGHT NOW ON MY SEISMIC WORK. CORE ELEMENT TO WHAT WE ARE DOING IS SMOOTHING OUT AND REDUCING DATA WE HAVE. AGAIN, NOT OVERLY COMPLICATED CONCEPT BUT WORKS WELL IN PRACTICE AND IS VALUABLE WHEN YOU TRY TO APPLY NEW TECHNIQUES TO NEW STUDIES. LAST FOUNDATIONAL SKILL I WILL SAY IS KNOWN AS MANY DIFFERENT THINGS. WE CALL IT DATABASE SCIENCE HERE YOU MAY KNOW IT AS INFORMATION RETRIEVAL OR DATA STORAGE OR MANAGEMENT. THIS IS BACKBONE OF WHAT NLM DOES. I'M SURE PEOPLE IN THE ROOM ARE BETTER EXPERTS IN THE ROOM THAN I AM. IT IS INTERESTING PROCESS OF OPTIMIZING DATA STORAGE FOR EFFICIENT MEMORY USAGE AND INTUITIVE USAGE AND DATA SPEED. DATA AND STORE IT IN PARTICULAR WAY FOR EASIER ACCESS. THIS IS SOMETHING THAT HAS BEEN STUDIED FOR A REALLY LONG TIME AND STUDIED AT NLM A LONG TIME. NOW AS EVERYTHING IS GETTING MORE DIGITAL, YOU WILL FIND THERE ARE NEW TOOLS AND TECHNOLOGIES THAT WILL HELP YOU DO THIS PROCESS A LITTLE BIT EASIER. IT IS ALL MEANT INCREASE EFFICIENCY. YOU WANT SOMEBODY TO FIND DATA THEY NEED AS FAST AS POSSIBLE AND ACCESS AS FAST AS POSSIBLE AND BEGIN THEIR ANALYSIS. EVERYTHING YOU GUYS ARE FAMILIAR WITH. I THINK OF IT AS BREAKING DOWN SOMETIMES. THINK ABOUT HOW LOTS OF DATABASE SCIENCE IS DONE BREAKING DOWN BETWEEN COLD AND WARM STORAGE. COLD STORAGE IS STUFF YOU DON'T NEED ALL THE TIME. YOU KNOW YOU MIGHT NEED IT AT SOME POINT BUT WON'T NEED IT TODAY OR TOMORROW. THINK OF IT AS ATTIC. YOU SHOVE STUFF IN ATTIC YOU KNOW YOU DON'T NEED IT NOW BUT AT SOME POINT. CYBER SECURITY WE KEEP ALL OF THE DATA ALL THE TIME BUT MIGHT NOT NEED ALL AT EVERY MINUTE. WARM STORAGE EVERYTHING IN YOUR HOUSE YOU WANT TO ACCESS MORE FREQUENTLY AND OFTEN BECAUSE IT IS CRITICAL TO ANALYSIS YOU WANT TO DO DOWN THE LINE. EXAMPLE OF DATABASE AS WE THINK ABOUT IT IN DATA SCIENCE IS SOMETHING CALLED A RELATIONAL DATABASE THAT MOST PEOPLE ARE FAMILIAR WITH AN EXCEL SPREADSHEET AND RELATIONAL DATABASE IS LOTS OF EXCEL SPREADSHEETS ON LOTS OF DIFFERENT TYPES OF DATA AND IS OFTEN ACCESSED USING SQL WHERE YOU CAN QUERY THOSE TABLES OR SPREADSHEETS YOU CAN USE TO GET THE DATA YOU ARE INTERESTED IN. WHERE DATABASE COMES IN IS LESS SO STORING DATA BUT DOING IT INTELLIGENTLY MAYBE YOU HAVE WEATHER DATA ACROSS US AND YOU ARE INTERESTED IN STORING ALL THAT IN ONE BIG DATABASE YOU WILL PROBABLY CREATE INDIVIDUAL TABLE OR SPREADSHEET FOR EACH CITY YOU ARE INTERESTED IN. YOU MAY HAVE ONE FOR BOSTON, SAN DIEGO, AND SAVANNAH, GEORGIA. WITHIN ALL OF THE TABLES YOU HAVE MEASURES OF THE WEATHER IN THOSE CITIES, THE PRECIPITATION AND TEMPERATURE AND HUMIDITY AND ALL THAT IN EACH ONE. YOU REALIZE WHEN SOMEBODY WANTS TO ANALYZE WEATHER DATA IN THE US THEY WON'T GO THROUGH EACH INDIVIDUAL CITY AND ANALYZE IT BUT LOOK AT A HIGHER LEVEL KIND OF ANALYSIS THEY MAY BE INTERESTED IN ANALYZING WEATHER ON EAST COAST. YOU MAY HAVE INDEXING SCHEME OR SEPARATE TABLE THAT MAPS INDIVIDUAL TABLES TO THE HIGHER CONCEPT. IN IN CASE, THE EAST COAST. THAT MAP OF COAST WOULD MAP TO BOSTON OR SAVANNAH. USERS I WANT TO ANALYZE TEMPERATURE IN EAST COAST OVER THE LAST FIVE YEARS THEY WON'T GO AND COLLECT ALL CITIES ON EAST COAST AND ANALYZE IT THAT WAY BUT ASK DATABASE CAN I GET DATA ALL CITIES ON EAST COAST SMARTER AND MORE EFFICIENT WAY TO ACCESS DATA TO DO SO ANSWER DATA AND QUESTIONS YOU WILL BE USING IT THAT IS WHERE IT GETS PRETTY NUANCED. THOSE ARE FOUR FOUNDATIONAL SKILLS WE HAVE AGAIN PROGRAMMING AND SCRIPTING COMPUTER SCIENCE ADVANCED MATH AND DATABASE SCIENCE. YOU WILL SEE THE CORE ELEMENTS PULLED THROUGH, THROUGHOUT THE NEXT FIVE SECTIONS THAT ARE MUCH MORE APPLIED. APPLIED TECHNIQUES AS WE WALK THROUGH THEM WILL BE MORE EXAMPLES AND EXAMPLES WILL BE LESS THEORETICAL. YOU WILL PROBABLY SEE APPLIED TECHNIQUES MORE OFTEN IN YOUR DAY-TO-DAY. I WOULD EMPHASIZE THAT ALL THESE APPLIED TECHNIQUES THOUGH THEY ARE HUGE FIELDS OF STUDY DEPEND ON THE FOUNDATIONAL SKILLS. AGAIN, THESE ARE THE FIVE APPLIED TECHNIQUES. YOU HAVE DATA MINING AND INTEGRATION STATISTICAL MODELING MACHINE LEARNING OPERATIONS RESEARCH AND DATA VISUALIZATION. WE WILL WALK THROUGH EACH OF THOSE. TO START DATA MINING AND INTEGRATION, I THINK DATA MINING AND INTEGRATION IS MY FAVORITE ONE EVEN THOUGH MY AREA OF EXPERTISE IS MORESO IN MACHINE LEARNING. LOTS OF WHAT YOU DO IN DATA MINING IS THE ACTUAL DATA DISCOVERY. LARGE DATA SET WITH TONS OF INFORMATION AND YOU WANT TO ASK DATA TO TELL YOU WHAT IS IMPORTANT RELATIONSHIPS OR TEST IF RELATIONSHIP IS IN THAT DATA SAID SET. IT ISITIERATIVE PROCESS OF FINDING WHAT IS IN THE DATA. DATA TELLS ME WHAT IS IMPORTANT AND THERE AND LET'S ME BE AGGNOSTIC OF MY OWN ASUJSZS AND NÁVE AND BANDIAS. IT IS EXCITING YOU NEVER KNOW WHAT YOU WILL FIND. -- TO LARGE DATA SETS TO EXTRACT INSIGHTS. AS I WAS SAYING, YOU CAN DISCOVER HIDDEN PATTERNS IN THE DATA YOU WON THE HAVE SEEN PREEF YO USUALLY. LOTS OF THIS IS VALIDATING YOUR EXPECTATIONS AS WELL. YOUR ASSUMPTIONS AROUND THE DATA YOU HAVE. WE ARE TALKING WAY BACK ABOUT RESEARCH DESIGN THAT NEEDS TO BE THOROUGH AND HOLD INTEGRITY SO YOU DON'T ANALYZE ANYTHING OR DRAW OR HOLD CONCLUSIONS THAT ARE BIASSED OR UNFAIR. YOU DO THAT IN DATA MINING PHASE. LET'S SAY YOU HAVE A TON OF DATA. WE WILL GO BACK TO THE WEATHER EXAMPLE. TONS OF WEATHER DATA THAT YOU THINK IS ACROSS THE ENTIRE WORLD. RIGHT? SOMEONE TELLS YOU HERE IS DATA SET ACROSS ALL OF THE WEATHER [INDISCERNIBLE]. YOU WANT TO ANALYZE THE DATA TO DRAW CONCLUSION ON GLOBAL WEATHER PATTERNS. MAKES SENSE. TO DO SO, YOU HAVE TO MAKE SURE DATA SET YOU HAVE IS ACTUALLY REPRESENTATIVE OF GLOBAL WEATHER DATA. SO WHAT YOU MIGHT DO IS GROUP ALL OF THE DATA THAT YOU HAVE INTO COUNTRIES OR INTO GEOGRAPHIC AREAS AND ANALYZE WHAT PERCENT OF YOUR DATA IS ALIGNED TO EACH COUNTRY OR EACH GEOGRAPHIC AREA OR SOMETHING LIKE THAT TO GET A SENSE HOW THE DATA YOU HAVE IS BROKEN DOWN. MAYBE WHEN YOU DO THIS, YOU FIND OUT EVEN THOUGH SOMEBODY GAVE YOU A DATA SET AND SAID HERE IS YOUR GLOBAL DATA 75% OF THE DATA YOU HAVE IS FROM THE US. ANY CONCLUSIONS YOU DRAW OF THE DATA YOU CAN'T GENERALIZE TO SAY IT IS A GLOBAL DATA PATTERN BUT SAY IT IS A PREDOMINANTLY US DATA PATTERN. USE THESE TECHNIQUES TO VALIDATE ANY ASSUMPTION YOU HAVE NOT JUST ABOUT THE PROBLEM ITSELF BUT DATA YOU HAVE TO ANALYZE. I WILL ALSO SAY THERE IS SOME REALLY COOL WAYS THAT YOU CAN SORT OF VALIDATING AN ASSUMPTION, YOU CAN FIND NEW PATTERNS. ONE COMMON TECHNIQUE YOU SEE IN DATA MINING IS CLUSTERING WHICH IS WHERE YOU WILL THROW YOUR DATA INTO A VECTOR SPACE OR COUPLE AXIS LIKE THIS AND ASK THE ALGORITHM TO FIND GROUPS IN. MAYBE I LOOK OUT IN THE ROOM AND THERE IS A CLUSTER OF PEOPLE BACK THERE AND CLUSTER OF PEOPLE UP HERE AND VERY LARGE CLUSTER OF PEOPLE IN THE BACK OF THE ROOM FARTHEST AWAY FROM ME WHERE THEY CAN GET AWAY WITH NOT PAYING ATTENTION SORRY FOR CALLING YOU OUT YOU HAVE CLUSTERS YOU CAN MAP OUT TLA WA ALGORITHM TELLS YOU THERE IS PEOPLE GROUPED BACK THERE. THROW DATA INTO THE ALGORITHM THAT USES IN THIS CASE SPACE BETWEEN POINTS TO CREATE GROUPINGS. IT IS VERY, VERY VALUABLE. YOU CAN USE IT TO FIND PATTERNS AND VALIDATE CORE ASSUMPTIONS YOU MIGHT NEED LATER IN YOUR ANALYSIS. I GOT TO FOR A WHILE AT BOOZ ALLEN TO DO ANALYSIS AND WORK WITH PROFESSIONAL SPORTS TEAMS THAT IS REALLY COOL I PLAYED SPORTS ALL THROUGH MY LIFE AND OVERSEAS. WE ACTUALLY DID A PROJECT WITH A MAJOR LEAGUE BASEBALL TEAM LOOKING AT PICTURE INJURY DATA WE WANTED TO UNDERSTAND DIFFERENT RELATIONSHIPS BETWEEN DIFFERENT INJURY PATTERNS IF SOMEONE STARTS TO GET ONE TYPE OF INJURY IF WE CAN PREDICT THEM GETTING ANOTHER TYPE OF INJURY AND STUFF LIKE THAT. WE RAN A CLUSTERING ALGER IJ ON ALL PITCHERS WITH INJURIES AND NUMBER OF INJURIES THEY HAD AND TIME INJURIES THEY HAD MODEL CAME BACK STATISTICALLY SIGNIFICANT MODEL THAT BROKE DATA INTO CLUSTERS AND GROUPS. WE LOOKED AT THOSE I TALKED ABOUT DATA SCIENCE TEAM OF SUPPORT WE HAD TEAM OF EXPERTS LOOK AT THE DATA AND HEY YOU SEEM TO HAVE A CLUSTER OF LOTS OF ELBOW INJURIES HERE. YOU SHOULD LOOK AT WHETHER OR NOT THOSE PITCHERS ARE THROWING MORE OFF-SPEED PITCHES. SPECIFICALLY WHAT WE LOOKED AT WERE CURVE BALLS AND SLIDERS POSITION YOUR ARM TORQUE MOTION THAT CAN PUT LOTS OF PRESSURE ON YOUR ELBOW. WE LOOKED AT IT AND FOUND A CLUSTER OF PITCHERS THAT WERE HAVING LOTS OF ELBOW INJURIES WERE PREDOMINANTLY PITCHERS THROWING OFF-SPEED PITCHERS KNOWN AS OFF -- SPEED PITCHERS AND CAME IN SITUATIONS TEAM WANTED TO THROW OFF-SPEED PITCHERS INTERESTING FINDING AND RESULT IN OVERALL STUDY WE REALIZED WE HAD TO STUDY THOSE TYPES OF PITCHERS DIFFERENTLY. IT IS SOMETHING IF YOU ASKED EXPERT ON BASEBALL INJURIES YOU HAVE TO STUDY THESE PITCHERS DIFFERENTLY. BORN OUT OF THE DATA AND USE DATA ELEMENT ANALYSIS ITSELF TO GET THAT WORK FLOW AND IT IS A CORE ASSUMPION WE MADE IN THE PROCESS TO KIND OF SUBSET A GROUP OF PITCHERS SEPARATELY FOR THE ANALYSIS. DATA MINING INTEGRATION. NEXT ONE TO GO INTO IS STATISTICAL MINING. YOU HAVE HEARD OF THIS. IT IS A REALLY BROAD FIELD PEOPLE CALL IT ANOTHER NAME ELUDING TO THAT BEFORE WITH DATABASE SCIENCE STATISTICAL MODELING MAY BE KNOWN AS EXPLORATORY MODELING OR DATA ANALYSIS TO SOME. WHEN WE LOOK AT HERE STATISTICAL MODELING IS PROCESS OF COLLECTING AND ORGANIZING DATA ALL TOGETHER TO INTERPRET RELATIONSHIPS IN THE DATA AND EXTRACT USABLE INFORMATION OFF A LARGE DATA SET. YOU WILL SEE IT IS ABOUT RELATIONSHIPS IS THERE A RELATIONSHIP BETWEEN THE TWO DATA ELEMENTS AND IF THERE IS, WHAT IS THE STRENGTH OF THAT RELATIONSHIP? YOU CAN FOLLOW THAT ANY TIME YOU HEAR SOMEBODY TALKING ABOUT A RELATIONSHIP IN DATA OR THEY REFER TO A STATISTICALLY SIGNIFICANT PATTERN. THEY ARE OFTEN USED IN STATISTICAL MODELING AND YOU KNOW HOW MANY OF YOU HAVE RECORRECTLY READ AN ARTICLE RECENTLY THAT HEY THERE IS A TREND IN X, Y, OR Z. RIGHT? EVERYBODY TALKS ABOUT TREND IN X, Y, OR Z AND THESE TWO THINGS ARE CORRELATED FOR ME SUNDAYS LARGE CORE ELATION WHEN I BUY PIZZA AND WATCH NETFLIX YOU WOULD SEE THAT CORRELATION IN MY SPENDING AND BEHAVIOR PATTERNS AND WHERE YOU SEE THOSE TYPES OF THINGS CORRELATION TREND RELATIONSHIP USES STATISTICAL MODELING AND IT LET'S YOU UPLOAD DATA INTO SPECIFIC RELATIONSHIP OR MEASURE. SO THIS IS KIND OF HOW YOU WOULD LOOK AT A STATISTICAL RELATIONSHIP. WE JUST ARE SIMPLIFYING THIS DOWN INTO AN X AND Y AXIS TO PLOTS. DATA ON LEFT THERE IS NO CLEAR RELATIONSHIP. GO UP ON ONE AXIS AND NOT A CLEAR PATTERN HOW DATA BEHAD HEYS ON THE OTHER. GRAPH ON THE RIGHT POSITIVE LINEAR RELATIONSHIP FURTHER ALONG X AXIS FURTHER FROM LEFT TO RIGHT, DATA TENDS TO BE HIGHER UP ON THE Y AXIS OR PLOT WHICH CREATES A FIT LINE THERE. IN THIS CASE, WE WOULD SAY THERE IS A POSITIVE LINEAR RELATIONSHIP. AGAIN, LOTS OF WHAT WE ARE DOING HERE IS ANALYZING WHETHER OR NOT THERE IS A RELATIONSHIP AND WHETHER OR NOT IT IS STRONG. COUPLE OF EXAMPLES, MAYBE YOU ARE LOOKING AT THE PURCHASE OR INCREASE IN TEMPERATURE AND HOW MUCH PEOPLE ARE BUYING PAPER TOWELS. YOU MAY ASSUME IF WARMER WEATHER OUTSIDE AND PEOPLE ARE HAVING MORE BARBECUES AND PEOPLE WILL BUY MORE PAPER TOWELS PEOPLE WON'T HAVE NAPKINS OUTSIDE THEY WANT PAPER TOWELS FOR THE BARBECUE. ANALYZE IT IT WILL LOOK LIKE THE ONE TO THE LEFT YOU HAVE LOTS MORE NOISE IN THE DATA NOT REALLY A CLEAR RELATIONSHIP. ONE THAT PEOPLE TALK ABOUT IS AS TEMPERATURE INCREASES YOU WILL SEE ICE CREAM SALES INCREASE COMMON RELATIONSHIP LOOKS LIKE ONE TO THE RIGHT TEMPERATURE GOES UP ON ONE AXIS VOLUME OF ICE CREAM BOUGHT GOES UP. STATISTICAL MODELING RELATIONSHIP IS PRESENT AND CAN YOU QUANTIFY THAT STRENGTH IS THERE. HOW MUCH MORE ICE CREAM DO WE EXPECT PEOPLE TO BUY MEASURE AND QUANTIFY IT OUT THAT WAY YOU CAN DO. THAT IS COOL. RIGHT NOW WE ARE TALKING ABOUT TWO VARIABLES AND LINEAR RELATIONSHIPS THAT ARE THE MOST SIMPLE ONES AND LOTS OF RELATIONSHIPS EXIST ACROSS MULTIPLE DIMENSIONS AND HAVE MORE COMPLEX PATTERNS AND MORE COMPLEX METHODS OR ANALYZING DATA STRAIGHTFORWARD ONE IS LINEAR RELATIONSHIP BETWEEN TWO VARIABLES. NEXT ONE IS MACHINE LEARNING. I SAID THAT THIS IS MY AREA OF EXPERTISE AND AREA THAT I WORK THE MOST IN. IT IS A HUGE ANCHOR AT NLM WHEN WE GO THROUGH EXAMPLES WE WILL TALK ABOUT NATURAL LANGUAGE PROCESSING WHICH IS A BIG PART OF WORK THAT NLM DOES FOR INDEXING AND MAPPING DOCUMENTS TOGETHER. WHAT IS MACHINE LEARNING? I WOULD JUST SAY VERY BROADLY IT IS BUILDING A LEARNING MACHINE TO MAKE DECISIONS AND PREDICTIONS OFF HISTORICAL DATA USING STATISTICAL MODELS. DRAWING THAT LINE AGAIN FROM HERE BACK TO STATISTICAL MODELING. YOU WILL USE LOTS OF THE SAME TECHNIQUES AND METHODS IN ANALYZING A RELATIONSHIP IN DATA IN THAT YOU WOULD USE IN MACHINE LEARNING. A LOGISTIC REGRESSION MODEL THAT TAKES LOTS OF DATA AND SAYS BINARY OUTCOME YES OR NO POSITIVE OUTCOME LOOKING AT THAT TYPE OF OUTCOME IS REALLY FOUNDATIONAL TO STATISTICAL MODELING AND ANALYZING RELATIONSHIPS WITH THAT OUTCOME AND USE IT IN MACHINE LEARNING TO PREDICT WHETHER SOMETHING WILL BE A YES OR NO OR POSITIVE OR NEGATIVE. YOU SEE THAT A LOT IN SCIENTIFIC STUDIES. WITH MACHINE LEARNING IT REQUIRES A LOT OF HISTORICAL DATA. WHAT YOU NEED IS THIS LARNING ENGINE TO UNDERSTAND ALL OF THE POSSIBLE VARIATION IN THE HISTORICAL DATA AND HOW THAT VARIATION AND HISTORICAL DATA CAN MAP TO VARIATIONS IN YOUR OUTCOME. IT -- I THINK AS YOU SEE MORE COMPLEX DATA, YOU NEED MORE DATA WHEN THERE IS A LOT OF RESEARCH GOING ON NOW ABOUT IMAGE PROCESSING. AND ANALYZING PICTURES AND GRAPHICS AND IMAGERY. THAT GENERALLY NEEDS A LOT OF DATA. THERE IS LOTS OF VOLATILITY IN WHAT AN IMAGE LOOKS LIKE BUILDING ALGORITHM FOR WHAT A CUP IS TONS OF THINGS CUPS ARE AND PLACES THEY CAN BE. YOU WILL SEE ALL THE DIFFERENT TYPES OF CUPS AND PLACES THEY CAN BE TO UNDERSTAND WHAT A CUP IS AND WHETHER IT JUST SEES AN IMAGE. I WOULD ALSO SAY HOW MANY PEOPLE WATCH NETFLIX? NETFLIX? AMAZON PRIME? I USE AMAZON PRIME RELENTLESSLY TO THE POINT MY GIRLFRIEND GIVES ME A PROBLEM ABOUT IT. AVOID STORAGE AS MUCH AS POSSIBLE. AMAZON, NETFLIX, WAY THAT THEY MAKE RECOMMENDATIONS IS ALL USING MACHINE LEARNING AND THEY ARE USING WHAT IS CALLED A RECOMMENDER SYSTEM IT RECOMMENDS -- INTO IT THE NAME FIRST-OFF. THEY RECOMMEND DIFFERENT SHOWS OR MOVIES OR PURCHASES BASED OFF TWO THINGS. ONE, ARE YOU BUYING LOTS OF THINGS THIS NEW ITEM IS SIMILAR TO? WE GO BACK TO MAYBE YOU BOUGHT YOUR -- MAYBE NETFLIX THAT YOU ARE WATCHING, YOU KNOW, FIVE OR SIX DIFFERENT ROMCOMS IN A ROW HE LIKES THEM LET'S GIVE HIM ANOTHER OR THEY WILL SAY MAYBE THIS PERSON WATCHED TWO AVENGER MOVES AND STAR WARS AND ROMCOM AND SOMETHING ELSE THIS GUY IS ALL OVER THE MAP WHAT IS HE SIMILAR TO? WHAT IS THE PERSON HE IS SIMILAR TO WHAT DID THEY WATCH AND MAKE RECOMMENDATION OFF THAT. WHO YOU ARE SIMILAR TO AND PATTERNS WHAT YOU ARE WATCHING SO THEY ARE GOOD RECOMMENDATIONS SOMETIMES SO IT SOMETIMES FEELS IT IS READING MY MIND. FACEBOOK ANALYZES LOTS OF UNSTRUCTURED DATA IN FACEBOOK HOW MANY HAVE SEEN A POST GET FLAGGED IS THIS MALICIOUS OR MEAN OR EXPLICIT CONTENT OR SEEN THAT FACEBOOK OR GOOGLE AUTOMATICALLY TRANSLATES SOMETHING. SEEN THAT SOMEBODY MAKES A POST AND IT WAS AUTOMATICALLY TRANSLATED BY SOMETHING ELSE AND REVERTED BACK. THEY ARE USING TWO DIFFERENT THINGS THERE. WHEN LOOKING AT POSTS WITH USING EXPLICIT LANGUAGE OR VERBAL ABUSE THEY DO SENTIMENT ANALYSIS WHERE EVERY WORD HAS A DIFFERENT EMOTIONAL SCORE ASSOCIATED WITH IT. IF SEES LOTS OF NEGATIVE WORDS USED TOGETHER WE MIGHT WANT TO FILTER OUT THE POST OR USES NEUROMACHINE TRANCE LAITION THAT UNDERSTANDS HOW DIFFERENT WORDS ARE USED TOGETHER IN DIFFERENT LANGUAGES. IF SEES A LOT OF WORDS FROM ONE LANGUAGE HAS ABILITY TO TRANSLATE IT ON THE SPOT TO ANOTHER LANGUAGE HOW GOOGLE WORKS. SENTIMENT ANALYSIS HOW WORDS ARE USED THAT IS USED A LOT BY AMAZON. RIGHT? GOING BACK TO AMAZON EXAMPLE PEOPLE PROVIDE AMAZON WITH LOTS OF INFORMATION ABOUT THEIR PRODUCTS WITHIN THE COMMENTS. LOTS OF PEOPLE MIGHT WRITE ANGRY XHEVENTS THAT PART OF AMAZON WILL GO ANALYZE IT PEOPLE GAVE IT THREE STARS AND THEIR SENTIMENT IS LOWER THAN PEOPLE THAT GIVE IT THREE STARS MAYBE THESE ARE MORE POSITIVE CUSTOMERS THAT IS SEPARATE INFORMATION WE CAN GET WHY PEOPLE ARE GETTING THIS PRODUCT. IT THAT IS NATURAL LANGUAGE PROCESSING. WHAT IS IT? PROCESS OF APPLYING -- UNSTRUCTURED DATA MAINLY TEXT TO IDENTIFY THEMES AND PATTERNS IN IT. AGAIN, AS I TALKED ABOUT, THIS IS AN AREA THAT NLM IS DOING A LOT OF WORK AND HAS DONE A LOT OF WORK. YOU CAN FIND REALLY COOL RESEARCH ON IT. IT FINDS LOTS OF APPLICATION IN LARGER I THINK INFORMATION RETRIEVAL AND INFORMATION ACCESS. SO ONE OF THE MOST COMMON METHODS OF NATURAL LANGUAGE PROCESSING IS CALLED TOPIC MODELING AND LOOKS AT PATTERNS HOW DIFFERENT WORDS ARE USED AND USES ABSTRACT THEME OR TOPIC DESCRIBED ABOUT WORDS USED FREQUENTLY TOGETHER IT IS COOL IN MANY CASES YOU CAN SEE HOW DIFFERENT TOPICS OR THEMES CAN TREND OVER TIME. FOR EXAMPLE IF YOU ARE LIKE MY PARENTS AND RESEARCHING THE FLUE AND LOOKING FOR PATTERNS ININ INFLUENZA RESEARCH PEOPLE ARE TURNING OUT A LOT OF WORK THERE. THAT IS THEME REDUCTION CONCEPT IS LOTS OF HOW STUFF ACTUALLY HOW MESH WORKS UNDER THE HOOD. IT IS A REALLY BIG FIELD AND INTERESTING AREA I'M USING NOW IS ORGANIZATIONS. WE ARE TRYING TO HELP THEM ADAPT TO DIFFERENT REQUIREMENTS. SO AND FIND WAYS INCREASE COLLABORATION WITHIN THE GOVERNMENT. I'M WORKING WITH ARMY OF CORE OF ENGINEERS ON THIS ONE MANAGING INFRASTRUCTURE SO FOLK DZ CAN UPLOAD DESCRIPTION OF THEIR REQUIREMENTS WHAT INDIVIDUAL PROJECTS ARE AND WHAT THEY HAVE TO ACCOMPLISH AND UPLOAD TO TOOL AND ALGORITHM ANALYZES IT IT HAS LEARNED STRATEGIC PLANS ACROSS ORGANIZATIONS ACROSS THE FEDERAL GOVERNMENT OKAY YOU HAVE TO DO A, B, C, AND D AND YOU SHOULD COLLABORATE WITH THIS INSTITUTION THIS IS PART OF THEIR PLAN. IT HELPS INCREASE EFFICIENCY AND WORKS ON UNSTRUCTURED DATA. PEOPLE WITH DIFFERENT REQUIREMENTS DON'T HAVE TO READ ALL OF THE STRATEGIC PLANS AND GO TO FEDERAL WEBSITES TO LEARN WHO THEY SHOULD COLLABORATE WITH THEY CAN MAKE A RECOMMENDATION ON REQUIREMENTS THEY ARE MOST SIMILAR TO OR WHO IS MOST LIKELY ABLE TO MEET THEIR NEEDS. COOL AREA MACHINE LEARNING IS MASSIVELY GROWING FIELD. IT IS A HOT-BUTTON TOPIC RIGHT NOW. I WILL ALSO SAY I HAVE PULPIT HERE TO UPONTIVE INDICATE A LITTLE BIT. PEOPLE ARE TALKING ABOUT AI. THERE IS NOT AI MACHINE AROUND THE CORNER BRANDING AI AND MACHINE LEARNING IS REALLY THE SAME THING AND WHAT PEOPLE ARE CALLING AI NOW PEOPLE CALLED MACHINE LEARNING FIVE YEARS AGO. TECHNIQUES AND BIT OF A BRANDING EXERCISE. I HEARD A TALK FROM SOMEBODY WHO WAS A RESEARCHER DOWN AT UNC WHO HAD DONE A LOT OF WORK IN PREDICTIVE MODELING. SHE PRODUCED 10 OR 15 PAPERS IN PREDICTIVE MODELING. IF SHE CHANGED PAPER AND HOW SHE REFERRED TO PREDICTIVE MODELING HOW MUCH SHE WOULD HAVE GOTTEN. FUNNY I WILL GIVE YOU THAT ANECDOTE AGAIN AI AND MACHINE LEARNING IS VERY SIMILAR AND THE SAME THING. YOU CAN DRAW THAT CONTINUATIONATION. TALKING ABOUT AI THAT IS THE SAME THING WE TALK ABOUT HERE. FAMILY OF APPLIED TECHNIQUES OPERATION RESEARCH IS A REALLY, REALLY BROAD FIELD AND PROBABLY ONE YOU EXPERIENCED IN LOTS OF DIFFERENT WAYS. WHO HAS BEEN TO A NATIONALS GAME IN THE LAST TWO SEASONS THIS SEASON OR SEASON BEFORE? HAVE YOU NOTICED NOW YOU HAVE TO GO TO A SEPARATE LINE IF YOU HAVE CASH AND SEPARATE WITH CARD AND SEPARATE WITH NATIONALS BUCKS OR WHATEVER IT IS PART OF OPERATIONS RESEARCH STUDY THEY ARE FINDING SO MUCH OF THE SLOW DOWN IN GOING THROUGH THE LINES AT NATIONALS GAME IS WHAT TYPE OF PAYMENT YOU ARE USING AND WORKING THROUGH THAT. THEY BROKE IT DOWN AND DOWN INTO DIFFERENT SEGMENTS. I TALKED TO KATE ABOUT OPERATIONS RESEARCH OF THE SHE SAID WHEN GOING TO BUSINESS SCHOOL THERE WAS A HUGE STUDY ON STARBUCKS THEY JUST CHANGED THEIR PROCESS AND THEY HAD DONE OPERATIONS RESEARCH STUDY HOW TO BREAK DOWN DIFFERENT ROLES OF ALL DIFFERENT PEOPLE WORKING AT STARBUCKS. AT FIRST IT WAS A DISASTER. EVERYONE WAS DOING THINGS DIFFERENTLY. OVER TIME IT GOT BETTER. FUNNY EXAMPLE TSA SOME PLACES SLOW SOME PLACES BETTER THAN OTHERS I LIKE REAGAN THEY ARE FAST AND FIGURED IT OUT BROKEN IT DOWN DIFFERENT SUBSETS AND GATES AND DISTRIBUTE PEOPLE PEOPLE THAT WORK AT TSA ARE BETTER THAN REAGAN. DO YOU KNOW YOU HAVE TO TAKE OWE OF YOUR BELT? YES FOR THE FIFTH TIME. PERSONAL RANT OPERATIONS RESEARCH IS ABOUT TAKING INEFFICIENT PROCESS AND MAKING IT MORE EFFICIENT AND IT AIDES IN DECISION MAKING. YOU WILL OFTEN SEE IT IS BROKEN INTO TWO BIG FAMILIES. ONE IS SIMULATIONS AND TWO IS OPTIMIZATIONS, BOTH OF WHICH ARE USED TO APPLY MATH AND STATISTICAL METHODS TO OPTIMIZE OUTCOMES OR FIND NEW WAYS IN DOING SOMETHING EFFICIENTLY AND MORE EFFECTIVELY. YOU SEE IT IN BUSINESS. OR IS A BIG AREA OF BUSINESS. TWO EXAMPLES HERE I THINK ARE INTERESTING ONES IS ONE, SIMULATING A PHYSICAL SYSTEM IS WHERE YOU SEE LOTS OF OR COME IN. MAYBE I HAVE A BOUNCING BALL. I WANT TO GUESS WHERE THAT BALL WILL END UP. THERE IS ONE WAY TO DO THIS THAT CAN DO IT REALLY FAST IS ESSENTIALLY I KNOW THE FORCE OF GRAVITY AND KNOW THE FORCE OF A BALL HITTING THE WALL AND I CAN MEASURE FRICTION AND SIMULATE FRICTION AND SIMULATE THE BOUNCINESS OF THE BALL I DON'T KNOW THE SCIENCE TERM. THE BOUNCINESS OF THE BALL THROW IT OUT AND MEASURE THESE THINGS OUT. WHAT YOU CAN DO IS SIMULATE A BILLION RUNS OF I THROW BALL OFF THE FLOOR AND THE WALL AND CEILING FIRST AND THROW IT THIS HARD OR THAT HARD AND SPIN IT AND YOU CAN SIMULATE ALL OF THE POSSIBLE OUTCOMES AND WHEN SOMEBODY IS LIKE I NEED TO KNOW HOW I CAN BOUNCE THE BALL AND GET IT THERE THEY CAN TELL YOU HERE ARE ALL DIFFERENT SIMULATED SITUATIONS WE TRIED THAT GOT THE BALL THERE YOU CAN TRY ANY OF THE DIFFERENT SITUATIONS YOU SEE IT LOTS WITH PHYSICAL SYSTEMS UNDERSTAND ALL OF THE OUTCOMES BASED OFF DIFFERENT PHYSICAL RULES HOW SOMETHING WILL ACT. WE SEE IT IN SEISMIC RESEARCH; RIGHT? YOU CAN SIMULATE HOW EARTHQUAKE WILL SIMULATE DIFFERENT SEISMIC FIGURES. IT WORKS WELL IN PHYSICS SYSTEMS AND YOU SEE IT IN -- WE USE IT FOR AGENT-BASED MODELING. IN A DISASTER SITUATION LET'S SAY THERE IS A HURRICANE. HOW WILL PEOPLE MOVE? WE CAN SIMULATE HOW PEOPLE WILL MOVE THROUGH A CITY ORB EVACUATE USING TECHNIQUE AGENT-BASED MODELING THAT IS USEFUL. PLANNING EVACUATION ROUTE WE CAN GO IN AND SAY WE EXPECT PEOPLE TO EVACUATE THROUGH HERE PROVIDE MOST RESOURCES OR INSTITUTIONAL OR STRUCTURAL SUPPORT THERE. ANOTHER ONE AN AREA THAT WE USE IS OPTIMIZATION. ANY TIME THAT YOU HEAR RESOURCE ALLOCATION OR ANYTHING LIKE THAT IT IS OPTIMIZATION. YOU WILL SEE IT THIS IS A NICE LITTLE EXAMPLE WE HAVE OF DISTRIBUTING BEDS THROUGHOUT A HOSPITAL; RIGHT? IN MANY CASES THERE IS AN OPTIMAL WAY TO PUT THE RIGHT NUMBER OF BEDS AND ROOMS TO ALLOW FOR OPTIMAL SPACE FOR EVERY PATIENT. YOU WANT TO HAVE THE RIGHT PATIENTS NEAR EACH OTHER AND DON'T WANT SOMEBODY HIGHLY CONTAGIOUS NEAR A BAD PLACE IF THEY GOT [INDISCERNIBLE]. YOU WANT TO OPTIMALLY DISTRIBUTE NURSES AND WARDS TO THE RIGHT PLACES THIS SEEMS SUPER INTUITIVE YOU KNOW THERE ARE DIFFERENT WARDS OF DIFFERENT AREAS OF THE HOSPITAL AND DIFFERENT SECTIONS BUT IT FALLS BACK ON THE OR PIECE. OPTIMAL SYSTEM ALLOCATING THIS NUMBER OF NURSES AND DOCTORS IN THESE BROAD RANGE OF PLACES WITH AMOUNT OF RESEARCH WHICH MIGHT BE BEDS AND FACILITIES AND ALL THAT. YOU CAN SEE OR IN LOTS OF DIFFERENT PLACES AND YOU OFTEN SEE IT WHEN YOU WON THE HAVE EXPECTED IT. LAST ONE, THANKS FOR BEARING WITH ME, GUYS. LAST COMPETENCY IS DATA VISUALIZATION WHICH IS REALLY HOW YOU, I GUESS, ACTUALIZE AND TEACH OR SHOW PEOPLE THE FINDINGS THAT YOU BUILT OUT THROUGH THE METHODOLOGY. IT IS A BIG PART OF COMMUNICATING WHAT YOU DID BECAUSE LOTS OF THE STUDIES ARE VERY, VERY COMPLICATED AND LOTS OF METHODS ARE VERY, VERY COMPLICATED YOU PROBABLY GOT A LITTLE BIT OF THAT TODAY AND NEED TO FIND A WAY TO REALLY GENERALIZE THE FINDINGS TO SOMETHING THAT IS INTUITIVE AND MAINTAINS INTEGRITY OF STUDY YOU DID AND MAPS ASSUMPTIONS TO ACTUAL FINDINGS. I WALKED DOWN THE HALLWAY COMING HERE WITH VISUALIZATIONS OF NLM IT IS PAGE VIEWS YOU CAN SEE IT IS ALL IN THAT HEAT MAP AND COOL WAYS TO VISUALIZE THINGS GEOGRAPHICALLY THEY ARE DOING A BUNCH OF THINGS IN THE HALLWAY THERE. I WILL LET THEM DIRECT PEOPLE IF THEY WANT TO CHECK IT OUT. IT IS GRAPHICAL REPRESENTATION OF ANALYTICS TO BETTER COMMUNICATE INSIGHTS. THERE IS A FINE LINE YOU NEED TO WALK FOR DATA VISUALIZATION BETWEEN IT BEING SIMPLE AND STRAIGHT-FORWARD AND INTUITIVE BUT COMPLEX ENOUGH YOU DON'T OVERSIMPLIFY IT PEOPLE MISS THE POINT OF WHAT YOU DID OR ANY CORE ASSUMPTIONS YOU MADE IN THE PROCESS. IT IS KIND OF IT THAT BALANCING ACT BETWEEN THE TWO. WHAT YOU WILL SEE ACTUALLY A LOT NOW AND I THINK WHAT I HAVE SEEN IS THERE IS A LOT OF REALLY COOL VISUALIZATIONS OUT ON SOCIAL MEDIA AND NEWSPAPERS AND ONLINE NEWSPAPERS DO THIS REALLY WELL NOW AND THEY ADD REACTIVE COMPONENTS IN YOU CAN CLICK AND DRAG OVER THINGS OR HOVER OVER THINGS TO GET MORE INFORMATION. THAT IS A WAY VISUALIZATIONS CAN BE OR PROVIDE MORE INFORMATION WITHOUT GETTING OVERLY COMPLEX. YOU CAN SEE A LOT OF LAYERS THAT ARE ADDED INTO ALL THAT. WE HAVE TWO EXAMPLES. I WILL WALK YOU THROUGH TWO EXAMPLES THAT WILL BE FUN. DATA VISUALIZATIONS YOU CAN QUICKLY GOOGLE AND SAY WHAT ARE THE WORST DATA VISUALIZATIONS OUT THERE AND GET COMICICALLY BAD ONES. YOU CAN SAY WHAT ARE COOL DATA VISUALIZATIONS OUT THERE AND SPEND AN HOUR PLAYING WITH THESE THINGS THEY ARE REALLY, REALLY COOL. THIS IS ONE EXAMPLE YOU WILL PROBABLY SEE A LOT MORE OF. FIRST ONE, ALL RIGHT. THIS IS NOT A GOOD ONE. POOR VISUALIZATION. A COUPLE OF THINGS I WILL CALL OUT. FIRST OF ALL, CAN ANYONE IN ONE SENTENCE TELL ME EXACTLY WHAT THE FINDING HERE IS? OKAY. DIDN'T THINK SO. I SPENT A LOT OF TIME THINKING OF THIS ONE AND STILL CAN'T COME UP WITH IT. IF YOU COULD, ACTUALLY THERE IS AN ANSWER ON THE NEXT SLIDE. ANYWAYS, ONE, YOU HAVE TWO Y AXISES. RIGHT? ON THE LEFT 0 TO 700 AND 0 TO 8. YOU DON'T REALLY KNOW WHICH ACCESS ALIGNS TO WHICH LINE. SO LIKE OKAY. WHICH ONE IS WHICH? IS BLUE LINE ENDING AT 7 OR BLUE LINE ENDING AT 600? YOU CAN KIND OF GET THERE A LITTLE BIT BY LOOKING AT THE KEY AND WHATEVER THE END PLANE PASSENGERS IS IN MILLIONS MAYBE NOT 7 STOE 0 MILLION BUT MAKE THE ONE TO THE RIGHT. NOT INTUITIVE AT ALL. NO CLEAR FINDING. I DON'T LIKE THE X AXIS ALL THE WAY OVER THIS WAY. IT IS JUST NOT VERY CLEAR. NOW, IF WE HAVE THE SAME DATA BUT VISUALIZE IT BETTER, WE HAVE THIS ONE. CAN ANYONE IN ONE SENTENCE TELL ME WHAT THE FINDING IS FROM THIS? HINT, IT IS THE TITLE. >> AUDIENCE: [LAUGHING]. >> AARON SANT-MILLER: OKAY. NO ONE. COOL. ALL RIGHT. BACK AJ FEES PROBABLY REDUCE CHANCE OF AIRLINE LOSING YOUR BAGS. HOW DO WE GET THERE? YOU ACTUALLY HAVE SAME TWO CURVES BUT ONE IS BIG GRAY CURVE AND THE OTHER IS A MAP CURVE. YOU HAVE THEM DEFINED TOT LEFT LINE IS MISHANDLED BACK PER 1,000 PASSENGERS AND GREY IS US DOMESTIC PASSENGERS. YOU KNOW WHICH IS WHICH. THEY DO A GOOD JOB OF CALLING OUT CONTEXT FOR HOW THE FINDINGS WERE GENERATED. IF YOU JUST LOOK AT THIS CURVE, YOU SEE KIND OF THIS PEAK HERE AND YOU DON'T REALLY IMMEDIATELY UNDERSTAND WHY THERE WAS A SPIKE AND WHY THERE WAS A DECLINE. NOW THAT YOU HAVE CLUES, YOU MAYBE GET THERE A LITTLE BIT BETTER. THERE WAS 9/11; RIGHT? YOU HAD LOTS MORE BAGGAGE SCREENING IMMEDIATELY AFTER THAT LOTS MORE MISHANDLED BAGS AND SECURITY. THAT IS WHERE YOU SEE THAT SPIKE. THEN THERE IS A HUGE CHANGE IN 2008 WITH IMMEDIATE DROP AND TELL YOU WHAT THAT IS AIRLINES BEGIN CHARGING FOR CHECKED BAGGAGE AND OTHER AIRLINES START FOLLOWING. UNDERSTAND WHY THERE IS A DECLINE AND EXPLAIN HERE EVEN THOUGH VOLUME BIG GRAY BOX MORE OR LESS STAYED HIGH AND PRETTY CONSISTENT NOW THEY ARE CHARGING MORE CHECKED BAGS MORE PEOPLE CARRYING THEM ON AND PACKING LESS AND LESS PEOPLE ARE LOSING BAGS; RIGHT? YOU GET THAT HERE BAGGAGE FEES REDUCE CHANCES OF LOSING A BAG. THIS IS INTUITIVE AND FINDING IS RIGHT THERE AND ALL CONTEXT IN THERE AS WELL SO YOU DON'T FEEL YOU ARE MISSING ANY ASSUMPTIONS THAT DATA SCIENTISTS MADE WHEN THEY BUILT THIS VISUALIZATION. OKAY. THAT IS THE 10. RIGHT? THIS WHOLE ECOSYSTEM HERE. AGAIN, YOU CAN SEE HOW THEY FIT TOGETHER RIGHT NOW. METHOD LOGICAL DESIGN COMES INTO THIS AREA ADVANCED MATH AND COMPUTER SCIENCE AND PROGRAMMING AND SCRIPTING INTEGRATED ACROSS ALL OF THEM. DATA SCIENCE IS FOUNDATION THAT LEADS TO METHOD LOGICAL PIECE HERE DATA YOU MINING TO FIND DIFFERENT PATTERNS AND DIFFERENT WAYS GENERATING INSIGHTS WHETHER MODELING OR MACHINE LEARNING OR OR AND HOW YOU COMMUNICATE IT OUT IN DATA VISUALIZATION. WE PUT IT OUT TOGETHER AND YOU CAN HAVE IT ON YOUR PIECE OF PAPER AS WELL. FOUR HIGH-LEVEL TAKEAWAYS I HAVE PULPIT TO RAMBLE ON A LITTLE BIT. NO. 1, DR. BRENNAN SAID THIS IS A START DATA SCIENCE IS A TEAM SPORT ALL THIS STUFF HERE YOU RARELY HAVE AN EXPERT THAT DOES ALL OF IT ON ONE PROJECT AND LOTS OF PEOPLE ARE WORKING TOGETHER AND PEOPLE FOCUSSED IN STATISTICAL MODELING AND IN DATA VISUALIZATION AND FOCUSED IN MACHINE LEARNING ALL COLLABORATING TOGETHER. ON MY SEISMIC PROJECT I HAVE SEISMOLOGIST THAT DOES THAT INTERPRETATION THAT IS NOT MY JOB WE HAVE SOMEONE FOCUSED ON DATA VISUALIZATION AND DASHBOARDS AND TOGETHER WE BUILD A COMPREHENSIVE SOLUTION BUT NO ONE IS TRYING TO DO IT ALL. AGAIN, DATA SCIENCE AS A DISCIPLINE CONTINUES TO EVOLVE. IT IS INTERDISCIPLINARY FIELD ALL THESE FIELDS OF STUDY ARE EVOLVING AND CHANGING MEANING DATA SCIENCE IS EVOLVING AND CHANGING. I ELUDED TO THIS BEFORE; RIGHT? YOU USED TO HAVE TO HAVE A PHD TO PASS GO TO DO DATA SCIENCE. DIFFERENT WAYS THAT IS A LOT EASIER CHANGING FIELD AND DEFINITION OF WHAT IT IS IS CHANGING PEOPLE CONSIDER DATA SCIENCE REALLY DEEPLY SCIENTIFIC DOMAIN AND OTHER PEOPLE ALMOST PHILOSOPHICAL UNDERSTANDING WHAT IS DATA AND WHAT DOES DATA MEAN AND IN THEORY EVERYTHING IT THAT EXISTS OUT THERE IN THE WORLD ALL OF THE INFORMATION IS REPRESENTED IN DATA ANYTHING WE LEARN OFF THE DATA IS IMPLICIT AND WE ARE NOT LEARNING OFF THE REAL SITUATION AND REALLY PHILOSOPHICAL AND EXISTENTIAL THAT IS HOW IT BREAKS DOWN AND LOTS OF WAYS FOR PEOPLE TO CONSIDER SCIENCE OF DATA. ALL THESE ARE CONTINUIUM. I SAID BEFORE SKILLS THAT ARE EMBODIED IN STATISTICAL MODELING SAME SKILLS AND METHODS ARE USED IN MACHINE LEARNING AND LOTS OF OVERLAP COMPETENCIES AND SCALE IS CONTINUIUM NOT MUTUALLY EXCLUSIVE ONE FLOWING PIECE WE BROKE DOWN INTO 10 CORE COMPONENTS FINALLY WE WENT AN INCH DEEP INTO ALL THESE I SPENT THREE MINUTES ON COMPUTER SCIENCE AND ADVANCE THE MATH THAT ARE HUGE, HUGE FIELDS. WE CAN'T BEGIN TO WRAP OUR MIND AROUND ONE OF THOSE PEOPLE DEVOTE THEIR ENTIRE LIVES TO STUDYING THOSE. WE HAVE GONE A LITTLE BIT INTO EACH OF THEM. I ENCOURAGE FOLKS IF INTERESTED TO EXPLORE THEM A LITTLE MORE. WHAT IS GROWING IS DATA SCIENCE IS GROWING AND AI AND MACHINE LEARNING ARE GROWING AND THERE ARE LOTS OF REALLY ACCESSIBLE BLOGS INTUITIVE TO READ ABOUT ALL THESE THINGS I ENCOURAGE FOLKS TO EXPLORE THESE FIELDS A LITTLE BIT MORE. THAT IS ALL I GOT. WE WILL OPEN IT UP FOR QUESTIONS. >> AUDIENCE: [APPLAUSE]. >> AARON SANT-MILLER: YES. >> [INDISCERNIBLE]. >> AARON SANT-MILLER: I DID NOT DO RESEARCH TO OFFEND INFORMATION. THEY ASKED ABOUT ELEPHANTS AND JOHNNY CASH. I WILL USE REPRESENTATIVENESS OF THIS ROOM THAT JOHNNY CASH WINS OUT. >> AUDIENCE: [LAUGHING]. >> WERE THERE ANY QUESTIONS ONLINE? >> YEAH. I WILL GET US STARTED WITH A QUESTION SUPER INTERESTING ONLINE CAME ONLINE THIS PERSON IS INTERESTED IN IDENTIFYING PATTERNS IN MEDICAL TERMINOLOGY -- THESE ARE ANALOG FORMATS PRODUCED BETWEEN THE 1920S AND 1980S. HOW WOULD YOU SUGGEST THIS TYPE OF MATERIAL BE PREPARED FOR DATA SCIENCE APPLICATIONS? >> AARON SANT-MILLER: BOY. THANKS, LISA. >> AUDIENCE: [LAUGHING]. >> AARON SANT-MILLER: FIRST PIECE IS YOU HAVE TO FIND A WAY TO EXTRACT THE VIDEO ITSELF AND YOU CAN DO SOME SORT OF SPEECH RECOGNITION TO CONVERT ALL OF THE ACTUAL LANGUAGE AND TERMLENOLOGY USED INTO DIJITIZED TEXT THAT IS PRE-PROCESSING PIECE RUN VIDEO AND EXTRACT TEXT TEXT IN DIGITAL LANGUAGE AND USE NLP TECHNIQUES WE TALKED ABOUT LOOK FOR SPECIFIC TERMS AND HOW THEY EVOLVED AND HAVE BEEN USED DIFFERENTLY OVER TIME AND SPECIFIC CONTEXT HOW TERMS ARE USED WITHIN AND IDENTIFY HOW DIFFERENT TERMS HAVE DIFFERENT SEMANTIC MEANINGS AT DIFFERENT POINTS IN TIME. YOU CAN UNDERSTAND HOW DIFFERENT WORDS ARE USED IN DIFFERENT POINTS OF LANGUAGE IN REALLY ABSTRACT VECTOR SPACE. THIS WORD IS OFTEN USED DOG IS OFTEN USED WITH CAT THOSE TWO ARE CLOSE IN VECTOR SPACE BECAUSE OF WORDS THEY ARE USED AROUND YOU CAN UNDERSTAND HOW TERMINOLOGY IS USED AT DIFFERENT POINTS BASED OFF WORDS IT IS USED WITH I HOPE THAT UNDERSTANDS THE QUESTION YOU CAN E-MAIL THAT E-MAIL AND I WILL GIVE YOU MO DETAILED ANSWER ONLINE. FRIEND, WHAT IS UP? >> IN MY JOB HISTORY OF DOING STUFF LIKE THIS YOU KNOW PREHEIGHT LATE AND SEMANTIC INDEXING AS YOU MENTIONED WAS REALLY, REALLY COOL. HERE I DISCOVER SYSTEMS LIKE WORD NET AND UMLS, ET CETERA. HOW DO YOU CONTRAST THOSE KINDS OF APPROACHES? >> AARON SANT-MILLER: REALLY GOOD QUESTION. TALKING NATURAL LANGUAGE PROCESSING AGAINST. UMLS IS A SEMANTIC LATENT INDEX AND YOU WILL SEE EXTENSION OF THOSE IN TOPIC MODELING ADLA IS AN EXAMPLE THAT BUILDS OVER THAT OVERLAPPING SEMANTIC UNDERSTANDING OF WORDS USED DIFFERENTLY. WAY I SEE TRADEOFF IN BIGGER NEURAL NETWORKS WHEN I DO STUDIES I FIND NATURAL MODELING LANGUAGE I WANT TO BUILD IS SUPER TAILORED TO SPECIFIC DOMAIN I WANT TO RESEARCH OR APPLY TO. BIG NEURAL NETS THAT ARE TRAINED ON LARGE AMOUNTS OF DATA ONE IS TRAINED OFF EVERY WIKIPEDIA IT IS TRAINED AT GENERALIZED LEVEL AND HARDER TIME UNDERSTANDING SPECIFIC USE OF LANGUAGE. YOU CAN DO A COUPLE DIFFERENT THINGS TO TAKE REALLY BIG MAPPINGS AND MAKE THEM MORE SPECIFIC AND DO TRANSFER LEARNING WHERE YOU TAKE A BIG MODEL AND MAKE IT OR TRANSFER IT TO LEARN YOUR SPECIFIC DOMAIN OR RUN IT THROUGH ESSENTIALLY A SEMANTIC FILTER THAT WILL HELP IT REDEFINE HOW IT UNDERSTANDS SPECIFIC WORDS LOTS OF EXAMPLES OF THIS WORD MIGHT MEAN ONE THING IN ONE CONTEXT AND ANOTHER THING IN ANOTHER CONTEXT. CLASSIC SITUATION VERY OBVIOUS THING TOTAL BRAIN FART RIGHT NOW CAN'T PULL ONE OFF THE TOP OF MY HEAD. YOU CAN RUN THROUGH CODE EX HOW SPECIFIC WORDS ARE USED IN SPECIFIC DOMAIN NEURAL NETWORK APPLICABLE TO YOUR STUDY. WORK I'M DOING MAPPING STRATEGIC PLANS TO PEOPLE'S REQUIREMENTS WE USE LDA WHICH IS REALLY INTERPRETABLE. WHEN PEOPLE ARE LIKE I SEE THIS TOPIC, YOU KNOW, COMING UP AGAIN AND AGAIN IN YOUR REQUIREMENTS AND ALSO A TOPIC THAT IS USED VERY, VERY FREQUENTLY IN FEMA'S STRATEGIC PLAN WHAT IS THIS TOPIC AND WHAT WORDS MAKE UP THE TOPIC AND WHERE ELSE ARE THE WORDS USED? YOU HAVE CLASSIC TRADEOFF OF BIG METHODS DONE TO SCALE ARE LESS INTERPRETABLE. YOU HAVE LITTLE BIT OF TRADEOFF SOME CLASSIC TECHNIQUES MORE INTERPRETABLE THEY HAD TO BE BECAUSE THEY ARE BUILT OFF LESS DATA. >> CAN YOU DEFINE LDA. >> LATENT DARE ALLAY ALLOCATION. OR DEER [INDISCERNIBLE] OR SOMETHING LIKE THAT. YES? >> CAN YOU TALK ABOUT INDUSTRIES YOU WORKED IN AND HOW DO YOU BEGIN TO RECOGNIZE JOB OPPORTUNITIES OR WORK FLOW OPPORTUNITIES WHERE ADDITIONAL A*- SPIRATIONS USING SOME OF THESE MIGHT BE USEFUL? I WILL ASK DIANNE TO TALK ABOUT HOW WE [INDISCERNIBLE] WITH DATA SCIENTISTS. HOW DO YOU GET PEOPLE SORT OF DEFINING TO SEE OPTIMUM. >> AARON SANT-MILLER: YEAH. THAT IS A GOOD QUESTION. LOTS OF TIMES THE QUESTION IS HOW DO YOU START TO SEE OPPORTUNITIES DEVELOP FOR DATA SCIENCE AND MACHINE LEARNING AND HOW DO YOU START TO RECOGNIZE THOSE THINGS AND MAYBE WHAT WERE SOME INDICATORS? THERE IS ONE I CONDITIONE I CAN'T DO THE INDICATORS WE CAN SHARE WITH TIME FOLKS ARE INTERESTED IN TRIGGERS AND INDICATORS. WE SEE FOLKS HAVE LOTS OF DATA THEY ARE NOT DOING ANYTHING WITH OR PROBABLY HAS INFORMATION WITH THAT PEOPLE HAVING A HARD TIME ACCESSING. IT IS LIKE ONE SIDE IS A LOT OF DATA THAT IS EITHER NOT BEING USED OR ANALYZED OR IS DIGITIZED THAT CAN BE ANALYZED REALLY QUICKLY AND DEFINES HIGH LEVEL OF CHARACTERISTICS OF IT OR YOU ARE LOOKING FOR EFFICIENCY AND INSIGHT AND KNOW YOU HAVE DATA THAT PERTAINS TO IT. PROCESS MOVES REALLY SLOW OR YOU WANT TO PREDICT OUTCOME OR ESTIMATE THINGS ARE RELATED SPECIFIC REQUIREMENT AND NEXT QUESTION IS WHAT DATA SUPPORTS IT. SO, I THINK, AGAIN, IF YOU WANT A SPECIFIC, WE HAVE A SPECIFIC CHECK LIST AND TRIGGERS AND INDICATORS WE CAN SEND YOU IF INTERESTED. I MOST OFTEN SEE IT WHEN I SEE INEFFICIENT PROCESS I THINK CAN BE SPED UP OR WHERE INSIGHT CAN COME OR LOTS OF DATA NOT BEING USED TO FULLEST POTENTIAL. EXAMPLE WE GOT FROM E-MAIL IS A GREAT ONE. LOTS OF LANGUAGE IN ANALOG VIDEO FORMATS NO ONE IS REALLY ANALYZING IT DOESN'T SOUND LIKE MAKING ASSUMPTION OR CAVEAT BUT THAT IS DATA AT OUR FINGER TIPS WE ARE NOT USING. IT IS OUT THERE AND THAT IS THE OTHER SIDE OF THE COIN. >> I WILL ALSO ADD TO THAT I HAVE SEEN ORGANIZATIONS WITH 2 OR 3 RESOURCES ENTIRELY FOCUSED ON EXPLORATORY IDEAS DIGGING INTO DATA SETS NOT USED THAT OFTEN AND EXPLORING DIFFERENT WAYS TO MAKE USE OF THAT INFORMATION. I HAVE ALSO SEEN OTHER ORGANIZATIONS WHERE SOME KIND OF CROWD SOURCING MECHANISM OR SUGGESTION BOX COULD BE SEND AN E-MAIL OR SUBMIT ON SHARE POINT OR DROP A NOTE IN PHYSICAL BOX HERE ARE MY IDEAS FOR DATA SCIENCE AND TEAM PRIORITIZES AND INVESTS IN THOSE IDEAS BASED ON INFORMATION IT BRINGS TO THE ORGANIZATION. VALUE IN BOTH OF THOSE. SOME OF THE BEST IDEAS COME FROM THE MOST UNLIKELY PLACES I DIDN'T MEAN TO JUMP IN THERE. >> AARON SANT-MILLER: YEAH. >> [INDISCERNIBLE]. SUPERVISORS LIKE YOU AND [INDISCERNIBLE]. >> YEAH. SO THE QUESTION IS IF YOU HAVE AN IDEA OR YOU SEE AN EFFICIENCY OR NEW PROCESSOR, SOME WAY TO DO YOUR WORK HERE AT NLM, HOW CAN YOU GO ABOUT GETTING THAT DONE? FIRST THING TO DO IS TO TALK TO YOUR SUPERVISOR. MANY OF OUR PROJECTS HAVE TEAM WORKING GROUPS AND THINGS LIKE THAT. I WOULD SAY TAKE YOUR IDEAS TO YOUR SUPERVISORS AND TO THE WORKING GROUPS. THERE IS WORK GOING TO BE LOOSHG LOOKING AT COMMUNITIES OF PRACTICE. THAT WOULD BE A GOOD PLACE TO BRING IDEAS AND SHARE DIFFERENT THINGS GOING ON IN THE DIFFERENT DIVISIOS. THERE IS LOTS OF OPPORTUNITY. IF YOU DON'T SPEAK UP OR BRING IT UP IT WON'T COME TO FRUITION. FEEL EMPOWERED TO BRING IDEAS FORWARD. YOU CAN ALWAYS SEND THEM TO OUR DATA SCIENCE OR NLM'S DATA SCIENCE OR NIH.COM. WE LOOK FORWARD TO IT WE POST THINGS TO THE WIKI AND ARE HAPPY TO SHARE YOUR IDEAS. >> AARON SANT-MILLER: TALKING ABOUT SUBJECT MATTER EXPERT AS ONE OF THE THREE PILLARS IT IS NOT ALWAYS A PHD IN A COMPLEX FIELD. OPTIMIZING ROOM BOOKING, RIGHT? ANYONE THAT HAS TO DEAL WITH WORKING WITH CHALLENGE TO WORK WITH RESERVE SYSTEM OR MEETING ROOM USES SUBJECT MATTER EXPERT. IDEAS OF DATA SCIENCE OPTIMIZATION COMES FROM PEOPLE USING SYSTEM OUT THERE THIS CAN BE MORE EFFECTIVE AND EFFICIENT TO FURTHER EMPOWER FOLKS IF YOU RUN INTO BOTTLENECK THIS IS SOMETHING WE CAN FIX THIS IS GENERALLY HOW LOTS OF THE THINGS START. YEAH? >> FOLLOWING ON A LOT OF THE QUESTIONS THIS IS AN EXCELLENT OVERVIEW, THANK YOU. IN YOUR TRAVELS SPEAKING ABOUT DATA SCIENCE AND LEARNING ABOUT IT AND UNDERSTANDING THE DEPTH AND BREADTH OF IT, HAVE YOU SEEN ANY RESOURCES THAT SURFACE HOW DATA SCIENCE IS A CONCEPT THAT IS BEING EMPLOYED ACROSS DISCIPLINES? IN OTHER WORDS ANY TYPE OF REPORT I'M MAKING THIS UP. >> AARON SANT-MILLER: YEAH. >> A REPORT IN SOCIOLOGY IT IS HAPPENING THIS WAY AND HISTORY HAPPENING THIS WAY AND COMPUTER SCIENCE HAPPENING THIS WAY BIOFORMATICS HAPPENING THIS WAY LAW HAPPENING THIS WAY AND JOURNALISM HAPPENING THIS WAY. ANYTHING LIKE THAT COME TO MIND? I'M ASKING, ONE, BECAUSE I'M CURIOUS AND TWO, WE HAVE A DIVERSE WORKFORCE AND HAVE COME A LONG WAY EMBRACING OPPORTUNITY WE HAVE AT HAND. THERE IS NO QUESTION ABOUT THAT. SEEING A REPORT LIKE THAT IF IT EXISTS A RAN OR BOOKING STUDY OR SOMETHING LIKE THAT THAT COULD BE PROVIDED OR KNOWN THAT PEOPLE WHO MAY HAVE DONE FOR EXAMPLE UNDERGRADUATE DEGREE IN SOCIOLOGY MIGHT BE LIKE OH, THAT IS HOW THEY ARE DOING IT OR IN LAW THAT IS HOW THEY ARE DOING IT. JUST ASKING THE QUESTION OVER ALL THIS HAS BEEN EXTREMELY HELPFUL. I THANK YOU FOR WHAT YOU DO. >> AARON SANT-MILLER: YEAH. GOOD QUESTION HARD QUESTION DATA SCIENCE AS A CONCEPT AND FIELD IS STILL EVOLVING. IF WE TAKE DEFINITION OF DATA SCIENCE FROM 10 YEARS AGO HOW WE DRAW THAT MAP TO HOW WE ARE DROUING IT NOW AND WILL 10 YEARS FROM NOW WILL BE DIFFERENT. GARTNER REVIEWED SOMETHING FROM A BUSINESS INDUSTRY PERSPECTIVE HOW IS DATA SCIENCE BEING USED AND TERMINOLOGY AND MACHINE LEARNING BEING USED WHAT ARE TRENDS AND HOW IT IS BEING DEFINED. >> CAN YOU SPELL THAT NAME. >> AARON SANT-MILLER: GARDENER, LIKE THE COMPANY. THERE WILL BE PLACES LIKE THAT WITH MORE OF AN INDUSTRY FOCUS. YOU WILL HAVE A TOTALLY DIFFERENT PERSPECTIVE IF YOU GO INTO ACADEMIC RESEARCH IN THE FIELDS HOW DATA SCIENCE IS APPLIED. THAT IS BOTH WHERE IT MAKES UNDERSTANDING AND DEFINING DATA SCIENCE SO HARD IS EVERYONE HAS A DIFFERENT PERSPECTIVE AND WAY OF DEFINING WHAT IT IS AND ALSO MAKES IT SO INTERESTING BECAUSE AS A FIELD YOU HAVE -- I WILL GO INTO A MEETING YOU PLAY WORD BINGO WITH BUZZ WORDS PEOPLE ARE USING AND HOW THEY DESCRIBE DATA SCIENCE AND ANOTHER MEETING PEOPLE SPEND 100% OF TIME DOING ACADEMIC RESEARCH HOW THEY DEFINE DATA SCIENCE IS DIFFERENT. GENERAL SOURCES LIKE GARDENER THAT HELP YOU GET STRUCTURE TO IT. I FOUND THIS IS LIKE I WAS HAVING A REALLY HARD TIME KEEPING UP WITH HOW FAST THE FIELD WAS EVOLVING. THIS IS WHEN I TRIED TO STAY ON TOP OF EVERY POSSIBLE THING WHICH IS IMPOSSIBLE. WARNLING, IF YOU FIGURE OUT HOW TO DO IT GIVE ME YOUR TIME TURNER. I STARTED TO LISTEN TO PODCASTS ACTUALLY. LOTS OF PODCASTS GENERALIZE AND THINK AND TALK ABOUT WHAT DATA SCIENCE IS AND HOW IT IS USED DIFFERENTLY THERE IS DIFFERENT LIKE OROOILLY DATA SHOW I LISTEN TO A LOT. LOTS OF SOURCES OUT THERE GROWING WHERE PEOPLE TALK ABOUT THESE THINGS. I DON'T KNOW IF I CAN GO TO A 1-STOP SHOP SOURCE FOR IT. IF YOU GO TO OR BIG GENERALIZABLE ONE PRESENT INDUSTRY FOCUS HELP YOU GET SEARCH FOR OTHER MATERIALS IT WILL GIVE YOU A BETTER PERSPECTIVE. YOU'RE WELCOME. YEAH? >> SO I JUST KIND OF HAD A COMMENT AND THEN A QUESTION MAYBE. YOU WERE TALKING ABOUT HOW DATA SCIENCE IS DIFFERENT DEPENDING ON DIFFERENT GROUPS. I WAS IT THINKING ABOUT THE FACT THAT -- SO IN A HACK-A-THON YOU ACTUALLY HAVE A LOT OF DIFFERENT PEOPLE FROM DIFFERENT GROUP BIOLOGICAL RESEARCHERS AND COMPUTER SCIENTISTS AND TRADITIONAL DATA SCIENTISTS WHO GET IN A ROOM TOGETHER AND CREATE THINGS TOGETHER AND WORK ON DATA SETS THAT ARE BOTH INTERNAL AND EXTERNAL. I'M SECUREOUS TO KNOW IF YOU HAVE ANY INPUT ON IF YOU CONSIDER THAT TO BE A USEFUL TOOL THEN AND IF THERE ARE ANY OTHER SUGGESTIONS ON CONTINUING TO MAKE THAT USEFUL FOR HELPING US TO DEFINE DATA SCIENCE TO ELUCIDATE INTERESTING IDEAS HOW WE USE DATA SCIENCE AND THINGS LIKE THAT. >> AARON SANT-MILLER: USING HACK ATHONS TO LEARN MORE ABOUT THE FIELD MORE BROADLY. HONEST KUDOS TO YOU FIRST TIME I HEARD ANYONE LINK THAT AS A WAY TO GET DATA SCIENCE AWESOME. GOOD WAY TO DO IT IF YOU HAD A GROUP THAT RAN HACK ATHONS IN WIDE VARIETY OF FIELDS YOU WOULD SEE HOW DIFFERENT PEOPLE BRING PRACTICES TO DATA SCIENCE DIFFERENTLY. PEOPLE THINK AND ANALYZE DATA SCIENCE DIFFERENTLY. YOU WILL SEE IN THOSE SITUATIONS THAT ARE COMBUSTOR ECOSYSTEMS WE HAVE 48 HOURS TO FIND SOMETHING VALAABLE TO THIS DATA SET. WHAT YOU SEE THERE IS A VERY RAPID YOU KNOW DUMP OF PREVIOUS EXPERTISE AND HOW PEOPLE THINK ABOUT DATA SCIENCE YOU PROBABLY QUICKLY HAVE PEOPLE LIKE WHAT DO YOU THINK DATA SCIENCE IS AT THE START AND END OF ONE OF THOSE HACK ATHONS WOULD BE A REALLY INTERESTING WAY TO SEE HOW PEOPLE'S DEFINITION OF DATA SCIENCE CHANGES OVER TIME. ANSWER YOUR QUESTION. >> GREAT QUESTION I WILL ASK THEM ABOUT IT AND SEE WHAT THEY SAY AT THE END THANK YOU. >> AARON SANT-MILLER: I LOVE HACK ATHONS I DO LOTS IN CYBER SECURITY WE ARE TRYING TO GET GOING AI VS. HACKER THING. HACKERS TRYING TO GET INTO A SYSTEM AND AI DEFENDS IT AI GETS SMARTER LEARNS WHAT HACKERS ARE DOING AND FLIPSIDE OF THAT. I'M A HUGE ADVOCATE FOR HACK ATHONS FOR LOTS OF FOLKS IT IS AN EASY WAY TO GET INVOLVED IN DATA SCIENCE AND BRING WHATEVER EXPERTISE YOU HAVE. LOTS ARE DESIGNED FOR PEOPLE YOU MAY NOT BE THE WIZARD PROGRAMMER BUT YOU ARE INTERESTED AND WANT TO SIT WITH SOMEBODY TACKLING A PROBLEM FOR A DAY. THINGS LIKE THAT ARE A REALLY GOOD WAY TO GET STARTED. THANK YOU FOR BRINGING THAT UP. >> ANOTHER QUESTION ONLINE IF WE HAVE TIME. >> AARON SANT-MILLER: OKAY. >> THIS PERSON IS ASKING FOR ACCOUNTING FOR CHANGES IN LANGUAGE OVER TIME AND TEXT MINING SO CERTAIN TERMS CHANGE OVER TIME AND SOME GO OUT OVER USAGE AND SOME APPEAR HOW IS IT DEALT WITH IN TEXT MINING. >> AARON SANT-MILLER: GOOD QUESTION. HUGE AREA OF RESEARCH IN MACHINE LEARNING AND DATA MINING AND DATA SCIENCE MORE BROADLY. AS A BROAD FIELD IT IS LIFE-LONG LEARNING. LET'S SAY YOU BUILD AN ENGINE YOU WANT TO LEARN ABOUT TEXT OVER TIME AND CONTINUE TO LEARN, HOW DO YOU BUILD THAT ENGINE IN A WAY THAT IT WON'T CARRY BIAS OR WRONG DEFINITIONS TO A NEW CONTEXT. IN TEXT MINING, LOTS OF TIMES YOU WILL SEE THAT ENCODED IN WINDOWS OF TIME. YOU WILL HAVE A SEPARATE MODEL LEARN HOW A SPECIFIC WORD IS USED AT ONE POINT IN TIME VERSUS DIFFERENT POINT IN TIME. DIFFERENT CONTEXT AND LOTS OF WAYS TO DO THAT SEPARATE MODELS OR SMOOTHING MODELS TRAIN 10 DIFFERENT MODELS 100 YEAR PERIOD AND EVERY 10 YEARS SMOOTH OUT THE MODELS GRADING AND UNDERSTANDING HOW EACH MODEL UNDERSTANDS USAGE. COMPLICATED PROBLEM THAT YOU SEE LOTS OF REALLY SCALED DEPLOYMENTS OF MACHINE LEARNING IN THESE THINGS IN FIELDS NOT EVOLVING AND CHANGING AND ARE MORE STATIC AND FIELDS WE WANT TO APPLY MACHINE LEARNING TO CONTINUE TO EVOLVE IN CHANGE. HOW MODEL OR METHOD DOES AT ONE POINT IN TIME IF PERFORMING JUST AS WELL AS LATER POINT IN TIME APPLICATION OF THE SAME METHOD MAY NOT BE APPROPRIATE ANYMORE. LOTS OF RESEARCH IS GOING THERE RIGHT IF NOW. OKAY. >> ALL RIGHT. I WANT TO GIVE A HAND TO AARON FOR HIS TIME. >> AUDIENCE: [APPLAUSE]. >> AND A THANK YOU TO EVERYONE FOR SPENDING YOUR AFTERNOON WITH US. I HOPE YOU HAVE GOOD TAKEAWAYS FROM THIS. AS ALWAYS, IF YOU HAVE ANY QUESTIONS, WE ARE HERE FOR YOU. IF YOU HAVE QUESTIONS ABOUT THE COURSES AND ITPS, PLEASE CONTACT US. WE ARE HERE AND READY TO HELP YOU AND WANT YOU TO SUCCEED IN THIS AND DEDICATE TO YOUR TIME TO DATA SCIENCE SUMMER OF TRAINING. SO THANK YOU AGAIN.