>> GOOD MORNING, LADIES AND GENTLEMEN. WELCOME TO THE NIH CONFERENCE ON NATURAL LANGUAGE PROCESSING, WE'LL TELL YOU A LITTLE BIT MORE ABOUT IT LATER, ALTHOUGH YOUR PROGRAM HAS MOST OF IT. IT'S A PLEASURE FOR ME TO INTRODUCE DR. DONALD LINDBERG, WITH THE NATIONAL INSTITUTE OF MEDICINE. >> I'M NOT SURE HOW TO CARRY THAT OUT BUT I ALWAYS AT LEAST TRY TO DO WHAT DR. CORN SAYS, BUT I DO WANT TO THANK HIM FOR ORGANIZING THIS SESSION. ALSO THANKING THE CO-SPONSOR AND THE ORGAN EYEING BODY REPRESENTS A LARGE NUMBER OF THE NIH INSTITUTION, WE APPRECIATE THEIR HELP AND SUPPORT AND ADVICE. IN THE TIME ALLOCATED WAY EXCEEDS WHAT IT WILL TAKE FOR YOU ALL WHAT I UNDERSTAND ABOUT NATURAL LANGUAGE UNDERSTANDING BUT IT FALLS SHORT OF THE TIME REQUIRED TO TELL YOU HOW MUCH I ADMIRE THE FIELD SO ANYWAY, I'M GLAD TO BE HERE AND I THINK IT IS REALLY PROBABLY A GOOD TIME TO HAVE THIS MEETING. I HAVE TO CONFESS, I NEVER HAVE UNDERSTOOD WHY PEOPLE SAY NATURAL LANGUAGE PROCESSING, THE GROUP HERE WAS REALLY STARTED BY ALEXY MCCRAE WHO I THINK DID A GOOD JOB WITH IT AND I THINK DID A GOOD JOB AND ABLE TO SUPPORT THAT BUT YOU SHOULD TALK ABOUT NATURAL LANGUAGE UNDERSTANDING AND I'LL TELL YOU WHERE SHORTLY. BUT THE FIELD IN THE FIELD IN ANY EVENT HAS DEVELOPED A LOT OF GOOD TOOLS AS IMPROVED AS STRATEGY. I THINK WE ALL REMEMBER THE GOOD OLD DAYS WHEN PEOPLE ROUNDED AROUND OFFERING A LARGE SUM OF MONEY TO ON THE FLY TRANSLATE CHINESE AND ENGLISH TO RUSSIAN AND CAN RUSSIAN INTO ANYTHING YOU WANTED, A BIG BUST BUT THE STRATEGY FOR SUBSEQUENT SUCCESS IS THE TRANSLATIONS BETWEEN ACTUAL LANGUAGE„ GO WAS CLOSE COUPLING WITH THE APPLICATION AREAS, SO THE PROGRAMS ARE ACTUALLY BIG HELP TO COMMERCIAL PEOPLE WHO DO TRANSLATIONS AND EVEN AT NLM, I DON'T KNOW IF YOU HAVE--I DON'T THINK YOU HAVE ANY PAPERS ON OUR AUTOMATED INDEXING PROJECT BUT THAT WAS--WAS SERIOUSLY WORK HANDWRITING I CAME HERE IN 84, SO IT'S WAY OVER 25 YEARS IN DURATION AND ONLY NOW, AFTER 3 OR 4 YEARS OF CLOSE COUPLING WITH THE LIGHTER OPERATIONS PEOPLE, IS IS IT POSSIBLE TO SAY THAT THE PROGRAM ARE ACTUALLY INDEXED IN THE LITERATURE. IT'S SECOND GUESSED BY THE STAFF, BUT IT'S ACTUALLY DOING THE INDEXING, IT LOOK A LONG WHILE TO GET THERE BUT THE STRATEGY WAS CLOSE COUPLING RATHER THAN THEY--THORRIZING. SO WHY DO I THINK THAT YOU SHOULD TALK ABOUT NATURAL LANGUAGE UNDERSTANDING. I THINK REALLY, IT'S COMING TO UNDERSTANDING THE WORK WE DO. YOU REMEMBER, ALEXANDER POPE SAID THE BEST STUDY OF MANKIND IS MAN. AND BY WHICH HE MEANT, TO TRY TO UNDERSTAND US AND I SHOULD THINK THAT THAT'S THE PURPOSE OF DOING THIS WORK. IT CERTAINLY IS BEHIND THE ARTIFICIAL INTELLIGENCE WHICH WAS A VERY HAPPY PARTNER WITH NATURAL LANGUAGE UNDERSTANDING FOR MANY DECADES, THE PURPOSE WASN'T TO THROW A SWITCH, THE PURPOSE WAS TO MODEL HUMANMENTATION, AN OBJECTIVE WHICH WE FALL SHORT NOW BUT WE DON'T HAVE TO GIVE UP, I THINK WE STILL SHOULD BE DOING THAT. OF COURSE WE'RE ALL THINKING, I IMAGINE WE'RE SPEAKING OUT OF 2 THINGS, 1 IS THE POSSIBLE OPERATING GLOBALLY LAYICATION OF YOUR TECHNIQUES TO--POSSIBLE APPLICATION OF YOUR TEAKS TO COMPUTERIZED RECORD AND THEN SECONDLY AN UNSPOKEN ELFIENT IN THE ROOM, THE IBM WITH WATSON. BUT WITH RESPECT TO THE FIRST, WE ALL KNOW A HUGE TRUSTEES TREASURE OF MEDICAL KNOWLEDGE IS HIDE NOTHING THE PATIENT RECORD BE IT HANDWRITTEN OR TYPED OR SCANNED OR WHATEVER. AND MUCH OF THAT IN THE NATURAL LANGUAGE THAT WE USE BY THE PEOPLE TO RIGHT THE NOTES, SO THAT'S CERTAINLY A WORTHY GOAL. I MEAN THERE'S GOLD IN THEM THERE HILLS FOR CERTAIN AND I GUESS YOU'LL GET IT. NOW IF YOU STEP BACK A LITTLE BIT AND REMEMBER THE EARLY PERIOD, IN THE 60S, FIRST WE GOT MACHINEABLE LAB DATA AND MACHINEABLE X-RAYS AND I--THAT'S A GOOD THING. I REMEMBER IN THOSE EARLY YEARS, RADIOLOGYST WOULD SAY PROUDLY, IF YOU GOIS EVER GET A COMPUTER TJe PATIENT RECORD, 96 PSYCHOFIZZISTS'VE IT WILL BE--96% OF IT WILL BE X-RAYS AND BIT BY BIT WAS PROBABLY TRUE. YOU COULD SAY 96% OF IT BE THE GENOME, BUT, THAT IS YET TO BE FULLY UNDERSTOOD ALSO, A LITTLE PROJECT FOR YOU ALL IF YOU LIVE LONG ENOUGH. THEN OF COURSE WE TRY TO CONDOCTORS IN A TYPIER NOTE OF A RECORD AND THEY'RE NOT VERY GOOD TYPISTS BUT WE FINALLY SORT OF YIELD IT AND ARE DOING QUITE A BIT OF THAT WITH POINT AND CLICK BUT GETTING IT IN MACHINEABLE. NOW OF COURSE, YOU CAN DICTATE A GOOD BIT OF WHAT YOU WANT TO SAY, SO, ARE WE SORT OF 0ING IN ON THE MARKED FOR IDENTIFICATION--0ING IN ON THE PROBLEM AND WE'RE DELIMITING AND IT MAKES OUR FIELD EVER MORE IMPORTANT. BUT THE TOTAL UNDERSTANDING AS TO SAY OF MAN THROUGH THIS RECORD WILL RECEDE, I THINK TOWARD THE HORIZON FOR QUITE A LONG WHILE. ONE OF THE LITTLE PROJECTS WE'RE DOING HERE AT THE LIBRARY, I'LL FINISH WITH THIS, IS HISTORY OF MEDICINE AND THE PAPERS OF MICHAEL DEBAKKE, A GIANT OF A MAN AND GREAT SURGEON, INTERESTING PERSON IN MANY RESPECTS. BUT MIKE NEEDED TO FINISH SURGERY, THEY SIGN A PAPER AND SAY WHAT THEY DID AND SO FORTH BUT HE WOULD APPEND A MASTER FULL DRAWING OF HOW HE UNMASK THE VESSELS. I MEAN THEY'RE BEAUTIFULLY BEAUTIFULLY DONE. SO FOR HIM, THE WOR WORDS WERE NOT ENOUGH, THE X-RAYS WERE NOT ENOUGH, HIS DRAWINGS DESCRIBE WHEN HE ACTUALLY DID SURGICALLY AND OF COURSE, SURGEONS ALWAYS SAID THAT DISTAL ASTENMOASIS NEVER BROKE AND I'VE HAD PEOPLE 45 YEARS AFTER AN AORTIC AN ANEURYSM STILL BEING HIS PATIENT. BEAR ANOTHER WORD ABOUT THE ELFACT IN THE ROOM, NAMELY IBM, AND I SEE DR. GUNDECK, AND ALUMNI DR. FUSHI, WILL BE AT THE ROOM, AND I'M DELIGHTED THAT'S THE CASE AND WE ARE INITIATING A PROGRAM WITH THEM AND MY GOAL THERE IS TO BE ABLE TO DEAL WITH THE QUESTIONS THAT PATIENTS FAMILIES AND THE PUBLIC DIRECT TO NLM. WHICH ARE IN THE 10S OF THOUSANDS, NOT TRIVIAL, AS I LOOK AT THEM MUCH PORE CAREFULLY, THEY'RE HARD TO HANDLE FOR A LOT OF REASONS THAT YOU CAN GUESS PROBABLY AS WELL AS I, BUT I'M HOPING THAT IN A YEAR OR SO, WOULD BE ABLE TO ANNOUNCE TO YOU, AT LEAST A LITTLE BIT OF PROGRESS IN THAT VENTURE ALONG WITH IBM. SO IN THE MEAN TIME I WISH YOU A HAPPY DAY, I'LL BE IN AND OUT AS MUCH AS I CAN, I WOULD LIKE TO SPEND THE WHOLE DAY AND I KNOW YOU ARE IN GOOD HANDS WITH DR. CORN AND DR. FRIEDMAN AND HAVE A GOOD TIME. [ APPLAUSE ] >> THANKS, JOHN. I'M MILTON CORN DEPUTY DIRECTOR OF RESEARCH AND PUBLICATION FOR THE NATIONAL LIBRARY. I HAVE A FEW COMMENTS ABOUT THIS MEETING AND HOW IT CAME ABOUT AND SOME OTHER OTHERWISE, MAYBE A FEW PUZZLING THINGTHE AGENDA BUT AS I THINK IS CLEAR TO EVERYONE WHO LOOKED AT THIS, THIS IS A JOINT ENTERPRISE OF THE NATIONAL LIBRARY OF MEDICINE AND THE NATIONAL INSTITUTE OF BIOMEDICAL IMAGING AND BIOENGINEERING HERE, AND THERE ARE 2 DAYS OF THESE MEETINGS. THIS FIRST DAY HERE HAS BEEN PRIMARILY THE WORK OF THE NATIONAL LIBRARY OF MEDICINE AND THE FOCUS FOR US HAS BEEN ESPECIALLY ON NATURAL LANGUAGE PROCESSING, NATURAL LANGUAGE UNDERSTANDING, ITSELF, WE'RE INTERESTED IN HOW WE'RE DOING IT NOW, WHAT ARE THE METHODS WE'RE USING, CAN WE IMPROVE THEM, AND WE WOULD HOPE, BECAUSE WE'RE ETERNALLY OPTIMISTIC THAT FEDERAL FUNDS FOR SCIENCE WILL IMPROVE IN THE FUTURE, THAT WE MAY BE ABLE TO SUPPORT A RESEARCH AGENDA IN IMPROVING THE METHODS THAT WE HAVE AVAILABLE FOR DOING AN LP, AN LP IS CLEAR TO EVERYBODY HERE IS NOT JUST ANOTHER LITTLE TOOL, IT IS SO CENTRAL TO ACTUALLY MAKING COMPUTERS TRULY VALUABLE TO THE EFFORTS OF HUMANS AND BIOLOGY MEDICINE AND A HUGE NUMBER OF OTHER FIELDS THAT IT HAS A CENTRAL PLACE AND DESERVES AN ENORMOUS AMOUNT OF ATTENTION. THE SECOND DAY, TOMORROW, IS ALSO HERE IN THIS AUDITORIUM, IT'S BEEN PRIMARILY ORGANIZED BY BY NIBIB, AND IT FOCUSES ON THE APPLICATION OF NLP. THIS IS AN ENORMOUS FIELD IN ITS OWN RIGHT AND I THINK THAT MOST OF IT WILL BE USED AT LEAST FOR ILUS TRAITIVE PURPOSES FOR THE USE OF NLP FOR CLINICAL DECISION AT ANY POINT, CERTAINLY A VERY HOT FIELD PARTICULARLY BECAUSE OF THE ELECTRONIC MEDICAL RECORD WORK BEING DONE BY THE UNITED STATES. YOU WILL NOTICE THERE IS A SECOND AGENDA FOR TUESDAY, THE PRINCIPAL AGENDA IS THE 1 ON CLINICAL DECISION SUPPORT AND IT WILL BE HERE AT THE NATIONAL LIBRARY WE'RE ALSO INTERESTED AS I'VE SAID IN TRYING TO AT LEAST SEE WHAT THE COMMUNITY OF THE SPECIALISTS IN THE FIELD THINK MIGHT BE DIRECTIONS THAT WE SHOULD GO FROM THE RESEARCH POINT OF VIEW AND IMPROVING NLP AND FOR THIS REASON, WE HAVE A--PROBABLY MUCH SMALLER GROUP MEETING IN NATCHER WHICH IS A BUILDING JUST A FEW MINUTES WALK AWAY FROM HERE AND THAT'S MEETING TOMORROW MORNING AS WELL. THAT MEETING WILL HAVE NO PRESENTATIONS, IT'S PRIMARILY FOR PEOPLE WHO THINK OF THEMSELVES AS MORE OR LESS PROFESSIONAL NLP, PEOPLE AND IT WILL CONSIST OF BREAK OUT GROUPS AND IT WILL BE A SERIES OF DISCUSSIONS TO GET SOME IDEA OF FUTURE DIRECTIONS FOR THE FIELD. ANY OF YOU WHO REALLY HAVE YOUR HANDS WET WITH NLP, HAVE SOME IDEAS ABOUT WHERE THE FIELD SHOULD BE„i GOING, SO IN OUR WELCOME TO COME TO THIS, IT'S NOT BY INVITATION AND WE'D BE DELIGHTED TO HAVE YOU, BUT AS I SAY, THIS IS NOT GOING TO BE A SERIES OF PANELS OR PRESENTATIONS TOMORROW OVER AT NATCHER, TELL BE JUST FOR PEOPLE REALLY INTERESTED IN NLP TO TALK TO EACH OTHER AND MAYBE PROVIDE SOME SUPPORT AND HELP TO OUR OWN EFFORTS HERE. IN FURTHERING THE FIELD. SO YOU MAKE YOUR CHOICE, I ANTICIPATE AND NEITHER THAN ANTICIPATES THAT THE MAJORITY OF THE PEOPLE HERE WILL BE MORE INTERESTED IN THE APPLICATION AND THE DECISION SUPPORT PORTION OF IT AND WILL BE HERE BUT AS I SAY, ANY PEOPLE WHO REALLY WALK THE WALK AND WOULD LIKE TO MAKE A CONTRIBUTION TO THE RESEARCH, ASPECTS OF NLP, PLEASE DO TAKE A LOOK AT THAT OTHER AGENDA AND COMCOME ON OVER AND JOIN US AT NATCHER. I DO WANT TO SAY THAT ALL OF THE PRESENTATIONS WILL BE AVAILABLE PROBABLY ON THE WEB SITE, BUT WE HAVE A MAILING ADDRESS FOR ALL OF YOU WHO REGISTERED AND WILL LET YOU KNOW EXACTLY WHERE THESE WILL BE, THESE PROCEEDINGS HERE TODAY AND FOR NEITHER TOMORROW ARE BEING WEBCAST, AND THEY'RE AVAILABLE REALLY TO ANYBODY ON THE NIH CAMPUS, AND IN FACT, ANYBODY, ANYWHERE WHO OWNS A COMPUTER HAS ACCESS TO THE INTERNET AND CAN CHECK-IN ON THE NIH WEBCAST VIDEO CAST URL. TELL BE ARCHIVED AND FOR THOSE OF YOU WHO WILL HAVE A FANTASTIC WONDERFUL EXPERIENCE THAT YOU WANT TO DO IT OVER AND OVER AGAIN, YOU CAN ALWAYS CHECK IN, AND SEE IT AS MANY TIMES AS YOU LIKE. IT MAY COME OUT DIFFERENTLY AT OTHER TIMES, 1 NEVER KNOWS THIS. WE WERE VERY FORTUNATE, I THINK TO HAVE AS CO CHAIRS FOR TODAY'S MEETING, CAROL FRIEDMAN AND TOM RINDFLESCH, AND TOM IS AN NLM MEMBER. I'M DELIGHTED TO SAY HE'S BEEN HERE SINCE 1991, HE'S A COMPUTATIONAL LINGUIST, GOT HIS DEGREE FROM MINNESOTA, I THINK THAT'S TRUE? AND I DID NOT KNOW THAT HE HAS A B. A. IN ARABIC AND SOME OF HIS HIS CONVERSATIONS DURING THIS MEETING MIGHT BE IN ARABIC BUT WE HAVE TRANSLATORS AVAILABLE IF TOM WANTS TO DO THAT. HE HAS HAD A DISTINGUISHED CAREER IN NATURAL LANGUAGE PROCESSING, HE PRESENTS OFTEN AND WRITES AWIVE AND HIS NAME IS MOST TO KNOWN OF YOU. THE SAME IS TRUE OF CAROL FRIEDMAN, SHE'S 1 OF THE DING TINGUISH INDIVIDUALS IN NATURAL LANGUAGE PROCESSING, SHE'S A COMPUTER SCIENTIST, A PROFESSOR AT COLUMBIA UNIVERSITY AND A MEMBER OF THE DEPARTMENT OF BIOINFORMATICS THERE, I THINK A NUMBER OF THE THING SHE'S PRODUCED ARE USED ALL OVER THE WORLD AND SHE'S BEEN PARTICULARLY CREATIVE AND USEFUL WITH MEDLEY FOR WHICH SHE IS KNOWN AND HER WORK WITH MEDICAL TECHS BUT SHE'S BEEN A HUGE CONTRIBUTOR AND A NUMBER OF OTHER FIELDS. SHE WILL BE DELIVERING THE KEY NOTE HERE AND IT'S A GREAT PLEASURE FOR ME TO INTRODUCE HER AND ASK HER TO COME TO THE MICROPHONE NOW. KD ALAW [ APPLAUSE ] >> THANK YOU VERY MUCH DR. CORN FOR THAT NICE INTRODUCTION AND ALSO FOR HELPING SUPPORT THIS WORKSHOP AND FORM TGF AND ORGANIZING IT AND I'D ALSO LIKE TO THANK TOM WHO DID MOST OF THE WORK AT THIS DAY, SO HE SHOULD GET THE CREDIT FOR THE ORGANIZATION. I WOULD ALSO LIKE TO THANK DR. LINDBERG TO ALL OF US NATURAL LANGUAGE PROCESSING RESEARCHERS. I REMEMBER THAT SPENT YEARS AGO, THE FIELD, MOST PEOPLE WERE SAYING THAT NATURAL LANGUAGE WAS NOT FEASIBLE, NOT FEASIBLE FOR CLINICAL PURPOSES AND DR. LINDBERG HAD THE VISION TO SEE THAT IT WAS AND SUPPORTED THE FIELD IN MANY WAYS. I SEE NOW HE STILL HAS THE VISION BECAUSE WE'RE TALKING ABOUT NATURAL LANGUAGE PROCESSING AND HE'S TALKING ABOUT THE NEXT STEP WHICH IS UNDERSTANDING SO I THINK IF WE GET THE PROCESSING RIGHT THEN MAYBE THE NEXT STEP WOULD BE THE UNDERSTANDING AND HE'S ALWAYS LOOKING AHEAD. THE FIELD REALLY TOOK OFF. IT'S BEEN A VERY EXCITING TIME IN NATURAL LANGUAGE PROCESSING AND IT'S A GREAT POSITION TO BE IN NOW IN THIS RESEARCH AREA THAT IF YOU SEE THAT IN THE 1970S AND THIS IS ROUGH,--THIS CHART IS ROUGH, BUT THE MAGNITUDES ARE DEFINITELY RIGHT IS THAT IN THE EARLY 1970S IT JUST STARTED IN THE CLINICAL DOMAIN AND IN THE 1980S, THIS STARTED TO BE A FEW MORE PAPERS AND THEN SOMEWHERE IN THE MID1990S, THE AREA SHOT UP AND I THINK NOW IT'S GOING TO JUST KEEP RISING BECAUSE PEOPLE REALIZE THAT FOR MEANINGFUL USE FOR THE HEALTH I.T. STIMULUS THAT THE ONLY WAY WE GET FROM IS WITH NATURAL LANGUAGE AND OF COURSE, THE IBM CHALLENGE CERTAINLY PROPELLED THE FIELD AND SO WHAT COULD BE POSSIBLE. SO THE GOAL OF THE WORKSHOP IS AS DR. CORN SAID IS GO OVER PAST ACHIEVEMENTS AND SEE WHERE WE ARE NOW, WHAT SOME OF THE CHALLENGES ARE THAT ARE REMAINING AND TO RECOMMEND FUTURE DIRECTIONS AND HOPEFULLY, THIS IS A WONDERFUL OPPORTUNITY FOR US IN THE FIELD OF NATURAL LANGUAGE TO BE ABLE TO DO THAT, TO SORT OF STEP BACK AND BE INTERNATIONAL INTROSPECTIVE ABOUT THE FIELD. SO HERE I'M SHOWING ASPECTS OF NATURAL LANGUAGE PROCESSING AND YOU SEE THAT IT'S NOT UNDERSTANDING YET. AND THESE ARE ALL THE ELEMENTS THAT I BELIEVE ARE NECESSARY FOR THE FIELD AND IF YOU SEE ON THE LEFT THAT IT'S SORT OF THE PROGRAMMING ENGINE THAT ARE NEEDED AND THEY'RE VERY IMPORTANT AND ON THE RIGHT IS THE KNOWLEDGE AND MATERIALS THAT GO INTO IT, AND I BELIEVE THAT THE RESEARCH HAS TO HIT ALL THESE ASPECTS. AND JUST TO GO BACK, I'M GOING TO TALK ABOUT SOME OF THESE, I'M NOT GOING TO TALK ABOUT THE METHOD SO MUCH BECAUSE WE HAVE 2 SPEAKERS, WHO ARE FOLLOWING WHO ARE GOING TO TALK ABOUT METHODS, SO I'M GOING TO FOCUS ON ALL THE OTHER ASPECTS. SO IN THE CLINICAL, FOR PATIENT CARE, FOR DECISION SUPPORT, I GUESS WE WON'T BE AT THE MEETING, MANY OF US WON'T BE AT THE DECISION SUPPORT MEETING TOMORROW, BUT NATURAL LANGUAGE PROCESSING COULD CERTAINLY HELP CLINICAL DECISION SUPPORT AND I THINK THAT'S A VERY IMPORTANT AREA, AND THERE'S MANY OTHER AREAS, REALLY THE IMP SIS IS HEALTH INFORMATION EXCHANGE, MEANINGFUL USE WILL BE THE IMPETUS OF NATURAL LANGUAGE AND IT CAN BE USED TO REDUCE AREAS AND NEW EXAMPLES THAT ARE SHOWN THAT IT DOES TO IMPROVE DOCUMENTATION WHICH IS VERY IMPORTANT AND NOT ONLY FOR PATIENT CARE BUT IT'S BECOMING MORE AND MORE IMPORTANT FOR SECONDARY USE AND SO WE SEE THAT IT'S IMPORTANT TO BE ABLE ON IDENTIFY PATIENTS FOR CLINICAL TRIAL RECRUITMENT AND TO--IN THE LETTEST GWAS STUDIES IS A BIG EFFORT TO USE GWAS STUDIES. AND SO A VERY IMPORTANT AREA IS KNOWLEDGE ACQUISITION AND DISCOVERY AND THAT'S TAKING THOSE NOTES, COMBINING, GETTING THE DATA OUT OF THE NOTES AND COMBINING IT TO DISCOVER NEW THINGS, FOR INSTANCE MY RESEARCH IS TRYING TO DISCOVER NEW NOVEL ADVERSE DRUG EVENTS ON THE CLINICAL NOTES. SUMMARIZATION IS VERY IMPORTANT, TRANSLATION IS VERY IMPORTANT AND TAYLORING INFORMATION FOR CONSUMERS IS VERY CRITICAL SO WITH THE PERSONAL HEALTH RECORD NOW IT WOULD BE IMPORTANT FOR CONSUMERS TO BE ABLE TO FIND OUT INFORMATION ABOUT THEIR TREATMENTS, ABOUT THEIR HEALTHCARE AND BE ABLE TO LINK, SOME HAVE AUTOMATED TOOL THAT'S LINK THEM TO MEDLINE PLUS. AND HERE I SHOULD HAVE HAD QUESTION AND ANSWER WHICH IS MORE LIKE THE JEOPARDY CHALLENGE, THAT WOULD BE A NEXT STEP. SO IN THE BIOMEDICAL APPLICATIONS AREA, THAT OF COURSE WE NEED TO IMPROVE THE ACCESS TO THE INFORMATION AND TEXT ON THE WEB, SO BASICALLY, WAWE'RE TRYING FOR NOW, IS TO IMPROVE THE MANAGEMENT OF INFORMATION TO CAPTURE IT AND TO REPRESENT IT IN A GOOD WAY. IT'S BEEN USE INDEED THE BIOMEDICAL AREA TO FACILITATE DURATION, ALSO FOR KNOWLEDGE DISCOVERY AND TO VERY, VERY, IMPORTANT TO INTEGRATE INFORMATION FROM MULTIPLE SOURCES AND DISCIPLINES AND SO, SINCE THE FUTURE OF MEDICINE IS REALLY MULTIDISCIPLINARY THAT NATURAL LANGUAGE PROCESSING CAN FACILITATE THAT BY PROCESSING ARTS TO MANY DISCIPLINES AND FROM MANY DIFFERENT SOURCES AND AGAIN, THE QUESTION AND ANSWERING AND CRITICAL AND SUMMARIZATION SO PEOPLE CAN QUICKLY READ THEIR RETRIEVAL AND HAVE IT SUMMARIZE INDEED A NICE WAY. I'M GOING TO TALK ABOUT SOME OF THE MILESTONES AND I'M NOT GOING TO BE ABLE TO COVER EVERYTHING IN A SHORT TALK BUT I WANT TO GIVE AN OVER VIEW OF HOW THE FIELD DEVELOPED. SO IN THE 1960S AND 70S, YOU STARTED TO SEE THE DEVELOPMENT OF CLINICAL N. O. P. AND IN THE 1970S AND 1980S, THE NAOMI SAYER IN THE NATURAL LANGUAGE PROCESSING DEVELOPED THE COMPREHENSIVE STRENGTH SYSTEM AND SHE DEMONSTRATED THE FEASIBILITY OF AUTOMATICALLY STRUCTURE AND CLINICAL INFORMATION. IN THE EARLY 1990S THERE WERE DEMONSTRATIONS THAT NATURAL LANGUAGE PROCESSING COULD BE USED TO IMPROVE HEALTHCARE FOR REAL AND THAT IT WASN'T JUST THE THEORETICAL EXERCISE AND HOW WITH SIN TEXT AND FREED MAN AND RIPSICK DEMONSTRATED THAT. IN THE EARL 1990S, THE WORK WAS ALSO LOOKING INTO TERMINOLOGY AND ONTOLOGY AND SO, SHOOT AND ELKINS DID A LOT OF WORK ON COMPOSITIONALLITY OF TERMINOLOGIES AND LINKING IT TO ONTOLOGYS AND THEN THERE WAS A GROUP IN EUROPE WHO WORKED ON MULTILINGUAL TRANSLATION THAT WAS BOUND FROM GENEVA AND GERMANY AND FRANCE AND ALSO ONTOLOGY DRIVEN SIM ANTICS WHICH WAS VERY IMPORTANT, OR ONTOLOGY DRIVEN ANALYSIS. OTHER MILESTONES ARE ON THE ROW SOIRS SIDE SO I THINK MECOTAY, WITH THE DEVELOPMENT OF SNOWMED THAT SHOWED YOU COULD STRUCTURE THE CONTENT OF MEDICAL LANGUAGE AND THAT WAS A HUGE EXPEFORT A VERY BIG STEP AND THEN IN THE LATE 1980S DR. LIMBURG AND BETSY HUMPHREYS STARTED THE ULL S AND IT'S BEEN A CRITICAL RESOURCE FOR BIY MEDICAL INFORMATICS. IT WASN'T INTENDED FOR NATURAL LANGUAGE PROCESSING BUT IT'S BEEN A HUGE, HUGE, HUGE, RESOURCE OF NATURAL LANGUAGE PROCESSING AND REALLY HELPED PROPEL THE FIELD AND ALSO IN THE 1990S ALEXY Mc KAY DEVELOPED THE SPECIALIST AND NOP SYSTEM AND WITH THE WORK OF HER AND DR. BROWN, THEY DEVELOPED A COMPREHENSIVE MEDICAL LEXICON SO SIN TACTIC INFORMATION, VERY FINE GRAIN AND IF ANY OF YOU HAVEN'T LOOKED AT IT, YOU SHOULD, IT'S QUITE AN ACOMPLISHMENT AND I SHOULD MENTION THAT PUB MED IS A BIG--HAVING PUB MED AVAILABLE TO NATURAL LANGUAGE PROCESSING TO RESEARCH HAS BEEN A HUGE BOOST ALONG WITH THE MESH ANNOTATIONS. NOW IN THE GENOMIC SIDE, THE WORK STARTED LATER SO I THINK WE FIRST STARTED TO SEE NATURAL LANGUAGE PROCESSING IN THE GENOMICS DOMAIN IN THE LATE 1990S WITH DR. FUJI AS 1 OF THE FOUNDERS EVER NAAND HE'S HERE TODAY AND WILL GIVE A TALK ANDA AARON SEN ILLEGALSEN AND HUNTER, MANY OF THEM DID EARLY WORK IN EXTRACTING INFORMATION FROM THE LITERATURE AND SO THEY EXTRACTED NAMES, GENE, PROTEIN NAMES, BIOMOLECULAR RELATIONS AND IN THE 2000S, ROZETSKI, AND WONG CONNECTED INFORMATION AND YOU CAN EXTRACT INFORMATION FROM DIFFERENT ARTICLES AND THEN CONNECT THEM TO GENERATE A PATHWAY. AND IN THE--ALSO, RAY CHARGERY AND LARRY WITHIN RUSS ALLOT MAN, MAPPED THE GENE NAMES TO CODE SO THAT WAS A VERY--START IN NORMALIZING NAMES AND REALIZING THAT YOU HAD TO HAVE A CONCEPT TO MAP TO TO REALLY--OTHER PEOPLE COULD UNDERSTAND WHAT THE MEANING WAS. ALSO, THERE STARTED TO BECOME CORPORATE CHALLENGES AVAILABLE AND THOSE WERE IMPORTANT EFFORTS SO AGAIN, DR. FUJI DEVELOPED A GENIUS CORPUS AND PROPELLED THE FIELD OF LITERATURE, THE GENOMIC LITERATURE IN NATURAL LANGUAGE PROCESSING AND THE BIOCREATIVE CHALLENGE HAS STARTED ALSO A LITTLE BIT AFTER AND THAT WAS DR. HERSHMAN AND VALLENIA WERE THE CREATORS OF AND DR. HERSHMAN WILL TALK ABOUT THAT. AND DR. HERSH, HAD THE DENATIONAL LIBRARY OF MEDICINICS TRACK AND THERE WERE TWEEN BY O N. L.P. SHOPS AND CHALLENGES, SO THE FIELD IS MOVING FORWARD WITH A LOT OF ACTIVITY. THEN THERE ARE A NUMBER OF TOOLS THAT ARE CRITICAL FOR NATURAL LANGUAGE AND I THINK ALMOST EVERYBODY MUST KNOW OF MEDY MAP, WITH DR. AARON SON WHICH LAPS THESE AND WE KNOW WHAT THESE TERMS ARE, THEY'RE NOT JUST STRINGS ANYMORE AND SIM RATAKES THESE CONCEPTS AND MAPPED THEM INTO RELATIONS AND I'LL TALK A BIT MORE ABOUT THAT LATER, BY THAT SYSTEM BY DR. RINDFLESCH, NOW I'M TALKING ABOUT THE CLINICAL SIDE. AND SO NEGECS, AND CONTEXT BY DR. CHAPPION WAS IMPORTANT AND THEN OTHER QUALIFIERS OF INFORMATION AND CA-TIES WITH REBECCA CRAWLY FROM PATHOLOGY DIAGNOSIS, C-CASE WITH ASHOOT AND MORE GENERAL NATURAL LANGUAGE EXTRACTION ENGINE AND THERE'S A WEB SITE HERE THAT IF YOU'RE INTERESTED IN, WHICH HAS REGISTERED ALL THESE--A LOT OF BIOMEDICAL INFORMATICS TOOLS AND MANY OF OF THEM ARE NATURAL LANGUAGE PROCESSING TOOLS, SO I'M PUTTING THIS UP HERE. SO LOOKING AT THE ASPECTS I KIND OF COVERED THE LEFT-HAND SIDE WHICH ARE APPLICATION SYSTEMS AND TOOLS, AND FA I WANT TO GO ON THE OTHER SIDE OF MATERIALS AND SEE WHAT WE HAVE AVAILABLE TO US NOW AND WHERE WE NEED TO GO. I'M NOT GOING TO COVER THE GENERAL LANGUAGE AND I PUT UP A LOT OF WEB SITES HERE IN A VERY GOOD 1 IS CHRIS MANNING'S WEB SITE, A GOOD LIST OF RESOURCES AND HE'S GOING TO BE SPEAKING LATER AND WE'RE VERY HAPPY TO HAVE HIM HERE. WITH SOME OF THE INFORMATION THAT NATURAL LANGUAGE NEEDS IS THE LEX COINFORMATION WHICH IS CRITICAL AND THAT'S SO THAT IT RECOGNIZES THE TERMS OF THE DOMAIN, AND THE UMLS IS HERE HAVE BEEN AN INVALUABLE RESOURCE BECAUSE WE COULD USE THE METATHESAURUS ALL THE TERMS MENTIONED HERE TO RECOGNIZE WHAT THE TERMS ARE IN THE DOMAIN AND THE SIM ANTIC NETWORK TO CAT GARRIZE THE TERMS TO RECOGNIZE THE TERM OR PROCEDURE, AND MEDICATION OR DISEASE IS VERY, VERY, IMPORTANT AND THAT'S BEEN AN INN VALUABLE RESURES. THE MLS SPECIALIST AND NLP TOOLS WITH 1 OF THE EARLY SYSTEM OF TOOL AS AND THAT HELPED THE FIELD A LOT. SO MANY, MANY, MANY, TOOLS AVAILABLE AND THE NCBI RESOURCES ARE ALSO CRITICAL, NAMING THE GENES, THE PROTEINS, SO WE KNOW THEIR NAMES THEIR BIOMOLECULARRENTITYS, CHEMICAL SPECIES THERE'S ANOTHER RESOURCE WHICH IS THE OPEN BIOLOGICAL AND BIY MEDICAL ONTOLOGIES AND THAT'S AN IMPORTANT RESOURCE ALSO SO WE CAN RECOGNIZE THE NAMES OF THESE ENTITIES. NOW HAVING THE NAMES ISN'T ENOUGH, SO WE TO MAP THEM TO UNIQUE CONCEPT AND THATY'S CRITICAL NOW, I THINK AT FIRST PEOPLE THOUGHT MAPPING TO JUST EXTRACTING THE NAMES WAS ENOUGH, BUT NOW WE REALIZE THAT WE HAVE TO KNOW WHAT THEY MEAN AND FOR SHARING AND OPERATING GLOBALLYERABILITY, IT'S CRITICAL THAT THE--THAT THE MAPPINGS OF NLP DOES IS MAPPED A CONCEPT AND AN ONTOLOGY SO WE KNOW WHAT THEY MEAN CLEARLY AND THEREYA I LOT OF MODELS, DOMAIN MODELS NOW FOR CONCEPTS, AND I'LL SHOW YOU THEM LATER AND THE MODELS FOR RELATIONS WHICH ARE MORE COMPLICATED SO THE CONCEPT MODELS ARE BASICALLY THE ONTOLOGYS AND TERMINOLOGIES AND THE UMLS NOW HAS OVER 160 SOURCE OF OF VOCABULARYS AND TERMINOLOGYS AND ONCOLOGYS AND I MENTIONED A FEW OF THEM HERE AND OBO HAS ANOTHEROT BIOLOGICAL SIDE WHICH IS CELL ONTOLOGY, SPHENE O TYPE ONTOLOGY, GENOME ONTOLOGY AND SO THESE ALL GIVE US DEFINITIONS OF THENTITYS IN THE DOMAIN THAT ARE IMPORTANT TO THE DOMAIN. AND THIS IS VERY IMPORTANT FOR NLP TO MASTER THESE. NOW THE DOMAIN OF OF RELATIONS IS ALSO IMPORTANT AND IN THE CLINICAL DOMAIN, BASICALLY ONCE WE HAVE THE CLINICAL CONCEPTS, AND WHEN YOU PROCESS NATURAL LANGUAGE, A LOT OF OTHER TERMS CAN CHANGE THE MEANING OR QUALIFY THE MEANING AND SO, ALL THE QUALIFIERS OF THESE CONCEPTS ARE IMPORTANT TO CAPTURE ALSO, AND THERE WAS A NUMBER OF--THERE WAS ACTUALLY A BIG EFFORT, A NUMBER OF YEARS AGO WHICH WAS THE CANNON EFFORT TO DEVELOP A MODEL, LOTS OF INDIVIDUAL GROUPS WITH DEVELOPING MODELS SO THE BIG EFFORT WAS TO MERGE ALL THE MODELS AND THAT WAS DONE AFTER QUITE A WHILE, BUT THEN NOBODY ADOPTED IT. I THINK AT THAT TIME, MAYBE THE TIME WASN'T RIGHT TO USE IT THAT YOU KNOW THIS HEALTH INFORMATION WASN'T IN EVERYBODY'S EYES AND NOW IT IS AND THE EFFORT WAS ANOTHER EFFORT IN EUROPE OF VERY DEEP MODEL OF CLINICAL INFORMATION. SOPHISTICATED THIS LEADS TO NATURAL LANGUAGE PROCESSING INFORMATION THAT OTHER MAPPING SYSTEMS CAN MAP INTO. THE RECENT EFFORT THEY THINK IS CATCHING UP IS THAT THEY MAY ADOPT IS THE CLINICAL MODEL WHICH IS DEVELOPED IN THE MOUNTAIN BY STAN HUFF AND SOME GROUPS ARE ADOPT TAG INDEEDLE TO NATURAL LANGUAGE PROCESSING AND THAT'S A DEEP MODEL AND IF YOU TALK ABOUT BLOOP, THAT'S REASONABLE TO SAY ABOUT BLOOD PRESSURE, THAT'S TRYING TO CAPTURE THAT, THAT MAY BE TOO DEEP NOW FOR NATURAL LANGUAGE, BUT PROCESSING FOR MAYBE UNDERSTANDING THAT'S WHAT WE NEED AND I PUT UP A WEB SAID WHERE YOU CAN SEE ALL THE EFFORTS IN MODELING THESE CLINICAL CONCEPTS GOING ALONG SO I THINK WE'RE GOING WELL WITH THE DOMAIN, THE LITERATURE DOMAIN WE HAVE THIS MEDICATE AIRING YOU WANT STRUCTURE THAT A LOT OF PEOPLE ARE WORKING ON AND THAT'S TO CAPTURE ACTION AND SO FOR INSTANCE IF YOU HAVE INHIBIT WHO IS THE TARGET, WHAT IS THE AGENT OF INHIBITION, SO IT SPECIFIES THE ROLES MUCH THESE PREDICATES AND THAT'S VERY IMPORTANT FOR THE LITERATURE, SO THE DIFFERENT MODELS, THERE'S 2 MODELS BIOPROP AND ALSO, THE FEMME REP HAS PREEDICATIONS AND THAT AGAIN USES THE UMLS AND IT'S NOT JUST BASED ON VERBS IT'S MORE SOPHISTICATED THAN THAT, SO FOR INSTANCE THERE'S A RELATION TREAT, AND THERE MAY BE PENICILLIN FOR FEVER, SAY, AND SO THE 4 WOULD BE THE PREDICATE TREAT WITH RELATION TREAT, I SHOULD SAY AND THIS DOMAIN SPECIFIC MODEL, SO I'M JUST GIVING A FEW OF THEM AND 1 IS A GUIDELINE AND THE OTHER IS A TIME WHICH IS VERY, VERY, CRITICAL AND CLINICAL DOMAIN, TIME ML AND WE HAVE THE OTHERS HERE TODAY WHO MAY TALK ABOUTU THE TIME ASAR FAZE THAT, THAT'S CORPORATE FIELD 52 AND WE HAVE PUB MED AS 1 OF THE BIGGEST AND EARLIEST WAS MESHAN OITATIONS AND THE CORPUS AND DR. FUJI MAY TALK ABOUT IT WHICH ORIGINALLY HAD THE SIM ANTIC ANNOTATIONS THAT HAVE ANNOTATIONS OF THE GENES AND PROTEINS AND THEN EXPANDED INTO RELATION AND THEN INFACT INFORMATION AND THEN THERE WAS THE CREATIVE SPECIALIZED CORPUS WHICH ISANO TIGHTED FOR REALISTIC TASKS THAT BIOLOGISTS WOULD NEED AND THERE WOULD BE FINDING THE NAMES OF GENES AND PROTEINS AND THEN NORMALIZING THEM TO SOME SORT OF CODE, MOLECULAR INTERACTIONS AND THEN, YOU KNOW SO ON. THE OTHER 1 SO THERE'S OTHER CORPUS AVAILABLE AND THE 1S I WILL TALK ABOUT ARE TWOF THE WORDS OF THIS AMBIGUATION COLLECTIONS BECAUSE THOSE OF US WHO WORK IN NATURAL LANGUAGE KNOW THAT AMBIGUITY IS A DIFFICULT, DIFFICULT ISSUE AND I'LL TALK ABOUT IT MORE LATER. SO ON THE LITERATURE SIDE, IT'S SORT OF I WOULDN'T SAYA EASY BUT EASIER BECAUSE PUB MED CENTRAL PROPELLED THE FIELD FORWARD BECAUSE WE HAVE FULL TEXT ARTICLES TO EVALUATE SOME HOSPITALS ARE AFRAID YOU MAY SEE SOMETHING AND THE NOS AND TREATMENT WAS LESS THAN IDEAL, SO, SO EARLY CORPUS WITH AND HAD THE IDEBTIFIED REPORTS OF MULTIPLE HOSPITALS BUT THE LOOKED AT THE WEB SITE AND IT'S UNAVAILABLE NOW, AND THEN THE MIMIC DATABASE WHICH IS THE LONGITUDINAL DEIDENTIFIED REPORTS OF 26 THIS HAPPENED PATIENTS IN THE ICU SETTING SO IT'S SPECIALIZED REPORTS AND A VARIETY OF THEM. SO THERE'S ANNOTATED TEXT, THERE'S MUCH LESS OF IT AND THERE'S PURPOSES SO THERE'S CINCINNATI I THINK SOME GROUPS ARE NOW WORKING ON THE SHARE PROJECT IS WORKING ON ANNOTATING THE MIMIC DATA DACE FOR MUCH MORE GENERAL ANNOTATION. SO, THESE ARE--THIS IS WHERE WE ARE NOW AND THERE'S MORE THAN THIS, BUT THIS IS WHAT WE CAN SUMMARIZE IN LITTLE TIME SO WHAT IS THE CHALLENGES AND FUTURE DIRECTIONS. SO WE HAVE A LARGER VARIETY OF CLINICAL NOTES AND I GUESS THAT'S A VERY HIGH LEVEL ISSUE, AND MAYBE A POLITICAL ISSUE AND SO, I'M NOT SURE HOW WE ARE GOING TO. Q. FORWARD. I KNOW MANY HOSPITALS ARE VERY, VERY,--WELL, THEY'RE HESITANT OR WILL NOT RELEASE THEIR REPORTS AND THE ANON MYSELFATIONS AND THEY'RE SURE THAT THE PATIENT INFORMATION IS NOT A NOTE AND NOT AN EASY TASK SO WE WOULDN'T WANT THAT--WE WOULDN'T WANT ANY NOTES OUT THAT HAVE ANY KIND OF PATIENT INFORMATION AND IT COULD BE SUBTLE BUT ENOUGH TO IDENTIFY A PATIENT. WE ALSO SEE A LOT OF METHODS NOW THAT ARE--I THINK ARE INCREMENTAL, AND SO, WE NEED SOME NEW METHODS LIKE REALLY A NEW FRESH LOOK AT THINGS, METHODS AND MORE VARIED APPLICATIONS SO NLP COULD BE USED FOR A LOT AND I THINK WE NEED TO SEE MORE VARIED APPLICATIONS AND PEOPLE SHOULD PUSH THE ENVELOPE OF WHAT THE APPLICATIONS CAN BE. WE'VE HAD A NUMBER OF EVALUATIONS AND THEY'RE CRITICAL THROUGH THE FIELD AND WE LEARNED A LOT FROM THEM. WE LEARNED HOW, WHAT THE STATE-OF-THE-ART IS, AND WHAT PERFORMANCE WE CAN EXPECT FOR DIFFERENT TASKS. AND WE'VE ALSO FOUND OUT THAT SOME TASKS ARE MORE DIFFICULT THAN OTHERS BUT I THINK WE HAVE TO LOOK INTO THE Y OF THESE THINGS, VERY DEEPLY SO IN EVALUATIONS WE SHOULD LEARN SOMETHING FROM THEM AND BASICALLY, I GUESS IF IT'S A SPECIFIC TASK, SMAIB SOME METHODS ARE BETTER FOR SPECIFIC PATHS THAN GENERAL PATHS, MAYBE SOME REASONS THAT A SYSTEM DID--NOT FOR NLP PURPOSES FOR REASONS BUT IT DID WELL OR NOT SO WELL, IT COULD HAVE BEEN FOR OTHER REASONS AND UNLESSY WE ANALYZE THE REASONS WE HAVEN'T LEARNED AS MUCH AS WE SHOULD AND ANOTHER REASON COULD BE THERE'S A LOT OF DOMAIN REASONING NEEDED TO ANSWER--TO DO SOME OF THOSE TASKS AND THAT'S AN IMPORTANT THING BECAUSE NLP PART MAY DO OKAY, BUT THE REASONING MAY NOT BE OKAY AND THAT'S THE UNDERSTANDING PART THAT DR. LINDBERG WAS TALKING ABOUT SO WE HAVE TO TEASE OUT THOSE ISSUES. THELING WICHT VS BEEN GOING THROUGH SEVERAL TRENDS AND THIS IS SORT OF A PENDULUM, I GUESS AND BEEN THROUGH A VERY GOOD JOB AT ANIMATING THIS. SO, IN THE 1950S, BEFORE, IN THE LATE 1950S AND BEFORE IT WAS CORPORATE BASE LINGUISTICS THAT WAS WHAT DOMINATED THE FIELD AND PEOPLE WOULD LOOK AT CORPORATE AND ANALYZE THE DISTRIBUTIONS OF WORDS TO FINE OUT WHAT KIND OF A VERB, WHAT SORT OF THE CLASSIFIED VERBS, AND CLASSIFY WORDS. AND THAT WAS ALSO IN LINE WITH SKINNER, THEN, AND IN THAT SORT OF WENT AWAY, DISAPPEARED AND THEN THE PENDULUM SWUNG THE OTHER WAY AND PROBABLY, NOT PROBABLY, BUT DUE TO CHOMPSKI, IN THE LATE 1950S, FOR 30 YEARS THE MANUEL RULE BASED LINGUISTIC DOMINATED THE FIELD. THEN, THAT WENT AWAY. AND THE PENDULUM SWUNG BACK AGAIN IN THE 1980S, LATE 1980S AND PRETTY MUCH STATISTIC CORPORATE BASE WORK, WAS EVOKED AND PRETTY MUCH THAT'S THE TREND RIGHT NOW. AND SO I THINK THERE'S STILL THIS OTHER SIDE WHICH IS THE LINGUISTIC BASED EXPERTISE AND THE FIELD SHOULD SOMEHOW MOVE TOWARDS THE MIDDLE BOUGH BECAUSE--AND I'LL GO INTO WHY I THINK THAT WAY. SO WE NEED DEVELOPMENT OF HYBRID METHODS THAT TAKE SAD VAPT KNOWLEDGE OF THE STATIST--ADVANTAGE OF THE STATISTICAL METHODS BECAUSE THE STATISTICAL METHOD VS MANY, MANY ADVANTAGES. FIRST THEY'RE ROBUST AND LIKE TO DETECT PATTERNS FROM LARGE CORP RA. AND MANY MACHINE LEARNING TOOLS ARE AVAILABLE NOW, SO THAT MEAN FIST HAVE YOU AN ANNOTATED CORPUS, YOU TOOLS CAN YOU IMPLEMENT IN A NATURAL LANGUAGE SYSTEM PRETTY RAPIDLY AND YOU DON'T HAVE TO HAVE LINGUISTIC EXPERTISE AND YOU HAVE TO USE THEM ANNOTATING THE CORPUS TO DO THAT. SO IT MEANS THAT YOU COULD ALSO EXPERIMENT A LOT, WITH DIFFERENT FEATURES AND DIFFERENT METHODS. AND YOU COULD GET A LOT OF PAPERS, BECAUSE CAN YOU EXPERIMENT AND CHANGING THE METHODS AND CHANGING THE FEATURES AND THAT'S WHAT WE SEE A LOT OF AND I'M HOPING THAT WE MOVE TO SOMETHING IN BETWEEN, BECAUSE SOME DISADVANTAGES OF THE STATISTICAL METHODS ARE AND ANNOTATION IS COSTLY AND THAT MEDICINE IS VERY SPECIALIZED SO THAT IF YOU'RE LOOKING, IF YOU HAVE TO DO A TASK THAT'S DEPENDENT ON A CERTAIN CLINICAL DOMAIN, I THINK YOU NEED CORP RAFOR THAT DOMAIN, AND IF YOU MIX THEM ALL, YOU COULD GET A DETERIORATION IN PERFORMANCE. THE OTHER THING IS THE STATISTICAL PATTERNS ARE NOT INTUITIVE. AND ANOTHER ANALYSIS, THAT'S DIFFICULT TO PERFORM, SO THE MAIN THING IS I DON'T KNOW WHAT TO DO WHEN THERE'S AN ERROR, WHEN THERE'S AN ERROR CAN YOU FIX IT RAPIDLY. SO THE THING TO DO IS TO GO AND YOU GET MORE ANNOTATED TEXT. OR, YOU CHANGE THE METHODS IN OTHER WORDS YOU EXPERIMENT SOME MORE AND GENERATE MORE PAPERS AND YOU KNOW YOU LOOK AT DIFFERENT FEATURES. SO, WHAT I THINK WE NEED A SYNERGISTIC MODEL, MODEL THAT INTEGRATE THE EXPERT RULES, LANGUAGE RULES, DOMAIN KNOWLEDGE, AND THEN MAYBE WE CAN GET TO SOME UNDERSTANDING, MACHINE LEARNING TECHNIQUES, THAT INTEGRATE ALL OF THEM AND THAT ALLOW EXPERTS TO OVERRULE, AND TO BE MORE LINGUISTICALLY INTUITIVE, ASK THERE ARE MANY TIMES THAT I WORK ON WORD SENSE TO AMBIGUATION THAT I COULD LOOK AND WRITE THE RULE WITHIN A FEW MINUTES, AND IT WOULD BE A LOT OF WORK FOR A SYSTEM TO DO THAT. TO SOMEHOW TAKE ADVANTAGE OF BOTH WOULD BE VERY CRITICAL. IN THE CLINICAL DOMAIN, THE ABBREVIATIONS ARE REALLY ROUGH AND NOW THAT PHYSICIANS ARE ENTERING DATA, BY TYPING AND THEY SEEM TO LIKE IT, THEY ALSO SEEM TO LIKE GENERATING VERY CREATIVE SHORT CUTS. SO THERE'S A LINE OF 2 AND 3 LETTER WORDS, AND SO WE NOTICE CANCER OR KULSIUM, PD IS PARKINSON'S DISEASE SO A SYSTEM COULD IDEBTIFY OR COULD WORK WITH TERMS IT KNOWS BUT THERE ARE VERY ATYPICAL ABBREVIATIONS SO BEFORE, I THINK THOSE OF YOU THAT HAVE CHILDREN THAT KNOW, THAT'S A TEXTING LANGUAGE AND THAT'S CREEPING INTO MEDICAL NOTES BUT HF, AND RA. SO H. F. THOSE WHO ARE PHYSICIANS WOULD THINK OF HEART FAILURE, AND WHEN WE'RE PROCESSING NOTES AND WE GOT A LOT OF HEART FAILURES AND WONDERING WHY SO I DON'T KNOW IF ANYBODY COULD GUESS WHAT HF IS? HISPANIC FEMALE. [LAUGHTER] EMPLOY AND HRH POSITIVE OR RH NEGATIVE, BLOOD GROUP IS RIGHT HANDED SO THOSE ARE THE THINGS THAT WE'RE GOING TO HAVE TO DEAL WITH. VERY DIFFICULT. SO THE DIFFICULTIES IS THE SPACE IS LARGE, VERY LARGE, SO IF WE TALK ABOUT SIN TACTIC AMBIGUITY, LIKE A DROP THT BUCKET COMPARED TO THE AMBIGUITY AND I THINK IN ALL THE WORK WE'VE DONE, WE HAVE PERFORMANCE MEASURES, WE KNOW THE OVERALL AVERAGE PERFORMANCE THAT WHEN WE GET 1 WORD, A PARTICULAR WORD, IT'S HARD TO KNOW WHETHER LOCAL FEATURES ARE GOOD FOR THAT WORD, GLOBAL FEATURES OR CONTEXTURAL KNOWLEDGE, WHEN WE WANT TO DO THAT WORD, ACCURATELY, WE STILL DON'T KNOW WHAT'S THE BEST THING FOR THAT WORD. MORE WORK IS NEEDED THERE. SO AS FAR AS DOMAIN MODELS, I THINK WE'RE GOING IN THE RIGHT DIRECTION, WE SHOULD CONTINUE THE WORK AND EXPAND THE MODELS AND EVALUATING THEM AND NOW, IN SUMMARY, WE HAVE TO--I WOULD LIKE TO SEE A BALANCE, AND BROADEN THE NLP PORTFOLIO IS THINK OF THINGS THAT NLP CAN HELP IN, FOR INSTANCE, DATA ENTRY AND IMPROVED DATA ENTRY IS CRITICAL AND HERE IS SOME WAY THAT MAY BE PHYSICIANS CAN SPEAK OR SOME--NATURAL, OR MAYBE REDUCE ABBREVIATION, SOMEHOW THERE'S ANOTHER WAY OF ADDRESSING, YOU KNOW NOT HAVING TO DO DISAMBIGUATION BUT TO CATCH IT AT THE FRONT END, REDUCE TO CUT AND PASTE AND IMPROVE TEMPLATES SO I THINK THERE'S A LOT OF WORK THAT NATURAL LANGUAGE COULD DO HERE, TO IMPROVE THE DATA ENTRY AND THAT WOULD BE CRITICAL TO AVOIDING A LOT OF THE COMPLICATIONS OF NLP LATE OR AND NLP COULD BE USED TO IMPROVE THE DOCUMMATION AND WE'RE SEEING THAT NOW, SO WHEN FOR INSTANCE AN INDICATION FOR MEDICATION IS MISSING OR SOME EXAM IS MISSING, NLP SYSTEM COULD PROBABLY SUGGEST SOMETHING. AND I BELIEVE WE SHOULD DEVELOP THE KILLER WRAPS, LIKE THE WATSON EXPERIMENT WHERE PEOPLE REALLY SEE THE IMPORTANCE OF THE FIELD AND AND THEN BRING IT TO THE FOREFRONT. SUMMARIZATION AND HERE'S QUESTION AND ANSWERING AND I MEANT THE UNDERSTANDING AND THE WATSON EXPERIMENT. IMPROVING INFORMATION FOR THE CONSUMER SYSTEM CRITICAL AND GIVING THEM NEVERGZ THEY UNDERSTAND AND WE'RE SEEING THAT NOW. AND KNOWLEDGE ACQUISITION INTEGRATION AND DISCOVERY IS ABSOLUTELY CRITICAL AND THAT I THINK ABSOLUTELY NEED DECK TEAKS AND DATA MONITORING TECHNIQUES TO LOOK AT STATISTICS TO LOOK AT AND SEE AND DISCOVER SOMETHING NEW OR TO SEE HOW PHYSICIANS ARE TREATING PATIENTS WITH CERTAIN DISEASES AND WHAT THE OUTCOMES ARE, AND ALSO INTEGRATING INFORMATION FROM DIFFERENT SPECIALTIES. SO I THINK THE FIELD SHOULD KEEP UP THE MOMENTUM AND YOU KNOW KEEP ON DOING THE GOOD WORK IT'S DOING AND ALSO MOVING INTO NEW DIRECTIONS. THANK YOU. [ APPLAUSE ] ANY QUESTIONS? I DON'T KNKNOW IF I HAVE TIME, ACTUALLY. >> LADIES AND GENTLEMEN, EACH OF YOU OR ALMOST EACH OF YOU HAS A MICROPHONE, WITH A BUTTON WHERE YOU ARE, IF YOU ARE GOING TO ASK A QUESTION, PLEASE PRESS THE BUTTON, THE RED LIGHT WILL COME ON AND THAT WAY, IDENTIFY YOURSELF AND THAT WAY, YOU'LL ACHIEVE IMMORTALITY ON THE WEBCAST. >> SORRY. >> NO, THAT'S OKAY, DO WE HAVE TIME? THE TIME KEEPER? >> ANYONE HAVE QUESTIONS OR COMMENTS FOR CAROL? >> NO, OKAY. I SAID IT ALL. OR I SAID NOTHING, 1 OR THE OTHER. I HOPE IT WAS ALL. SORRY, LIFE-THERE'S A QUESTION. >> ARE THERE PROMULGATE--PROMISEING USES F OR THAT ARE POINTERRING THE WAY FOR ACADEMIC RESEARCH? >> THEY'RE STARTING, THERE'S ENCODING--THE CODE SUGGEST A BIG ISSUE AND THERE'S CODING COMPANIES. THERE ARE SOME COMPANIES I KNOW THAT ARE STARTING TO DEVELOP, THEY ALL ADVERTISE NATURAL LANGUAGE SO THE EXTENT THEY HAVE IT BUT THEY ARE STARTING AND SO IT'S BECOME A COMMERCIAL THING, RIGHT? , IBM, NOT IBM BUT THE iPHONE, YOU KNOW THE SIRI PRETTY NEED AND THAT'S NOT IN THE CLINICAL FIELD BUT THAT'S WHERE THE CLINICAL FIELD CAN GO IS USING THAT KIND OF INTERFACE TO HELP PEOPLE IN A LIMITED WAY. SO IT'S NOT THERE YET BUT IT'S COMING, I'M SURE IT'S COMING. >> I WAS GOING TO FOLLOW UP ON THE ISSUE ABOUT PHYSICIAN TYPING IN NOTES AND IT'S NOT JUST THE ABBREVIATIONS OR TYPOS, YOU KNOW, THIS TELEGRAPHIC THERE,'S ALL KINDS OF ISSUES THAT CAN MAKE IT KIND OF--NOT ONLY TOUGH, BUT MAYBE IMPOSSIBLE, AT LEAST ACROSS A SPECTRUM. NOW, SO, A COUPLE THINGS WE SHOULD PROBABLY TRY TO ASK FOR IS THAT WE KNOW THE SOURCE OF THE TEXT. YOU KNOW THERE MAY BE A NEW FIELD THAT IS HAND ENTERED VERSES TRANSCRIBED BECAUSE THAT MIGHT HELP THE PROCESS, BUT THE OTHER THING IS, YOU KNOW THE VOICE STUFF IS PRETTY TEMPTING AND THERE IS A FAIR AMOUNT NOW IN MEDICAL AND SOME PEOPLE, YOU DON'T HAVE A SENSE WHETHER IT'S GOOD ENOUGH OR PEOPLE LIKE IT? >> IN CERTAIN FIELDS IT'S GOOD ENOUGH. LIKE I KNOW IN RADIOLOGY, I'M NOT SURE, PATHOLOGY, IT'S GOOD ENOUGH AND THEY LIKE IT. I WOULD THINK THAT'S THE MOST NATURAL THING AND I SHOULD HAVE SAID THIS AS A DIRECTION BUT MARRYING, INTEGRATING VOICE RECOGNITION AND NLP IS PROBABLY VERY POWERFUL ASK WE HAVEN'T DONE IT. I MEAN THOUGHTAL INTEGRATION BECAUSE THE SPEECH RECOGNITION CERTAINLY KNOWS, YOU KNOW WHEN THE PERIODS ARE WHEN SOMEBODY PAUSES, IT WOULD BE ABLE TO PREDICT PRETTY WELL AND I THINK PEOPLE WOULDN'T USE THOSE APREV IATIONS IF THEY WERE SPEAKING. >> I RESPOND I DON'T HAVE--UNDERSTAND AND DON'T HAVE DETAILIT IS 1 OF THE DICTATION IS TRYING TO COMBINE A PRODUCT AND TRANSCRIBING AND ENCODE BUT HELP THE TRANSCRIBER TO CORRECT THE CODING. >> YEAH, WE DID EXPERIMENTS WITH THAT BUT NA'S 1 TOP OF THE OTHER, IT'S PIGGYBACK, SO YOU DO THE SPEECH AND THEN YOU DO THE NLP BUT YOU WHAT REALLY NEED IS THE TECHNOLOGY SHOULD BE MERGED AND THEN TELL BE MUCH MORE POWERFUL BECAUSE THE SPEECH SORT OF KNOWS WHAT THE--I KIND OF HOPE TO PREDICT WHAT THE WORD IS IN THE CONTEXT AND THAT COULD BE HELPFUL TO THE NATURAL LANGUAGE PROCESSING SYSTEM BUT THEY'RE NOT INTEGRATED NOW. SO 1 IS BEING USED ON TOP OF THE OTHER. [ APPLAUSE ] NOW I HAVE THE PLEASURE OF INTRODUCING DR. CHRISTOPHER MANNING. WE ALL KNOW DR. MANNING, WE USE HIS BOOK, STATISTICAL USES FOR NATURAL LANGUAGE PROCESSING, IT'S A BIBLE OF NATURAL LANGUAGE PROCESSING. HE HAS A DEGREE IN 1995 FROM STANFORD IN LINGUISTICS AND I THINK WE WENT TO CAN EGLE MELON AS A COMPUE--ASSISTANT PROFESSOR COMPUTATIONAL LINGUISTIC AND HE'S BACK AT STANFORD AND COMPUTER SCIENCE, HE RAN THE GUM UTR FROM LINGUISTICS TO COMPUTER RESEARCH. HE'S A PROLIFIC WRITER AND PUBLICATIONS SO A LOT OF RESEARCH, A LOT OF RESEARCH IN LEXICOGRAPH Y, RESEARCH IN CLUSTERING AND RESEARCH IN STATISTICAL NATURAL LANGUAGE AND ALSO IN TRANSLATION AND HIS BOOK AS I SAID IS LIKE A BIBLE, VERY GOOD BOOK AND HIS--HE JUST RECENTLY PUBLISHED ANOTHER BOOK AND ENRINGS RETRIEVE--INFORMATION RETRIEVAL SO IT'S JUST AS GOOD SO I'M HAPPY THAT CHRIS CAME AND ACCEPTED OUR INVITATION TO SPEAK ABOUT STATISTICAL NATURAL LANGUAGE PROCESSING. [ APPLAUSE ] >> GOOD AFTERNOON, EVERYONE, I WAS ASKED TO GIVE AN OVERVIEW OF STATISTICS FOCUSED APPROACHES. AND I WOULD LIKE TO SAY THAT I'D LIKE TO ARGUE IS THAT WELL, STATISTICS AND PROBABLISTIC MODELS COME UP IN THE CONTEXT OF THAT BUT THAT'S NEVER BEEN MY FOCUS. IT IS THE CASE IN NATURAL LANGUAGE PROCESS NOTHING RECENT YEARS THAT THERE'S BEEN A TON OF USE OF MACHINE LEARNING METHODS AND PROBABILITYISTIC MODELS AND I THINK IT'S EVEN THE CASE THAT YOU CAN LOOK AT THAT LITERATURE AND WORRY THAT FOR THE YOUNG STUDENTS THESE DAYS THEY END UP SO FOCUSED ON THE DETAILS OF MACHINE LEARNING METHODS AND THEIR MAGNETICAL DERIVATION AND FOCUSING ON THAT RATHER THAN THINKING ABOUT THE PROBLEMS THAT THEY'RE GOING TO SOLVE. AND THAT'S NEVER REALLY BEEN WHERE I'VE COME OUT THINGS FROM. SO I STUCK IN THIS PICTURE FROM COMMENCEMENT A COUPLE OF YEARS AGO AND THE 2 THINGS THAT CAN YOU NOTICE IN THIS PICTURE IS FIRST, SLIGHTY DANGEROUS TO BE 1 OF MY STUDENTS BECAUSE I'M NOT GOOD AT PUTTING ON HOODS SO THERE'S A REAL RISK OF AFFIXATION DURING THE HOODING CEREMONY BUT THE SECOND THING THAT YOU CAN NOTICE, IF I USE A LITTLE BIT OF IMAGE ENHANCEMENT HERE IS THE FACT THAT CARE O MENTIONED IN HER INTRODUCTION IS THAT ACTUALLY, I'M A HUMANITIES Ph.D. THAT I HAVE A Ph.D. IN LINGUISTICS AND THAT WAS MY UNDERGRAD HONORS AS WELL, SO I WANTED TO TAKE A MINUTE TO SAY JUST A LITTLE BIT MORE ABOUT THE HISTORY OF LINGUISTICS AND HOW SOME OF THESE¨— IDEAS DO FIT IN, AND A ARE WELL MOTIVATE INDEED THE HISTORY OF LINGUISTICS. SO, LINGUISTICS IN THE LAST 50 OR SO YEARS HAS BEEN DOMINATED BY VIEWING LINGUISTICS AS A LOGICAL SYSTEM WITH RULE BASED DISSCRIPGZ AND A LOT OF THINGS IN LINGUISTICS GET NAMED ON--THE CONTINUOUS E. G. INFIN TIS MALCALCULUS OR THE FINITE GROUP THEORY NOW I'LL TURN OUT THAT THE MATH MEATICS CALLED LINGUISTICS BELONGS TO THE SECOND CLASS, IT DOES NOT MAKE ANY COMPROMISE OF CONTINUITY AS STATISTICS DOES OR INFINITY GROUP THEORY, LINGUISTICS IN THE QUANT UMPIRES MECHANICS IN THE EXTREME SENSE OR POSSIBILITIES WITH THESE ARE SHUNT OUTSIDE OF LINGUISTICS IN 1 DIRECTION OR ANOTHER. AND YOU CAN REALLY THINK OF CHOMPSKI AS COMING ALONG AND REALIZING A VISION OF LINGUISTICS THAT'S LIKE THIS. BUT WHAT I'D LIKE TO ARGUE IS THAT THIS IS HAS NEVER BEEN THE WHOLE OF LANGUAGE OR SATISFACTORY DESCRIPTION OF LANGUAGE. SO, THESE CATEGORICAL LINGUISTIC THEORIES END UP CLAIMING FAR TOO MUCH, THAT THEY TRY AND PLACE THESE HARD CATEGORICAL POUNDRYS AS TO WHAT IS POSSIBLE AND WHAT'S NOT POSSIBLE IN LANGUAGE, BUT LANGUAGES AREN'T LIKE THAT, LANGUAGES ARE VERY FUZZY SYSTEMS THAT ARE DETERMINED BY LOTS OF CONFLICTING CONSTRAINTS AND PREFERENCES. ISSUES OF CONVENTIONALITY THAT KEEPS THEM THE SAME AND HUMAN CREATIVITY THAT WANTS TO MAKE THEM DIFFERENT. AND SO THERE HAVE ALWAYS BEEN COUNT AVAILING VOICES, SO HERE'S EDWARD HERE, 1 OF MY FAVORITE LINGUISTS, AND SO HE WRITES IN HIS POPULAR LANGUAGE TEXT FROM THE 20S OF A NUMBER OF NICE PITY VIMS BUT 1 OF THEM IS ALL GRAMMARS LEAK WHICH IS REFLECTING THE FACT THAT IT'S NOT JUST POSSIBLE TO DESCRIBE ALL THEARIVATION THAT HAPPENS IN THE LANGUAGE IN THE SYSTEM OF RULES. BUT IT'S NOT ONLY AN IMPOSSIBLE TASK BUT YOU END UP WITH THE COUNTER VAILING PROBLEM THAT CATEGORICAL LINGUISTIC THEORIES EXPLAIN TOO LITTLE SO THEY END UP SAYING NOTHING AT ALL ABOUT THE SOFT CONSTRAINTS OF LANGUAGE WHICH ARE ABOUT HOW PEOPLE CHOOSE TO EXPRESS THINGS AND THIS IS PRECISELY THOSE KIND OF SOFT CONSTRAINTS THAT COMPUTATIONAL LINGUISTS AND ANYONE ELSE THAT'S DEALING WITH USE, WANT TO KNOW A LOT ABOUT AND THAT'S BEING WELL BEFORE JOHN LIONS AS A WELL KNOWN BRITISH SIM ANT CYST AND SO HE WRITES IN THE 60S, STATISTICAL CONSIDERATIONS ARE ESSENTIAL TO AN UNDERSTANDING OF THE OPERATION AND DEVELOPMENT OF LANGUAGE. SO I THOUGHT I WOULD GIVE 1 TEENY EXAMPLE OF HOW THIS PLAYS OUT FROM THE LINGUISTICS, SO THIS IS A CASE OF COORDINATION, SO CHOMPSKI NOTICED THE IDEA THAT WHEN YOU HAVE COORDINATIONS YOU CONJOIN THINGS OF THE SAME CATEGORY SO YOU CAN JOIN A NOUN PHRASE AND NOUN PHRASE, A BOY AND HIS DOG OR ANA ADJECTIVE PHRASE AND AN ADJECTIVE PHRASE BUT YOU CAN'T MIX THE CATEGORIES OF THINGS THAT ARE COORDINATED SO YOU CAN'T SAY A BY AND HAPPY. SO HE PROPOSED THE CONJOINED LIKES CONDITION WHICH IS SAYING YOU CAN HAVE RULES THAT PERFORM, X GOES TO X AND X. WELL THAT SEEMED A GOOD IDEA UNTIL PEOPLE NOTICED THAT IMPERICALLY THAT'S NOT TRUE. IT TURNS OUT YOU CAN CONJOIN THINGS WHERE THEY AREN'T THE SAME CATEGORY SO CAN YOU SAY A 57 YEAR-OLD AND 27 YEAR VOICE OF VETERANS, VETERAN IS A NOUN SO CAN YOU JOIN AN ADJECTIVE AND A NOUN AND NOT CRUDY BUT NOT ADDRESS EITHER WHICH IS CONJOINING AN ADJECTIVE AND A NOUN TOGETHER. WELL, WHAT DO YOU DO ABOUT THIS FACT? WELL 1 THING YOU CAN DO ABOUT THIS FACT IS SAY, GHEE WELL WE HAVE TO WEAKEN THE THEORY BECAUSE WE--IT'S NOT TRUE THAT YOU KNOW YOU CAN JOIN LIKES. AND PEOPLE HAVE TRIED TO WORK THAT OUT IN FORMAL LOGICAL BASED APPROACHES TO GRAMMAR SO THAT I'M NOT GOING TO EXPLAIN THE DETAILS BUT EFFECTIVELY, YOU IMPOSING EXTREENSIC CONDITIONS ON--EXTRINESSIC CONDITIONS ON WHAT IT MUST SATISFY AND UNDER WHAT CIRCUMSTANCES CAN YOU CONJOIN THINGS AND YOU CAN DO THAT BUT IT'S PROBLEMATIC AS L. BECAUSE IT TURNS OUT THAT CONJOIN LIKES IS TRUE, AS A STATISTICAL CLAIM BUT THE VAST MAJORITY OF THE TIME PEOPLE DO CONJOIN THINGS AT THE SAME CATEGORY AND IN FACT IF YOU THINK OF IT AS A STATISTICAL CLAIM, IT'S TRUE IN A STRONGER MORE GENERAL WAY THAN CHOMPSKI WAS PROPOSING. BECAUSE YOU NOTICE THERE'S A LOT OF PARALLELISM IN CORDINATION, IF THE FIRST,--IF YOU'RE CONJOINING 2 NOUN PHRASES AND THE FIRST NOUN PHRASE IS MODIFIED BY AN ADJECTIVE, IT'S LIKELY THAN CHANCE THAT THE SECOND NOUN PHRASE WILL BE MODIFIEDs BY AN ADJECTIVE AS WELL AND THAT KIND OF PARALLELISM IS PREVALENT IN LANGUAGES, SO WE CAN INCREASE THE EXPLANATORY POWER OF THE JON JOINED LIKES PRINCIPLE BY INTERPRETING THE STATISTICALLY. AND SO THAT'S SOMETHING THAT IS VERY POWERFUL TO DO. AND WHILE IT'S NOT YET THE MAIN STREAM, IT'S AN IDEA THAT'S REALLY PICKED UP INSIDE LINGUISTICS TOO, THESE DAYS SO THERE'S NOW STARTING TO BE ACTIVE WORK IN SIDE LINGUISTICS IN MAKING USE OF IMPERICAL PROBABLISTIC MODELS TO EXPLAIN ALL AREAS OF LANGUAGE WHETHER IT'S AT THE LEVELS OF SIM ANTICS TECHNOLOGY AND PHONETETTICS--PHONETICS HAS HAD A LOT OF ANNOLOGYS AND ALL THAT. SO HOW I GOT INTO STATISTICAL NLP WAS WRITING AN ENGLISH OF GRAMMOR AND THAT MIGHT SEEM A LIKE LOW PLACE TO GET IN STATISTICAL NLP, BUT HOW I GOT INTO IT IS WRITING A GRAMMAR OF ENGLISH CONVINCED ME OF THE PROBLEMS OF AMBIGUITY OF LANGUAGE AND THE UNFORTUNATE TRAIT TRADE OFFS BETWEEN COVERAGE AND DROWNING AND AMBIGUITY. SO I STARTED OFF WITH A VERY SMALL GRAMMAR OF ENGLAND JOURNAL LIRK AND IT WAS NICE BECAUSE I COULD TAKE A SIMPLE SENTENCE OF ENGLISH AND PUT IT INTO MY P A ZA, AND GET THE CORRECT PAUSE OF THE SENTENCE BACK AND BEING AN AMBITIOUS GRAD STUDENT, I NOTICED THERE ARE LOTS OF SENTENCES MY GRAMMAR COULDN'T PAUSE AND I WAS SPENDING THE SUMMER EXPANDING THE GRAMMOR SO I COULD PAUSE MORE SENTENCES THAT GOT CLOSE TO WHAT WAS REALLY SAID IN HUMAN LANGUAGES BUT THAT'S WHEN THE PROBLEM STARTED BECAUSE EVERY TIME I PUT MORE RULES IN THE GRAMMAR SO I COULD PAUSE MORE DIFFERENT SYSTEMS THAT I WAS COMING ACROSS THE TEXT I WAS TRYING TO MAKE POSSIBLE THAT WHAT I WAS DOING WAS PUTTING IN MORE OF THE POSSIBILITIES OF LANGUAGE VARIOUS RARE AND MARKED CONSTRUCTIONS THAT YOU NEED IN DIFFERENT PLACES AND AS I DID THAT, EVERY SIMPLE OBVIOUS SENTENCE STARTED HAVING MORE AND MORE AMBIGUOUS INTERPRETATIONS SO THE SENTENCE I HAD THAT WAS A SIMPLE 6 WORD I WANT TO GO TO THE STORE, IT STARTED OFF WITH 1 PART INTERPRETATION AND 3 PAUSES AND 5 INTERPRETATIONS AND 7 PAUSES AND 11 INTERPRETATIONS, AND THE MORE I WENT ON, THE WORSE THIS WAS GETTING AND THE THING THAT WAS FRUSTRATING WAS THAT I HAD ABSOLUTELY NO TOOLS AT MY DISPOSAL TO PHASE THE SYSTEM, HERE'S HOW YOU SHOULD GO ABOUT CHOOSINGS BETWEEN THESE DIFFERENT PAUSES AND INTERPRETATIONS FOR THESE SIMPLE SENTENCES WHICH IS OBVIOUSLY WHAT YOU WANT TO DO ON THOSE SYSTEM OR THE SYSTEM WILL NEVER BE UP TO DOING ANYTHING USEFUL. SO THAT'S WHAT GOT ME INTEREST INDEED BEING INTERESTED IN STATISTICAL NLP. I WANT TO HAVE A SYSTEM WHERE I COULD GIVE UP SENTENCES AND IT WOULD SAY, A REASONABLE STRUCTURE AND INTERPRETATION THAT THIS SENTENCE THAT YOU JUST GIVEN TO ME IS LANGUAGE 1. SO LANGUAGE UNDERSTANDING IS A PROCESS OF FLEXIBLE REASONING UNDER UNCERTAINTY. SO HUMAN LANGUAGE IS AMBIGUOUS, PEOPLE SPEAK AMBIGUOUSLY, THE INTERPRETER HAS INCOMPLETE INFORMATION SO THE INTERPRETER HAS TO MAKE GUESSES ON THE LANGUAGE USED AND SHAPE CONTEXT AND KNOWLEDGE AND THE PROBABILITIES ARE JUST A REALLY GOOD WAY OF MAKING GUESSES. AND SO, THAT'S WHY THEY COME INTO LANGUAGE NOT BECAUSE WE'RE STATISTICS FOCUSED BUT BECAUSE WE'RE FOCUSED ON LANGUAGE BEDDING. --UNDERSTANDING. AND SO IN PARTICULAR, USING PROBABILITY DOESN'T MEAN YOU DON'T NEED GOOD LINGUISTIC REPRESENTATION, YOU CAN AND YOU SHOULD PUT YOUR STATISTICAL MODELS OVER COMPLEX LINGUISTIC REPRESENTATIONS AND I'D LIKE TO ARGUE THAT ME, ACTUALLY, AND ACTUALLY WORK ABOUT REPRESENTATIONS THAT REALLY I AM ORIGINALLY THE LINGUIST AND SOME OF THE REASONS I'VE BEEN ABLE TO DO SOME OF THE WORK, THE PEOPLE HAVE LIKED IS BECAUSE I'M GOOD AT COMING UP WITH REPRESENTATIONS TO PUT MODELS OVER. SO UNLESS YOU STOP SOME OF MY TIME, FOR THE REST OF THE TIME I WANT TO SINCE THIS IS THE INTRO SAY A QUICK LITTLE BIT ABOUT THE FEWER CURRENT STATE-OF-THE-ART OF SOME THE THINGS WE'VE ACHIEVED AND THEN MOVE ON TO PROSTECTS OF RESEARCHING STUFF FOR THE FUTURE. OKAY, WHAT HAVE WE LEARNED? IF THERE'S 1 THING WE LEARNED IN THE LAST 20 YEARS TO CORPORATE APPROACHES TO NLP IS THE FACT THAT CRUDE TEXT STATISTICS OVER LARGE AMOUNTS OF TEXT CAN GET YOU A REALLY LONG WAY SOMETIMES SO WHAT DOES THAT MEAN, WHAT THAT MEANS IF IT YOU WANT TO DO A TASK LIKE PROTEIN, PROTEIN INTERACTION. WHAT YOU CAN DO IS TAKE THE CONTEXT THAT CAROL WAS TALKING ABOUT, FIND SENTENCES WITH 2 PROTEINS AND OTHERS MENTIONED IN THEM AND EVERY TIME YOU FIND 1 OF THOSE, YOU HAVE THE NAMES OF THE 2 PROTEINS AND COUNT 1 AND YOU JUST THUMB THOSE OVER A LARGE AMOUNT OF TEXT AND USE THE FREQUENCY AND FILTER A LITTLE WITH SOMETHING LIKE MUTUAL INFORMATION BUT YOU JUST COUNT THOSE THINGS UP AND YOU SAY, THOSE ARE THE PROTEINS THAT INTERACT AND IT TURNS OUT IF YOU HAVE TONS OF DATA THAT ACTUALLY CAN WORK PRETTY WELL AND GIVE YOU A LOT OF INFORMATION ABOUT THE DOMAIN IS LEARNING WHICH PROTEINS INTERACT WITH EACH OTHER AND I THINK THAT IN SOME SENSE THAT WAS A SHOCKING DISCOVERY BECAUSE I THINK PEOPLE SPEND A LOT OF TIME TRYING TO DO NATURAL LANGUAGE UNDERSTANDING IN THE 70S 70S AND 80S UNDER THE ASTHAWSMGZ YOU COULDN'T DO ANYTHING--ASSUMPTION THAT YOU COULDN'T DO ANYTHING UNLESS YOU HAD HAD MODELS OF DEEP SIM ANTIC STRUCTURE AND THEN YOU COULD, YOU COULD NONE OF THAT STUFF AND YOU COULD DO INTERESTING TEXT MINING WORK. WHY SHOULD YOU DO MORE? AND IN PARTICULAR WHEN THESE KINDS OF CRUDE METHODS WERE FIRST INTRODUCED WHAT PEOPLE FOUND OUT WAS KIND OF A LOST THE COMPLEX, RURAL BASED METHODS THAT PEOPLE HAD BEEN BUILDING FOR YEARS, DIDN'T ACTUALLY WORK AS WELL AS DOING THIS. THAT SHOULD BE PUT BACK ON THE SHELF BUT OF COURSE I DON'T BELIEVE THAT'S THE ANSWER AND IS THE END OF THE STORY, THAT--THAT TAKES YOU A CERTAIN DISAS DISTANCE CHEAPLY BUT IF YOU WANT TO DO MORE THAN THAT, IF YOU WANT TO HAVE BETTER RESULTS IF YOU NEED TO BE ABLE TO LOOK AT THOSE SENTENCES AND SEE WHETHER THE SENTENCE IS ACTUALLY SAYING THAT THE 2 PROTEINS, DO HAVE A RELATIONSHIP, BETWEEN THEM THEN YOU NEED MODELS WITH REAL NLP IN THEM SO THEN I'M YOUR MAN. OKAY, SO WHAT ARE SOME OF THE TOOLS THAT WE HAVE, SO JUST QUICKLY,„i CAROL ALREADY MENTIONED THE PROBLEM OF NAMES FOR THINGS SO THIS IS NAMED ENDITY RECOGNITION SO THAT'S 1 SPACE IN WHICH A LOT OF WORK HAS BEEN DONE AND A LOT OF PROGRESS HAS BEEN MADE, I MEAN INTERESTINGLY, NAMEDDENTITY RECOGNITION IS A REALLY NEW INVENTION, RIGHT? SO WE HAD MACHINE TRANSLATION IN THE 1950S, SIN TACTIC PAUSE NEGLIGENT 1960S BUT IT'S REALLY ONLY IN 1996 THAT SOMEONE CAME UP WITH THE IDEA THAT MAYBE WE SHOULD FOCUS ON THE IDEA OF BEING ABLE TO PICK OUT THE NAMES OF THINGS IN A PIECE OF TEXT. AND SO THAT THESE DAYS THAT'S A TASK THAT'S NORMALLY DONE WELL, USING PROBABILITY HISTIC MODELS SUCH AS CONDITIONAL RANDOM FILLED MODELS THAT GIVE PRETTY GOOD ACCURACY ON THINGS LIKE NEWS WIRE, ACCURACYS IN THE 90%S, NOT QUITE SO GOOD IN THE BIY MEDICAL DOMAIN, I THINK A LOT OF REASONS THAT IS THE BIOMEDICAL DOMAIN IS HARDER, BECAUSE OF OF SOME OF THE KIND OF CREATIVE USE OF APREVIATIONS BAH--ABBREVIATIONS BUT OFTEN THE SAME NAME IS USED FOR THINGS LIKE GENES AND GENE PRODUCTS BECAUSE THEY'RE RELATED TO EACH OTHER, THAT YOU WANT TO DIFFERENTIATE THEM IN DIFFERENT CONTEXTS. BUT THIS IS ALSO A GOOD PLACE WHERE HAVING YOU KNOW LEXICONS TAKES YOU A FAIR DISTANCE BUT DOESN'T TAKE YOU AS FAR AS YOU WANT TO GO BECAUSE IT DOESN'T SOLVE THE DISSENT AMBIGUATION PROBLEMS AND IT DOESN'T SOLVE THE PROBLEM OF LETTING YOU EXTEND GRACEFULLY FOR WORDS IN THE LEXICON THAT YOU SHOULD BE ABLE TO RECOGNIZE BASED ON CONTEXT. STATISTICAL PAUSING, THAT'S 1 OF THE BIG SUCCESSES 1990S LLP THAT TOOK US FROM NOT BEING ABLE TO BUILD PAUSES THAT WORKED TO BEING ABLE TO HAVE PAUSES THAT YOU COULD FEED ANY SENTENCE INTO THEM AND THEN RETURN AN ANALYSIS WHICH IS USUALLY MOSTLY CORRECT AND SO THIS IS MADE PAUSES AN OFF THE SHELF TOOL WHICH SHOWED NOW BEING PICKED UP TO ALLOW DEEPER NATURAL LANGUAGE APPLICATIONS WHERE YOU REALLY DO UNDERSTAND THE STRUCTURE OF SENTENCES. DEPUTY. SO THIS THERE'S BEEN A LOT OF WORK ON CONSTITUENCY PAUSING ON THE U.S. BUT THERE'S BEEN FOCUS ON DEPENDENCEY PAUSING BECAUSE DEPENDENCEY PAUSE SUGGEST MUCH MORE DIRECTLY SHOWING THE RELATIONSHIPS BETWEEN WORDS. AND SO IN PARTICULAR, I'VE BEEN RATHER INTERESTED IN THIS APPROACH OF STANFORD DEPENDENCEYS WHICH WE CAME UP WITH AS AN ATTEMPT TO FOCUS DEPENDENCEY REPRESENTATION ON SOMETHING THAT WAS--WHAT WAS NEEDED FOR RELATIONSHIP TASK, WITH RELATIONSHIPS BETWEEN HEAD WORKS AND DOING A CERTAIN AMOUNT OF POST PROCESSING TO BE ABLE TO PICK UP RELATIONSHIPS LIKE DISDRIEWBTING CONJUKESS AND THINKS LIKE THAT SO THAT LEADS ME INTO THIS TASK OF RELATION EXTRACTION WHICH IS THE NEXT STEP UP FOR STARTING TO DO A BIT OF SEMANTICS WHICH IS THEN LOOKING AT PIECES OF TEXT AND TAKING OUT PREDICATES OF RELATIONSHIPS OF AND THE ARGUMENTS OF THE RELATIONS SO IN THIS TEXT HERE, WE HAVE THESE GUIDES THAT ARE INTERACTING WITH EACH OTHER, AND THEN CV, XC IS AN--CBFV IS NOT INTERACKING WITH EITHER OF THEM BUT IT'S ASSOCIATING WITH THE COMPLEX SO GETTING OUT THESE KIND OF RELATION EXTRACTION FACT WHICH IS ARE OFTEN GOTTEN BY BUILDING MACHINE LEARNING, MODELS OVER SUPERVISED DATA. AND SO THE INTERESTING THING IS THAT THIS IS 1 OF THE PLACES THAT STATISTICAL PAUSING HAS FED INTO AND SO IN THESE VARIOUS BIONLP CHALLENGES IN RECENT YEARS THAT LOOK AT RELATION EXTRACTION, BEEN EXTENSIVE USE OF PAUSING AND IN PARTICULAR, THE DEPENDENCEY REPRESENTATION THAT I VERY RECENTLY MENTIONED, SO HERE'S A LITTLE DATA ANALYSIS GRAFT FROM THESE GUYS FROM FINN LAND WHO WON THE 2009 COMPETITION OF BIONAL MP AND THIS GRAPH IS SHOWING IS THAT FOR THE KINDS OF RELATIONS YOU WANT TO PICK OUT THAT IF YOU'RE JUST LOOKING AT DISTANCE APART IN WORDS, IT'S OFTEN DIFFICULT BECAUSE LOTS OF THE TIME THE WORDS YOU WANT TO CONNECT IN THE RELATION ARE FAR APART, THERE'S SOMETHING AROUND 15% OF THE TIME THEY'RE ACTUALLY GREATER THAN 10 WORDS APART, BUT IF INSTEAD YOU LOOK AT THE DISTANCE APART IN DEPENDENCEYS, GRAMMATICICAL CORRELATIONS AND DEPENDENCEYS IS WHAT THEY'RE USING THAT BRINGS TOGETHER THE THINGS THAT ARE CONNECTED VERY CLOSE TOGETHER SO OVER 3-QUARTERS OF THE TIME THAT YOU'RE ACTUALLY AT A DISTANCE OF OHM 1 OR 2, INDEPENDENT CRX, BUT DEPENDENCEY REPRESENTATIONS ARE USEFUL BECAUSE THEY THEN ARE CLOSE TO SEMANTIC NET WORKS AND SO PEOPLE ARE ABLE TO PICK THEM UP AND BUILD SEMANTIC NETWORKS BY DOING INTERPRETATION SO THIS IS WORK FROM RUSS ALLOT MAN'S LAB WHERE THEY'RE PICKING UP PHARMACOGENOMIC FACTS BY BASING IT OFF DEPENDENCEY PAUSES. I HAVE A FEW MORE EXAMPLES AND THEN I'LL SKIP--I LET ME SAY SOMETHING ABOUT DIRECTIONS. THERE ARE CLEAR DIRECTIONS FOR THE FUTURE. ONE IS HOW TO GET BEYOND THE SUPERVISED LEARNING PARADIGM SO THE NLP SUCCESSES OF THE PAST 15 YEARS WAS FUELED BY THE APPROACH WHERE PEOPLE SPEND MONTHS OR SOMETIMES YEARS ANNOTATING TEXT WITH CLASSIFICATIONS FOR SOME TASK AND THEN TRAINING A SUPERVISED MODEL FOR THAT TASK. IT'S A VERY SUCCESSFUL PARADIGM, IT GAVE US THESE NAMEDDENTITY RECOGNIZED AND STATISTICAL PAUSES THAT HAVE BEEN SO USEFUL. BUT THERE ARE 2 FACTS TO MAKE ABOUT IT, 1 IS THAT IT DOESN'T GET RID OF LINGUISTIC KNOWLEDGE IT JUST EXTERNALIZES THE KNOWLEDGE THAT THE PEOPLE WHO CONSTRUCT THE CORPUS OF PUTTING LINGUISTIC KNOWLEDGE IN THIS EXTERNAL FORM, BUT THE OTHER PART OF IT IS THAT THE PARADIGM IS JUST REALLY COSTLY. THAT IT'S VERY COSTLY TO UNDERTAKE EACH NEW TASK OR TO HAVE SYSTEMS THAT WORK WELL IN DIFFERENT COMAINS AND IT MEANS THAT WE'RE STUCK TRAINING SYSTEMS OVER SMALL AMOUNTS OF DATA WHICH IS JUST EVER MORE PECULIARATTIC AS WE HAVE FASTER AND FASTER AMOUNTS OF DATA THAT JUST LYE EVERYWHERE AROUND US SO IT'S KIND OF EMBARRASSING WHEN WE CAN COLLECT BILLIONS OF WORDS OFF THE WEB OR ACADEMIC ARTICLES AND THEN WE HAVE TO EXPLAIN THAT ACTUALLY OUR NAMED ENTITY IS TRAINED ON 300,000 WORDS BECAUSE THAT'S ALL THE DATA THAT WAS ANEEITATED. --A ANNOTATED. SO THE TASK IS TO BUILD WITH MODELS WITH LESS SUPERVISION. SO THAT'S 1 OF OF THE THINGS WE'RE INTEREST TED IN WORKING ON SO THERE ARE MANY APPROACHES ON LESS SUPERVISION PARADIGM, SOMETHING THAT I THINK KIND OF INTERESTED IN IS APPROACHED THAT I TRIED TO CALL DISTANCE SUPERVISION WHICH ESSENTIALLY PLOITING THE FACT THAT IN THE MODERN WORLD, THERE'S LOTS OF FOUND INFORMATION WHERE THERE'S SOME KIND OF SEMANTICALLY CONTENT FULL STUFF THAT IS ALREADY AVAILABLE. SO YOU KNOW BIOMEDICINE, MANY RESOURCES OF DATABASES THAT TELL YOU ABOUT THINGS, THAT ALREADY JUST SITTING THERE AND THERE'S TEXT JUST SITTING THERE, CAN YOU PUT THE 2 OF THOSE THINGS TOGETHER AND DO A KIND OF IMPLIED ANNOTATION AND LEARN SOMETHING USEFUL? AND SO THAT'S WHAT WE'VE BEEN TRYING TO WORK ON IS THAT KIND OF APPROACH AND THE PROBLEM IS SINCE IT'S NOT ANNOTATED DATA IT'S NOISY IMPLAYED ANNOTATION AND YOU LEARN FROM IT ANYWAY. BUT IT'S SOMETHING THAT CAN YOU DO, GETTING FAXED OUT IN THAT SORT OF WAY. OKAY. FINALLY A COUPLE OTHER ISSUES, ANOTHER ISSUE FOR THE FUTURE IS THAT AT THE MOMENT MOST OF OUR NLP SYSTEMS ARE PIPELINES. AND WE FEED INFORMATION FORWARD BECAUSE IT'S BAD BECAUSE WE MAKE ERRORS IN EVERY STAGE AND WE JUST FEED FEMME THRD AND WE WOULD LIKE TO GET RID OF THAT. HUMAN BEINGS DO THAT, THEY USE L KNOWLEDGE OF THE WORLD TOP DOWN TO INFLUENCE THEIR INTERPRETATIONS SO SOMETHING I'VE WORKED ON WITH STUDENTS AS WELL--HOW CAN WE DO MORE MORE JOINT FORMS OF PROBABLISTIC INFERENCE TO SOLVE SOME OF THOSE PROBLEMS. BUT FINALLY WHEY THINK WE WANT TO TO BE DOING IS EMPHASIZING MORE, GETTING IN THE DIRECTION OF SEMANTICS HOW CAN WE UNDERSTAND MORE USE OF TEXT AND USE OF THAT IN OUR PROCESSING. AND 1 IS THAT THAT I'VE WORKED ON A BIT AND TRYING TO UNDERSTAND RELATIONSHIPS BETWEEN DIFFERENT PIECES OF TEXT AND THE EMPHASIS HERE IS ON DEALING WITH VARIABILITY OF LINGUISTIC EXPRESSION, THAT IDEAS CAN BE ENCODED IN MANY, MANY WAYS AND YOU WOULD LIKE TO SEE THAT THEY'RE CONNECTED DESPITE THAT. THIS IS SOMETIMES REFERRED TO AS TEXTURAL INFERENCE AND SO FOR MANY TASKS REGARDLESS OF SEMANTIC--SEARCH AND CLINICAL REPORT AND INTERPRETATION THAT YOU WOULD LIKE TO SEE TEXTURAL IDEAS EXPRESSED IN VARIOUS WAYS AND THANKSGIVING BUT TO WORK OUT HOW TO CONNECT THEM TOGETHER EVEN THOUGH THE WORDS AND THE EXPRESSION OF THE EXAMPLE THAT IF YOU'RE LOOKING FOR LOBBIESTS ATTEMPTING TO BRIBE LEGISLATORS, WELL IN THIS SENTENCE IT SAYS LOBBYIST BUT IT DOESN'T SAY BRIBE OR U.S. OR LEGISLATORS BUT WITHIN THE CONTEXT IT'S OBVIOUS WHAT'S TALKED ABOUT AND WE WOULD LIKE OUR SYSTEMS TO BE ABLE TO DO THAT KIND OF INTERPRETATION AS WELL. AND SO WE WORKED ON VARIOUS WAYS OF LOOKING AT THAT, SO IN PARTICULAR SOMETHING, 1 PARTICULAR APPROACH THAT I'VE BEEN INTERESTED IN, IS CAN WE BUILD WEAK LOGICS THAT WE WITH BUILD OUT OVER OUR NATURAL LANGUAGE TOOLS AND SENTENCES THAT CAN AT LEAST LET US DO A LOW LEVEL OF INFERENCE ACROSS THE TEXT SO THAT WE CAN SHOW THAT THERE ARE RELATIONSHIPS BETWEEN PIECES OF TEXT. I SHOULD PROBABLY CONCLUDE HERE. OKAY SO WHAT I HOPE TO SHOW YOU A TINY LITTLE BIT ABOUT IS HOW PROBABILITYISTIC MODELS HAVE GIVEN US GOOD TOOLS FOR ANALYZING HUMAN LANGUAGE SENTENCES. THAT'S WHAT'S GIVEN US THE NAMEDDENTITY RECOGNIZES AND THE STATISTICAL PAUSES THAT ARE USEFUL FOR ALL KINDS OF PURPOSES NOW BUT WE DON'T WANT TO STOP THERE AND THERE'S NOW STARTING TO BE EXCITING WORK IN TEXT UNDERSTANDING, INFERENCE, BASED ON THESE FOUNDATIONS. AND THAT'S GOING TO BE THE WORK THAT'S GOING TO BE THE FOUNDATION FOR ALLOWING US TO DO MUCH MORE WITH KNOWLEDGE AND REASONING WHICH HAS TO BE THE GOAL WHEN YOU'RE LOOKING AT USING NLP IN AREAS LIKE BOTH CLINICAL INFORMATICS AND FOR UNDERSTANDING THE RELATIONSHIPS WHEN DOING BIOMEDICAL RESEARCH. BUT OF COURSE, THERE'S STILL MUCH WORK TO BE DONE TO ACHIEVE THE KIND OF EFFORTLESS NATURAL LANGUAGE UNDERSTANDING THAT WE SEE, YOU KNOW WITHOUT SCIENCE FICTION MOVIES. THANK YOU. [ APPLAUSE ] >> WE DO HAVE TIME FOR A FEW QUESTIONS. >> [INDISCERNIBLE]. --WHAT'S MISSING SO THERE'S A LOT OF KNOWLEDGE--[INDISCERNIBLE]. >> SO THE QUESTION WAS WHEN YOU GO INTO CLINICIAN REPORTS THERE'S A LOT THAT ISN'T WRITTEN DOWN AND IT'S ASSUMED KNOWLEDGE THAT'S XREAM--CREAMILY IMPORTANT. AND THE--EXTREMELY IMPORTANT AND I ANSWER TOA IS I AGREE AND I WILL SAY IT'S DIFFICULT BUT I WILL TRY TO GIVE A LITTLE BIT OF AN ANSWER. BUT THIS IS ACTUALLY A PLACE WHERE THERE HASPT BEEN ISSUE MUCH WORK AND THERE NEEDS TO BE A LOT MORE WORK SO THAT BY AND LARGE NLP HAS OPERATED UNDER THE ASSUMPTION THAT HERE'S A PIECE OF TEXT, LET'S WORK OUT WHAT THE WORDS ARE AND LET'S TRY ASK INTERPRET THEM AND WORK FROM THERE, AND THAT'S NOT THE WAY THAT HUMAN COMMUNICATION WORKS AS HAS BEEN TALKED ABOUT A LOT IN THE LINGUISTIC LITERATURE AND BY OTHERS MORE CONCERNED WITH HUMAN BEINGS AND THEIR CONVERSATIONS WITH EACH OTHER BUT HUMAN BEINGS WORK BY ASSUMING THAT THE PERSON THAT THEY'RE TALKING OR SPEAKING TO UNDERSTANDS MOST OF WHAT THEY KNOW AND THEY REALLY JUST NEED TO GIVE A FEW HINTS AND THE NECESSARY INFORMATION WILL BE IN THE OTHER PERSON'S BRAIN AND SO YOU HAVE TO BE DOING RICH INTERPRETATIONS OF THE ACTUAL WORDS WITHIN THE CONDISPKS THE DOMAIN KNOWLEDGE, THAT'S ASSUMED AND I THINK THAT'S AN INTERESTING CHALLENGE. IT'S ACTUALLY SOMETHING I'VE BEEN DOING A BIT OF WORK WITH IN THE LAST COUPLE OF YEARS WITH CHRIS POTS' SYMANTICS AND [INDISCERNIBLE] THAT I THINK AN INTERESTING AREA FOR COMPUTATIONAL LINGUIST TO FOCUS MORE ON COMP CUEITATIONAL PRAGMATICS IF YOU WILL, HOW YOU CAN DO THE INTERPRETATION OF LANGUAGE USE, MAKING USE OF A RICH CONTEXT AND I MEAN I ABNORMALITIES SLOTLY AGREE THAT THAT REQUIRES KNOWLEDGE OF THE DOMAIN AND HOW TO INTERPRET PIECES OF TEXT RELATIVE TO THAT KNOWLEDGE WHICH SEEMS LIKE IT WILL REQUIRE MUCH MORE OF THIS TOP DOWN INTERPRETATION BECAUSE THAT'S EXACTLY THIS KIND OF PROBLEM. THIS IS THE PROBLEM I WAS HINTING OF THE SORT OF FT. FEED FORWARD PIPELINE AND YOU GAVE EXAMPLES IN YOUR TALK WHERE IT WAS SOMETHING LIKE RH AND IF YOU'VE WRITTEN THE SYSTEM WHERE RH IS JUST BEING INTERPRETED AND THE RH NEGATIVE POSITIVE SENSE BY THE BOTTOM UP INTERPRETER THAT YOU'RE KIND OF LOST BEFORE YOU'VE BEGUN, THAT YOU NEED TO BE BE ABLE TO HAVING KNOWLEDGE OF THE DOMAIN COMING DOWN TO HELP DETERMINE YOUR INTERPRETATIONS. >> VERY ENTERTAINING YOUR TALK AND THE HISTORY OF HOW YOU GOT INTO STATISTICAL PASSING BECAUSE YOU NEEDED TO HAVE ALL THE RULES TO HAVE THE COVERAGE BUT IN THE CASE WHEN YOU'RE WORKING IN THE MEDICAL FIELD, HAVE YOU ALREADY A LOT OF KNOWLEDGE THAT IS AVAILABLE AND THE END IS NO LONG TORE PARSE THE SENTENCES, FOR ANALYSIS FOR MEDICAL RECORD, IT MAY BE THAT YOU'RE INTERESTED TO FIND THE ACTION OF FINDINGS NOT NECESSARILY TO POSSIBLY SENTENCE AND THE PROBLEM IS REVERSED SO I WAS WANTING TO PICK YOUR BRAIN HERE. SO HOW WOULD YOU INTRODUCE THIS ONTOLOGYST IN A STATISTICAL PROCESSING OF THE MEDICAL RECORD TO MAKE SENSE OF IT ASK TO FIND WHAT'S ACTIONABLE FROM THERE. >> GOOD QUESTION. YES, SO OBVIOUSLY I'M NOT VERY ACTIVELY WORKING IN THE MEDICAL FIELD. YES, SO, I MEAN I TOTALLY WOULD LIKE TO ARGUE THAT YEAH, PAUSING IS AN END, PAUSING IS A TOOL, IT'S A USEFUL TOOL FOR THE STRUCTURING OF SENTENCES THAT IT GIVES YOU, WILL ALLOW YOU TO DO OTHER TASKS BETTER AND I WOULD ARGUE THAT IT DOES ONCE YOU'RE STARTING TO DO MORE SEMANTIC TASKS OF GETTING OUT THINGS LIKE RELATIONS AND EVENTS AND THINGS LIKE THAT. SO THE QUESTION THEN IS, WELL, HOW CAN WE TAKE OUT DOMAIN KNOWLEDGE AND FEED IT IN TO DO A BETTER JOB IF WE HAVE VARIOUS KINDS OF ONTOLOGIES OR KNOWLEDGE REPRESENTATIONS. AND YOU KNOW, I THINK THE HONEST ANSWER IS, THAT THAT'S SOMETHING THAT NEEDS A LOT OF WORK AND ISN'T REALLY A SOLVE PROBLEM, SO I CAN'T GIVE THE FULL ANSWER. THERE'S AN EASY ANSWER FOR THE BOTTOM LEVEL OF STUFF THAT THE BOTTOM LEVEL OF STUFF IS, WELL, IF YOU HAVE THINGS LIKE TAXA NATIONAL LIBRARY OF MEDICINIC CATEGORIES, YOU HAVE TAXONATIONAL LIBRARY OF MEDICINIC CATEGORIES OF WORDS OR PHRASES OF STATISTICAL PAUSING SYSTEM AND THEY'RE GIVE A BIT OF GENERALIZATION AND GIVE YOU A PERCENT MORE. --BUT THAT'S TOO EASY AN ANSWER AND WHAT ABOUT ALL THE REF--REST OF IT. I THINK IN THE OLD DAYS THERE WERE LOGICAL SYSTEMS BASED NLP SYSTEMS WHICH COULD EASILY MAKE USE OF KNOWLEDGE AND REASONING AND FEED IT DOWN. THEY EFFECTIVELY GOT BLOWN OUT OF THE WATER IN TERMING OF PERFORMANCE IN PROBABLE NLP SYSTEMS WHERE THEY TRAIN THEIR FEATURES OVER THE ANNOTATED TEXT AND WENT OFF AND DID THEIR STUFF. BUT, THE PROBLEM IS THOSE SYSTEMS DON'T HAVE AN EASY WAY TO INCORPORATE HIGH LEVEL KNOWLEDGE AND SO GENTLEMENLY THOSE DAYS, PEOPLE WILL USE A STATISTICAL POWER BECAUSE IT PERFORM PES BETTER THAN ANYTHING ELSE THEY HAVE BECAUSE THEY HAVE NO WAY TO PUT THE HIGHER LEVEL INTO IT AND WHAT WE'VE LIKE TO BE ABLE TO DO IS SOLVE THAT PROBLEM, AND THE DIRECTIONFUL SOLVING IT I THINK IS DOING MUCH MORE IN THE WAY OF JOINT PROBABILITYISTIC INFERENCE SO THAT YOU ARE BEING ABLE TO PERCOLATE INFORMATION ACROSS THE DIFFERENT LEVELS OF ANALYSIS RATHER THAN USING A FEET FORWARD PARADIGM BUT THERE ARE DIFFICULT BUT MODEL BUILDING COMPUTATIONAL CHALLENGES IN DOING THAT IS IT'S NOT YET A SOLVED PROBLEM.„i >> ONE OF THE THINGS WE RECOGNIZE EARLY ON IS THAT PNEUMONIA MEANS DIFFERENT THINGS TO A RADIOLOGYST TO A MICROBIOLOGIST TO A PRIMARY CARE PHYSICIAN, ET CETERA. AND SO THE ONTOLOGY CATEGORIES THEMSELVES ARE VERY SLIPPERY, AND SO, I GUESS I'M A LITTLE UNSURE WHICH COMES FIRST, THE CHICKEN AND EGG PROBLEM, TO DEFINE THE DOMAINS OF ASSOCIATATIVELY OR TO DO STATISTICAL METHODS TO CREATE THOSE. OR TO SORT OF HAVE THEM CATEGORICALLY DEFINED AND THEN USE STATISTICAL METHODS TO TRY TO DETERMINE THE DIFFERENT SETS, I'M NOT SURE IT'S A QUESTION OR COMMENT BUT IT'S A PROBLEM I'M NOTICING BUT IT MAY BE A WAY TO RECONCILE A FEW METHODS, THE FACT THAT THERE ARE DIFFERENT ONTOLOGY CATEGORIES. >> COMMENT IN RESPONSE, I MEAN HAVE A BIG BIG BELIEVER IN DOMAIN KNOWLEDGE AND YOU WANT TO EXPLOIT DOMAIN KNOWLEDGE, AND THAT, CLEARLY, IS A LOT OF WHAT HUMAN BEINGS DO TO MAKE THEIR LANGUAGE UNDERSTANDING SO EFFORTLESS IS THAT THEY'RE USING THEIR KNOWLEDGE. ON THE OTHER HAND, I THINK 1 OF THE ISSUES HAS BEEN THAT CATEGORICAL KNOWLEDGE OF SENSES, HAS ITSELF BEEN A RATHER PROBLEMATIC NOTION BECAUSE A LOT OF THE TIME, ALTHOUGH THERE'S CLEARLY DIFFERENT MEANINGS FOR WORDS AND DIFFERENT CONTEXT AND SOME PROTOTYPICAL SENSES INVOLVED, THAT'S 1 OF THE PLACES WHERE THERE'S A LOT OF FUZZINESS WHEN YOU START LOOKING IN THE DETAILS AND, BUT, IT'S VERY DIFFICULT TO CLEANLY CUT UP THE WORLD INTO A SET OF TAXONOMIC CATEGORIES FOR THOSE THAT ACTUALLY WORK VERY WELL AND I THINK THAT'S ANOTHER OF OF THE ISSUES THAT NLP IS STRUGGLING WITH. I MEAN IF I JUST DO THE NONMEDICAL EXAMPLE THAT I'M MORE FAMILIAR WITH, SO IN GENERAL NLP, WE ALL MAKE USE OF WORD NET WHICH IS A SORT OF BI FAR THE LARGEST FREELY AVAILABLE TAXONOMIC DESCRIPTION OF WORD SENSORS, BUT I THINK A LOT OF US ALSO FIND THAT A LOT OF THE TIME, IT DOESN'T ACTUALLY WORK AS WELL AS THE KIND OF BOTTOM UP STATISTICAL CLUSTERING OF WORDS THAT WE CAN BUILD FROM CORP RAAND THATTA FOR VARIOUS COMPLEX REASONS BUT IT'S PARTLY BECAUSE THE STATISTICAL MODELS ARE KIND OF SOFTER AND CAN SEE CONNECTIONS BETWEEN THINGS AND WAYS THAT A IN THE TAXONOMY. >> GEORGE WITH COLUMBIA, SO THE APPROACH WILL DEPENDOT GOAL AND YOU HAD EXAMPLES SO COMING UP WITH PROTEIN INTERACTION SYSTEM A POPULATION EFFECT, WHEREAS KNOWING DID THE PATIENT DAK A DRUG ON THIS DAY, YOU PROBABLY OHM GET 1 SENTENCE, RIGHT OR WRONG AND YOU MAY HAVE 2 DIFFERENT APPROACHES FOR THAT, THE OTHER EXAMPLE FOR THAT IS CHEST PROGRAM, DO YOU WANT THE CHEST PROGRAM THAT EXPLAINS THE POSITION TO YOU, DO YOU WANT TO KNOW THE NEXT MOVE? WE DO NOT SO WELL ON THE FIRST 1 BUT BET ORTSECOND, AND WATSON IS AN EXAMPLE OF TURNING SOMETHING HARRDS UNDERSTANDING A SENTENCE AND MAKING A JOKE AND EASIER PROBLEM WHICH PREDICT THE NEXT SENTENCE WHICH IS THE ANSWER ON JEOPARDY. AND SO I WONDER HEALTHCARE IF WE KNOW THAT NEXT SENSE IS WE TURN WHAT DO WE MEAN BY UNDERSTANDING IS IT REALLY UNDERSTANDING OR JUST PREDICTING WHAT THE NEXT SENTENCE SHOULD BE. THAT'S A GOOD COMMENT AND YE, I THINK THAT IS A USEFUL OBSERVATION AND I TOTAL AGREE THAT THE TRICK OF GETTING OUT THE PROTEIN, PROTEIN INTERACTION FACTS IS--IT DEPENDS ON THE FACT THAT IF YOU CAN--IF YOU CAN JUST DO A SUMMARY OVER A LARGE AMOUNT OF DATA THEN OFTEN CAN YOU GET AWAY WITH VERY CHEAP METHODS THAT DO NO REAL ANALYSIS AND IT WILL WORK, THAT IF YOU JUST TAKE A TON OF TEXT AND ADD UP EVERY TIME YOU SEE THINGS CO-OCCURRING THAT WILL WORK AMAZINGLY WELL WITH NO NLP. BUT IF YOU HAVE 1 PATIENT'S DIAGNOSIS AND YOU WANT TO INTERPRET WHAT THAT IS, THOSE METHODS CLEARLY COMLITELY INAPPLICABLE AND YOU DO WANT TO DO DETAILED COMPREHENSIVE AND RIGHT NATURAL LANGUAGE UNDERSTANDING, YEAH, ABSOLUTELY, THAT'S A GOOD POINT. >> CHRIS IS NOT GOING AWAY, WE'RE HAVING A BREAK FAIRLY SHORTLY AND I REALIZE THAT WE'RE 3 OR 4 OR OF WOULD LIKE TO GET HIS ATTENTION, I'M SURE HE WILL BE ABLE TO ANSWER OR TALK TO YOU INDIVIDUALLY AT THAT TIME. [ APPLAUSE ] >> WELL IT GIVES ME GREAT PLEASURE TO INTRODUCE OUR„i NEXT SPEAKER PROFESSOR SERGEI NIRENBURG, HE WAS FORMERLY THE DIRECTOR OF THE FAMOUS COMPUTING RESEARCH LABORATORY IN THE UNIVERSITY OR NEW MEXICO STATE UNIVERSITY AND HE'S NOW DIRECTOR OF THE INSTITUTE FOR LANGUAGE AND INFORMATION AT UNIVERSITY OF MARYLAND BALTIMORE COUNTY, HIS BOOK, ONTOLOGYICAL SEMANTICS WRITTEN WITH VICTOR RASKIN, IS COACH INDEED A RIG VOUS FORMAL LINGUISTIC AND ONTOLOGIC FRAMEWORK AND IS FIRMLY UNDERPINNED BY INSIGHTTHE FACTS OF NATURAL LANGUAGE AND I THINK IT SERVEs AS A MAJOR RESOURCE FOR GUIDING RESEARCH IN THE FIELD. AND I'M LOOKING FOR MUCH LOOKING FORWARD DR. NIRENBURG, TO HEARING SOME OF THOSE INSIGHTS TODAY. [ APPLAUSE ] >> WELL, YOU ARE ABOUT TO HEAR A VARIATION OF THE PREVIOUS TALK. WE AGREE WITH CHRIS ON MUCH THAT IS FACTORIAL ABOUT THE HISTORY--AND BY THE WAY, MY Ph.D. WAS ALSO IN LINGUISTICS SO THERE WE GO. BUT 1 OF THE DISTINCTIONS WOULD BE THAT YOU HEAR BEFORE YOU HEARD A POSITION OF SOMEBODY WHOM I COULD CALL A SUPPLY SIDER THAT IS LOOKING AT WHAT WE HAVE AND HOW GOOD IT IS, WHAT WE HAVE AND HOW MUCH BETTER IT IS NOW THAN IT USED TO BE AND MY POSITION AND VIEW OF THE WHOLE THING IS THAT OF A DEMAND SIDE. WE NEED CERTAIN THINGS AND WE NEED TO GET THEM BY HOOK OR BY CROOK. SO THAT'S WHAT WILL BE, I HOPE, THE RESULT OF WORK. OKAY, SO I MENTIONED A FEW SYSTEMS IN THIS TALK AND THERE ARE MANY, MANY, MORE IMPORTANT 1S. SO, SO, THE IMPORTANT POINT IS THAT REPLICATION SYSTEMS NLP ARE DIFFERENT ON A VARIETY OF SPECIFIC PARAMETERS, THE CORE AND FUNCTIONALITIES OF MOST BIOMEDICAL AND NLP SYSTEMS ARE IDENTICAL TO THOSE THAT SUPPORT GENERAL NLP. AND ANOTHER POINT, GENERAL POINT TO MAKE IS THAT MOST RECENT APPLICATION SYSTEMS BOTH GENERAL PURPOSE, AND BY BIOMEDICAL, ARE HYBRID WHEN IT COMES TO METHODS. WHENEVER PRACTICAL, THEY USE ANY CRIME TO BE ABLE TO LOGARITHMS, STATISTICAL OR OTHERWISE AND KNOWLEDGE RESOURCES ANY KIND TO FILL IN THE LAKUNA, WITH NEWLY DEVELOPED CAPABILITY. SO THE POINT IS JUST AS CHRIS SAID BEFORE THAT ALL SYSTEMS AND METHODS ULTIMATELY, LINGUISTICS BASED AND THE SIMPLE EXAMPLE IS, WELL, IN THE CONTEXT OF LANGUAGE INCREASE, EIGHT YEARS AGO, DOING FEATURE ENGINEERING IS OTHERWISE KNOWN AS DOING LINGUISTICS, SO WE HAVE ALWAYS BEEN DOING IT, WHETHER WE DECLARED THIS OR NOT. AND ALSO, ANOTHER PART OF WHAT'S KNOWN AS GENERALLY SPEAKING STATISTICS BASED APPROACH IS REQUIRES IN THE--APPROACH THAT WAS DISCUSSED BRIEFLY BY CHRIS, TOO, A LOT OF WORK AND JUST THE DATA SET CREATION THAT IS ANNOTATE AND CORPORATE, SO THIS COULD BE VIEWED AS A PARTICULAR WAY OF CARRYING OUT KNOWLEDGE INQUISITION, SO IN STATISTICS BASED SYSTEMS, THE PREFERRED METHOD IS LEARNING TOTALLY FROM LARGE TEXT COLLECTIONS BUT THIS IS CURRENTLY NOT QUITE FEASIBLE THEREFORE, LEARNING THIS AND CARRIED OUT IN A SEMANTIC WAY AND FIRST A LOT OF ANNOTATION WORK HAS TO BE DONE. BUT, THE APPROACH TO WILL ACQUISITION USED IN WHAT'S ROUGHLY KNOWN AS RULE BASED SYSTEM TO USE HUMAN LABOR, NOT FOR FACILITATE AND LOWER ALGORITHMS BUT FOR REQUIRE STATIC KNOWLEDGE AND PROCESSING RULES DIRECTLY. RIGHT. I WOULD LIKE TO CONCENTRATE ON ONE OVERRIDING TYPE OF PROCESSING THAT IS USED IN BIOMEDICAL AND O. P. WHICH IS SOME KIND, SOME VERSION OF INFORMATION EXTRACTION, AND THE VERY BRIEFLY, MENTION TO THIRDS REPLICATIONS CURRENT APPLICATIONS THAT YOU USE THIS IN MOST OF THE SYSTEM THAT IS--WHICH ARE TRAINING [INDISCERNIBLE] AND CLINICAL DECISION SUPPORT. SO THE BRIEFST ILLUSTRATIONS FROM--BRIEFEST ILLUSTRATIONS FROM A FAMOUS NLP APPROACH, I ISN'T SAY IT'S A SYSTEM, IT'S MORE THAN A SYSTEM AND MEDLEY AND WHAT WE SEE IN THIS SYSTEM IS THE CORE OF THIS APPROACH IS THE CORE. TECHNOLOGY IS BEING ABLE TO TAKE FREE STANDING TEXT AND PRODUCE A STRUCTURE OUTPUT FROM IT. WITH THE FEATURES, IN THE VALUE SETS TAKEN FROM A PARTICULAR DOMAIN AND ALLOWING, ALSO FOR ELEMENTS OF FREE TEXT IN SOME OF THE SALES OF SUCH TABLES. SO THERE IS A WHOLE FAMILY OF SYSTEMS BASED ON THIS APPROACH AND WELL, A NAMED A FEW, I'M SURE THAT ARE MORE, AND THEIR PROCESS OF THE PIPELINE THAT IS USED THERE, IS PRETTY TYPICAL FOR MANY SYSTEMS BUT DO STANDARD PROCESSING STAGES INCLUDING RECOGNITION IS SENTENCE BREAK AND OTHER SEGMENTATION ISSUES. THEN A PARTICULAR KIND OF SIN TACTIC ANALYSIS, OR SEMANTIC ANALYSIS IS USED IN THE MED FAMILY THAT IS A LINGUISTICS EXTREME PROJECT IT'S A PARTICULAR KIND OF A SUBLANGUAGE GRAMMAR AT LEAST AT THE BEGINNING IT WAS LIKE THAT. AND THEN, ONE IS TO CREATE THE TARGET INFORMATION STRUCTURE WITH FEATURED VALUE SETS SPECIFIC TO PARTICULAR BIOMEDICAL DEMCDONALD'SS AND APPLICATION AND ENCODE INDEED THE VARIETY OF SCHEMEATA, IT CAN BE PRINCIPLE AND IT HAS BEEN IN THIS FAMILY OF SYSTEMS AND THEN, A SET OF DOMAIN AND APPLICATION REENTER DUAL SETS AND OTHER SOURCES LEXICON FOR DEPAIRED FOR DEXTRAN SULFATE TECTING RELEVANT CONTEXTS AND PHRASES PHRASES AND DOCUMENTS AND MAPPING THEM INTO APPROPRIATE VALUES OF THE TARGET FEATURES, WITH THE MAIN TIME TRYING TO MAKE IS THAT THIS IS A SEARCH AND EXTRACTION TYPE OF TASK AND THIS IS NOT THE OLDSMOBILE WAY OF THINKING ABOUT IT WE'LL GET TO THIS ISSUE LATER. OTHER SYSTEMS THAT ARE WELL, DIFFERENT AND THE SAME GENERAL DEMAND, IPT GREATER CLUED MEDIA LAB DEVELOPED HERE AND¨— THE INPUT ANALYSIS STEPS IN THIS PROGRAM AND ALSO BETTER PROCESSING MODULES, SIN TACTIC AND VARIANT GENERATION AND THEN THE MAP INTO METATHESAURUS TRAITS INCLUDES FINDING A CANDIDATE METHA THESAURUS SPRINGS MENTION ASTERISKS LEAST SOME OF THE STRINGS AND PROCESS THE INPUT AND FIND THE LONGEST OR I SHOULD SAY BEST MATCH, RIGHT. AND AN OPTIONAL STEP INCLUDES THE DISAMBIGUATION WHICH IS DONE IN--USING THE CO LOCATION BASED METHOD OR TEXTURAL CONTEXT. , SO SIM RAIS ANOTHER PROGRAM PRODUCED IN THIS--WELL, I DON'T KNOW WHETHER THIS IS THIS BUILDING BUT AT NLM, RIGHT. AND IT'S PRESENTING ITSELF AS A SYSTEM DESIGNED TO RECOVER SEMANTIC PROPOSITIONS FROM BIOMEDICAL TEXT, USE ANOTHER SPECIFIED SIN TACTICAL ANALYSIS AND STRUCTURE DOMAIN KNOWLEDGE FROM THE UMLS, AND SO, IT GENERATES--IT'S BASED ON A DIFFERENT SPECIFIC SET OF MODULES AND ELEMENTS OF KNOWLEDGE, BUT THE GENERAL IDEA IS TO RECOVER IN PROPOSITIONS FROM THE TEXT THAT IS TAKEN SOME PARTS OF THE TEXT OUT AND USING THEM DIRECTLY AND WELL, THE QUESTION IS, WHAT ABOUT THE REST OF THE KNOWLEDGE, HOW DOES IT KNOW IT'S NOT USEFUL--I MEAN IN THE INPUT? OKAY, SO, FOR VARIOUS REASONS WHICH WE COULD GUESS, ADDITIONAL WORK IN THIS AREA HAS BEEN DONE TO TRY TO RECOVER SPECIFIC FEATURES OFTEN CALLED CONTEXT FEATURES THAT HELP RAISE THE QUALITY OF THE RESULTS OF INFORMATION EXTRACTION. SO A FEW EXAMPLES IN THE CONTEXT SYSTEM WHAT IS TARGETED IS NEGATION, SUBSET OF TIME RELATED PROPERTY KNOWN AS THE TEMPORALITY FEATURE AND SELECTIVE CASE ROW ASSIGNMENT. IN THIS CASE, EXPERIENCE. SO IN THE MEDICAL SYSTEM NEGATION IS ALSO TARGETED AND ALSO WHAT IS TARGETED IS INDICATORS OF SEVERITY OF MEDICAL EVENTS IN THE TEXT, AND ALSO ELEMENTS OF QUANTIFICATION. WELL, EVENT EXTRACTION IS A POPULAR OBJECTIVE ALSO AS WITNESSED FOR EXAMPLE BY THE POPULARITY OF THE--BI THE TASK EFFORTS--BY THE TASK EFFORTS, A LOT OF WORK HAS BEEN DONE THERE SO THE ULTIMATE GOAL OF SUCH EFFORTS IS TO IMPROVE RESULTS OF EXTRACTION BY DETERMINING SELECT TED CONTEXTURE, THE CO LOCATION REENTERED FOR DISAMBIGUATION. THIS WORKS, TOO. SOPHISTICATED JUST A FEW WORDS ABOUT APPLICATIONS. THERE IS THE--THERE HAS BEEN QUITE A LOT OF WORK IN THE AREA OF TRAINING, OF SYSTEMS, NOT OHM IN THE MEDICAL DOMAIN BUT ALSO IN THE MEDICAL DOMAIN. THE TYPICAL STATE OF THE COGNITIVE SKILL TRAINING SYSTEMS WHICH INTEREST US MORE THAN SYSTEMS THAT TRAIN PEOPLE TO WORK ON MANIKINS. THEYAX TRACT USER INFORMATION FROM USER INPUT TO HEALTH THROUGH A DECISION PATH THROUGH A DECISION WHOSE NOTES COULD RESPOND TO THE TRAINING TASMG. IN OTHER WORDS THE TRAINING SYSTEM CAN BE VIEWED AS A DECISION SUPPORT SYSTEM AND NLP IN THESE SYSTEM SYSTEM TYPICALLY CARRIED OUT IN THE MANNER THAT IS USED IN GENERAL IN INFORMATION EXTRACTION BASED EFFORTS. SO A LOT OF THESE SYSTEMS HAVE BEEN DEVELOPED IN INDUSTRY AND MED CASES THERE ARE SIMULATION CASES AND OTHERS. BUT THERE ARE ALSO VERY WELL KNOWN EFFORTS WITH CIRC COMPUTER AND THERE HAS BEEN A LARGE EUROPEAN APPROACH AT EVIP THAT HAS A NUMBER OF PUBLICATIONS IN THIS DOMAIN, TOO. BY THE WAY, THE METAPHOR THAT IS USED IN MANY OF THESE SYSTEMS IS THE METAPHOR OF THE VIRTUE OF PATIENTS. SO, WHAT--THE CONTENT OF THESE VIRTUAL PATIENTS IS--IS A MATTER OF CHOICE ESSENTIALLY BUT METAPHOR IS USED VERY PROUDLY. OKAY, SO, WHEN IT COMES TO NATURAL LANGUAGE PROCESS, THIS IS--THIS IS A VIEW OF A VIRTUAL, ONE PARTICULAR VIRTUAL PATIENT BASED TRAINING SYSTEM, SO,--I'M SORRY, THIS PROBABLY COULDN'T BE SEEN VERY WELL, BUT THE USER ASKS, WHY HAVE YOU COME TODAY? AND SO THE SYSTEM, THIS KEY WORD IS--SORY THIS QUESTION IS KEY WORD MATCH CONONICLE QUESTION THAT COULD BE ANSWERED BY THE SYSTEM THAT IS WE--WE SAY THAT THIS IS THE SAME AS CAN YOU TELL ME WHY YOU HAVE COME TODAY? AND THE SYSTEM KNOWS HOW TO ANSWER THAT QUESTION, SO, IT GIVES A CANDID ANSWER. WELL WEEK AT THE SHOPPING CENTER AND SO ON AND SO FORTH. SO THIS IS MORE OR LESS A STATE-OF-THE-ART APPROACH OF THIS PARTICULAR EXAMPLE, WAS TAKEN FROM DOUGLAS CHESSEER'S SYSTEM, WHICH WAS HIS Ph.D. IN 2004. A FEW WORDS ABOUT DECISION SUPPORT SYSTEMS, THERE HAVE BEEN RECENTLY SEVERAL VERY GOOD DETAILED SURVEYS OF THE STATE-OF-THE-ART IN MEDICAL DECISION SUPPORT SYSTEMS THAT ARE--THESE REFERENCES THAT ARE OTHERS AND WELL, IF ANYBODY NEEDS THE FULL REFERENCES, I HAVE THEM. THE DECISION SUPPORT CHALLENGE ALSO DISCUSS INDEED THE REPORT FROM THE NATIONAL RESEARCH COUNCIL ON COMPUTATIONAL RESEARCH TECHNOLOGY FOR HEALTHCARE. SO THESE MATERIALS PROSIGHTED EXCELLENT--ROUGH ATOM VIED EXCELLENT ANALYSIS OF THESE ISSUES AND COVER A VERY LARGE PERCENTAGE OF SYSTEMS AND PROJECT. AND I WILL JUST MAKE A FEW COMMENTS. SO DECISION SUPPORT SYSTEMS CAN BE USEFUL IN BOTH CLINICAL PRACTICE AND THEN RESEARCH ENVIRONMENTS. CLINICAL DECISIONS SUPPORT SYSTEMS, MAY HAVE A GREAT SOCIETAL IMPACT, BUT FACE GREATER ISSUES RELATED TO USER ACCEPTANCE AND I WILL COMMENT ON THIS A LITTLE BIT LATER. HUMAN COMPUTER INTERACTION IN MEDICAL DECISION SUPPORT SYSTEMS CAN TAKE DIFFERENT FORMS, THOUGH MANY SYSTEMS IF NOT MOST INVOLVE NLP. I WOULD LIKE TO POINT OUT THAT NLP CAPABILITIES IS REQUIRED FOR HCI, ARE NOT EXACTLY THE SAME AS THOSE NEEDED TO SUPPORT INFORMATION EXTRACTION. ALSO DECISION SUPPORT SYSTEMS HAVE SEPARATE DECISION MAKING MODULES THAT RELAY ON NLP MODULES FOR DECISION MAKING KNOWLEDGE, AND COMMUNICATE THEM WITH THE USER. SO, NLP CAPABILITIES IS REQUIRED FOR THESE DECISION MAKING MODEL--MODULES, ARE ALSO NOT EXACTLY THE SAME AS THOSE NEEDED TO SUPPORT EITHER I. E. OR H. C. IMPLETE RIGHT. SO NOW TO THE POINT DISCUSSED IN THE QUESTION PERIOD OF OF THE PREVIOUS PRESENTATION. THERE HAS BEEN, IN FACT, WORK ON TRYING TO INCORPORATE KNOWLEDGE ABOUT THE WORLD AND SITUATION AND THE OTHER NONTEXTURAL INFORMATION IN BUILDING SYSTEMS THAT ARE DEVOTED MEDICAL APPLICATIONS, THE TWO EXAMPLES THAT I WOULD LIKE VERY BRIEFLY TO TALK ABOUT, CHESTER WHICH IS WORK FROM JIM SULIN'S GROUP WHICH IS AN APPLICATION OF THERE WELL KNOWN DYING UPS IN MEDICAL DOMAIN AND THERE IS OUR WORK ON SOMETHING RECALL THE REAGENT PROJECT, WHICH HAS RESULTED SO FAR IN THE DEVELOPMENT OF TWO PROOF OF CONCEPT SYSTEMS, ONE OF THEM IN CHILDREN CALLED MARYLAND VIRTUAL PATIENT NLP AND THE OTHER WHICH IS A CLINICIANS ADVISER, BOTH SYSTEMS BASED ON THE SAME TECHNOLOGY. SO, CHESTER REMINDS PATIENTS ABOUT MEDICATION SCHEDULING TO HELP WITH COMPLIANCE, A VERY IMPORTANT NEED, SO THE IMPORTANT THING TO UNDERSTAND IS THAT, THE UNSHADED PART OF THIS ARCHITECTURAL DIAGRAM ARE ACTUALLY TAKEN FROM GENERAL PURPOSE APPLICATIONS OF TRIPS. AND WHAT IS DONE SPECIFICALLY FOR THIS APPLICATION INVOLVES--DO I„i HAVE--OOPS IT'S STOPPED ON THAT--EXACTLY NOT WHAT I WANTED TO DO. AH, YES. INVOLVES SOMETHING THAT IS CALLED THE BEHAVIORIAL AGENT. OBVIOUSLY THIS IS A METAPHOR, RIGHT SO WE NEED TO UNDERSTAND IT THAT WAY BUT THIS IS A MODEL OF A DECISION MAKER THAT TAKES INPUT FROM THE OUTSIDE AND GENERATES RESPONSES, BUT A RESPONSES THAT ARE ACTUALLY PRODUCED ON THE FLY IN REASON, ALL RIGHT. SO, SO THAT'S CHESTER I NEED TO PUSH ON BECAUSE THERE IS NOT TOO MUCH TIME AND THIS IS THE--WELL, DIAGRAM--I'M SORRY, IT'S DIFFICULT TO READ, ALL OF THESE ARCHITECTURAL DIAGRAMS AND EVERYONE IN THE ROOM HAS SEEN TOO MANY OF THEM SO FAR BUT THIS IS THE OVERALL ARCHITECTURAL DIAGRAM OF THE OTHER AGENT APPROACH AND PLEASE NOTE, ALSO ONCE AGAIN, THAT THE LANGUAGE SET OF PROCESSORS WHICH BY THE WAY IS A SET OF MAYBE OVERA DOZEN DIFFERENT MODULES. SITS TO THE SIDE AND THE CENTRAL PART IS ACPIED BY A LARGE COLLECTION OF VARIOUS KINDS OF KNOWLEDGE, ABOVE THE WORLD BUT ALSO ABOUT LANGUAGE AND ABOUT THOSE PARTS OF THE WORLD THAT ARE PARTICULARULAR AGENTS IN THE ENVIRONMENT THAT IS THIS AGENT NEEDS TO KNOW ABOUT OTHERS BECAUSE OTHERWISE THIS DIAL UP [INDISCERNIBLE] WILL SOUND TERRIBLE [INDISCERNIBLE] EVENTUALLY WHEN WE ARE ACTUALLY DOING. SO THESE ARE TWO DIFFERENT SOURCES OF PERCEPTION, THIS IS IN LANGUAGE THAT IS THIS AGENT CAN CONVEY INFORMATION AND THAT IS THE INTERCEPTION ENGINE THAT IS WE HAVE BUILT A SIMULATION OF PATIENTS OF PHYSIOLOGY AND PATHOLOGY, NOT COMPLETE BUT QUITE LARGE, BECAUSE, WITH THIS LATER, AND THE PATIENT CAN FOR INSTANCE SIMULATE THE PERCEPTION OF PAIN OR SLEEPINESS AND WHAT HAPPENS THEN IS THAT THE DEMADE CAN MODULE IS ACTIVATED AFTER NEW FACTS COMING INTO THE SYSTEMS AND AS A RESULT OF THAT, VARIOUS KINDS OF ACTIONS BECOME POSSIBLE, MIND YOU, I HAVE SAID ACTUALLY NOTHING ABOUT THE METHODS, THIS SPECIFIC METHODS THAT WILL BE USED IN IMPLEMENTING VARIOUS COMPONENTS OF THIS SYSTEM. WE HAVE, EVEN OUR SALES HAVE DIFFERENT OPTIONS IN THIS REGUARD ASK THIS WILL GO ON AND DONE FOR INSTANCE PROBABLISTIC INFERENCE THAT WAS MENTIONED BY CHRIS WILL BE ONE OF THE THINGS USED, BUT THIS IS RIGHT HERE WHAT WE BELIEVE IS NEEDED OR SOMETHING OF THIS KIND. IT IS NEEDED TO IMPROVE THE OVERALL QUALITY AND POSSIBLE ACCEPTANCE OF SUCH SYSTEM, SO IN THE MODELS OF ARTIFICIAL INTELLIGENCE AND NOT THE PATIENT, THEN IN THE HUMAN. NOW IN THE HUMAN WHERE THE HUMAN PLAYS THE ROLE OF THE TRAINEE. SO WE HAVE SO FAR, COVERED IN THE PRF OF CONCEPT SYSTEM, THE SEVEN DISEASES OF THE ESOPHAGUS, THAT'S ALWAYS GOOD, TO DO SO FAR, BUT THE PROCESSING, THAT'S INVOLVED IN THIS SYSTEM IS QUITE INVOLVED. SO WHEN THE TRAINEE AATTEMPT WHAT IS BRINGS THEM HERE FOR INSTANCE, IT GENERATES A MAIN PRESENTATION FOR ALL OF THIS, AND THIS IS THE RESULT OF--WELL, FAIR ENOUGH--ALL OF THIS IS THE RESULT OF THE PROCESSING BY A LARGE NUMBER OF ELEMENTS AND EVEN THIS IS NOT NECESSARILY THE LAST STEP IN PROT SEASES BEFORE THE SYSTEM DECIDES TO RESPOND IN SOME WAY, BUT, JUST TO TELL--THE POINT I'M TRYING TO MAKE IS THIS: WE'RE TRYING TO BUILD SEMANTIC AND DISCOURSE PRAGMATIC REPRESENTATIONS, FOR THE ENTIRE INPUTS, NOT JUST LOOKING FOR INDIVIDUAL'LL MENTORSHIP SKILLS--INDIVIDUAL ELEMENTS. SO FAIR ENOUGH, THAT IS THE MAIN MESSAGE FROM THIS. I HAVE LITTLE TIME REMAINING, AND I WANT TO PUSH ON TO THIS. BECAUSE THIS IS WHAT WE'RE„i UP AGAINST. THE ULTIMATE CRITERION OF ANY SUCCESS OF APPLICATION IS EPPED USER ACCEPTANCE SO WE HAVE LOOKED THROUGH A VARIETY OF PAPERS DISCUSSING ATTITUDES, IN IT CASE TO CLINICAL DECISION SUPPORT SYSTEMS BY USERS AND HERE IS JUST A LIST OF VARIOUS OPINIONS, SO CLINICAL DECISION SUPPORT SYSTEMS ARE FAILING TO OFFER TAYLORED CLINICAL PROPERTY, WHAT USERS WANT DIRECT ANSWERS, SPECIFIC RECOMMENDATIONS, AND EMPHASIS ON TREATMENT AND BOTTOM LINE ADVICE, CUSTOMIZABILITY, AND THE SYSTEMS THEMSELVES MUST BE MODULAR NOT MONOLITHIC, [INDISCERNIBLE] SUGGESTS THE FIVE RIGHTS FOR CLINICAL DECISION SUPPORT SYSTEMS, GIVE THE RIGHT INFORMATION TO THE RIGHT PERSON, RIGHT FORM, RIGHT CHANNEL RIGHT TIME WHICH IS NICE, JUST RHETORICALLY WHEN IT'S NICE. SO ALSO WELL KNOWN ISSUE OF ALERT FATIGUE AND SO ON, NEEDS TO BE RAISED. RIGHT, SO THIS IS AN IMPORTANT POINT, I.T. APPLICATIONS IN GENERAL ARE DESIGN INDEED WAYS THAT PROVIDE LITTLE SUPPORT FOR COGNITIVE TASKS OF CLINE IPGZS OR THE WORK FLOW OF THE PEOPLE WHO MUST HAVE ACTUALLY USED THE SYSTEM AND ANOTHER CODE FROM THE SAME SOURCE. COGNITIVE SUPPORT IS NOT WELL SERVED BY THE TASKS SPECIFIC AUTOMATION SYSTEMS SO FAR. SO THAT'S THE STATEMENT FROM THREE YEARS AGO. OKAY, ALSO, WE NEED TO ADDRESS USERS BIASNESS BECAUSE CLINICIANS TEND TO NOT WANT TO HELP. THIS IS A WELL KNOWN PROBLEM. SO WHAT DO USERS WANT? CLINICAL SUPPORT SYSTEMS? TO BE LIKE IT'S CLEAR THAT THEY WANT THEM TO RESEMBLE HELP FROM STANLEY CUBE RICK'S 2001 SPACE ODESY, SO HERE IS IF YOU REMEMBER H. A. L. AND HERE'S A SAMPLE DIALOGUE YOU MIGHT REMEMBER SOME OF YOU. DAVE SAYS OPEN THE POD BAY DOORS H. A. L. AND H. A. L. SAYS, I'M SORRY DAVE, I'M AFRAID I CAN'T DO THAT. WHAT'S THIS PROBLEM? WELL THIS IS NOT MEDICAL APPLICATIONS BUT I THINK YOU KNOW WHAT THE PROBLEM IS, GUFF AS WELL AS I DO. I DON'T KNOW WHAT YOU'RE TALKING ABOUT. I KNOW THAT YOU AND FRAN WERE PLANNING TO DISCONNECT ME AND I'M AFRAID THAT'S SOMETHING I CANNOT ALLOW TO HAPPEN. SO WHAT DOES--WHAT ARE SOME OF THE CAPABILITIES H. A. L. DEMONSTRATES IN IN DIALOGUE. SO UNDERSTANDING THE QUESTION REQUEST AND THE ACTION, NONNATIONAL LIBRARY OF MEDICINE COMPOUND BAY DOORS REFERENCE RESOLUTION, WHAT'S THERE, WHY THAT'S IMPORTANT, TO KNOW ABOUT IT. WHAT'S THAT? REFERRING TO? PLIGHTNESS, WELL THAT IS EMOTIONAL INTELLIGENCE. POLIT NIGHTNESS AND I'M AFRAID I CAN'T DO THAT SO BRIEF DESCRIPTION, IT'S VERY IMPORTANT PART, MODEL AND TECHNOLOGY ACTIONS AND PLANED AND WHAT. SELF-AWARENESS, MODEL ONE'S SELF-INCLUDING GOALS AND PLAN, I'M AFRAID THIS IS SOMETHING I CAN'T ALLOW. AND THEN LOWER LEVEL ISSUES, NOT LESS IMPORTANT, SUCH AS EMBEDDED MODALITY, CANNOT ALLOW TO HAPPEN. SO A LITTLE BIT MORE, DIFFERENT PART OF THE DIALOGUE, I CAN TELL FROM THE TONE OF YOUR VOICE, DAVE THAT, YOU ARE UPSET, WELL, EMOTION RECOGNITION BY A SPEECH RECOGNITION THAT WOULD BE GREAT. MODEL OF FEEL ACCIDENT OF OTHER SPECIES [INDISCERNIBLE] PROCESS, BY THE WAY THIS, IS ACTUALLY A DIAGNOSTICS AND TREATMENT QUESTION. IT SAYS TAKE A STRESS PILL AND GET SOME REST. I DON'T KNOW WHETHER H. A. L. WAS LICENSED BUT HERE WE GO, IT WAS ALL IN THE SCRIPT, RIGHT? AND SO ON AND SO FORTH. YOU UNDERSTAND. NOW, WE'RE]I„--WE ALL UNDERSTAND WE'RE NOT THERE YET. THERE WAS A BOOK, THERE WAS A BOOK OF CONTRIBUTIONS CALLED H. A. L.'S LEGACY PUBLISHED IN 97, AND JOE [INDISCERNIBLE] SAYS IS WE'RE IN THE YEAR 2001 DO WE HAVE A COMPUTE THEY'RE SOUNDS LIKE THE H. A. L. PORTRAYED BY ACTOR DOUGLAS REIGN, NO, NOT YET, THE GREATEST OBSTACLE IS THE MACHINE'S INABILITY TO COMPREHENNED WHAT IT IS SAYING OR HEARING. WELL, IT'S CONCENTRATING ON SPEECH BUT IT'S THROWN--EQUALLY THROUGH WITH RESPECT TO LANGUAGE ISSUES. TOO AND REMEMBER, WE ARE TALKING HERE ABOUT THE DEMAND SIDE, NOT THE SUPPLY SIDE. SO ROGER SHAN'T PREDICTABLY SAYS TO UNDERSTAND LANGUAGE AS WELL AS HE DOES, BY THE WAY NOTE THAT THE PRONOUN HE IS USED, NOT IT, TO DESCRIBE H. A. L. WELL THE COMPLETE MODEL OF THE WORLD, WELL MAYBE NOT COMPLETE BUT SUFFICIENTLY COMPLETE MODOF THE WORLD THAT INCLUDES UNDERSTANDING OF HIS OWN GOALS, THE GOALS OF THOSE AROUND HIM AND THE RELATIVE SIGNATURES 95 KAPS OF WHICH IN ADDITION,--SIGNIFICANT OF WHICH WE WILL HAVE TO UNDERSTAND ALM THE WAYS TO REFER TO SUCH GOALS AND SO O. SO WITH H. A. L., THE SIGNIFICANT PROGRESS HAS BEEN MADE OVER THE YEARS, THE ABOVE MENTIONED OBSTACLES ARE STILL VERY MUCH PRESENT IN 2012. SO WE NEED TO ACKNOWLEDGE BASE PROCESSING AND WE MUST FROM A SOCIETAL POINT OF VIEW, I SUPPOSE, WE MUST THANK THE PEOPLE WHO HAVE BEEN PURSUING STATISTICAL APPROACHES BECAUSE THEY FREED US FROM THE NECESSITY TO BE CONCERNED ABOUT CERTAIN APPLICATIONS THAT WE DON'T NEED TO BE CONCERNED ANYMORE ABOUT WHICH WE DON'T NEED TO BE CONCERNED ANYMORE, LIKE FOR INSTANCE, MACHINE TRANSLATION. I STARTED BY RESEARCH AS A MISSION TRANSLATION RESEARCHER IN TECHNOLOGY BASED PACKA DIME AND INDEED, THE PROBLEM WAS TAKEN AWAY FROM US. WONDERFUL. WE CAN CONTRAIT ON WHAT IS ACTUALLY REALLY DIFFICULT. SO, IF YOU WANT TO ADDRESS THE TASK OF OVERCOMING THIS OBSTACLE, HEAD ON, WELL THE QUESTION S&P THE CURRENT AREA PREVALENTLY THE MOST SIGNATURESSATIVE CANTILY PROCESSING OR WAS THIS MOTIVATED BY MAYBE EXTRA SCIENTIFIC CONSIDERATIONS. HOW MUCH DO I HAVE? WELL, I WILL MENTION THIS GROUND RESEARCH CHALLENGE FOR I.T. AND MEDICAL IN THE MEDICAL DOMAIN, PUBLISH INDEED IN NATIONAL RESEARCH COUNCIL REPORT, PATIENT CENTERED COGNITIVE SUPPORT, EMERGED AS ANOTHER ACTION, GROUND RESEARCH CHALLENGE DURING THE COMMISSION'S' DISCUSSIONS AND THERE IS DISCUSSION ABOUT WHAT PEOPLE MEAN BY THAT BUT THE IDEA IS THAT YOU SEE THESE PEOPLE DON'T TALK NECESSARILY ABOUT THE METHODS TO BE USED, THEY TALK ABOUT THE NEEDS AND WE'RE PROBABLY--PROBABLY SHOULD WELL, REACT TO THAT. SO, WE'RE FACING THE GRAND CHALLENGE PROGEC, LET'S TRY TO COMPARE ITS NOT EXTREMELY DEEP THOUGHTS BUT LET'S TRY TO COMPARE THE SYSTEM OF IT PROJECT WITH WELL KNOWN GRAND CHALLENGE PROJECT SO THE MANAT AN PROJECT EXPANDITTURES. WELL, IN 2012, THERE WERE ALMOST 900 MILLION DOLLARS. IN THIS WORK, MUCH OF THE WORK REALLY IS DEVELOPMENT, TOO, FOR INSTANCE IN KNOWLEDGE ACQUISITION IS SUCH, SO THERE IS THIS NUMBER OF THE HUMAN GENOME PROJECT BETWEEN 98 AND 2003, SPENT ALMOST FIVE BILLION DOLLARS AND THE FUNDING DOES NOT INCLUDE FUNDING OF THIS PROJECT SO, WITH REGARD TO PROJECT TO MACHINE THAT CAN MEANINGFULLY COMMUNICATE WITH PEOPLE, IN MY OPINION, IT IS AT LEAST OF THE SAME COMPLEXITY AS THE TASKS BEFORE MANAT AN AND HUMAN AND GENOME PROJECTS ACTUALLY THINK OF OF THIS PROJECT AS MUCH MORE COMPLEX, THIS MEANS THE QUEST OF THIS PROJECT WILL BE AT LEAST COMMISERATE, SO, THE MAN HAT AN AND HUMAN GENOME PROJECTS ADDRESS SOCIETAL NEEDS, THAT ARE MORE IMMEDIATE THAN THE NEED FOR OUR PROJECT IN MY OPINION AND THIS IS A POINT WHICH IS NOT SCIENTIFICKIC AT ALL SO WATSON WAS MENTIONED AS A PROJECT HERE BEFORE, IT'S PROBABLY THE LARGEST EVER IN THE APROJECT THOUGH, I DON'T KNOW WHAT THE ACTUAL NUMBERS ARE AND IT IS INDEED A SPECTACULAR APPROACH AND THE QUESTION IS DOES IT BRIDGE ENOUGH OF THE GAP? AND I SUPPOSE WITH THIS, THOUGHT I WILL LEAVE YOU. THANK YOU. [ APPLAUSE ] >> WELL, THANK YOU VERY MUCH, WE'LL TAKE ONE QUESTION AND THEN WE'LL GO TO THE BREAK AND ANY OTHERS YOU CAN TALK TO THE DOCTOR DURING THE BREAK? THERE ARE ANY QUESTIONS? WELL THEN LET'S HAVE A BREAK, OF 15 MINUTES NOW AND DURING THIS TIME, DR. NIRENBURG AND I WILL LOCATE THE FIVE MILLION THAT YOU NEED. [LAUGHTER] >> ALL RIGHT, LADIES AND GENTLEMEN, LET'S CONTINUE WITH PANEL ONE COMBINING STATISTICS AND LINGUISTICS STRUCTURE. SORT OF A MATCHUP OF THE PREVIOUS TALKS. IT'S BEING CHAIRED BY SOMEONE I THINK MANY OF YOU KNOW, GUERGANA SAVOVA, SHE'S A PROFESSOR OF PEDEIATE RICKS WORKING WITH ZACH AND THE ITWO BTWO GROUP AND SHE WILL INTRODUCE THE MEMBERS OF HER PANEL. DOCTOR GUERGANA. >> SO WE HAVE THREE OUTSTANDING PANELISTS AND THEY HAVE PRESENTATIONS FROM WITH INVESTIGATORS WITH DEGREES IN LINGUISTICS AND INVESTIGATORS WITH DEGREES IN COMPUTER SCIENCE. SO THE FIRST PRESENTER IS DAN MOLDOVAN, WE ALL KNOW DAN, I WILL DO A BRIEF INTRODUCTION. HE IS PROPROFESSOR OF COMPUTER SCIENCE OF UNIVERSITY OF TEXAS AT DALLAS AND ALSO THE CO DIRECTOR OF THE HUMAN LANGUAGE TECHNOLOGY RESEARCH INSTITUTE THERE. PREVIOUSLY HE HAD FACULTY POSITIONS AT THE UNIVERSITY OF SOUTHERN CALIFORNIA AND SOUTHERN METHODIST FORT IN DALLAS. HE WAS ALSO A PROGRAM DIRECTAR AT NSS, WHETHER WHILE ON SEBATTICICLE FROM UNIVERSITY OF CALIFORNIA. DAN IS ALSO FOUNDER OF LIMDA, CORPORATION, A TEXAS-BASED COMPANY SPECIALIZE NOTHING NATURAL LANGUAGE PROCESSING, PRODUCTS AND SOLUTIONS. PROFESSOR MULDOVAN'S RESEARCH INTEREST ARE IN SEMANTICS SIM ANT AND EXPLICITY KNOWLEDGE AND DATA INTO SEMANTIC THROUGH PUT. THIS HE'S CO AUTHORED MORE THAN 300 TECHNICAL PAPERS IN NATURAL LANGUAGE PROCESSING AND ARTICLE FICIAL INTELLIGENCE AND DISTRIBUTED PROSEWSING. HE AS A Ph.D. IN ELECTRICAL ENGINEERING AND COMPUTER SCIENCE FROM COLUMBIA UNIVERSITY NEW YORK CITY. DAN? >> [ APPLAUSE ] HELLO, EVERYBODY, IT'S NICE TO BE HERE TO SHARE WITH YOU MY RECENT RESULTS. THIS TALK IS MORE ABOUT TRANSFORMING STRUCTURE DATA TEXT INTO SEMANTIC THROUGH PUT. SO MY POINT OF VIEW IS HABIT OR SEMANTICS. I'M GOING TO TALK ABOUT SOME TOOLS THAT KEY USE TO USE ADVANCE APPLICATIONS AND TALK ABOUT A WAY OF REPRESENTING KNOWLEDGE WHICH IS HIRE ARCH CALAWAY AND TALK ABOUT SEMANTIC APPLICATIONS AND PUTTING TOGETHER CALCULATIONS AND MORE RELATIONS AND HOW TO BUILD ONTOLOGY AFTER WE TRANSFER TEXT INTO SEMANTIC THROUGH PUT AS POSSIBLE TO DOMAIN ONTOLOGYST AND I'M GOING TO FOCUS ON ONE APPLICATION, THE DOCUMENT SIMILARITY WHICH I THINK IS RELEVANT FOR MANY FIELDS IN THIS AREA. AND AT THE END I'LL MAKE INFORMATIONS ABOUT STATISTICAL VERSION VERSES SEMANTIC DRIVEN APREACHES SO HERE YOU HAVE THE PIPELINE OF OF NLP TOOLS THAT EVERYBODY WITH US WORK IN NATURAL LANGUAGE PROCESSING HAVE, SOME PEOPLE HAVE MORE, SOME PEOPLE HAVE LESS, SOME PEOPLE HAVE MORE ADVANCED TOOLS OR STATE-OF-THE-ART, SOME PEOPLE USE OTHER PEOPLE'S TOOLS BUT NEVERTHELESS, YOU HAVE TO START FROM ORGANIZATION AND THEN PART OF SPEECH TAGGING AND YOU HAVE TO FIND THE BOUNDARIES OF WORDS AND SENTENCES AND THEN DO SOME THEORY RECOGNITION, CONCEPT TAGGING, SIN TACTIC PARSING, AND WORSE AMBIGUATION CONTEXT, DETECTION, CONTEXT DETECTION, IT'S ENOUGH SAID ABOUT THAT AND WE TRY TO INJECT CONTEXT INTO OUR NLP PROCESSING BECAUSE I THINK THAT'S VERY IMPORTANT CONTEXT FOR EXAMPLE, AND I'M NOT GOING TO SAY ANYTHING ABOUT CONTEXT, I MENTION IT HERE, IT COULD BE ABOUT PATIENT, ABOUT DOCTOR, ABOUT SOME SPACE CONTEXT, SOME TIME CONTEXT, AND CERTAIN THINGS ARE TRUE ONLY IN SOME CONTEXT AND NOT IN SOME OTHER CONTEXT. SEMANTIC PARSING THIS IS EXTRACTED TO SEMANTIC CORRELATION, REFERENCE WITHIN DOCUMENT, GROSS DOC YOU WANTS,ENT IS AND EVENT EPTS AND CALCULUS, SO AT THE END, IF YOU MANAGED TO GO THROUGH THIS PIPELINE, THEN AT THE END, YOU HAVE SOME SEMANTIC THROUGHOUT PUTS AND IN THE COMMERCIAL NLP IS GOING THIS DIRECTION WITH THE RDF STANDARDS THAT ARE ADAPTED BY A NUMBER OF COMPANIES, AND IT'S A WAY TO EXTRACT THE MEANINGFUL TEXT AND THEN CAN YOU PERFORM SOME MORE ADVANCED APPLICATIONS. SO IF YOU LOOK AT THIS PIPELINE, THE ESPECIALLY THE LOWER LEVEL MODULES USE A LOT OF STATISTICAL AND RULES BASED APPROACHES AND YET, THE MODULES AT THE TOP AND THEN THE APPLICATIONS WILL BE MORE SEMANTICALLY BASED BECAUSE THAT'S HOW I VIEW THIS. WE HAVE TO HAVE RESOURCES IN ADDITION TO THIS MODULES LIKE WORD NET, EXTENDED WORD NET THIS, IS SOMETHING THAT WAS DONE AT UTC, EVENT NET AND LEXICON CHANGES AND ALSO TO BE ABLE TO BUILD DOMAIN ONTOLOGYST. SO I'M SAYING THAT AFTER YOU MANAGE THE TRANSFORM THE STRUCTSURED TEXT INTO STRUCTURE KNOWLEDGE, THEN YOU CAN DO SOME REASONING AND SOME OTHER MORE ADVANCED APPLICATIONS MR. IICATIONS. --APPLICATIONS. HERE'S A HIRE ARCH CALAWAY OF REPRESENTING KNOWLEDGE AND WE START WITH TEXT AND AT THE LOWER LEVEL, THIS IS LIKE A PYRAMID, PYRAMID BECAUSE WE WANT TO OBSTRUCT THE FORMATION THAT WE EXTRACT SUCH THAT AT THE VERY TOP, THE WHOLE TEXT IS DOMINATED BY TWO EVENTS THAT DESCRIBE THE WHOLE MEANING OF THE TEXT AND CREDIT, AT THE LOWER LEVEL, YOU HAVE THE LEXICAL INFORMATION, THE WORDS AND THEN, MAYBE SOME LOGIC FORMS AND THEN HAVE SEMANTIC RELATIONS THAT HAVE BEEN EXTRACTED SO IF YOU WANT TO REASON ABOUT TEXT YOU HAVE TO IDENTIFY THIS KEY EVENTS SORE CONCEPTS THAT DOMINATE THE TEXT AND YET IF YOU NEED MORE DETAILS YOU GO DOWN TO THE LOWER LEVEL AND YOU FIND MORE INFORMATION. HERE'S AN EXAMPLE OF A SENTENCE THAT IS--CAN BE REPRESENTED HIRE ARCHICALLY, THE PATIENTS EYE PAIN WAS ASSOCIATE WIDE A SURGICAL PROCEDURE IN [INDISCERNIBLE] LACTIC ACID. SO DOWN HERE, WE HAVE THE CONCEPTS AND PERHAPS SOME ENTITY ASSOCIATED WITH THE CONCEPTS AND PART OF SPEECH AND THE SENSES FROM WORD NET AND THEN ABOVE THAT, AND ON TOP OF THAT, NOT DISASSOCIATED BUT IN TOP, CONNECTED TO THIS CONCEPT LEVEL IS THE SET OF SEMANTIC RELATIONS I EXTRACT FROM THIS SENTENCE, FOR EXAMPLE, THE EYE IS PART OF THE PATIENT, THE EYE IS WHERE THE PAIN IS LOCATED THE PATIENT IS THE EXPERIENCER OF THE PAIN. SURGICAL PROCEDURE IS THE CAUSE OF THE PAIN. SURGICAL PROCEDURE IS A PROCEDURE, AND THERE'S A VALUE OF PROCEDURE THAT'S SURGICAL. AND ON TOP OF THAT, THE WHOLE SENTENCE IS ABOUT PAIN WHICH IS A STATE AND THE FACT THAT THERE IS ALSO PROCEDURE WHICH IS AN EVENT. AND THEN THE EVENT RELATIONS PROCEDURE CAUSES THE PAIN AND PROCEDURE HAPPENED TO BE BEFORE THE PAIN. SO THIS IS IN A WAY ONE WAY TO ABSTRACT THE INFORMATION FROM THE SENTENCE BECAUSE IT'S EASY KRER TO REASON--EASY KRER TO REASON APPROXIMATE THESE HIGHER FORMATIONS THAN IT WOULD BE TO ENTANGLED INTO THESE WORDS THAT AT ARE AT A LOWER LEVEL. SO ONE KEY MODULE IS THE SEMANTIC PARSER, AND FOR SOME REASON, FOR SOMETIME NOW, WE HAVE USED A SET OF 26 SEMANTIC RELATION WHICH IS THEY WORK THOSE OUT AND THERE'S A REASON FOR HAVING THIS RELATION AND NOT OTHER RELATIONS. AND THESE RELATIONS ARE GENERATED BY THE SEMANTIC PARSER AND THEY ARE RELATIONS BETWEEN VERB AND ARGUMENTS, LIKE AGENT, THEME, RECIPIENT, MANNER, WHOOPS WHAT DID I DO WRONG? SORRY, TRY TO MOVE THE MOUSE. >> PUT THE MOUSE IN THE UP ERRIGHT HAND CORNER THAT WILL HAPPEN. >> ALL RIGHT, GOT IT. AND THEN ALSO FROM A COMPLEX NOMINALS, YOU HAVE HERE, THE PATIENT'S EYE BUT YOU WANT TO GET THAT RELATION THAT IS EYES IS A PART OF THE PATIENT AND ALSO IS THE LOCATION WHERE THE PAIN IS. SO WE WANT TO EXTRACT AS MANY SEMANTIC RELATION AS POSSIBLE SO WE'RE NOT--I WANT TO MAKE THE DISTINCTION BETWEEN SEMANTIC PARTS AND A LABELER WHICH BASICALLY EXTRACT THE ONLY ARGUMENT RELATIONS. THERE'S A PAPER, DESCRIBING THIS SEMANTIC PARSER AND I'LL SAY MORE ABOUT THAT, THE PERFORMANCE IS MEASURED ABOUT 60%. THE STEPS, THAT OCCUR IN THE SEMANTIC PARSER FIRST IS THE BRACKETING AND THAT'S VERY IMPORTANT IF YOU MAKE A MISTAKE THERE, THEN THE WHOLE THING WILL BE WRONG ALL THE RELATIONS WOULD BE WRONG SO FOR EXAMPLE THE SUGAR INDUSTRY ANALYST WOULD BE DIFFERENT FROM A FEMALE INDUSTRY ANALYST, HERE SUGAR INDUSTRY MODIFYS THE [INDISCERNIBLE] LIST AND IN THE SECOND CASE FEMALE MODIFYS THE INDUSTRY ANALYST. THEN YOU HAVE TO DETECT THE ARGUMENTS OF THE RELATION. THE RELATION ARE OF THIS FORM, HAS TWO ARGUMENTS, X AND Y IS WE ESTABLISH A SEMANTIC RELATION BETWEEN THOSE TWO ARGUMENTS AND THERE IS AN INVERSE OF THE RELATION IF YOU REVERSE THE ARGUMENT Y AND X. SO THE ARGUMENT THEN WE DO OUR FILTERING OF THE ARGUMENTS BECAUSE THIS ARGUMENT FOR SOME RELATION CAN ONLY TAKE SOME SEMANTIC VALUES. AND 10 WE USE STATISTICAL APPROACHES, SO STATISTICAL APPROACHES ARE REQUIREDDED TO HOLD THESE MODULES AND THEN WE EXTRACT FEATURES AND THEN WE USE A NUMBER OF CLASSIFYING, THE MACHINES, SURGERY, NAIVE BASE AND THIS IS A METHOD DEVICE FOR SOME RELATIONS AND THEN FINALLY PERFORM CONS FLICK RESOLUTION SO IT'S POSSIBLE TO EXTRACT THE SUM ACCURACY WITH THE RELATION AND THAT'S A VERY IMPORTANT STEP TOWARDS STRUCTURING THE ENGINE FROM TEXT. THIS IS WHAT EXTRACTS MORE REGULATIONS IN TEXT ADDITION TO WHAT THE ROLE LABELER OR SEMANTIC PARSER CAN PROVIDE. ESPECIALLY THE RELATIONS BETWEEN CONCEPTS THAT ARE FAR APART IN THE SENTENCE, LIKE CHRIS WAS TALKING ABOUT. LET'S TAKE THIS EXAMPLE, JOHNENT TO THE SHOP TO BUY FLOWERS, SO TYPICAL LABOR WOULD EXTRACT THE AGENT'S RELATION TEEN--JOHN IS THE AGENT OF GOING AND GOING TO THE SHOP AND THE PURPOSE OF GOING IS TO PEA AND THE THEME OF PEA SUGGEST FLOWERS. SO THESE ARE TYPICAL RELATIONS THAT ARE EXTRACTED BY SEMANTIC PARSING BUT THEN IF YOU LOOK AT THE SENTENCE, THERE ARE OTHER SEMANTIC RELATIONS THAT CAN BE EXTRACTED. FOR EXAMPLE, JOHN IS--THE LOCATION OF JOHN IS AT THE SHOP AND SCRAN IS THE AGENT OF BUYING, NOT ONLY OF GOING, AND JOHN HAS THE INTEND OF BUY, AND ALSO PRESUMABLY JOHN MAY POSSESS FLOWERS AFTER HE BOUGHT FLOWERS. SO THERE IS ANOTHER INTERESTING IDEA HERE THAT WE MAY ATTACH FOR ABILITIES BECAUSE PROSUGGESTION HERE WE'RE ALMOST CERTAIN IS THE BORDER FLOWERS OR NOT, HE WENT TO THE SHOP TO BUY FLOWERS BUT WE MAY NOT KNOW THAT HE ACTUALLY GOT THE FLOWERS. SO YOU MIGHT ATTACH SOME PROBABILITIES TO THIS SEMANTIC RELATIONS. SO YOU SEE HERE THAT THE PARSE THREE AND ALL THESE LONG CONNECTIONS BETWEEN FAR APART WORDS ARE PROVIDED BY THE SEMANTIC CALIFORNIA CLUES. SO THORS MAKE A POINT ABOUT THIS CALCULUS, USING THOSE GENETIC RELATIONS WE OBTAIN AXIOMS AND ANY KIND OF THIS KIND FOR EXAMPLE, AGENT CAN BE COMBINE WIDE PURPOSE SO THIS WILL BE LIKE IF X IS AGENT OF AN EVENT, Y, AND THE Y HAS A PURPOSE Z, THEN X IS THE AGE OF Z, IT'S INTUITIVELY CORRECT. SO THIS NEW FOR INSTANCE AXIOMS, WE HAVE EIGHT OF THESE WERE SUBSTANTIATED FOR THE FIRST 1000, AND WE GOT 36% MORE RELATIONS THAN WITHOUT USING SEMANTIC CALCULUS AND THE ACCURACY OF THOSE RELATIONS WERE WERE JUST AS HIGH BY PRACTICES VIEDED BY THAT OR HIGHER BY THE SEMANTIC PARSER. SO HAVE YOU A WAY TO EXTRACT EVEN MORE RELATIONS FROM TEXT AND THAT'S WHAT WE WANT BECAUSE AFTER EXTRACTED RELATIONS THEN I CAN THROW THEM AWAY. IF I PLEASE, BUT THE POINT IS, THEY EXTRACT AS MUCH SEMANTICS AS POSSIBLE. SEMANTIC CALCULUS CAN HELP WITH EXTRACTING NEW HIGH LEVEL AREAS FROM TEST, SO DIFFERENT DOMAINS COMES A TERRORIST, INTERESTED IN ASSOCIATIONS TEEN PEOPLE AND ONE APPROACH WOULD BE TO WRITE SOME INFORMATION EXTRACTION SPECIALIZED TOOL TO EXTRACT ALL KINDS OF FORMS OF ASSOCIATION LIKE COMMUNICATIONS, GATHERING, EMPLOYMENT, TRADE, SPACIAL THAT WOULD BE VERY TIME CONSUMING SO ANOTHER WAY IS TO IDENTIFY THIS SUBTYPES OF ASSOCIATION, AND THEN HAVE SOME [INDISCERNIBLE] OF SEMANTIC RELATIONS OF FORM OF AX COMS, SOMETHING ABOUT THIS FORM. BOB WROTE, WROTE, IS THE AGENT OF WRITING AND THE THEME OF WRITING IS A LETTER, AND THE LETTER WAS RECEIVED BY MARY, AND THAT WE CONCLUDE THAT BOB HAS COMMUNICATED WITH MARY AND BOB IS ASSOCIATED WITH MARY SO THAT IS A SUBSTANTIATION OF THIS AXIOM WHICH IS A GENERAL AX COMTHAT ACCOMPLISH THREE SEMANTIC RELATIONS AND ALSO PUTS RESTRICTIONS ON THE ARGUMENTS OF THE RELATIONS LIKE KREBS CYCLE HAVE THE [INDISCERNIBLE] PRODUCE AND Z HAS TO BE A WRITING MATTER. SO WITH AXIOMS LIKE THIS, CAN YOU QUICKLY, I ARGUE, YOU CAN IMPLEMENT SOME OF THESE HIGH LEVEL RELATIONS FOR WHICH YOU HAVE TO TRAIN THE SYSTEM AND TO DO INFORMATION EXTRACTION OTHERWISE. NOW ANOTHER POINT THEY WANT TO TOUCH UPON IS CAN WE BUILD DOMAIN ONTOLOGYST AUTOMATICALLY, AND I ARGUE THAT YES, IT'S POSSIBLE SO A MEASURE BAR CONNECTEDDED AND HAVING THIS ONTOLOGYST AVAILABLE FOR DIFFERENT PURPOSES AND YOU HAVE TO KNOW THE ONTOLOGYST ARE VERY USEFUL. HAVING TRANSFORMED THE TEXT SPONTANEOUS ACTIVITY THE SEMANTIC LOOP HOLES THEN IT'S POSSIBLE TO DO, TO BUILD THE ONTOLOGYST BY CLASSIFYING THE KNOWLEDGE ORGANIZING THE KNOWLEDGE INTO HIRE ARCH SCHEFORMING AN ONTOLOGY. SO, REALLY, DERIVING SEMANTICS FROM THE TEXT IS THE KEY TO THIS [INDISCERNIBLE]. SO THESE ARE SOME OF THE STEPS, BASICALLY YOU START WITH SOME SEEDS IN THIS, THIS MAY BE OPTIONAL OR SOME EXISTING ONTOLOGIES AND THESE EXPAND IF YOU HAVE NEW INCOMING DOCUMENTS, AND THEN USING THE SEMANTIC THROUGH PUT YOU CAN BRING OTHER CONTEXT, CONCEPTS SPONTANEOUS ACTIVITY THE PICTURE AND THEN YOU CLASSIFY THEM. ONE APPLICATION SHOWS THE POWER OF THE SEMANTIC THROUGHOUT PUTS THAT ONTOLOGYST THAT I TALK ABOUT. THIS COULD BE DOCUMENT SIMILARITY AND THIS COULD BE USED FOR DIAGNOSIS OR RESEARCH PAPER OR DECISION MAKING RULES OR OTHER APPLICATIONS AND OTHER APPROACHES THAT HAVE BEEN USED IN THE PAST, STATISTICAL APPROACHES, LD-EIGHT, AND CONDITIONED FIELDS AND OTHER APPROACHES FOR EXAMPLE, EVENTS DAYS OR ONTOLOGY DAYS. SO LET ME SHOW YOU AN EXAMPLE OF ONTOLOGY BASE. SUPPOSE THEY HAVE THIS--THIS IS THE SAME SENTENCE LIKE PERFECT. THE PATIENT'S EYE PAIN WAS ASSOCIATED WITH SURGICAL PROCEDURE AND POLYIAL LACTIC ACID, NOW THIS IS IN THE MOTED THAT COLLECTION, THE BEST MATCH IS THIS DOCUMENT, THIS DOCUMENT WAS FOUND BECAUSE WE HAD AN ONTOLOGY THAT THEY USE. SO NONE OF THIS CONCEPTS HERE, LIKE LATERAL CANCERS, SURGICAL--SO NOW--THANK YOU. SO WITH THE--CONCEPTS FROM EXTRACTED FROM MATCH THAT YOU SEE THEM HERE WHICH IS ONE OF THE WORDS IN THE TEXT IS A KIND OF PAIN AND THEN THIS CONCEPT IS LINKED TO THE [INDISCERNIBLE], BUT THEN, THAT'S NOT ENOUGH. YOU WANT TO HAVE SOME OTHER HIERARCHY OF CONCEPTS EXTRACTING FROM OTHER SOURCES, LIKE FOR EXAMPLE FREQUENCY ENCYCLOPEDIA, THAT PART OF PHASE, THIS PAROF PHASE, THIS IS PART OF EYE AND SO ON. SO BY BUILDING A LARGER ONTOLOGY ON TOP OF MESH, WE ARE ABLE TO EXTRACT THAT DOCUMENT. SO FINALLY LET ME MAKE COMMENTSTATTISTICAL VERSES SEMANTIC DRIVEN APPROACH BECAUSE THAT'S THE TOPIC OF THE PANEL. THIS TO APPROACH ARE KICKED VERY GOOD BECAUSE THEY HAVE A LARGE VOLUME OF DATA AND THEY'RE CERTAINLY MUCH MORE ROBUST AND IT JOE HAS MENTIONED THIS MORNING. SOME OF THE PROBLEMS IS STATISTICAL APPROACHES THAT THEY OLDSMOBILE GIVE YOU APPROACHED DECISIONS AND THEY CAN'T TELL YOU WHAT'S GOING ON THERE IF YOU HAVE AN ERROR. THEY MAKE MORE ERRORS, I BELIEVE AND REQUIRE A LOT OF TRAINING AND THAT'S CERTAINLY TRUE THE SEMANTIC DRIVEN APPROACHES PROVIDE FINALLY GRAIN AND THEY HAVE HIGH PRECISION IN THIS INSTANCES, THE LEVERAGE MACHINE LEARNING WITH KNOWLEDGE DRIVEN FEATURES, ONTOLOGIES, SIM ABTIC, RELATIONS, THEY'RE EASY TO CUSTOMIZE, SOME OF THE PROBLEMS, IS THAT COMPUTATION INTENSIVE AND TO BUILD KNOWLEDGE SOURCES CAN BE ALSO LABOR INTENSIVE. THE QUESTION IS CAN WE COMBINE THE TWO APPROACHES FOR BETTER? YES, CERTAINLY THESE TWO APPROACHES CAN BE MADE ONE NATURAL WAY IS TO INTERTWINE THEM AND WE SEE THAT ALL THE TIME IN THE DESIGNING DIFFERENT MODULES WITH THAT PIPELINE, SPECIAL OTHER APPROACHES AND SEMANTIC AND ANOTHER WAY IS TO USE STATISTICAL IN THE BEGINNING, AND FILTER OUT SOME OF THE RESULTS THAT THEN DO A HIGHER PRECISION USING SEMANTICS ESPECIALLY FOR HIGH LEVEL OBLIGATION O YOU PUT SEMANTICS IN FRONT, AND THEN WITH A FEATURES THAT YOU GENERATE USE SEMANTICS THEN CAN YOU APPLY THE STATISTICAL METHODS OR EVEN A BETTER WAY IS TO AND THE REST OF THEM I'M DO NOTHING ONE APPLICATION IN KOREFERENCE,--BY DOING THAT THE OUTPUT SHOULD BE BETTER THAN ANY OF THE OUTPUTS OF THESE MARGINS. SO I'M DONE. [ APPLAUSE ] >> WE WILL TAKE QUESTIONS AT THE END OF THE PRESENTATIONS, SO WE WILL HAVE 30 MINUTES FOR QUESTIONS SO IF YOU WRITE DON THE QUESTIONS HAVE YOU FOR THEM, WE WILL APPRECIATE IT. SO THE NEXT PRESENDOR IS PROFESSOR PHILLIP RESNIK, HE HOLDS JOINT APPOINTS IN THE DEPARTMENTS OF LINGUISTIC SYSTEM ADVANCED COMPUTER STUDIES HE RECEIVED HIS BACHELOR'S DEGREE IN COMPUTER SCIENCE AT HARVARD, AND COMPUTER SCIENCE AND INFORMATION SCIENCE AT THE UNIVERSITY OF PENNSYLVANIA IN 1993, AND HE HAS WORKED IN BOTH INDUSTRY AND ACADEMIA, HE WORKED FOR BBN, IBM AND WATSON AND MAYBE HE CONTRIBUTED TO THE JEOPARDY SYSTEM. AND ALSO SUN MICROSYSTEMS. DEPUTY. DR. RESNIK KNOWLEDGE IS ON COMBINING NATURAL LANGUAGE PROCESSING WITH APPLICATIONS TO MACHINE TRANSLATION, TRANSLATION CROWD SOURCES AND COMPUTATIONAL SOCIAL SCIENCE. HIS CURRENT WORK IS SUPPORTED BY A NUMBER OF GOVERNMENTAL AGENCIES AS WELL AS INDUSTRY FUNDING. AND HE ALSO WAS A FOUNDER OF CODE WRITE WHICH IS ONE OF THE COMPANIES THAT ACTUALLY APPLIES NLP TO THE CLINICAL DOMAIN FOR THE USE CASE OF GENERATING BILLING CODES. SO ... DR. RESNIK. [ APPLAUSE ] >> THANKS, LET'S SEE IF KIGET THIS THING--SEE IF I CAN GET THIS THING STARTS. GREAT. I WANT TO SAY THANKS TO THE ORGANIZERS TO START WITH, I THINK THIS IS A VERY, VERY, TIMELY DISCUSSION TO BE HAVING. ON LOT OF DIMENSIONS AND NLP AND LOOKING AT HEALTHCARE GENERALLY. SO I WANT TO START OUT REENFORCING AND REITERATING THE THINGS I HEARD EARLIER. I DON'T NEED TO SPEPPED A LOT OF TIME BUT I WILL TAKE A DIFFERENT PERSPECTIVE AS YOU'LL HEAR. EVERYBODY IS BY NOW FAMILIAR WITH THE REVOLUTION THAT TOOK PLACE IN NATURAL LANGUAGE PROCESSING IN THE LATE 80S AND EARLY 90S ILLUSTRATED HERE BY THE PERCENTAGE OF STATISTICAL NLP PAPERS AT THE FIELD'S TOP CONFERENCE. SO THAT'S ALL FAMILIAR, THERE WAS A REAL REVOLUTION IN LANGUAGE TECHNOLOGY THAT EVERY BIT IS PROFOUND FOR THAT FIELD REVOLUTION OR THE CHOMBSKI REVOLUTION OF SCIENCE AND YOU KNOW CHRIS' BOOK IS A PERSPECT EXAMPLE OF THE TRANSITION OF THE FIELD, THE OTHER ONE IS THE GREEN BOOK I CO-EDITED CALLED THE BALANCING ACT WHICH WAS ALL ABOUT THE FACT THAT PEOPLE WERE FIGHTING ABOUT THIS ISSUE IN THE EARLY 19 NIEBTS, THERE WAS A NEW APPROACH AND PEOPLE NOT USING IT AND THERE WAS CONFLICT AT THE TIME AND I DID THE RODNEY KING THING AND SAID CAN'T WE ALL JUST BET ALONG AND THE IMPORTANT POINT HERE IS THAT BY THE TIME THE BALANCING ACT BOOK CAME OUT UNFORT MATILY FOR THE SALES THAT DISCUSSION WAS OVER. THE PICTURE THAT EMERGED IS ONE WHERE KNOWLEDGE BASED METHODS'RE AN ESSENTIAL PART OF THE FOUNDATION OF THE FIELD. BUT YOU ALSO HAVE A MOVEMENT TOWARD KNOWLEDGE BASED METHODS INFORMED BY LARGE SCALE DATA ANALYSIS NOT JUST MINDING EXPERTS FOR THE KNOWLEDGE IN THEIR HEADS AND INCORPORATING THEM IN SOURCES, AND THEN YOU HAVE THE PROGRESS SO THE HEAVY FOCUS ON MACHINE LEARNING IN OTHER WORDS AUTOMATIC ANALYSIS OF THE DATA IN ORDERED TO LEARN FROM THE DATA CELL TASKS AND ACQUIRE NEW KNOWLEDGE, THIS LARGER CIRCLE IS STATISTICAL NLP, SO WE HAVE TO GET AWAY FROM THE EARLY 90S, MID90S CONTRAST, LINGUISTIC AND NLP VERSES STATISTICAL NLP. THAT HASN'T BEEN NLP FOR QUITE A LONG TIME AND I'LL SAY THAT CIRCLE IS NOT STATISTICAL NLP. THIS IS A PICTURE OF NLP TODAY. NOW ONE OF THE OTHER THANKS OF NOT ABOUT METHODS AND DATA, IT WAS ABOUT A VERY LARGE SHIFT AWAY FROM THE GENERAL GOAL OF KNOWLEDGE REPRESENTATION, REASONING, BEING ABLE TO TALK WITH OUR COMPUTERS TO HAVE A TASK FOCUS TO A TASK DRIVEN FOCUS TO TRYING TO SOLVE THE LANGUAGE UNDERSTANDING PROBLEM IN THE LARGE. SO LANGUAGE UNDERSTANDING IS INDEED CRUCIAL BUT IT NEED TO BE CONSTRUED DIFFERENTLY FROM THAT TIME SO THE WAY WE TALKED ABOUT THIS THEN WAS SOMETHING LIKE THIS. WHERE YOU IMAGINED INTELLIGENT AGENTS HAVING A CONVERSATION WITH YOU OR THE CLINICIAN AND THE--THE--YEAH, I BORROW THD SLIDE WITH ANOTHER TALK I CAME. I COULDN'T COME UP WITH A GOOD MEDICAL EXAMPLE SO I THOUGHT I WOULD KEEP THIS ONE SO THAT DREAM HAS FADED OUT AS BEING THE FOCUS OF NATURAL LANGUAGE PROCESSING AND I WOULD ARGUE WHEN WE NEED DOJ IN THE PROCESSING COMMUNITY IN THE LARGE WAS DOING WITH THAT TRANSITION IN THE EARLY TO MID90S. AND CLINICAL NLLP. AND I WILL STATE THIS STRONGLY BECAUSE IT'S A PAN AND HE WILL PROVOKE INTERESTING DISCUSSION, BUT CLINICAL NLP IS BEHIND THE CURVE WHEN IT COMES TO THE REST OF NLP. AND THERE'S A REASON FOR THIS. THE REASON IS YOU CAN BE THE GET YOUR HANDS ON THE DATA AND IT'S ALL ABOUT THE DATA. SO IN TODAY'S NLP, THE QUESTION IS NOT WHETHER YOU'RE GOING TO INTEGRATE KNOWLEDGE BASED METHODS WITH STATISTICAL METHODS BUT HOW? AND ALL OF THE CUTTING EDGE AIRED VANCES THAT PEOPLE PAY ATTENTION TO, WHAT WATSON DID AND I KNOW SOME OF THE PEOPLE ON THE TEAM DID NOT HAVE ANY--MY WORK, BY THE TIME I SPENT AT IBM DID NOT FEED INTO THAT AT ALL BUT IF YOU LOOK AT SIRI, IF YOU LOOK AT GOOGLE TRADGESALATE, IF YOU LOOK AT SPAM FILTERS SEARCH ENGINES, SENTIMENT ANALYSIS ALL THE PLACES WHERE NATURAL LANGUAGE PROCESS SUGGEST STARTING TO HAVE A VERY LARGE INFLUENCE ON THE REAL WORLD, ALL OF THOSE PLACES ARE DRIVEN BY AN APPROACH THAT IS FUNDAMENTALLY BASED ON THE IDEA OF DATA INFORMING THE PROCESS. SO, THERE ARE A COUPLE OF DIFFERENT WAYS I'LL REVIEW QUICKLY, ABOUT HOW YOU CAN INCORPORATE KNOWLEDGE INTO A DATA DRIVEN ENTERPRISE. THERE'S REALLY THREE THAT COVER, I THINK A LARGE PART OF OF THE TERRITORY, ONE IS WHAT WAS REFERRED TO AS FEATURE ENGINEERING DEFINING THE ANALYSIS PAYS ATTENTION TO SO YOU CAN HAVEUNE GRAM FEATURES I. E. TOKENS OR WORDS IN THE TEXT, CAN YOU TREAT EACH OF THESE ACE DISTINCT, AND YOUR SYSTEM IS GOING TO HAVE TO PAY ATTENTION TO THOSE DISTINCTIONS OR REALIZE, PLURALS DON'T ACTUALLY MAKE TOO MUCH OF A DIFFERENCE IN SOME CONTEXT, SO MAYBE CATHETERS SHOULD BE TREATED THE SAME AS CATHETER AND WE SHOULDN'T ABOUT DISTINCTIONS BETWEEN BRITISH AND AMERICAN SPELLING SO THE SYSTEM IS NO LONGER AWARE IN THE FEATURE SET OF THE DISTINCTIONS THERE AND IN FACT ACCOUNTS FOR SOME APPLICATIONS MIGHT T MIGHT MAKE SENSE TO SAY ALL THESE ARE TELLING THAT YOU WE'RE DEALING WITH SOMETHING TO DO WITH CATHETER AND THAT WILL BE A DECISION THAT YOU'RE KNOWLEDGE AS A DOMAIN EXPERT INFORMS, THE TASK INFORMS THE WAY THAT YOU THINK ABOUT YOUR FUTURE ENGINEERING. MODEL STRUCTURE IS A ANOTHER PLACE WHERE YOU CAN INCORPORATE THE SIGNIFICANT KIND OF KNOWLEDGE THAT HAVE INTO SYSTEMS. SO FOR EXAMPLE, MANY CLINICAL NLPSYSTEMS GO THROUGH SOMETHING THEY WOULD CALL REGIONING SO YOU HAVE TO DECIDE HAVE YOU DIFFERENT SECTIONS OF THE DOCUMENT LIKE HPI, MEDICAL HISTORY AND SO FORTH. ISMS DENTIFICATION OF DIAGNOSTIC LANGUAGE AND ONE WAY OF STRUCTURING A MODEL OR SYSTEM LIKE THIS IS TO HAVE A PIPELINE, ANOTHER IS TO DO A FULL JOINT INFERENCE PROCESS OF THE KIND THAT CHRIS AND HIS STUDENT VS DONE AND OF COURSE THERE ARE THINGS IN THE MIDDLE FOOTWORK EXAMPLE IT MAY BE FACT THAT HAVING THE MODEL OF REGIONING, TAKE ADVANTAGE OF DIAGNOSTIC LANGUAGE, MIGHT DO BETTER BUT IN THIS AS EXPENSIVE A WAY TO MERGE ALL THESE TOGETHER IN A SINGLE MODEL. THIS IS ANOTHER PLACE WHERE YOUR KNOWLEDGE OF THE DOMAIN MAKES AN ENORMOUS DIFFERENCE. FINALLY, THE NOTION OF A PRIOR OR PRIOR PROBABILITY, IN A STATISTICAL MODEL, HAS TO DO WITH KNOWLEDGE THAT YOU BRING IN TO EXPECTATIONS ABOUT THE PROBABILITIES WILL BE, AND COULD BE CONSTRUCT OF LUNG DISEASE OR THE COMMON COLD AND IT MIGHT BE THE CASE THAT YOU LET THE SYSTEM KNOW, YOU INFORM THE SYSTEM THAT ONE OF THESE IS MORE LIKELY THAN THE OTHER BUT THESE KINDS OF ASSUMPTION K'S BE OVERWRITTEN BY ENOUGH EVIDENCE IF YOU'RE FINDING OTHER EVIDENCE THAT LEAVES IN THE DIRECTION OF COPD. THEN CAN YOU GO IN THAT DIRECTION AND I WORK WITH WORD NET IN THE EARLY DAYS AND I'M NOW CONVINCE THAD THE PROPER USE OF ONTOLOGYYS IS AS A PRIOR FOR KNOWLEDGE, NOT AS AN INCAP SUEALATION OF BOUNDARIES THAT ARE IN„i SAYS AND SUFFICIENT AND RESTRICT YOU. SO STATE-OF-THE-ART NLP IS DEPENDS CRUCIALLY ON LEARN PROGRESS RELEVANT DATA. THAT'S A PROBLEM. THE PROBLEM SURVEYS CAN'T GET YOUR HANDS ON THE DATA, HIPPA PROVIDES CONSTRAINTS AND OTHER REASONS ORGANIZATIONS AND AS A RESULT, THE RESULT IN CLINICAL NATURAL LANGUAGE PROCESSING HAS NOT HAD THE SAME EXPOSURE TO VERY LARGE QUANTITIES OF REAL WORLD DATA THAT OTHER AREAS OF NATURAL LANGUAGE PROCESSING HAVE. NOW THERE ARE A COUPLE OF THINGS CAN YOU DO TO SOLVE THAT PROBLEM. AS HAS BEEN MENTIONED, SUPERVISED METHODS, THIS IS PROBLEMATIC, RIGHT? NOT ONLY DO YOU NEED THE DATA BUT THE ANNOTATIONS AND PUSH IN THE DIRECTION OF SEMISUPERVISED METHODS AND BOOT STRAPS OF UNCODED OR UNANNOTATED DATA, AND TAKE ADVANTAGE OF DATA THAT HAS NOT HAD EXTERNALIZED LINGUISTIC KNOWLEDGE, I LIKE TAKEN--THEY PHRASE ADDED TO IT. SO SEMISUPER SIZED METHODS ARE ONE WAY OF GOING AND CAN YOU GO FURTHER AND TAKE ADVANTAGE OF UNSUPERVISED METHODS AND MENTION THERE BY ALLOCATION WHICH IS ONE OF THE POPULAR WAYS OF DOING THIS NOW AND IN ORDER TO INFORM THE MADLE IN AN UNSUPERVISED WAY, ABOUT THE STRUCTURE OF THE DATA ROOTS LOOKING AT--IT'S LOOKING AT WITH PRIORS SO IT HAS AN IDEA, FORGIVE ME FOR ONTHROWN--EQUALLY MORPHONNIZING THE SYSTEM. AND AND THIS IS A PROBLEM THAT WAS THE CHICKEN AND EGG QUESTION THAT WAS ASKED EARLIER. THIS IS WHAT'S HAPPEN NLP GENERALLY, EVERYBODY REALIZES THAT FINE YOU TRAIN A TRANSLATION SYSTEM, AND GUTMACHER TRY AND TRANSLATE BLOG, YOU WILL NOT DO WELL AND SO PEOPLE ARE TRYING TO FIGURE OUT WAYS OF OOH AT THAT POINTING--ADAPTING SYSTEMS ON ONE DATA TO LARGER SETS OF UNCODED OR UNAN OITATED DATA IN ORDER TO TAKE ADVANTAGE OF WHAT YOU LEARN AND THE KNOWLEDGE YOU HAVE AND SUPERVISE PROCESS ON EACH NEW SUBDOMAIN BECAUSE THAT'S NOT GOING IT SCALE AND SO THIS FIGURE SHOWS YOU A SIMILAR KIND OF HISTORICAL LINE FOR SEMISUPERVISED METHODS AS YOU'VE SEEN FOR SUPERVISED METHODS BACK IN THE EARLY 1990S. DID THERE IS ANOTHER REALLY INTERESTING SOLUTION NO. ABOUT 13 YEARS AGO, I FOUND MYSELF IN THE TRENCHES DOING NATURAL LANGUAGE PROCESSING AND TO MY SURPRISE AND THE FOLKS I WAS WORKING WITH HAD A VERY INTERESTING POSITIONING AND I MEAN THAT MANY DIFFERENT SENSES OF THE WORD, THE PROBLEM THAT WE WORKED ON AND THE BRED AND BUTTER OF WHAT IS DONE TODAY, WORKING ON THE PROBLEM OF COMPUTER ASSISTED CODING THIS, IS SOME PLACE WHERE THERE IS A VERY, VERY, ACTIVE NATURAL LANGUAGE PROCESSING, MOTIVATED COMMUNITY, THE AMERICAN HEALTH INFORMATION MANAGEMENT ASSOCIATION JUST HAD A SUMMIT LAST WEEK WHERE YOU HAD A LOT OF PEOPLE IN THAT AREA, ONE OF THE REALLY INTERESTING THINGS ABOUT THIS IS IF YOU LOOK AT REVENUE CYCLE THIS, IS THE PLACE WHERE YOU GET SOME THE DOCUMENTATION TO THE CODES THAT DETERMINE THE PAYMENT THIS, IS THE FUNNEL THROUGH WHICH THE DOCUMENTATION NOTICES SO THIS IS AN OPPORTUNITY, TO BE WHERE THE DATA IS, AND THAT'S THE AROACH THAT WE TOOK. AND ONE OF THE REASONS THAT I COMMENT ABOUT CLINICAL NLP, BEING IN A DIFFERENT SPACE STILL, THAN NLP IS THE LARGE COMES FROM EXPERIENCE IN THAT COMMERCIAL SIDE, WHERE THE MAJORITY OF FOLKS OUT THERE ARE NOT TAKING STATISTICAL APPROACHES AND BY STATISTICAL, LET ME TALK ABOUT THE WORD TALK BUG THE APPROACH WHERE IS THE DATA DRIVES THE PROCESS BUT THE KNOWLEDGE IS AN ABSOLUTELY ESSENTIAL PART OF IT. NOW, THIS IS JUST A--FROM'S ANOTHER ASPECT WHERE BEING IN THAT ARENA HAS TURNED OUT TO BE INTERESTING. IF A DIFFERENT VIEW OF THE PROBLEM AND SAYING I'M GOING TO BUILD A MODULE OR I'M GOING TO BUILD A PIECE OF SOFTWARE TO DID SOME KIND OF ACINAL SIS AND HAND IT OVER. THE HUMAN IN THE LOOP AS THEY MENTIONED IS ABSOLUTELY ESSENTIAL I THINK IF YOU ARE GOING TO TRANSLATE, NATURAL LANGUAGE PROCESSING METHODS INTO SOMETHING THAT IS USEFUL IN ACTUAL HEALTHCARE AND IN MEDICAL RESEARCH AND ON THE LEFT-HAND SIDE UP THERE, YOU'VE GOT THE ORIGINAL, YOU KNOW TEXT AND ON THE RIGHT HAND SIDE, YOU YOU HAVE A USER INTERFACE WHERE HUMAN CODER CAN REVIEW AND CORRECT THE CODES AND EVERY TIME THEY DO THAT, AND OF COURSE, IT'S A MODULE THAT SAYS IF YOU'RE CONFIDENT ENOUGH, YOU DON'T NEED THE HUMAN REVIEW BUT EVERY TIME A HUMAN INTERACT, - IT'S A DATA TO LEARN FROM AND THERA MANAGEMENT, THE PICTURE OF INTERVENTIONAL CODING AND SO FORTH, THE HUMAN IN THE LOOP ELEMENT IS SOMETHING THAT I THINK WE AS NATURAL LANGUAGE PROCESSING PEOPLE HAVE NOT PAID ENOUGH ATTENTION TO. AND I WANT TO REENFORCE WHAT THEY SAID EARLIER ABOUT HUMAN INTERACTION, BEING A DIFFERENT SET OF PROBLEMS. NOW, THERE'S A DIFFERENT PROBLEM, WHAT I SAID A SECOND AGO IS, I THINK THAT CLINICAL NATURAL LANGUAGE PROCESSING, HAS NOT YET HAD THE OPPORTUNITIES THAT OTHER AREAS OF NATURAL LANGUAGE PROCESSING HAVE, TO WORK WITH LARGE QUANTITIES OF DATA BECAUSE IT'S HARD TO GET OUR HANDS ON THE DATA. NOW MY POSITION ON THIS, IT MAY DIFFER FROM SOME OTHER FOLKS IN THE ROOM IS THIS IS NOT JUST A QUESTION OF THE CLINICAL NLP SIDE NEEDING TO MOVE FROM THE KNOWLEDGE REPRESENTATION AND REASONING ROOTS THAT IT HAS, TO THE CENTER, WHILE THE STATISTICAL SIDE WHICH WENT THAT WAY FOR 15 YEARS IS NOW ALSO MOVING TOWARD THE CENTER, MY POSITION ON THIS ACTUALLY IS THE CLINICAL NLP NEEDS TO MOVE PAST THE CENTER, WE NEED TO BE ABLE TO GET THE MIND SET AND THE TECHNIQUES WORKING ON LARGE QUANTITIES OF CLINICAL DATA SO THE PEOPLE WE TRAIN ARE ACTUALLY HAVING THAT CROSS POLLENNATION OF WORKING WITH THE LARGE DATA TECHNIQUES THAT HAVE WORKED AND PUSHED THE FIELD FORWARD AND THE PEOPLE WHO HAVE--THE PEOPLE VOCABULARY THE KNOWLEDGE USING THOSE TECHNIQUES ON LARGE QUANTITIES OF DATA, IN ORDER TO DO THAT YOU HAVE TO LIVE THE DATA FOR A WHILE. SO MY ARGUMENT WOULD BE CLINICAL NLP NEEDS TO GO BACK AND SWING BACK WITH THE REST OF THE FIELD. NOW THERE'S A BIGGER PROBLEM, THOUGH, IT'S NOT JUST A QUESTION OF WHERE DO YOU GET THE DATA FOR RESEARCH AND DEVELOPMENT, THE QUESTION I HAVE IS WHETHER THE DATA IS GOING TO CONTINUE TO EXIST AT ALL. WITH THE WIDE SPREAD ADOPTION OF ELECTRONIC MEDICAL RECORDS AND THE BIG PUSH TOWARDS THAT, SOMETHING INTERESTING IS HAPPENING. PEOPLE RECOGNIZE THAT YOU NEED TO DO A--BITSA LYTICS AND DISCREET DATA IN ORDER TO GET THE VALUE OUT OF THESE ELECTRONIC RECORDS, ALL OF THESE THINGS THAT IT OFFERS. AND SOPHISTICATED THIS IS A--SO THERE IS PUSH TO MAKE THE INPUT DISCREET, NORMALIZED AND STANDARDIZED. SO THIS PICTURE HERE WHERE CLINICAL DICATION OF THE TRADITIONAL VARIETY GOES THROUGH SOME KIND OF CODING PROCESS, HUMAN OR COMPUTER ASSISTED IS STARTING TO BE REPLACED BY THIS PICTURE, I'LL DO A BLOW UP ON THAT THING THERE ON THE LEFT, ELECTRONIC MEDICAL RECORDS THAT INVOLVE PICK LISTS AND CLICKING AND TEMP PLATES AND AWIVE TEXT BOXES--OFTEN TEXT BOXES AS A PART OF THE PROCESS, THESE ARE A VERY, VERY, DIFFERENT ANIMAL FROM THE CLINICAL DICATIONS THAT CLINICIAN VS GROWN UP WITH. AND THERE'S FOLKS MYSELF INCLUDED WHO ARE WORRY BODY THAT, SO THIS IS JUST SOMETHING THAT I PULLED FROM A LETTER TO THE EDITOR A WHILE BACK AND YEARS PAST, A WELL WRITTEN HISTORY AND PROGRESS NOTE WOULD UNFOLD LIKE A STORY. GIVING A VIVID DESCRIPTION, AND SO FORTH. SO HERE'S AN EXAMPLE, HERE'S A NOTE IN BLUE HERE, AND THIS IS NOT COMPLETE, THIS IS CREATED EXAMPLE FOR THE PURPOSES OF THE TALK, ALTHOUGH I HAVE GOT A STUDY THAT I CAN SHARE WITH FOLKS WHERE WE DID SOMETHING MUCH MORE SYSTEMATIC WITH MUCH MORE RIGOROUS RESULTS, IN BLUE HERE, HAVE YOU THE KIND OF INFORMATION THAT GETS PULLED OUT OF--I'M SORRY THAT IS HERE IN THIS DICATION THAT HAS A PLACE IN A STANDARD STRUCTURED KIND OF ELECTRONIC MEDICAL RECORD. THERE'S A LOT OF STUFF, I PUT THIS IN RED WHICH ANYWHERE FROM DEFINITELY TO AT LEAST ARGUABLY DOESN'T HAVE A HOME. IN YOUR TYPICAL ELECTRONIC RECORD STANDARD. RIGHT? CONS TIARAS PATED AS A DIAGNOSIS, DOG A LOT STRAINING, NOT SO SURE. --DOING A LOT OF STRAINING, SNOT SO SURE. ANOTHER EXAMPLE, SAME PATIENT WEEKS LATER, PATIENT WAS ADMITTED BLOOD PRESSURE WAS BROUGHT DOWN WITH THIS SHE REVERSED TO RISK ASSESSMENTS PIRATTORY DISTRUSS PROMPTLY BUT I SPOKE TO A DOCTOR WHO HELPED ME WITH THIS EXAMPLE AND WHAT YOU LOOK AT THIS, WHAT WAS GOING ON HERE? IS THAT THE CLINICIANS WERE TRYING FIGURE OUT WHETHER THE PROBLEM HERE WAS HEART FAILURE OR PNEUMONIA AND IF YOU DO THIS ISSUES, AND THE SHORTNESS OF BREATH, RESPIRATORY DISTRUSS REVERSES PROMPTLY IT'S NOT PNEUMONIA. NOW PROMPTNESS OF THE RESPONSE TO A TREATMENT, IS THERE A SLIDERINAR IN THE TYPICAL ELECTRONIC MEDICAL RECORD. A PLACE TO PUT IT IN. MAYBE THERE'S A TEXT BOX, BUT THINK ABOUT THE EXTRA WORK THAT A CLINICIAN HAS TO DO IN ORDER TO GO AND DECIDE TO LOOK AT SOMETHING IN THE TEXT BOX AND ONCE YOU DO PUT STUFF IN TEXT BOXES, TWO THINGS HAPPEN, YOU FREEING MANET THE NARRATIVE AND OOPS YOU JUST CREATE AID WHOLE BUNCH OF NONA STANDARDIZED NONNORMALLIZED TEXT TO DEAL WITH. SO THERE IS I WORRY THAT I HAVE, MY WORRY IS THAT WE AS A NATURAL LANGUAGE PROCESSING COMMUNITY HAVE A LIMITED AMOUNT OF TIME TO DO WHAT WE NEED TO DO, WHICH IS TO DEMONSTRATE THAT YOU DON'T NEED TO CONSTRAIN THINGS SO MUCH UPFRONT IN ORDER TO GET THE VALUE AT THE BACK END. AUTOMATIC STRUCTURING OF RECORDS, FROM CLINICAL--THIS IS ANOTHER MOVE TOWARDS THE CENTER, EVERYBODY RECOGNIZES, THAT YOU NEED DISCREET DATA, VERY, VERY, DILL VERMEN INFECTED HAVE PEOPLE DICTATE BLOOP. THERE ARE THOUSAND DIFFERENT WAYS OF SAN FRANCISCO SAYING THE PREKRIPGZ IS ONE TABLET BY MOUTH DAILY. LIT LALLY A THOUSAND, IT SOMEBODY SHOWED ME A LIST OF THEM, ALL THE NATURAL VARIATIONS OF JUST THAT ONE THING. IT WOULD BE CRAZE TOW HAVE TO DO NLP ON THAT BUT THERE IS SO MUCH THERE, VALUE AND NUANCE AND RICHNESS IN THE NARRATIVE, AND IF WE GO WITH THE TREND RIGHT NOW. WHICH IS TRYING TO SQUEEZE THINGS DOWN FOR DOWN STREAM PURPOSES WHICH ARE ENORMOUSLY VALID PURPOSES WE RUN THE RISK THAT AS NATURAL PROCESSING PEOPLE, THE LANGUAGE WE'LL PROCESS WILL BE A LOT MORE IMPOVERISHED OR NOT THERE AT ALL. BUT WHAT WE CAN DO IS SAY HEY LOOK, THERE'S AN ALTERNATIVE, INSTEAD OF MOVING TO THE CENTER BY DOING DISCREET STUFF AND ADD TEXT BOXES. IT'S POSSIBLE TO SAY, LOOK THE CLINICAL DICATION IS WHAT THE COMMUNITY USES, IT'S THE FOUNDATION OF CLINICAL COMMUNICATION. LET'S START FROM THERE. AND AT LEAST BROADEN THE MIND SET, PEOPLE NEED TON THAT THE PROCESSING WITH THAT, THE PEOPLE NEED HAD HELP THEM. THE TEXT BOXES DON'T SOLVE THE PROBLEM AND IF YOU LOSE THE LANGUAGE, YOU LOSE THE STORY SO TAKE AWAYS. CLINICAL NLP NEEDS MORE STATISTICAL NLP. NOT JUST TO SAY, HERE ARE ALL THESE WONDERFUL KNOWLEDGE SOURCES, HOW DO WE START DOING STATISTICAL STUFF WITH THEM. BUT TO SAY, WE NEED TO FIND A WAY TO TAKE A STATISTICAL NLP MIND SET AND APPLY IT TO CLINICAL LANGUAGE PROCESSING, THOSE ARE TWO VERY DIFFERENT THINGS. SECOND, WE'VE GOT A BIG PROBLEM WHICH IS IN ORDER TO DO THAT SOMEHOW YOU NEED BRING TOGETHER THE R&D PEOPLE AND THE DATA AND ONE THING TO„i THINK ABOUT IS IF WE CAN'T BRING THE DATA TO THE RESEARCHERS, MAYBE WE NEED TO BRING THE RESEARCHERS TO DATA MORE AGGRESSIVELY AND FINALLY WE AND EVERYBODY ELSE, I'VE ARGUED HAVE A FAR BIGGER PROBLEM TO SOLVE. WHICH IS THAT THE NATURE OF LANGUAGE IN CLINICAL COMMUNICATION IS CHANGING, THE PROBLEM IS CHANGING UNDER OUR FEET AND I DON'T THINK THERE'S BEEN NEW DISCUSSION OF THAT AND I'M HERE TO INKOIRAGE THAT THERE BE MORE OF IT. THANKS. KD AS PLACES. >> I'M SURE THAT WAS A VERY THOUGHT PROVOKING PRESENTATION AND YOU'RE JOTTING DOWN QUESTIONS FOR QUESTIONS FOR PHILLIP RESNIK, AFTER THE LASTS RESENTATION IS IT'S MY HONOR TO INTRODUCE PROFESSOR TSUJII, AND WE ALL KNOW HIS WORK AND HE'S A MULTIYEAR EXPERIENCE HERE AT UNIVERSITY OF TOKYO, BUT HE'S NOW MOVED TO THE OTHER SIDE OF THE PENDULUM AND NOW HE'S IN INDUSTRY, HE'S PART OF MICROSOFT RESEARCH, ALTHOUGH HE REMAINS VERY ACTIVE IN THE ACADEMIC WORLD. SO HE'S STILL HOLDS AN APPOINTMENT AT UNIVERSITY OF TOKYO AND ALSO AT UNIVERSITY OF OF MANCHESTER. DR. STUJII'S RESEARCH ACHIEVEMENT INCLUDE DEEP SEMANTIC PAUSING BASED ON FEATURE FOREST MODEL. ESPISHT SEARCH ALGORITHMS FOR PARSING, IMPROVEMENT OF ESTIMATOR FOR MAXIMUM MODEL AND THE WORK THAT WE ARE ALL I'M SURE YOU'RE VERY WELL AWARE OF IS THE CONSTRUCTION OF THE GEANIAL AND MULTIPLAYERER ANNOTATIONS WITHIN THE GEANIAL [INDISCERNIBLE]. HE'S ACHIEVED AWARDS SUCH AS THE IBM SCIENCE AWARD, VISITING PROFESSOR AND IBM FACULTY AWARD AND ALSO BEEN THE PRESIDENT OF THE ASSOCIATION OF COMPUTATIONAL LINGUISTICS, THE PRESIDENT OF THE INTERNATIONAL ASSOCIATION FOR MACHINE TRANSLATION AND HE HAS A VERY LONG AND VERY PRODUCTIVE CAREER IN NLP, SO ... [ APPLAUSE ] >> THANK YOU VERY MUCH, A KIND INTRODUCTION. SO I FIRST HAVE TO START WITH DISCLAIMER. I'M ALWAYS SAYING I'M INVOLVED IN THE PROCESSING AND TEXT MINING IN THE BIOMEDICAL DOMAIN, HOWEVER, THAT IS REALLY MY TOPIC IS ON BY O DOMAIN, BIOLOGICAL DOMAIN AND ALSO, I ONLY TREATED SCIENTIFIC PUBLICATIONS IN MICROBIOLOGY SO THAT IS A VERY, VERY, NARROW FIELD. FIRST OF ALL, TEXT TYPES ARE RESTRICTED, THAT IS A PUBLISHER OF SCIENTIFIC PAPERS AND ALSO, HEART FELT AMOUNTS OF PAPERS AND IN DOMAIN IS AGAIN VERY RESTRICTED, MARKER OF BIOLOGY. AND RECENTLY, I STARTED TO SEE OTHER TYPES OF APPLICATIONS ESPECIALLY IN CLINICAL DOMAIN, CLINICAL TRIALS, KPMC, WHICH [INDISCERNIBLE] FOR PAPERS AND DISCHARGE SUMMARIES, CHALLENGES ARE REFINED BY I-TWO B-TWO. WHAT I FOUND IS REALLY A SIMPLE FACT THAT IS THE FEASIBILITY OF NLP OTM TECHNOLOGY, GIVEN ONE, IS HIGHLY DEPENDENT ON TEXT TYPES AND DOMAINS IN TASKS. SO, THE PEOPLE LIKE CAROL OFTEN TALK ABOUT SUBLANGUAGES AND SO ON AND I REALLY REALIZE THAT LANGUAGE IS NOT THE SINGLE, THERE ARE MANY LANGUAGES AND THERE ARE LANGUAGES IN SCIENTIFIC PAPERS, LANGUAGES IN MEDICAL RECORDS, LANGUAGES IN CONSTRUCT ARE ALL VERY, VERY, DIFFERENT. AND ALSO TASKS HOW WE WANT TO PERFORM ARE ALSO VERY DIFFERENT. SO DEPENDING ON THOSE FACTORS, THE FEASIBILITY OF A GIVEN TECHNOLOGY, IT IS QUITE DIFFERENT FROM ONE TO THE OTHER. SO THAT IS KIND OF OUR LESSON LEARNED. SO THIS IS THE BRUNT OF MY TALK. THE FRONT I TALK ABOUT WHAT I'VE DONE BEFORE AND THE SECOND PROP IS WHAT I FOUND, THE CURRENT KIND OF LIMITATION WITH THE CURRENT TECHNIQUES AND SOME KIND OF CONCLUDING REMARKS. SO, WHAT WE DID WAS MARKERS AND REREPRESENTATIONS AND WHOLISTIC EXPRESSIONS OF TO OTHER DOMAIN WHICH IS MAYBE MODEL OR DOMAIN MODEL. SO WHAT IS VERY DIFFERENT MAYBE, SLIDELY DIFFERENT FROM DAN'S APPROACH IS THAT WE THINK THAT [INDISCERNIBLE] DOMAIN IS NOT SO MUCH RELATED WITH LANGUAGE. WE FIRST THINK THAT DOMAIN IS COMPLETELY INDEPENDENT, DEFINED BY DOMAIN, OR THE DOMAIN SPECIALIST. SO THE PURPOSE OF MEASURING THIS PROCESS SUGGEST TO MARK--PROCESSING IS TO MARK THE LANGUAGE THAT EXPRESSIONS IN THE DOMAIN. SO, BETWEEN THIS TWO DOMAINS, THESE ARE NOT SO TRIVIAL AND IT'S NOT SO CONTINUOUS OF SEMANTIC BILLS OR STRUCTURES OF THE LANGUAGE, IT'S NOR THE PROBLEM--VERY DIFFERENT FROM LINGUISTIC STRUCTURES AND THE INFORMATION, DOMAIN SPECIALIST WANT TO EXTRACT. SO I--WE THINK THAT THERE IS A DISCONTINUATION BETWEEN THE TWO DOMAINS. OKAY. SO, IN ORDER TO SEE WHAT KIND OF MAPPINGS WE HAVE TO DEFINE, WE FIRST STARTED TO ANNOTATE TEXT IN TERMS OF DOMAIN ONTOLOGY. SO FOR EXAMPLE, HERE'S A SENTENCE OF--THAT MEANS EXPRESSIONS IN LANGUAGE AND THEN WE ANNOTATE WHAT KIND OF INFORMATION ACTUALLY ENCODED IN THE LANGUAGE AND THAT INFORMATION IS DEFINED BY THE DOMAIN ONTOLOGY. THAT THE LANGUAGE ITSELF. SO WE HAVE BIOLOGIST TO ANNOTATE WHAT KIND OF RELATIONSHIPS YOU RECOGNIZE IN THIS TEXT AND WHAT KIND OF EVENTS YOU WANT TO RECOGNIZE IN THIS TEXT, SO THAT IS TEXAN OITATION WE MAKE--TEXAN --TEXT ANNOTATION WE MAKE. WE HAVE TWO TYPES OF ANNOTATION, WOONE IS LINGUISTIC ANNOTATION WHICH IS BASED ON LINGUISTIC SERIES, PARTS OF SPEECHES, STRUCTURES OR PREDICATE STRUCTURES SO O AND SO FORTH. AND THE DOMAIN IS DEFEIGNED BY THE DOMAIN ONTOLOGY. COMPLETELY INDEPENDENT OF THE LANGUAGE. THAT IS GIVEN BY BIOLOGIST, AND THEN WE ANNOTATE THE NEXT WHICH WE CALL SEMANTIC ANNOTATION, BASED ON THAT ONTOLOGY, THAT MAIN ENTITIES LIKE PROTEINS, GENES, DISEASES AND SO ON, AND EVEN IN STUDYING RELATIONS AND SO ON. SO THE TWO ANNOTATIONS, HERE SOMATIC ANNOTATIONS, HERE YOU WILL SEE THE DIFFERENT COLORS INDICATE DIFFERENT TYPES OF ENTITIES, GENE, DISEASE, PROTEINS AND SO ON. AND THEN YOU HAVE OUR LINGUISTICAL NOTATIONS WHICH ARE REALLY ANNOTATION ABOUT LANGUAGE, PART OF SPEECHES, ET CETERA. OR YOU CAN SEE HOW BRAIN STRUCTURE AND NOTATION WILL SOMETIMES PREDICATE AUGMENT STRUCTURES, BUT ON TOP OF THAT, COMPLETELY SEPARATE, WE HAVE EVENT [INDISCERNIBLE] WHICH MEANS WHAT KIND OF BIOLOGICAL EVENT BIOLOGIST RECOGNIZE IN THIS SENTENCE. SO THE TWO ANNOTATIONS ARE SEPARATE INDEPENDENT, SO THIS IS A KIND OF EVENT ALLOCATION, THE BIOLOGIST RECOGNIZE THREE DIFFERENT TYPES OF EVENTS, BINDING, LOCALIZATION OF PROTEINS, END SOME KIND OF RELATIONSHIPS BETWEEN THESE TWO EVENTS, SOME EVENTS CAUSE ANOTHER EVENT AND SO ON. SO THERE ARE THREE DIFFERENT TYPES OF EVENTS, WHICH ARE--WHICH BIOLOGISTS THINK ARE INTERESTING FOR THEM. OKAY, SO, SINCE THESE ARE NOTATIONS, THEY'RE NOT LINGUISTICAL NOTATION SO WE BORROW ONTOLOGY, DEVELOPED BI BIOLOGYST, IN THIS CASE WE BORROW FROM GENE ONTOLOGY AND WE DEFINE ABOUT 40 DIFFERENT EVENTS BIOLOGICAL EVENTS WHICH BIOLOGISTS ARE INTERESTED IN. AND AS CHRIS MENTIONED, WE ORGANIZE EVENT RECOGNITIONS, THEY'RE BASED ON OUR ANNOTATIONS. WE MAKE THOSE TEXT PUBLICLY AVAILABLE FOR THE PARTICIPANT AND I'LL TRY TO SEE HOW THESE SYSTEM K'S RECOGNIZE THOSE EVENTS. SO, EVENT INSTRUCTION BASED ON PARSING IS QUITE SIMILAR TO WHEN CHRIS SAID OR DAN SAID, BASICALLY WHAT WE DID WAS [INDISCERNIBLE] A SENTENCE AND THEN LOOKING AT CONTEXT IN WHICH CERTAIN ENTITIES ARE IN THIS CASE PROTEINS APPEAR AND CERTAIN WORDS OR TRIGGER WORD APPEAR SO YOU CAN SEE, THE LINGUISTIC CONTEXT WHERE THOSE EXPRESSIONS APPEAR. AND THEN WE CHARACTERIZE THOSE LINGUISTIC ENVIRONMENT, HOW THINGS OF THIS ENVIRONMENT ACTUALLY DENOTE SEVERAL TYPES OF VIRUS EVENTS SO WE CONSTRUCT CLASSIFIERS WHICH USES THOSE RESULTS, SO ON AND SO FORTH. OKAY. SO WE FOUND THAT THERE ARE SOME DIFFERENCES AMONG TYPES OF OF EVENTS SOME EVENTS ARE EASY TO RECOGNIZE, AND SOME ARE NOT, FOR EXAMPLE, SOME VERY SPECIFIC EVENTS LIKE LOCALIZATION OR GENE EXPRESSIONS OR POST GENOMIC EVENTS ARE QUITE EASY TO RECOGNIZE BECAUSE THEY ARE USED, THEY ARE EXPRESSED BY USING VARIOUS SPECIFIC VERBS BUT ARE DIFFICULT BECAUSE OF BINDING EXPRESSIONS IN THE PROTEINS AND THE STRUCTURES OF PROTEINS ALSO MENTION IN THE TEXT, SO THE SENTENCE ARE NOT REALLY CENTERED AROUND SINGLE VERB THERE ARE OTHERS INVOLVED WHICH TWO EVENTS, THIS EVENT CAUSES A CERTAIN ANOTHER EVENT, HOW MUCH MORE COMPLEX SO DEPENDING ON THE EVENTS, WE WANT TO ECINIZE THE CLEAR DIFFERENCES OF PERFORMANCE OF THE SYSTEM, A SIMPLE EVENT RECOGNIZE FURTHER, BUT THE OTHER EVENTS WE CALLED EVENTS BUT OUR STRATEGY OF DEFINITE TYPES ARE QUITE DIFFICULT TO RECOGNIZE OKAY, SO, THAT IS A KIND OF WHOLE STORY OF OUR SYSTEM, BUT WE FOUND SEVERAL DIFFERENT PROBLEMS, WE HAVE TO ADDRESS. FIRST OF ALL, WE ALWAYS THINK THERE'S A SINGLE PUZZLE WHICH CAN BE APPLIED TO DIFFERENT LANGUAGES BUT THIS IS NOT THE KEYS, THE PRODUCTS WHICH ARE TUNED TO MOSTLY CHANNEL, FOR A NEWSPAPER ARTICLES CANNOT BE APPLIED TO THE STRUCK IN SCIENTIFIC PAPERS MAY NOT BE ABLE TO BE APPLIED TO MEDICAL RECORD PROCESSING. THE LANGUAGE IS VERY, VERY, DIFFERENT FROM ONE DPOMAIN TO THE OTHER--DOMAIN TO THE OTHER. SO WHAT WE DID IS KIND OF A DOMAIN ADAPTATION FOR HIGH--I COME BACK TO THIS TOPIC IN A LATER STAGE, BUT THE SENTENCES IN ABSTRACT AND SENTENCES IN NEWSPAPERS ARE DIFFERENT BUT SOMEWHAT SIMILAR, THE FORM ALITYS ARE PRETTY HIGH SO WHEN WE SEE THE SENTENCES IN CLINICAL RECORDS THEY ARE VERY, VERY, DIFFERENT FROM THESE TWOAINS, SO--DOMAINS, SO ADAPTATIONS TRAINED BY NEWSPAPER ARTICLES CAN BE ADAPTED TO THE NEW DOMAIN, LIKE CONSTRUCTS IN SCIENTIFIC PAPERS, BECAUSE THEY ARE SOMEWHAT SIMILAR, THEY ARE DIFFERENT, WHAT CERTAINTY, SOMEWHAT SIMILAR. SO FOR EXAMPLE, SO FOR THIS ONE WE USE [INDISCERNIBLE], AND THE TERFORMANCES NEAR TO 90%. BUT IF WE APPLY THAT PUZZLE TO SENTENCING IN CONSTRUCT, THEN THE PERFORMANCE DROPPED. SO SINCE WE HAVE LINGUISTIC ANNOTATIONS IN OUR PURPOSE WE CAN ADAPT THAT PUZZLE TO THE NEW DOMAIN, THAT MEANS ABSTRACT IN SCIENTIFIC PAPER, SO WE DID--WE DID SOME KIND OF ADAPTATION TECHNIQUES JUST BY NOT--NOT ABANDON EVERYTHING TRAINED BY USING PENTINE [INDISCERNIBLE] BUT WE USE THAT STATISTICAL MODEL AND THEN ADDING NEW [INDISCERNIBLE] WHICH ANNOTATE FRIDAY ABSTRACT AND THEN RETRAIN THE WHOLE PURPOSE OF WHOLE PUZZLE AGAIN THEN WE GET BETTER PERFORMANCE, SO IN THE END, THE ABSTRACT SENTENCES, THE SENTENCING OF THE ABSTRACT IS SOMEWHAT EASIER THAN SENTENCES IN NEWSPAPER BUT IN ORDER TO ACHIEVE THAT PERFORMANCE, WE HAVE TO ADAPT OUR [INDISCERNIBLE] BY USING OUR NEWLY ANNOTATED PURPOSE ON [INDISCERNIBLE] THINGS, AND THEN, I WILL TALK ABOUT SOMEWHAT LIMITATIONS OF THE CURRENT TECHNIQUES, I REALIZE. ONE IS THAT STATISTICS BASED PAUSING, THAT [INDISCERNIBLE] WHAT PROBLEM IS THOSE PUZZLES HAVE TO BE ADAPTED TO THE NEW DOMAIN, SO IF YOU WANT TO APPLY, MEDICAL OR CLINICAL SENTENCES, YOU HAVE TO RESTRAIN YOUR [INDISCERNIBLE]. AND ALSO, YEAH, PRESET OF STATISTICAL PUZZLE, PERFORM ALMOST CORRECTLY, THAT IS TRUE, SO THAT MEANS 90% OF OUR DEPENDENCEY DURATION ARE COLLECTLY RECOGNIZED, BUT IF YOU SEE CARE FREE, THEN MOST OF THE DEPENDENCEYS, WILL BE HIGH FREQUENCY ARE RATHER TRIVIAL ONE LIKE DETERMINER IN THE [INDISCERNIBLE] OR THE VERB OR THE ADJECTIVE, OR THE VERB AND THE SUBJECT, SO THOSE ARE THE HIGH FREQUENCY AND THEY HAVE ANOTHER GOOD RESULT HERE, YOU HAVE VERY GOOD PERFORMANCE HERE BUT WHEN YOU SEE RATHER DIFFICULT ABBREVIATIONS, THE PP ATTACHMENT AND ALSO RECOGNITION OF ANTISEEDENT OF REACTIVE CLOSES, OR COORDINATED„i PHRASES AND THE POSITIONS, THOSE CONSTRUCTIONS ARE SEMANTICALLY [INDISCERNIBLE], SOMETIMES THOSE ARE THE VERY IMPORTANT RELATIONSHIPS FOR SEMANTIC PROCESSING, BUT ALSO TO THAT RELATIONSHIP ARE RECOGNIZING NOT SO BADLY WITH A 90% OF A CORRECT RATE, MOST BASED ON RATHER TRIVIAL DEPENDENCEYS WITH HIGH PREQUENCY AND IF YOU SEE SEMANTICALLY CRUCIAL RELATIONSHIP, THERE STILL REMAINS A LOOK FOR LOTS OF ERRORS. SO NOW WHAT WE'RE DO SUGGEST NOW NONAPOPTOTIC IMPROVE, THESE CRUCIAL PROBLEMS BY USING SPECIFIC ONTOLOGY AND THAT I THINK IS ESSENTIAL FOR IMPROVING OUR TECHNOLOGIES AND THE OTHER THINGS WHICH I LOOKED AT FOR THE I-TWO, B-TWO, BASIC LE THEY WANT TO SUBTRACT SEVERAL CONCEPTS AND CLASSIFY THEM INTO SEVERAL DIFFERENT CLASSES IN THE RELATIONSHIPS. AND THAT'S VERY DIFFICULT PROGRAMS ACTUALLY, I WILL SKIP SEVERAL EXAMPLES, BUT IF YOU SEE THE RELATIONSHIPS BETWEEN MEDICAL PROBLEMS WHICH ARE SHOWN IN I THINK IN RED IN THIS EXAMPLE AND TREATMENT, MEDICAL TREATMENT WHICH IS SHOWN IN THIS SLIDE BY GREEN AND THAT IS VERY DIFFERENT FROM THESE EVENTS, THESE 70S ARE MOSTLY WELL CENTERED, MOST PROTEINS DO SOMETHING WITHOUT OTHER PROTEIN OR SOME PROTEIN CHANGES ITSELF LOCATION AND IT'S MORE CLOSER TO SENTENCE STRUCTURES. BUT IF YOU SEE, THOSE RELATIONSHIPS, THEY ARE NOT VERB CENTERED AT ALL. AND LOTS OF INFERENCES, OF VARIOUS TYPES ARE INVOLVED.„i SO SIMPLE PATTERNING MAKE NOT WORK SO WELL, SO MY CODING WITH THE MICROSOFT IS A EXPERIMENT AND PARTICIPATED IN ITWO, BTWO CHALLENGE AND IF YOU SEE THE DEPENDENCEY RELATIONSHIPS BY USING THE SAMPLE PRODUCT CONTRIBUTE VERY SMALL INCREMENT OF THE PERFORMANCE OF OUR DURATION OF THIS TASK OKAY AND WHAT IS IMPORTANT FOR RECOGNIZING THIS RELATIONSHIPS REQUIRE LOTS, LOTS OF ENGINEERING AND THEY SAY THAT IS FEATURED ENGINEERING AND THAT ACTUALLY, THEY ARE CHECKING TRAINING PURPOSE MANUELLY BY USING SOME PROGRAMS, HOW IN THE END THEY ENCODE LOTS OF FEATURES AND FEATURES ON INDIVIDUAL WORDS AND RELATIONSHIPS AND THOSE FEATURES CONTRIBUTE TO THE PERFORMANCE VERY MUCH BUT THAT IS ACTUALLY IT LOOKS LIKE CONSTRUCTION OF ORIENTED ONTOLOGY SO THEY SAY THAT IS THE FEATURE BUT THEY THINK IS ENCODING HOW, THE QUOTING ONTOLOGY. AND SO, WHAT I WANT TO SAY IS HERE IS THE RESULT IS FINE, THE PERFORMANCE IS BETTER THAN THE SHARED TASK BUT THEIR RESULTS MAY BE HIGHLY DEPENDENT ON THIS SPECIFIC TASK AND WE DON'T KNOW WHETHER THIS SPECIFIC SYSTEM TRANSFERABLE TO SUM RYES BUT TO OTHER CLINICAL RECORDS OR EVEN THE SAME DISCOVERED IT, DISCHARGE SUMMARIES OF DIFFERENT TYPES OF DISEASES. SO THE CONCLUDING REMARKS SO BASICALLY WHEY WANT IT SAY IS TED TECHNIQUES HIGHLY DEPENDENTOT TYPES OF YOU YOU WANT TO ACHIEVE. YOU WANT TO ADDRESS THESE PROBLEMS AND WE NEED TO HAVE ONTOLOGY CLOSELY RELATED IN THE TEXT AND THAT HAVE TO BE LEVERAGED TO ACTUAL INFORMATION INSTRUCTION TASKS AND WE NEED MORE EXPLICIT SEMANTIC REPRESENTATIONS AND SO ON AND THE OTHER, CLINICAL RECORDS AND THOSE ARE INTERSPERSED WITH ORDINARY SENTENCES AND WHAT THEY DID IS TO SWITCH ON THE SYSTEM FOR TELEGRAPHIC SENTENCES AND IN THE NORMAL SENTENCES. SO NORMAL SENTENCES, WE CAN PASS, WE CAN DO SOME RECOGNIZE IT FOR THAT, BUT THE TELEGRAPHIC SENTENCES WE HAVE TO VERY DIFFERENT NECKANISM FOR CREATING THEM. BUT I THINK THIS IS TWO EXTREMES OF THE PROBLEMS, THERE MIGHT BE MIDDLE, THAT MEANS AND TELEGRAPHIC EXPRESSIONS INSIDE AND WE HAVE TO TREAT TO THEM. SO IN ORDER TO DO THAT, I DON'T THINK STATISTICAL [INDISCERNIBLE] CAN HOLD, SO SOMEWHAT MANUEL RULE RIGHTING AND PATTERN WATCHING TYPE OF TECHNIQUES, HAVE TO BE INCLUDED AND THAT'S MY CURRENT INTUITION, [ APPLAUSE ] >> WE HAVE 30 MINUTE PERCENT QUESTION AND ANSWERS. >> COMMENT FOR DR. REZONING NICK, YOU CAN HEAR MY VOICE BUT YOU CAN'T TELL AND THAT IS REAUR SURANCE MAYBE THAT THE THAT THE TYPE OF NOTE THAT CLINICIANS ARE READING IT SEEMS TO BE SHIFTING BACK TOWARDS NARRATIVE TEXT IN MY EXPERIENCE, WE CAN TALK ABOUT THAT TOMORROW MORNING BUT YOUR OBSERVATION, I THINK IS YOUR RIGHT TO MAKE THAT A CONCERN, BUT MOST OF MY COLLEAGUES DON'T LIKE TO CLICK, THEY LIKE TO TALK MORE THEN A LIKE TO CLICK. >> THAT IS REASSURING, I'M GOING TO BE VERY INTERESTED THOUGH IN HOW IT SHAKES OUT WHEN MEANINGFUL USE AND THE PRESSURES START TO MOUNT ON BOTH SIDES. >> NEXT QUESTION OVER THERE? >> ME? >> YES. >> DR. RESNIK, I WAS INSPIRED BY YOUR ENTIRE TALK BUT THE PART I LIKED BEST WAS WHEN YOU SAID, IF WE CAN'T BRING THE DATA TO THE RESEARCHERS, MAYBE WE NEED TO BRING OUR RESEARCHERS TO THE DATA. I'M ELEN AND I'M FROM APIXELSIO, WHICH IS A SILLY CONE VALEE START UP WHEN IS COLLECT TAG DATA AND LOOKING TO PARTNER WITH RESEARCHERS AND SO I'D LIKE TO HEAR WHAT THOUGHTS HAVE YOU ABOUT HOW THIS COULD UNFOLD, WHAT YOUR VISION IS. >> I WAS REALLY WRESTLING WITH THIS, FOR EXAMPLE, I DON'T BEING THAT AUTOMATIC DEIDENTIFICATION WILL DO THE JOB. IT CREATES HIGH ENOUGH CONFIDENCE LEVELS FOR ORGANIZATIONS TO DO THIS AND IT WAS ACTUALLY AS I WAS LIFICKING TO SOME OF THE TALK THIS MORNING THEY REALIZED THAT IF YOU WANT TO TAKE THAT KNOWLEDGE BASED EXPERTISE, AND YOU ALSO WANT TO INTOPERATE INTO THE TRAINING, WHAT I THINK OF AS A STATISTICAL NLP OR DATA-DRIP MIND SET, YOU HAVE TO PUT THINGS WITH THE DATA. THERE ARE A VARIETY OF WAYS WE CAN DO IT JUST BRAINSTORMING AND HERE WE CAN DO MORE STORMING,. AND HAVE THAT KIND OF DATA, INTERNSHIP FOR Ph.D. STUDENTS OR OTHER WAYS OF DOING IT TO BRING PEOPLE IN-HOUSE AS PART OF THEIR TRAINING. AND I THINK THAT ONE OF THE BRIDGES TO THIS, COULD IN FACT BE USING THE BIOMEDICAL LITERATURE OF THE BRIDGE BECAUSE ALTHOUGH IT IS QUITE DIFFERENT, THERE ARE IN FACT A LOT OF SIMILARITIES. AND THEN PROVIDING THINGS IN THE PUBLIC DOMAIN THAT ALLOWS THEM TO DO THIS AND INCENTIVIZING THIS. AND THEN BY CREATING POSITIONS AND HAVING A RESEARCH TRACK THAT ALLOWS THEM TO DO THIS AS A POST DOC OR Ph.D. TRAINING BUT WOULD BE INTERESTED IN OTHER THOUGHTS ABOUT THIS. >> SO I WOULD LIKE TO ADD TO THAT COMMENT, AND ALSO I'D LIKE TO MENTION SEVERAL INITIATIVES THAT ARE CURRENTLY UNDERWAY THAT ARE GENERATED--GENERATING LAYERS ANNOTATIONS THAT WILL BE MADE AVAILABLE TO THE RESEARCH COMMUNITY THROUGH DATA USE AGREEMENTS, SO WE ARE--CLINICALLY FOLLOWING INITIATIVES AND THEY ARE CREATING TILES OF ANNOTATIONS OF CLINICAL DAT AND THEY ALL LAYERS OF ANNOTATIONS ON THE SAME CONTENT. THE LAYERS ARE THE BANK, THE ANNOTATIONS BE [INDISCERNIBLE] ANNOTATION WHICH IS IS THE PREDICATE ARGUMENT STRUCTURE AND CO VERMEN INFECTED FERENCE ANNOTATIONS--CO REFERENCE ANNOTATIONS AND DOMAIN RELATIONS, AND ENTITY ANNOTATIONS BASED ON THE UMLS, THE COMBINED PURPOSE FROM THOSE FOUR WILL BE VERY CLOSE TO TWO MILLION WORDS WHICH IS ALSO GOING TO BE CLOSE OF PANTRY BANK AND THIS IS CLINICAL DATA. SOME OF THEM YOU MAY NOT HAVE HEARD, MYP A PACK WHICH IS A MULTISERVICED CLINICAL PAT NORM THE OTHER IS THE SHARP INITIATIVE. THE OTHER IS THE SHARED PROJECT, SHARED ANNOTATED RESOURCES AND LAST ONE IS TIME. TEMPORAL HISTORIES OF MEDICAL EVENT. SO I FORGOT ONE IMPORTANT LAYER OF ANNOTATIONS AND THOSE ARE THE TEMPORAL RELATIONS FOLLOWING THE TIME ML. NOW WE SHOULD DEKISS HOW TO MAKE THESE AVAILABLE TO THE COMMUNITY. DATA USE AGREEMENTS ARE GIVEN BUT WHAT WOULD BE THE CENTRAL MISALIGNED CHROMOSOMES SOME WOULD IT BE THE NATIONAL LIBRARY OF MEDICINE? IT WOULD BE THE LINGUISTIC DATA CONSULTIUM AND THE BRANCH OF THE CONSORTIUM, I'M THROWING SOME IDEAS. THIS IS MY COMMENT AND SECOND IS A FOLLOW UP QUESTION, IMAGINE WE HAVE REMOVED ALL THE HURDLES THAT WE HAVE FILLED OUT ALL THE DATA USE AGREEMENTS WE DON'T HAVE ALL INVESTIGATORS THAT WANT TO WORK AND THEY HAVE ACCESS TO TO--HOW CAN WE--WHAT ARE THE MOTHERS AND HOW CAN WE UTILIZE THE VAST AMOUNT OF UNLABELED DATA IN CONJUNCTION WITH THE TWO MILLION GOAL ANNOTATED DATA THEY MENTIONED? >> SO I MENTIONED BRIEFLY WHEN I WAS SPEAKING, I ALLUDED TO TREND TOWARD SEMISUPERVISE TED METHODS IN NATURAL LANGUAGE PROCESSING, I THEN IS REALLY THE WAVE OF THE FUTURE BECAUSE WE DISCOVERED THAT ANNOTATED DATA JUST DOESN'T COME IN LARGE ENOUGH QUANTITIES, THIS IS HAPPENING IN PARALLEL, IN THE SOCIAL MEDIA WORLD, WHERE THERE'S SO MUCH VARIATION THAT YOU SIMPLY CAN'T ASSUME THAT THE TRADITIONALLY TRAINED MODEL AND YOU WON'T BE ABLE TO TRAIN OR KEEP UP WITH THESE ANNOTATION MODELS. THE THREE CATEGORIES THAT I GAVE OF THOSE, I AM MOST OPTIMISTIC ABOUT USING INFORMATIVE PRIORS IN WHAT WOULD OTHERWISE BE UNSUPERVISED MODEL SAYS BECAUSE THOSE THINGS ACTUALLY HAVE THE POTENTIAL TO BE BOTH DATA DRIVEN --WHEN YOU TALK ABOUT THAT AS A SMALL SEED AND WORK WITH LARGE GOOPS OF DATA, FOR THAT YOU'RE GOING TO THE CLUTTER AND YOU WANT TO RUN OUT ON AMIA SWRON ECTWO AND THERE ONCE AGAIN, WE'RE BACK TO THE PROBLEM OF HIPPA AND PRIVACY AND CONFIDENTIALITY ISSUES AND SO FORTH AND I WANT TO MENTION THAT DR. FRIEDMAN WANTED TO COMMENT SO I WANT TO MAKE SURE SHE HAS THE CHANCE TO. >> SO IN MEDICAL NOTES THE PURPOSE IS TO JOT DOWN THE HISTORY OF THE PARENTS FOR OTHER PHYSICIANS TO LOOK AT SO IT'S SPECIALIST AND WHEN YOU LOOK AT A NOTE THERE'S IMPLIED INFORMATION. LOT OF INFORMATION THAT THE READERS NOTE UNDERSTANDS BUT IT'S NOT IN THE NOTE EXPLICITLY SO I'M WONDER FIGURE YOU SAW THAT IN THE BIOLOGICAL NOTES THAT WHEN THE BIOLOGIST WERE ANNOTATING THERE WERE MEANING BEHIND IT THAT THEY COULDN'T ANNOTATE AND HOW THEY DEALT WITH IT AND I'M WONDERING TO THE PANEL HOW WOULD YOU DEAL WITH THAT SITUATION. >> OKAY, SO, WE ALSO ENCOUNTERED THE SINGULAR PROBLEMS WHEN WE ASKED BIOLOGIST, WITHOUT ANY GUIDANCE, THINGS THAT, OKAY, TRY TO FIND OUT VIRUS EVENTS WHICH ARE STATE INDEED IN SENTENCE. THEN DEPENDING ON THE BACKGROUND OF BIOLOGIST, THAT ANNOTATIONS ARE QUITE DIFFERENT FROM ONE TO THE OTHER BECAUSE THE BIOLOGY ARE SORT OF SUBSTANTIAL BACKGROUND KNOWLEDGE. DATING FOR IT. SOME INFORMATION SO THERE IS LOTS OF TYPE OF PROBLEMS ARRIVE AND I SEE THE SAME PROBLEM IN THIS SUMMARY, BETWEEN THE OTHER TREATMENT AND MEDICAL PROBLEMS SO PEOPLE INFER, LOTS OF RELATIONSHIPS, IN THE SENTENCES. SO [INDISCERNIBLE] ACTUALLY SMOOTH OUT THAT KIND OF IMPRESSIVE INFERENCES AS WELL. THAT MEANS THE SYSTEM DON'T NEED TO UNDERSTAND HOW RATIONALLY BECAUSE OF THIS STATEMENT IN THE SENTENCE AND THIS KNOWLEDGE WE HAVE, SO WE CAN INFER THIS STATEMENT, THAT IS A HUMAN BEING DOES, THAT THE [INDISCERNIBLE] ACTUALLY SOMETIMES ROUND AND THAT RELATIONSHIPS WITHOUT KNOWING, THAT'S KIND OF A DANGER AND ALSO ADVANTAGE OF LEARNING THAT'S THE RESULT, SO, MUCH OF THE LEARN SUGGEST ROBUST AND USEFUL, THAT IS BECAUSE THEY DON'T NEED EXPLICIT UNDERSTANDING OF THAT KIND AND SOME KIND OF [INDISCERNIBLE] CAN TREAT THE PROBLEM. OKAY, SO WHAT WE'RE SEEING IS THE PERFORMANCE OF THE SYSTEM BECAUSE THE SYSTEM DOESN'T UNDERSTAND EXPLICITLY SO WHAT I REALLY WANT TO DO IS TO COMBINE THE PROPOSITIONS OR STATEMENTS OR INFORMATION IN THE SENTENCE AND SOME KIND OFKSPLICITY KNOWLEDGE--EXPLICIT KNOWLEDGE TO COMBINE. SEE 99 OF THAT IS,--NONE OF THAT IS BUT STILL VERY IMPRESSIVE WAY. >> I WOULD ALSO--I WOULD ALSO SAY THAT THERE IS A CATEGORY OF MODELS THAT'S INCREASINGLY POPULAR, THAT ALLOW YOU TO SPECIFY THAT THEIR EXISTS LATENT STRUCTURE THINGS THAT ARE NOT NECESSARILY REFLECTED IN WHAT'S VISIBLE AND THOSE ACTUALLY ARE AGAIN A NICE WAY TO PROVIDE A SOFTOPPORTUNITY FOR THEIR TO BE IMPLEASESIT CONNECTIONS BETWEEN THING WITH SOME STRUCTURE BUT LETTING THE SYSTEM AND A NICE EXAMPLE OF THAT, THERE'S BEEN A BUNCH OF WORKING, LOOKING AT THE SCIENTIFIC LITERATURE, NSF ABSTRACTS, OVER THE YEARS AND SOME OF THESE MODELS ARE ABLE TO AUTOMATICALLY DISCOVER THAT HAVE YOU A PARTICULAR TOPIC THAT AT AN EARLIER POINT IN TIME IS DOMINATED BY THIS KIND OF TERMINOLOGY, AT A LATER POINT IN TIME IT'S DOMINATE BIDE THIS KIND OF TERMINOLOGY BUT THE TWO THINGS ARE LATENTLY THE SAME SUBJECT AND THAT'S AN APPROACH THAT WOULD ACTUALLY BE--I WAS GOING TO SAY COMPLIMENTARY BUT CONSIST WENT WHAT THEY'RE DISCUSSING. >> THE OTHER PROBLEM IS THAT KIND'VE APPROACH, I THINK THE PAT PATTERNS SOMETIMES PEOPLE ENCODE TO EXTRACT THE INFORMATION ENCODE THAT KIND OF INFERENCES, BECAUSE THE CONTEXT IS FIXED SO CAN YOU DO IT, BY PATTERN EXTRACTION PATTERNS, IT IS NOT REALLY LINGUISTIC BALANCE, IT'S SOMEWHAT INFERENCES, ALSO INVOLVED IN THAT PATTERN RIGHTING, SO--PATTERN WRITING SO MAYBE THE CHALLENGE IS HOW WE CAN REALLY REFERENCE THAT KIND OF KNOWLEDGE WHICH CONTEXT HAD--HIGHLY CONTEXT DEPENDENT BUT A SPECIALIST HAVE, AND ALREADY EMBEDDED IN THE PATTERNS, WE SAW MORE STATISTICAL TYPE OF TECHNIQUES NOW. SO, WE REALIZE THAT PATTERN MATCHING EFFORTS, ACTUALLY, WE'RE IN THE CONTEXT IS FAIRLY RESTRICTED. AND THEN, INFERENCES, EVEN ENCODED IN THE PATTERN ITSELF. >> --BEFORE WE CAN TRANSLATE THE GOOD STUFF NLP CAN DO INTO THE REAL WORLD WHERE IT WILL IMPROVE HEALTHCARE AND ONE I CAN THINK OF IS SKEPTICISM, YOU KNOW SKEPTICISM ABOUT COMPUTATIONOLOGY IN GENERAL. AN EXAMPLE MIGHT BE NO COMPUTER WILL TELL ME WHAT TO DO SO WHAT ABNORMALITIES STACKLES YOU THINK YOU MIGHT ONCE WE GET PAST THE DATA AVAILABLE WHICH BY THE WAY I THINK NLP KNOCK K'S HELP WITH THE DEIDENTIFICATION STUFF AND ALL THAT TO BE DONE TO GET PASSED THAT AND THERE ARE DATA SETS MORE AND MORE AVAILABLE AND THEY'RE OUT THERE AND I KNOW OF OF SOME THAT I WORK WITH SO MUCH NLP BUT CLINICAL DATA, AND THE BOTTOM LINE IS WHAT OTHER OBSTACLES YOU SEE LIKE SKEPTICISM AND WHAT DO YOU THINK WE CAN DO ABOUT THEM? >> [INDISCERNIBLE]. >> MENTAL DISORDERS SO WE PROCESS THIS AND AND IN THE BEGINNING WE ARE DIAGNOSING CERTAIN IMPLEMENTATIONS THAT THEY HAVE, AND HAVE THE THERAPISTS COME UP WITH DIAGNOSIS AND CLOSE THE LOOP WITH THE CLOSE [INDISCERNIBLE] SO IN THE BEGINNING, THE SYSTEM WAS PERFORMING, SO BY THEN, IN A COUPLE MONTHS TIME, THE SYSTEM IMPROVED CONSIDERABLY BECAUSE OF MORE DATA AND THEN, THE THERAPY AND APPLICATIONSISTS REALLY LIKE THE SYSTEM VERY MUCH SO ALL THIS SKEPTICISM IS DISAPPEARING IN TIME. SO YOU HAVE TO BE COURAGEOUS TO BE START, AND THEN, IT'S OVER THAT, NOW AS FAR AS THE DIFFICULTIES, I THINK IT'S THE NATURE OF THE DATA AND THE LACK OF TOOLS FOR EXAMPLE, ONTOLOGYSTS I TRY TO MAKE THAT POINT BEFORE IN THE NATURE OF DAT TED ABBREVIATIONS AND SHORT HAND NOTATIONS IN NATURAL LANGUAGE WE HAVE NOT SEEN MUCH OF THAT WE DON'T HAVE TOOLS FOR THAT. >> YEAH, SO I GUESS MY THINKING ON THIS IS INFORM BIDE HAVING BEEN INVOLVED IN THE COMIRB ASSISTED CODING ARENA FOR THE LAST 12 YEARS OR SO, IN THAT DOMAIN MPLET THE BIGGEST COMPETITOR IS THE STATUS QUO. YOU ALSO HAVE JOB FEARS SO THERE'S PEOPLE WHO FEEL THREATENED BY THIS, THERE THERE'S A WHOLE COMMUNITY OF PEOPLE WHO ARE WORRY BODY THEIR JOBS GOING AWAY, THE FOCUS NEEDS TO BE ON THE ABILITY TO HAVE COMPUTER DOS WHAT THEY'RE GOOD AT AND HAVE PEOPLE DO WHAT THEY'RE GOOD @ AND SO IN REALITY WHAT YOU PUT SOMETHING INVOLVED IN A PROPER HUMAN IN THE WORK FLOW, PEOPLE ARE HAPPIER WITH WHAT THEY'RE DOING BUT THE ADOPTION CHALLENGES ARE PART OF IT, BUT A BIG PART IS ALSO THAT WE AS A COMMUNITY ARE NOT THAT ACCUSTOMED TO DEALING WITH THE IN THE TRENCHES PROBLEMS WHERE HAVE YOU HUMANS IN THE LOOP. THIS IS NOT A NATURAL LANGUAGE PROCESSING PROBLEM. THIS IS A PROBLEM THAT INVOLVES THE TECHNOLOGY, IT INVOLVES CRUCIALLY THE DATA BUT IT ALSO INVOLVES THE SUBJECT MATTER EXPERTISE AND YOU HAVE TO PUT THOSE TOGETHER IN A PACKAGE AND IF WE CAN MAGICALLY ASSUME WE'VE GOT THE DATA AND WE'VE GOT LOTS OF TECHNIQUES, THERE'S VERY RAPID ADVANCES, THERE ARE VERY RAPID ADVANCES THAT WE CAN MAKE BUT IT SEEMS TO ME THAT THE NEXT SET OF CHALLENGES REALLY HAS TO DO WITH ENGAGING THE REAL WORLD PROBLEM SPACE AND TRANSLATING THIS INTO SOMETHING THAT'S ACTUALLY GOING TO MAKE A DIFFERENCE TO PEOPLE AND FORGIVE THE BUSINESSY TERM BUT SHOW RETURN ON THE INVESTMENT THEY HAVE TO SEE THIS WILL DO SOMETHING FOR THEM. >> AND JUST TO COMMENT, BRIEF COMMENT AND WHAT I'VE BEEN NOTICING IS DEFINITELY A PARADIGM SHIFT IN THINKING ABOUT THE TECHNOLOGY. SO WHEN THEY SEE THE YOUNGER PHYSICIANS, I SEE THEM TRUSTING THE DEVICE AND GOOGLING MUCH MORE THAN TALKING WITH THEIR COLLEAGUES. SO I SEE THIS TECHNOLOGY SHIFT OF UBIQUITOUS COMPUTING, IT'S ALL AROUND US. >> MAYBE ONE OF THE MAJOR CIRCLE AND IS THE NATURE OF--I MEAN THE COMPUTER SYSTEM IS ALWAYS SOMEWHAT [INDISCERNIBLE] THAT'S A PROBLEM. FOR EXAMPLE, OUR GROUP BEING ENGAGED IN PATHWAY CONSTRUCTION FROM SCIENTIFIC PAPERS, BIOLOGICAL PAPERS AND SCIENTIST REALLY WANT TO SEE THE ORIGINAL PAPER ALL THE TIME. THEY DON'T TRUST THE OUTPUT OF THE COMPUTER, SO THEY WANT TO TRACE A WHY THE SYSTEM IS PATHWAY OR INTERACT WITH THIS ONE AND SO SOY DON'T KNOW THE CLINICAL--SO I DON'T KNOW THE CLINICAL PART, BUT IT MIGHT BE THE CASE THAT HOW LIMITED IS IT, THEY WANT TO SEE MORE SORT OF WHITE BOOK TYPE OF FACTORS, AND WANT TO TRACE TO THE ORIGINAL. >> THAT WOULD MAKE ME SIMPLY BECAUSE BECAUSE OF DIFFERENT DISCIPLINES IT'S HARD TO BED WHAT'S GOING ON INSIDE SOMEBODY ELSE'S GLASS BOX. THAT SAID, MY EXPERIENCE IS, THAT WHAT'S REALLY IMPORTANT IS NOT KNOWING WHAT'S GOING ON, IN THE INSIDE, IT'S HAVING THE SYSTEM AGAIN FOR FIVE ME NEAR MORPHIZING BUT HAVING THE SYSTEM KNOW WHEN IT SHOULDN'T BE TRUSTED. IT'S HAVING--IT'S HAVING A FRAMEWORK, AGAIN THIS GOES INTO HOLISTIC FRAMEWORK AND IT IS SOMETHING THAT THROUGHOUT NATURAL LANGUAGE PROCESSING IN ALL--ACROSS ALL THE DIFFERENT APPLICATION WEES SPEND WAY TOO MUCH TIME IN CONFERENCE ESTIMATION AND YOU SEE THAT IN SPEECH COMMUNICATE A LOT LESS IN OTHER AREAS BUT IF THE SYSTEM--IF YOU CAN KNOW WHEN YOU SHOULD TRUST AND AND AND KNOWING WHAT'S ANYTHING ON ON THE INSIDE IS NOT--IS NOT CENTRAL PROBLEM. >> ON WHAT THEY WERE SAY O GBA THE DISCONNECT ON THE TASK ORIENTED NATURE OF THE PROBLEM WE FACE AND THE GENERIC APPROACH THAT WOULD TAKE INTERNATIONAL LANGUAGE PROCESSING, THE COMMENT OR QUESTION, HAS TO DO WITH PORTABILITY OF ALL THIS TECHNIQUES, HOW ARE WE GOING TO BE ABLE TO AT THE SAME TIME ADVANCE AND GET OVER ALL THESE GREAT CHALLENGES FROM THE TECHNIQUES AND METHODS AND THEN ACTUALLY MAKE THEM AVAILABLE, SO THEY CAN BE USED IN A LARGE AMOUNT OF THE PROBLEMS OF PLACES OR HOSPITALS OR WHATEVER TO APPLY THEM, WHAT IS IT THAT WE CAN DO TO BRIDGE THIS TO THINGS, IT HAPPENS BOTH IN FOR EXAMPLE, GENE FARM GKB IN DISCUSSING THIS PROBLEM OR WORKING ON IT FOR A YEAR ON HOW TO GET IN PHARMACOGENOMIC FIBBED PROGRESS LITERATE AND YOU ARE MAKE THEM AVAILABLE TO CURATE ORS BUT THE CURATORS WON'T TRUST THEM AND IN THE ANNOTATING PROCESS WITH WAS DISCOVER THE VERY DEEP UNIQUENESS TO THIS PROBLEM AND THE LANGUAGE APPROACHES THAT WE'RE GOING TO TAKE TO SOLVE THAT ONE AND THEN SIT BACK AND WONDER, WELL HOW IS THAT GOING TO BE APPLICABLE TO OTHER THINGS? SO IT'S LIKE A TASK BY TASK PROBLEM, PROBLEM BY PROBLEM APPROACH AND TELL BE VERY DIFFICULT TO GENERALIZE. SO I WANT TO LIFIC TO THOSE THOUGHTS ON THAT. --THERE ARE TWO ELM-TEBURKEULOSEISS TO THIS, ONE THAT'S IT'S OPEN SOURCE AND ONE IS THAT IT'S HIGHLY ADAPTABLE. ALTHOUGH THIS MAY NOT ANSWER YOUR QUESTION, IN ITS ENTIRETY. *CAPTIONS RESUME SHORTLY *CAPTIONS RESUME SHORTLY *CAPTIONS RESUME SHORTLY YOU GO OUT THE FRONT DOOR, WE'LL HAVE SOMEBODY THERE TO POINT THE WAY. PLEASE, AND THIS IS ESPECIALLY FOR THE PEOPLE IN THE SECOND -- O IN THE NEXT PANEL, PLEASE BE HERE BY 5 OF 2:00. WE'RE GOING TO START SHARPLY AT 2 O'CLOCK. >> OKAY, FOLKS. WELCOME BACK. AS I EXPLAINED TO A FEW OF YOU, WE FLY THE FOOD IN FROM PARIS, EVERY MORNING SO I HOPE YOU ENJOYED THAT LUNCH. I WANT TO MAKE AN ANNOUNCEMENT ABOUT TOMORROW'S SESSION. THE HAND-OUT THAT WE HAD OUT THERE SHOWS US STARTING AT 8 O'CLOCK. HA IS IN FACT TRUE. NOTE THAT IT IS STARTING AT 8 O'CLOCK. THERE WILL BE AS THERE WAS THIS MORNING, A LIGHT BREAKFAST, COFFEE AND SO ON, THAT WILL BE AVAILABLE BEGINNING AT 7 O'CLOCK. JUST BECAUSE THAT GOT DROPPED OUT OF THE PROGRAM SOMEHOW BUT THAT WILL BE AVAILABLE IN MORNING SO IF YOU'RE EARLY ENOUGH YOU'LL BE READY TO GO AT 8 WHEN THE NIDA CONFERENCE STARTS. OUR NEXT PANEL IS LINGUISTIC BASED METHODS. THE CHAIR IS HUA XU FROM THE THE UNIVERSITY -- FROM VANDERBILT UNIVERSITY DEPARTMENT OF BIOINFORMATICS THERE. Ph.D. IN BIOMEDICAL INFORMATICS FROM COLUMBIA AND HE WILL INTRODUCE THE OTHER MEMBERS OF THE PANEL. ONE OTHER POINT. WE DO ACTUALLY OF COURSE NEED TO GET SOME TIME BACK. I HATE TO TAKE IT OUT OF DISCUSSION OR QUESTION AND ANSWERS BECAUSE THAT'S THE OPPORTUNITY YOU HAVE TO INTERACT WITH PEOPLE. SO I'M GOING TO TAKE THE PREROGATIVE HERE AND SAY INSTEAD O OF HAVING A FORMAL BREAK, WE WILL HAVE EVERYTHING OUT OUT THERE AND AS YOU FEEL THE NEED FOR EITHER REFRESHMENT TO STRETCH YOUR LEGS RESTROOM OR WHATEVER, JUST LEAVE AS YOU NEED TO, AS MANY TIMES AS YOU NEED TO, JUST DO IT QUIETLY AND WE WILL NOT HAVE A FORMAL BREAK DURING THIS AFTERNOON. DR. XU. >> THANK YOU. GOOD AFTERNOON, EVERYONE. S IN PANEL WE HAVE THREE EXCELLENT SPEAKERS, DR. CHITTA BARAL, DR. HIRSCHMAN FROM CORPORATION AND JAMES PUSTEJOVSKY FROM BRANDEIS UNIVERSITY. ISLE START WITH DR. CHITTA BARAL. PROFESSOR OF COMPUTER SCIENCE AND ENGINEERING AT ARIZONA STATE UNIVERSITY AND IS A PAST CHAIR OF THE DEPARTMENT. HE RECEIVED HIS Ph.D. IN COMPUTER SCIENCE IN 1991 FROM THE UNIVERSITY OF O MARYLAND, COLLEGE PARK. CHITTA'S RESEARCH INTERESTS ARE IN THE AREAS ARTIFICIAL INTELLIGENCE, KNOWLEDGE REPRESENTATION, NATURAL LANGUAGE PROCESSING AND PROCESSING OF MOLECULAR BIOLOGY AN PHARMACOLOGY. HIS BOOK KNOWLEDGE REPRESENTATION AND THE REASONING IS PUBLISHED BY THE CAMBRIDGE UNIVERSITY PRESS. CHITTA'S RESEARCH IS FUNDED BY VARIOUS FEDERAL FUNDING AGENCIES. HE'S ASSOCIATE EDITOR OF AR RESEARCH AND AREA EDITOR OF THE ARCCM TRANSACTION ON COMPUTATIONAL LOGIC. HIS GROUP DEVELOPED SEVERAL SYSTEMS FOR BIOLOGICAL KNOWLEDGE FROM TEXT INTEGRATING THEM WITH A VARIABLE DATABASES AND THE REASONING WITH THEM LEADING TO APPLICATIONS SUCH AS CONSTRUCTING PATHWAY, PREDICTING DRUG DRUG INTERACTIONS AND IDENTIFICATION OF NOVEL DRUG INDICATIONS. THANK YOU. [APPLAUSE] >> THANK YOU. SO I STARTED MAKING A NEW SET OF SLIDES AND WITH THIS THING NATURAL LANGUAGE UNDERSTANDING. PEOPLE TO THINK ABOUT IT SO WE -- THIS TERM HAS BEEN BANDIED ABOUT QUITE A BIT IN THE MORNING SO WHAT IS NATURAL LANGUAGE UNDERSTANDING. AND MY DEFINITION OR MY TALK IS THAT THEY DYSPLASIA AND UNDERSTAND IT SO WHAT WE MEAN? THAT MEAN AFTER THAT IF WE ASK THE PERSON ANYTHING IN THE TEXT A QUESTION ABOUT THE TEXT YOU ASK ANSWER IT SO THAT'S THE WORKING DEFINITION THAT I KEEP IN MY MIND OF NATURAL LANGUAGE UNDERSTANDING THAT YOU CAN GIVE SOMEONE FEW PAGES OF TEXT, THEY CAN READ IT AND YOU ASK QUESTIONS AND THEY CAN ANSWER IT SO KEEP THIS IN MIND AND I'LL COME BACK TO THIS TOWARDS THE END OF THE TALK. THIS IS MAKEING WHAT KIND OF INPUTS WE NEED FOR THIS. SO THERE ARE FACTS AND VARIOUS INTERACTIONS OF VARIOUS OTHER FACTS THAT WE HAVE. THEN THERE ARE GENERAL AND DOMAIN SPECIFIC RULES WE NEED FOR THAT PARTICULAR DOMAIN F. WE DEAL WITH DRUGS MAYBE WE WANT TO KNOW VARIOUS RULES OF PHARMACOKINETICS OR HOW DRUGS INTERACT OR HOW WE CONSTRUCT PATHWAYS THEN GENERAL REASONING MECHANISMS, WHAT IS EXPLANATION, DIAGNOSIS, HOW DO WE PREDICT, WHAT IS THE PLAN OR DESIGN? SO WE NEED THIS KNOWLEDGE AND FACTS TOGETHER TO MAKE CERTAIN DECISIONS. SO WHERE DO WE GET THE KNOWLEDGE FROM? THEY'RE IN DATABASES. WE CAN EXTRACT FROM TEXT. THE DOMAIN SPECIFIC RULES, WE COULD GET FROM BASIC TEXT AND THE REASONING IS WHAT IS THE EXPLANATION, WHAT IS DIAGNOSIS AT THIS POINT, WE WRITE (INDISCERNIBLE). SO WHEN I TALK NATURAL LANGUAGE PROCESSING TO NATURAL LANGUAGE UNDERSTANDING THE PATHWAY NATURAL LANGUAGE PROCESSING IS MORE FOCUSED ON FULFILLING THE BIODOMAIN EXTRACTING FACTS FROM TEXT. WE CAN AUTOMATICALLY EXTRACT OR COLLABORATIVELY DEVELOP THEM INTO DATABASES. NATURAL LANGUAGE UNDERSTANDING WE NEED TEN MORE GENERAL KNOWLEDGE FROM THE TEXT THAT CAN BE USED TOGETHER WITH THE EXTRACTOR FACTS FOR UNDERSTANDING SO I'LL GIVE AN EXAMPLE OF WHAT I'M TRYING TO TALK ABOUT THIS FACT AND KNOWLEDGE BECAUSE OFTEN KNOWLEDGE IS USED TOGETHER. I WANT TO THIS TO ME IS A FACT. INHIBITORS OF P-450 AND GLYCO PROTEINS. SO WE TALK VERY SPECIFIC ENTITIES. WHILE HERE IN THE RED PART INDUCES INDIVIDUALS ENZYMES OF DRUG METABOLISM ALSO AFFECT DRUG TRANSPORTER PROTEINS. SO WE'RE NOT TALKING ABOUT ANY PARTICULAR INDUCER OR INHIBITORS OR ANY PARTICULAR DRUG TRANSPORTER PROTEINS, WE'RE TALKENING MORE GENERAL TERMS. AND SUCH KIND OF KNOWLEDGE DO EXIST IN THE TEXTBOOKS THEY EXIST. AND PART OF MY INTEREST IS BE ABLE TO GET THIS KNOWLEDGE ALSO ECT TRATTED OR TAINED FROM THE TEXT SO TOGETHER WITH THIS KNOWLEDGE AND THE FACT WE CAN MAKE MORE INTERESTING CONCLUSIONS. SO THAT WOULD BE MY -- I WILL BE TALKING AB THOSE THINGS THOSE SUITS SO FIRST ABOUT THE EXTRACT AND WANT TO TALK GENERALIZED EXTRACTION METHOD, THEN I WILL TALK ABOUT THE HOW WE ARE USING THIS EXTRACTOR TRACK AND SOME GENERAL KNOWLEDGE TO DO SOME KIND OF DISTANT MAKING AND OBTAINING GENERAL KNOWLEDGE FROM TEXT THE RULES THAT I TALKED ABOUT WHICH IS NEEDED FOR NATURAL LANGUAGE UNDERSTANDING. SO THIS IS AN EXAMPLE WHICH IS VERY SIMILAR TO THE IDEA THAT IN THE MORNING SO IT'S A SYSTEM THAT HAS TO BE DEVELOPED BY (INAUDIBLE) GROUP AND INDEED HE DIDN'T WHAT YOU DID IS USE DATABASES LIKE INTERACT AND OBTAIN THE PROTEIN PAIRS THAT WERE IN THE DATABASE AND HE LOOKED UP PUBMED AND FOUND SENTENCES BECAUSE IN SOME OF THESE DATABASES THEY TELL YOU WHERE THIS PARTICULAR FACT IS EXTRACTED AND YOU LOOK AT SENTENCES AND FIGURE OUT O WHICH MIGHT HAVE BEEN EXTRACTED SO IT WAS SUPERVISED KIND OF THING SUPERVISING NOT OBTAINING THAT ANNOTATED. THEN WENT OVER IT AND FIND THE PATTERNS, ANALYZE THE PATTERNS AND USE THE PATTERN AND HIS SYSTEM DID VEL IN THE BIOCREATIVE -- BIOCREATIVE COMPETITIONS SO THIS IS DR. JOHN HACKENBURG. SO NOW WE WANT TO TALK TO YOU ABOUT TH5$ GENERAL EXTRACTION, SO ONE THING WE DID IN OUR GROUP IS WHILE DOING THIS AND DEALING WITH PUBMED DR. GONZALES WAS HERE, NOTICE A LOT OF TIME IS GOING IN THE COMPUTING JUST READING THE FILE, CLOSING THE FILES. WE HAVE THE FILES OF THIS TEXT READING, PUTTING WHEN YOU'RE DOING THIS AND OUR IDEA WAS WHY NOT TAKE THE PROCESS TO SOME EXTEN, PUT IN A DATABASE AND REFORMULATE SO TAKE THE TEXT, PASS IT ANNOTATE IT AND PUT IT IN A DATABASE AND NOW LET THE DATABASE HANDLE THE MANY INPUT OUTPUT ASPECTS BUT THEY'RE GOOD AT IT AND FIGURE A WAY SO YOUR EXTRACTION IS A DATABASE QUERY. TO THIS PAST DATABASE. THAT IS WHAT WE CALL PT COOL SYSTEM AND THAT'S A MOTIVATION I TALKED ABOUT SO FOR EXAMPLE, AT THAT TIME WE'RE USING THE (INDISCERNIBLE) BOTTOM IS NOUN AND THERE'S THE RD-553 AND THEN -- RD-53 AND THEN THESE KIND OF LINKS BUT ANY OTHER SOMATIC PATHWAY WOULD BE FINE. YOU STORE IT AND DEVELOP A CODING LANGUAGE BECAUSE WE ARE STORING THIS SO YOU NEED A DIFFERENT LANGUAGE TO TRANSFER THIS IN A NICE WAY AND WE DEVELOP THIS TREE LANGUAGE WHICH IS BASED ON THE SEMISTRUCTURE DATABASE LANGUAGE AND USING THIS YOU WOULD BE ABLE TO WRITE QUERIES ABOUT SAYING LOOK, I WANT TO EXTRACT THOSE PAIRS OF NON-PHRASE WHICH IS ARE RELATED IN THIS WAY, YOU CAN TRAVEL THE TREE. SO WE DID THAT, AND NOT ONLY ABLE TO EXTRACT VARIOUS FACTS USING THIS METH WE CAN DONOR MALLIZED -- VARIOUS OTHER KINDS OF PROCESSES BIOMEDICAL TEXT WITH RESPECT TO THIS MA TEAR. SO THE IDEA IS PERHAPS TALKING ABOUT -- A POSSIBILITY THAT WE MIGHT IF WE HAVE A LARGE BODY OF TEXT WELL KNOWN SEMANTIC PARCELS AN CREATE AN INTERFACE WHERE PEOPLE CAN ASK HIGH LEVEL QUERIES, AS A FOLLOW UP EXTRACTION. IT AVOIDS -- TODAY I'M EXTRACTING PROTEIN PROTEIN. TOMORROW I WANT TO EXTRACT GENE DRUG INTERACTIONS SO I HAVE TO DO IT AGAIN FROM SCRATCH. I ASK A DIFFERENT QUERY. SO THAT WAS THE IDEA. SO NOW GOING TO MAKING A COUPLE OF EXAMPLES THAT WE DID ONE IS BUILDING PATHWAYS IN CLINICAL DOMAIN AND BUILDING -- YOU MAY WANT THE REASONING CHANGE. SO HERE ARE THE IMPORTANT PARTIES NOT JUST TO EXTRACT THE FACTS BUT CONNECTING THE DOTS. IF YOU -- THIS IS A PATHWAY IF YOU GO TO THE PARTICULAR PATHWAY T PATHWAY IS DIFFERENT FROM THE NETWORK AND WE'RE CREATING THIS NETWORK NOT THE PARTICULAR PATHWAY. THAT'S THE DIFFERENCE BETWEEN THIS AND THIS. KNOWLEDGE IN TO DECIDING HOW THIS STRUCTURE IS BESIDES THE ARROWS. DRUG TRANSPORT HAS DISTRIBUTED DRUG AT THIS TIME SO IT'S KIND OF INITIAL PROCESS AFTER THE DRUG IS DISTRIBUTED, THEN DRUG TRANSPORTERS DISTRIBUTED METABOLISM IN LIVER SO THIS THERE IS A SEQUENCE THAT HAPPENS. KNOWING THIS EXTRA KNOWLEDGE WE COULD -- KNOWING THE KNOWLEDGE WE ENCODED THE KNOWLEDGE AS RULES SO BEFORE THAT ELIMINATE SO THIS OCCURS IF ALREADY METABOLIZED. AND USING THESE RULES WE ARE ABLE TO OBTAIN A PATHWAY MUCH MORE CLOSER TO THIS THAN JUST A NE WORK. THERE ARE LIMITATIONS TO ENCOUNTER CLOSED LOOP INTERACTIONS AND THOSE THING BUS THIS WAS A GOOD START. ALONG THE SAME LINES, OTHER PROBLEM WHICH YOU REFER TO AS IN DRUG DRUG INTERACTIONS SO WE CAN EXTRACT FACTS ABOUT VARIOUS FACTS ABOUT PROTEIN INTERACTIONS AND DRUG PHARMACOKINETICS AND THEN PUT THOSE FACTS TOGETHER TO BE ABLE TO MAKE A FINAL PREDICTION ABOUT A PARTICULAR DRUG POSSIBLY INTERACTING WITH ANOTHER DRUG. SO WHAT I'M TRYING TO SAY IS IN EXTRACTION OR GETTING KNOWLEDGE FROM TEXT WE NEED TO GO BEYOND WHAT IS EXPLICITLY OUT THERE IN A SENTENCE, THIS PARTICULAR INTERACTION WITH THIS AND WHAT IS IMPLIED BY THAT WITH THE BACKGROUND KNOWLEDGE WHAT YOU CAN INFER FROM THIS TEXT AND OTHER KNOWLEDGE AND WHAT BROUGHT OTHER INFERENCE TO DRAW FROM IT. THIS WAS AN EXAMPLE OF SUCH A TASK. WE OBTAIN THE TEXT FROM DRUG BANK AND WE ADDED SOME FACTS ALSO USING THE (INDISCERNIBLE) BECAUSE WE COULD HAVE A BIGGER COVERAGE. THESE ARE FACTS THAT WE CHECKED PRUG DRUG PROTEIN PROTEIN ROLE VARIOUS KINDS OF FACTS EXTRACTED AND WE HAVE THE KNOWLEDGE ENCODING FOR THIS ENZYME DRUG INTERACTIONS. SO SAYING DRUG 1 INCREASES SAY THE AMOUNT -- EFFECTIVITY OF DRUG 2, IF DRUG 1 AFFECTS THIS PROTEIN LOW LEVEL AND THIS IS AN ENZYME WHICH METABOLIZES STRUCTURE. SO BECAUSE IT METABOLIZES THIS AND IT'S LOWER THIS ONE INCREASE IT. SO THIS IS KIND OF A PARTICULAR INTERACTION AND THIS FACT IS DERIVED USING OTHER RULES AS GIVEN HERE. WE HAD EXTRA KNOWLEDGE SO WE WENT AND PICKED THE TEXTBOOKS TO GET THE KNOWLEDGE AND ENCODED OURSELVES. THE IDEA WAS WHAT IF WE COULD WRITE A SYSTEM WHICH COULD RE-- GO AND READ THIS PARTICULAR CHAPTERS AND OBTAIN THAT KNOWLEDGE FROM THE TEXT. THAT IS WHERE I GO NEXT IS OBTAINING MORE GENERAL KNOWLEDGE FROM TEXT NEEDED FOR NATURAL LANGUAGE UNDERSTANDING. SO THIS IS THE SAME EXAMPLE YOU I SHOWED YOU EARLIER. WHAT DO YOU DO? THIS IS A PARTICULAR DIRECTION OF RESEARCH, I WANT TO FOCUS ON THIS, THE FEW SLIDES I WANT TO TELL MORE ABOUT IT. NATURAL LANGUAGE UNDERSTANDING TO ME IS THAT GIVEN THE TEXT IF YOU UNDERSTAND THAT TEXT MEANS YOU CAN ANSWER QUESTIONS ABOUT THAT TEXT. IF YOU CAN ANSWER THE TEXT AS FAR AS I KNOW I FOUND THE BEST TWO POSSIBLE WAYS TO DO IT IS PUTTING THE KIND OF TEXT TO TAKE THE TEXT AND TRANSLATE INTO A PARTICULAR KNOWLEDGE, PARTICULAR LANGUAGE SUITABLE FOR THAT PARTICULAR KIND OF TEXT. SO WHAT I MEAN IS TRANSLATING NOT EXTRACTING SENTENCE BY SENTENCE TRANSLATING ENGLISH TO THE APPROPRIATE KNOWLEDGE TO THE LANGUAGE. THERE'S WORK -- (INAUDIBLE) HAS DONE WORK, THERE'S OTHER WORK BEFORE, BUT I'M NOT FOCUSED ON PARTICULAR LOGIC BUT ON SOME TARGET LANGUAGE APPROPRIATE FOR THAT PARTICULAR TASK. HOW DO WE DO IT? THE APPROACH THAT IS THERE IS BASED ON (INAUDIBLE) APPROACH WHICH 1717, THE MEANING OF THE AWARD IS A LMDA CALCULUS FORMULA. AND WHEN YOU HEARD -- IF YOU APPLY THIS CALCULUS FORMULA RIGHT WAY, AT THE END THE SENTENCE IS -- YOU GET A FORMULA WHICH IS THE TRANSVERSE OF THE SENTENCE. THIS IS THE WHOLE IDEA IN LINGUISTICS BUT WHAT IS -- ONE REASON IT DIDN'T PROGRESS, ANALYZE IT DIDN'T PROGRESS MUCH IS WHO GIVES YOU THE MEANING OF THIS LAMDA CALCULUS FORM LA. WHERE DO YOU GET IT FOR EACH WORD? IT CAN BE SOMETHING -- HOW DO I GET THIS? AND THAT IS WHAT I THINK WE HAVE MADE SOME KIND OF A BREAK THROUGH IN HAVING A LEARN BASE APPROACH TO DO THAT, THAT IS WHAT I WANT TO DESCRIBE YOU. SO WE USE CCG GAMMA T. WAY THIS WORKS IS DRUG MAY DECREASE AMOUNT OF ENZYME BY BINDING TO IT. THIS IS A CCG IN CCG GRAMMAR EVERY WORD IS GIVEN A CATEGORY. YOU MAY SEE LAMB DAS, THE MEANING OF THE WORD IS GIVEN AS A LAMDA EXPRESSION AND WE WHY WE USE CCG, IN -- YOU HAVE THE BINARY (INDISCERNIBLE) YOU DON'T KNOW IF THIS IS THE INPUT OR THIS IS THE FUNCTION, THIS IS THE INPUT. CCG GIVES YOU THAT TECHNOLOGY TO FIGURE OUT WHICH IS THE END POINT. AND USING THIS IF YOU HAVE THE LAMDA EXPRESSION YOU CAN TRANSLATE IT. HOW DO YOU GET THE EXPRESSION? THIS IS THE LEARNING BASED APPROACH. SO YOU HAVE SENTENCE AND YOU GIVE IT MEANING. SO THE TRAINING SENTENCE IS SET OF SENTENCES AND CORRESPONDING MEANINGS. IT'S EASIER TO GIVE TRANSLATION OF WHOLE SENTENCE RATHER THAN TO FIND THE MEANING OF A WORD. THIS IS AGAIN ANOTHER POINT I WANT TO MAKE FOR WHEN WE START TO COLLECT -- MAKE THESE ANNOTATIONS AND (INAUDIBLE) IT WILL BE VERY HELPFUL IF WE ALSO THINK OF CREATING WHERE WE SAY THIS IS THE SENTENCE, THIS IS A SENTENCE AND IF WE THINK OF LANGUAGE WHERE WE WANT TO REASON ABOUT IT, THIS IS THE TRANSLATION. USING SUCH A CORPUS WHAT OUR SYSTEM DOES IS IT TAKES -- RECREATES SOME OF THE WORD MEANINGS. THIS WORD, SOME OF THE WORDS YOU CREATE NOW WHEN YOU HAVE THIS SO I HAVE THIS POINT, IF I KNOW THE MEANING OF THIS WORD AND THIS WOR, INVERSE APPROACH INVERSE LAMDA WILL GIVE US THE LAMBDA (INAUDIBLE) FOR THIS. THE SYSTEM SENTENCES AND TRANSLATIONS AND INPUT OF WORDS AND MEANINGS THEN TO FIND LAMBBDA EXPRESSION FOR OTHER WORDS. USING THAT WE GET INTO THE LEARNING STEP, IN THIS WHAT HAPPENS IS A WORD MAY HAVE MULTIPLE MEANINGS. SO WE USE -- ESTIMATE PARAMETERS FOR EACH OF THESE MEANINGS IN SUCH A WAY THAT THE TRAINING CORPUS COULD BE ACCURATELY THE PROBABILITY THE CORPUS WOULD BE ACCURATELY TRANSLATED AND MAXIMIZED. SO THAT IS WHERE WE GIVE THE PARAMETERS SO USING THESE PARAMETER ESTIMATES AN INVERSE LAMBBDA WE CREATE THE FINAL LEXICON, AND WE OBTAIN THE TRANSLATION. PARTICULAR INTERESTING EXAMPLE, THIS IS ANOTHER INTERESTING EXAMPLE I WANT TO TALK ABOUT, THIS KIND OF RESEARCH HAS BEEN DONE IN SO FAR IN 3 OR 4 GROUPS THRCH'S (INDISCERNIBLE) GROUP IN AUSTIN. THERE'S COLLINS I THINK UPENN AND THEN ONE OF THE JACQUELINE MIRE IN WASHINGTON SEATTLE. AND COUPLE OF (INAUDIBLE) THAT THEY USE IS ONE IS CLAING, A ROBOT (INAUDIBLE), ANOTHER IS A (INAUDIBLE) WHERE YOU ASK QUERIES ABOUT (INAUDIBLE) DATABASE. WE SAID THE REAL CHALLENGE OF UNDERSTANDING IS GO TO THE SUPER MARKET, PICK UP A BOOK ON PUZZLES AND SEE IF ANY ONE SYSTEM CAN READ THOSE AN ANSWER THOSE QUESTIONS. THAT IS WHERE REASONING HAPPENS. THERE ARE GREATER ACHIEVEMENTS MADE BUT REASONING IS NOT PUTTING TOGETHER ONE OR TWO THINGS BUT REASONING IS ABOUT PUTTING 15 CLUES, 20 CLUES, THAT IS THE HARDEST REASONING TASK. WE DID HALFWAY WE OURSELF READ AND CREATED THE ONTOLOGY, SO THESE ARE THE NAMES SO ON. THEN WE HAVE THESE CLUES. OUR SYSTEM WHAT WE DID WE TRANSLATE INTO PROGRAMMING A LANGUAGE FOR KNOWLEDGE REPRESENTATION, IF YOU TRANSLATE IT AND THEN SOLVE THE PUZZLES. OUR EVALUATION WAS CLUE CLUES WE CAN DO IT WITH THIS TYPE -- NOT SO GOOD. IF EACH CLUE THE CORRECTNESS IS .87 IF YOU IMAGINE POWER TO THE 7 IS .25. SO WE CAN CHALLENGE REAL REASONING OVER MULTIPLE TASKS IF EACH IS (INAUDIBLE) NINE TO THE POWER OF TEN IS QUITE LOW. IF WE PICK THE PARTICULAR PUZZLES TO TRAIN IF WE PICK THE HARD ONES THEY'RE BETTER THAN THE OTHER ONES. SO THAT'S O KIND OF MY MAIN THING TO TALK TO YOU ABOUT. THREE MAIN POINTS. ONE ANALYZE EXTRACTION, MAYBE WE HAVE CORPUSES OF -- COUPLE OF OTHER PROCESSES WE STORE IT AND LET EXTRACTION BE A QUERY -- BE A -- THINK OF EXTRACTION AS A QUERY TO THIS PAST ANNOTATED BASE. SECOND, WE NEED TO THINK BEYOND WHAT YOU CAN EXTRACT FROM A PARTICULAR PAGE BUT HOW IT CAN INTERACT WITH OTHER KNOWLEDGE WE HAVE TO MAKE BROADER CONCLUSIONS. THIRD IT'S TIME. ALL THE PIECES TOGETHER WE THINK BIG AND THINKING TRANSLATING SENTENCES TO THE APPROPRIATE KNOWLEDGE REPRESENTATION AND REALLY DO REASONING WITH IT. THAT'S MY CONCLUSION. MY COLLABORATORS, HAPPY TO SAY (INAUDIBLE) GONZALES IS HERE, DR. LUIS TERRY, DR. HACKENBERG, (INAUDIBLE) Ph.D. STUDENT AND FUNDED BY VARIOUS AGENCIES OVER THE YEARS. THANK YOU VERY MUCH. [APPLAUSE] >> THANK YOU CHITTA. SO WE'LL HOLD THE QUESTIONS UNTIL THE END. WE'LL HAVE A HALF HOUR TO ANSWER ANY QUESTIONS. SO OUR SECOND SPEAKER IS DR. LYNNETTE HIRSCHMAN, DIRECTOR OF BIOFOIRN MA TIX IN (INDISCERNIBLE) MA. SHE RECEIVED HER Ph.D. FROM UNIVERSITY OF PENNSYLVANIA IN 1972 AND HAS WORKED IN THE AREAS OF HUMAN LANGUAGE UNDERSTANDING AND INFORMATICS AT NEW YORK UNIVERSITY (INAUDIBLE) AND MIT SPOKEN LANGUAGE GROUP. BEFORE JOINING IN 1993, SHE'S A FUNDING ORGANIZER OF THE EVALUATION AN SERVES ON THE BOARD OF THE GENOMICS STANDARDS CONSORTIUM. HER CURRENT RESEARCH INTERESTS INCLUDE SECONDARY USE OF CLINICAL DATA, INCLUDING AUTOMATED DEIDENTIFICATION AND STRUCTURAL DATA AND INTEGRATION WITH GENOMICS AND MULTI-OMICS DATA FOR TRANSLATIONAL MEDICINE APPLICATION. THANKS. >> I WANT TO THANK THE ORGANIZERS VERY MUCH, IT'S A PLEASURE TO BE HERE. I WANT TO THANK THE SPEAKER WHOSE MADE THE POINTS I CAN STAY WITHIN MY 20 MINUTE TIME LIMIT. I NEED TO GIVE YOU INSTRUCTION HOW TO READ MY TITLE. SINCE THIS IS A LINGUISTICS BASED PANEL I WANT TO MAKE SURE I INCLUDE LINGUISTICS AND LINGUISTICS IS THE WORD SENSE AM BUY GIEWTY AM BUY GIEWTY HERE. SO YOU CAN READ IT ALL IN ONE SENTENCE. IS THIS IS THE QUICK OUTLINE WHEY I'M GOING TO TALK ABOUT NEED FOR DATA. CLINICAL RESOURCES, THESE ARE POINTS OTHERS HAVE MADE. AND I WANT TO TALK ABOUT THE CLINICAL DATA WALL, AGAIN, OTHER PEOPLE HAVE ALREADY TALKED ABOUT THIS. AND SCALING TO REAL APPLICATIONS. THIS IS ONE SLIDE I CAN GO THROUGH VERY QUICKLY. WE ALL KNOW WHY WE NEED DATA, WE NEED EXAMPLES OF SUB LANGUAGE, WE NEED EXAMPLES OF THE IP PUTS AND OUTPUTS FOR NATURAL LANGUAGE PROCESSING. AND THEN AS STANDARD PART OF NATURAL LANGUAGE PROCESSING WE NEED TO EVALUATE. THIS REQUIRES GOLD STANDARD DATA OF O CORRECT INPUT OUTPUT PAYERS TO COMPARE YOUR SYSTEM OUTPUT TO. EVALUATION IN THIS CONTEXT IS A ROUTINE PART OF NLT DEVELOPMENT, BASICALLY PART OF THE DEBUGGING. BUT WHAT I WANT TO SPEND SOME TIME ON IS CHALLENGE EVALUATION. AS CAROL MENTIONED THIS MORNING, I HAVE BEEN INVOLVED IN BIOCREATIVE WHICH I'M NOT GOING THE TALK ABOUT BECAUSE I WAS FOCUSED MORE ON CLINICAL DATA. BUT CHALLENGE EVALUATION PLAYS A VERY IMPORTANT SET OF RULES THAT CAN REALLY DRIVE RESEARCH PROGRESS IN A FEEL OR SUB FIELD. THEY CREATE COMMUNITIES OF RESEARCHERS, EVENTUALLY IF YOU'RE LUCKY THEY CREATE A MARKET AND THEY CREATE STANDARDS THAT ENABLE THAT MARKET, THEY ENABLE RESEARCHERS TO TRAIN THE NEXT GENERATION OF STUDENTS AND THEY CREATE INFRASTRUCTURE. IT'S PARTICULARLY INFRASTRUCTURE THAT I WANT TO SPEND TIME ON AND SPECIFICALLY WHAT KINDS OF INFRASTRUCTURE IS NEEDED TO UNLOCK ALL THE INFORMATION THAT'S CONTAINED IN THE PATIENT'S RECORD. SO THE CHALLENGE WITH CLINICAL DATA AND I THINK CAROL DID A VERY NICE JOB TALKING ABOUT THIS, IT'S BEEN IN ACTIVE FIELD SINCE THE '60s. WE HAVE LARGE TERM LARGE SCALE TERM LOGICAL RESOURCES AVAILABLE AND THESE HAVE ALL BEEN MENTIONED, SNOW MED, ICD-9 IF YOU WANT TO COUNT THAT. WIDELY USED EVEN IF THEY'RE MULTIPLE COMPETING STANDARDS THAT ARE FOR DIFFERENT PURPOSES. WE HAVE THIS RICH ENVIRONMENT PUBMED CENTRAL WHICH GIVES READERS ACCESS O THE FULL TEXT OF THE LITERATURE BUT UNTIL RELATIVELY RECENTLY, THIS IS RELATIVELY SO FIVE, SIX YEARS AGO THERE WERE NO SHAREABLE CLINICAL DATA AND MEDICAL RECORDS AND THAT MADE IT VERY DIFFICULT TO SHARE OR COMPARE RESULTS. SO YOU GET GREAT SYSTEMS BUT THEY WERE REPORTING ON DATA THAT OTHER PEOPLE COULDN'T SEE AND COULDN'T TRY TO ATTEMPT TO DO BETTER THAN. SO NO SHARED EVALUATIONS AND I WOULD SAY LIMITED PROGRESS OR AT LEAST THE FIELD WASN'T ATTRACTING THE ENERGY YOU GET WHEN PEOPLE ARE WORKING OPT SAME AREA AND EXCHANGING RESULTS AND CREATING EXCITEMENT. SO THAT'S SOMETHING I'M EXTREMELY HAPPY TO SEE THIS WORKSHOP BECAUSE THAT WILL ENERGIZE A FIELD THAT IS ACTUALLY ALREADY QUITE ENERGETIC. I WOULD ARGUE THE CLINICAL DATA IN THIS CONTEXT IS REALLY BEEN A POSTER CHILD FOR CHALLENGE EVALUATIONS. HISTORICALLY WHAT HAPPENED WAS A A WHILE AGO, 2005, 2006, AT MIT AND HIS TEAM DEVELOPED -- MADE DEIDENTIFICATION SOFTWARE. THAT IN TURN FACILITATED REMOVAL OF PROTECTED HEALTH INFORMATION, THAT MADE IT POSSIBLE TO SHARE UNDER LIMITED DATA USE AGREEMENTS AN NOW WE'RE SEEING A LOT OF CORPORATE ENUMERATED -- THIS IS ONLY A SMALL OF THE CORPUS THE PEDIATRIC RADIOLOGY CORPUS, THE NIMIC 2 DATABASE, AND THE I 2B 2 WHICH PUSHED FORWARD A SERIES OF EVALUATIONS WHICH I'LL TALK ABOUT IN THE NEXT BULLET. SO THESE CORPORATE ENABLED CHALLENGE EVALUATIONS AND IN PARTICULAR THE I 2B 2 THOUGH JOHN HES DID PEDIATRIC RADIOLOGY CHALLENGE, ACTUALLY THIS IS THE ONLY HIPAA COMPLIANT PUBLICLY RELEASED CORPUS THAT I KNOW OF. ALL THESE MOVED THE FIELD FORWARD AS SEVERAL SPEAKERS ALSO MENTION WED ARE BEGINNING TO SEE OPEN SOURCE MOLECULES BECOMING AVAILABLE SO PEOPLE REQUEST BUILD -- REQUEST BUILD ON TOP OF EACH OTHER'S SUCCESSES IN THE FIELD. SO WENDY CHAPMAN CAME OUT WITH (INAUDIBLE) QUITE A WHILE AGO AND IN CONTEXT AND IRGANA AND MAYO CLINIC CAME OUT WITH CTIGS. (INAUDIBLE) STATUS TOOL FOR INTERPRETING FACTS WHICH IS JUST NOW OUT OPEN SOURCE THAT DOES NEGATION AN UNCERTAINTY. SO THESE ARE ENABLING A COMMUNITY TO START BUILDING A SET OF MODULES THAT THEY CAN REPURPOSE FOR DIFFERENT KINDS OF APPLICATIONS. THAT'S GREAT. THIS IS A LIST OF THE I 2B 2 CHALLENGE EVALUATIONS AND I CAN -- YOU WILL SEE BLUE THINGS AND YELLOW THINGS. BLUE THINGS HERE ARE WHAT I WOULD DESCRIBE AS MEDICALLY ORIENTED OR CLINICALLY ORIENTED PATH, SMOKING HISTORY, OBESITY AND COMORBIDITY AND MEDICATION EXTRACTIONS. SO THESE ARE THINGS THAT HAVE IMMEDIATE MEDICAL APPLICABILITY. IN YELLOW ARE REALLY WHAT I CALL BUILDING BLOCK TASKS. AND I WILL TALK ABOUT THEM IN A MOMENT, DEIDENTIFICATION, WHICH KICK THE WHOLE FIELD OFF CONCEPT EXTRACTION, RELATION EXTRACTION. SO THERE WAS GOING TO BE A CLINICAL TRIAL ELIGIBILITY CHALLENGE BUT SEVERAL THINGS HAPPENING IN THE I 2B 2 CONTEXT BUT ELLEN VORHEES WHO WAS HERE IS WORKING TO DO THAT IN THE CONTEXT OF TRACK AND BEEN WANTING TO TRACK EVALUATION IN THIS AREA. SO WE'RE STARTING TO SEE CHALLENGE EVALUATIONS THAT ARE BRINGING OTHER PEOPLE INTO THE FIELD WHICH IS GREAT. I WANT TO DO A QUICK SEGUE TO LOOKING AT TWO SPECIFIC TASKS THAT WERE DONE IN THE I 2B 2 CONTEXT BECAUSE IT'S ALWAYS IMPORTANT IF YOU'RE DOING CHALLENGE EVALUATIONS AND I HAVE BEEN ALTHOUGH MOSTLY BIOMEDICAL FIELD OR BIO SIDE OF THE FIELD, IT'S ALWAYS IMPORTANT TO SAY ARE THESE EVALUATIONS DRIVING THINGS RIGHT DIRECTION? SO ENDEIDENTIFICATION THE I 2B 2 CHALLENGE EVALUATION WAS GREAT, KICKED OFF A BUNCH OF RESEARCH, THIS WAS A SLIDE -- ANIMATED SLIDE TO SHOW YOU WHAT IS INVOLVED SO THE CHALLENGE FOR THE I 2B 2 WAS TO IDENTIFY THE PERSONALLY IDENTIFYING INFORMATION OR PROTECTED HEALTH INFORMATION IN THE RECORD WHICH INCLUDES THINGS LIKE LOCATIONS, DATES AND PERSON NAMES. SO IN THAT SENSE IT'S A STANDARD IDENTIFY RECOGNITION TASK AND YOU CAN DO IT. BUT IN A REAL APPLICATION YOU HAVE TO DO MORE THAN THAT, YOU HAVE TO HIDE THE UNDERLYING ENTITIES AND THIS IS ONE WAY OF HIDING THEM BUT YOU CAN ACTUALLY DO SOMETHING MORE INTEREST WIG CREATES SOMETHING MORE HUMAN READABLE WHICH MAYBE IMPORTANT FOR DOWNSTREAM APPLICATIONS WHICH IS TO REPLACE THE ACTUAL PHI, THE ACTUAL IDENTIFIERS WITH SYNTHETIC SURROGATES THAT MAKE IT LOOK AS IF IT'S A REAL RECORD BUT IN FACT IT'S NOT. THIS IS -- THESE ARE AN MAIGS FROM OUR IDENTIFICATION DISCOVERY TOOL KIT. I 2B 2 WAS IDENTIFYING PHI IN NARRATIVE. PRACTICAL APPLICATIONS ARE FOR REDACTING OR TRANSFORMING PHI SO IT IS POSSIBLE TO SHARE THE DATA. EVEN INTERNALLY WITHIN A HOSPITAL. IT IS STARTING TO BECOME BEST PRACTICE TO DO THIS. WE ARE SEEING A NUMBER OF TOOLS AVAILABLE. THERE'S THE ORIGINAL MIT DID SYSTEM, THERE'S THE UNIVERSITY OF PITTSBURG DERIVED DID SYSTEM WHICH IS A COMMERCIAL SYSTEM. THIS OUR MYTH SYSTEM IS EMORY UNIVERSITY'S HEIGHT SYSTEM SO WE'RE STARTING TO SEE A BUNCH OF SYSTEMS DEVELOPING WHICH IS GREAT. SO WHAT HAVE WE LEARNED? IT WAS I THINK A VERY SUCCESSFUL APPLICATION. THE SYSTEMS THAT WERE BEING EVALUATED, ACHIEVED GOOD PERFORMANCE, USING STANDARD NLP MEASURES, THINGS LIKE ACCURACY, PRECISION, RECALL, F MEASURE. IT WAS USABLE FOR DATA SHARING. AND WAS OUR SYSTEM. WE HAD LUCK BUILDING COLLABORATION BY IN EFFECT MOVING THE SOFTWARE TO THE DATA. THIS WAS A COMMENT I WAS GOING TO MAKE WHEN PHIL REZNIK WAS MAKING THE POINT ABOUT NEEDING TO MOVE THE CLINICIAN TO THE DATA. ANOTHER SOLUTION WE SHOULD THINK ABOUT WHICH IS MOVING THE SOFTWARE TO THE DATA BUT THAT REQUIRES THAT YOU HAVE PEOPLE TO TRAIN THE SYSTEM AND BE ABLE TO DEPLOY AT THEIR OWN SITE. SO A PERSON DOESN'T CROSS THE WALL, SOFTWARE GOES TO THE DATA. THE PROBLEM HERE IS THOUGH I SAY THIS IS A SUCCESS WE HAVE NO EXTRINSIC EVALUATION. WE DON'T KNOW IF THESE SYSTEMS ARE GOOD ENOUGH. I HAVE EVIDENCE THAT IRBs STILL WON'T ACCEPT RESULTS OF AUTOMATED DEIDENTIFICATION. SO WE REALLY NEED TO COUNT RESULTS IN TERMS OF THAT ARE DIRECTING AT PEOPLE LIKELY THI EXPOSURE RISKS SO HOW LIKELY IS SOMETHING TO LEAK FROM THIS? SO WE SHOULD RETHINK THE METRICS FOR THIS. NOW TAKE ANOTHER QUICK CASE STUDY, CO-REFERENCE, ONE OF THE LATER I 2B 2 TASKS, IDENTIFICATION OF MARKABLES SO MENTION AS PERSON, PROBLEM, TEST, TREATMENT AND PRONOUNCES REFERRING TO THOSE AND THEN THE LINKAGE OF CO-REFERRING MENTIONS OF MARKABLES. I'LL GO THROUGH A COUPLE OF EXAMPLES. HERE WE HAVE PERSON SO DR. SO AND SO REVIEWED THE CASE, SO THOSE THINGS ARE CO-REFERENCIAL THEN WE HAVE ANOTHER ONE THE PATIENT AND HIS AND ANOTHER CO-REFERENCIAL, DIFFERENT FROM THE FIRST MENTION, OBVIOUSLY MEDICALLY SOMEWHAT RELEVANT INFORMATION. SECOND IS THE PATIENT HAD A KIDNEY STONE IN 2000, HE PRESENTS TODAY WITH A LEFT KIDNEY STONE. THOSE ARE MARKED DIFFERENT COLORS BECAUSE THEY'RE NOT CO-REFERENCIALS, IT'S A DIFFERENT KIDNEY STONE. SO THEY ARE METHODICALLY IMPORTANT FACTS HERE. WHAT HAVE WE LEARNED FROM THIS EVALUATION? ONE THING WE LEARNED WAS THE PERSON CATEGORY IS THREE, YOU HEAR IS THE PATIENT, ABOUT NON-PATIENT PEOPLE LIKE FAMILY MEMBERS, AND FRIENDS AND PROVIDERS. SO THIS IS A THREE WAY CLASSIFICATION TASK THAT MAY NOT REQUIRE REFERENCE. SECOND HAD TO DO WITH THE METRICS SO THE CO-REFERENCE SCORE USED WAS THE AIR ARITHMETIC U -- THE OTHER TWO CAME ABOUT BECAUSE PEOPLE DIDN'T LIKE THE FIRST EVALUATION DEVELOPED, MADE UP SHORT FALL IN THE FIRST ONE, SO THE SCORE WAS THE AVERAGE, THE -- SCORES ON THE END TO END I 2B 2 WERE 60% F MEASURE BUT IT'S UNCLEAR WHAT THIS MEANSES. THIS WAS NUMERICALLY COMPARABLE TO THE CO-REFERENCE VALUATION OF THE NEWS WIRE AND MESSAGE UNDERSTANDING CONFERENCE SPHERE IN 2001 BUT HOW DO WE INTERPRET THIS? ONE WE HAVE MADE NO PROGRESS SINCE 2001 AND ANOTHER INTERPRETATION IS THIS APPLICATION DIDN'T NEED CO-REFERENCE IN THE FIRST PLACE, WE NEED EXTRINSIC CLINICALLY BASED EVALUATION. WE NEED TO KNOW WHAT INFORMATION IS IMPORTANT HERE. I WOULD ARGUE LONG TERM WHAT IS THE COST OF GETTING THAT INFORMATION AND IS THE THIS THE MOST EFFICIENT WAY TO GET IT. SO THERE'S A BALANCING ACT DEVELOPING EVALUATIONS. ONE HAND WE HAVE THE PREEP L WHEER -- PEOPLE HERE WHO ARE DEVELOPERS WHO WANT TO DO PAPERS AND PEOPLE WHO WANT GENERIC APPLICATIONS WITH GERMANE INDEPENDENT TOOLS. ON THE OTHER HAND WE HAVE END USERS AND DATA PROVIDERS WHO WANT SOLUTIONS TO THEIR PARTICULAR PROBLEM. SO WE NEED ANNOTATION AS A NUMBER OF PEOPLE ALREADY TALKED ABOUT BOTH LINGUISTIC ANNOTATION AND DOMAIN EXPERT ANNATION AND THEN WE HAVE THE ORGANIZER WHOSE ARE TRYING TO BALANCE THESE TWO ENDS, SELECTING THE RIGHT TASKS TO EVALUATE. SO I THINK THE I 2B 2 EVALUATIONS ARE INTERESTING AND TRIED TO ADDRESS GENERIC BUILDING BLOCK TASKS AND MEDICAL APPLICATIONS. SO I THINK THAT'S GREAT. SO NOW I'M GOING TO QUICKLY MAKE A FEW POINTS ABOUT SCALING OR OVERCOMING IN FACT THE CLINICAL DATA WALL AND I DON'T NEED TO BELABOR THIS, WE KNOW IT'S HARD TO SHARE MEDICAL RECORDS BECAUSE OF THE PHI AND PRIVACY CONSTRAINTS. AND ACTUALLY MEDICAL RECORD DEIDENTIFICATION IS A RATE LIMITING STEP IN PROVIDING DATA FOR SECONDARY USE FOR ALL KIND OF APPLICATIONS. SO THERE ARE SOME WAYS TO SCALE THE DATA WALL. ONE IS LOWERING VARIABLES TRUE IRB APPROVALS. THE DATA YOU GET IN THOSE LIMITED DATA USERS AGREEMENTS IS CAREFULLY SCRUBBED. WE CAN TRY TO DO A BETTER JOB OF DEVELOPING METRICS TO ADDRESS THE IRB CONCERN BUS WE SHOULD FOCUS ON REDUCING -- TRYING TO FIGURE HOW TO REDUCE THE REIDENTIFICATION RISK. INTERACTIVE HUMAN REVIEW USING AUTOMATED DEIDENTIFICATION TOOLS CERTAINLY SPEED THAT IN VARIOUS WAYS, I LIKE AN IDEA THAT CAME FROM THIS PAPER GEORGE RIPSACK AT ALL, SELECTED EXTRACTION, EXTRACTING CLINICALLY RELEVANT INFORMATION AND LEAVING THE PHI, PERSONAL INFORMATION BEHIND. AND THEN THE OTHER POSSIBILITY, MOVING THE SOFTWARE THROUGH THE DATA SO THE DATA DOESN'T HAVE TO BE OUTSIDE SO THAT RAISES QUESTIONS OF APPLICATION AND CAN YOU MAKE THE SYSTEM USABLE ENOUGH SO THAT IT CAN ACTUALLY BE RETRAINED INSIDE THE PROVIDER ORGANIZATION AND APPLIED. WE HAD GOOD SUCCESS WITH THAT. SO NEXT THING I WANT TO TALK ABOUT QUICKLY IS SCALING THE DATA. MOST APPLICATIONS ARE ONE OFF BECAUSE IT CAN'T BE SHARED AND APPLICATION IS VERY INSTITUTION-SPECIFIC. BUT THERE ARE NEW MULTI-SITE PROJECTS SPRINGING UP, ONE IS ALREADY DIMENSIONED BY GROUP AND FOR SECONDARY DATA USE I'LL TALK ON EMERGE, ELECTRONIC MEDICAL RECORDS AND GENOMICS AND SEVEN GROUPS POOLING THE EMRs AND BIOBANK DATA TO IDENTIFY PATIENT PHENOTYPES. SO ON THE SHARP END, THIS IS FUNDED BY THE OFFICE OF NATIONAL COORDINATOR, ONE OF FIVE SHARP PROJECTS, MAYO CLINIC IS PI AND USE CASE OF THIS IS PHENOTYPE EXTRAX. SO WHAT IS THE DISEASE STATE OF THE PERSON OR CHARACTERISTICS 350,000 NOTES FROM PATIENTS AND TWO DIFFERENT PROVIDERS, YOU'RE GOING TO OUTLINE RICHLY ANNOTATED KORPRA WHICH ARE 500,000 WORDS INCLUDING LINGUISTIC ANNOTATION AND MEDICAL LAYERS OF ANNOTATION SO VERY DETAILED RICH ANNATION. THAT'S ONE POSSIBILITY. HERE IS EMERGE, ONE OF THE INTERESTING THINGS ABOUT EMERGE IS THEY PUBLISH A LIBRARY OF 13 PHENOTYPE EXTRACTION ALGORITHMS ACROSS SIX TO SEVEN GROUPS THAT WORKED ON THIS, EXTRACTING INFORMATION FROM STRUCTURED AND UNSTRUCTURED INFORMATION FROM THE LEN TROKIC MEDICAL RECORD. SO THAT'S A VERY INTERESTING PROJECT. IF WE WANT TO SCALE THE DATA FOR A REAL APPLICATION, ONE OF THE THINGS THAT THE EMERGE PEOPLE FOUND IS THEY HAD TO POOL THE DATA ACROSS INSTITUTIONS TO GET ENOUGH STATISTICAL POWER TO DO THEIR GENOME WIDE ASSOCIATION STUDIES. OR GWAS. AND THE TYPICAL NUMBERS THEY WANTED TO GET TO FOR THIS WERE SOMETHING LIKE 3,000 CASES AND 3,000 CONTROLS. SHARPN HAS A POOL OF PATIENTS, COULD WE FIGURE WAY TO PROVIDE THESE AS CHALLENGE EVALUATION DATA SETS? THE DATA ISSUES WITHIN EMERGE PARTIALLY RESOLVE BY EMERGE DATA YIEWTS AGREEMENT AND -- USE AGREEMENT AND THE CLINICAL ANNOTATION. SO PHENOTYPE, THIS PATIENT IS ASSOCIATED WITH THIS PHENOTYPE, BASED ON SOME KIND OF EVIDENCE CHAIN. SO LAST THING IS SCALING THE ANNOTATION. OPTIMISTIC ESTIMATE IS A DOLLAR FOR PATIENT NOTE PER LAYER AND IF YOU THINK OF O PATIENTS HAS HUNDREDS OF NOTES ASSOCIATED WITH THEM, TO SCALE TO THE PATIENT LEVEL WE'RE TALKING HUNDREDS OF DOLLARS PER PATIENT. THE REAL FIGURES ARE MUCH MORE DAUNTING. CAN WE STEAL ANNOTATION? WE CAN LEVERAGE NATURALLY OCCURRING ANTATIONS OR FOUND DATA AND SEVERAL SPEAKERS OUTLINED DEVELOPED BETTER ALGORITHMS TO LEARN NOISY ANNOTATION. SO HERE ARE MY CONCLUDING RECOMMENDATIONS. SCALE THE DATA WALL ACQUIRING MEDICALLY RELEVANT COLLECTIONS OF CLINICAL DATA AT SCALE AND MINIMIZING THE REIDENTIFICATION RISK. WE NEED TO SCALE FOR REAL APPLICATIONS AND EVALUATE MLP SYSTEMS ON UTILITY AND COST EFFECTIVENESS USING EXTRINSIC EVALUATIONS ON CLINICAL APPLICATIONS. AND WE NEED TO BE ABLE TO SCALE THE ANTATIONS. THAT MAY INVOLVE DOING LESS, MINIMALIST ANNOTATION STRATEGIES TAKING ADVANTAGE OF NATURALLY OCCURRING PARTIALLY ANNOTATED CORPRA AND NEW ALGORITHMS AS THE SPEAKERS MENTIONED. IN CONCLUSION I WANT THE THANK PARTICULARLY MY COLLEAGUES AT MITR WHO PRODUCED SOME OF THE SOFTWARE'ˇr I'M TALKING ABOUT AND MANY EXTERNAL COLLABORATORS WHICH INCLOOD BUT NOT LIMITED TO A NUMBER OF PEOPLE WHO ARE IN THE AUDIENCE TODAY. SO THANK YOU VERY MUCH. [APPLAUSE] >> THANK YOU VERY MUCH, LYNNETTE. SO WE HAVE OUR LAST SPEAKER FOR THE PANEL, DR. JAMES PUSTEJOVSKY. HE HOLDS A TJX CHAIR IN COMPUTER SCIENCE AT BRANDEIS UNIVERSITY WITH HE DIRECTS THE LAB FORPAgK%LINGUISTIC COMPUTATION AND FOCUSES ON LANGUAGE MA PROGRAM. HE HAS CONDUCTEDED RESEARCH IN COMPUTATIONAL WORK, ARCI, TEMPLE REASONING AND LANGUAGE ANNOTATION. HE'S CURRENTLY HEAD OF WORKING GROUP TC 37 TO DEVELOP A SO MAN TICK ANNOTATION FRAMEWORK, AND IS AUTHOR OF THE COURSE OF -- ALSO RECENTLY APPROVED ISO SPECIFICATIONS FOR ANNOTATION AND THE DRUG SPECIFICATION FOR SPACE ANNOTATION. JAMES WAS PI OF THE LARGE NSF FUNDED EFFORT TOWARDS COMPREHENSIVE LINGUISTIC ANNOTATION OF LANGUAGE THAT INVOLVE EMERGING SEVERAL ANNOTATIONS SUCH AS PROBANK, KNOWN BANK, THE TIME BANK AND OPINION CO ORPUS INTO A UNIFIED REPRESENTATION. CURRENTLY HE'S CO-PI OF THE MAJOR PROJECT FUNDED BY MFF TO ADDRESS INTEROPERABILITY FOR NFP DATA AND TOOLS. ALSO CO-AUTHOR OF NUMEROUS APPLICATIONS, THE MOST RECENT OF WHICH ON THE SEMANTICS OF MOTION AND LANGUAGE AND THEY'RE FORTHCOMING BOOK ON THE LANGUAGE ANNOTATION, FORMATION LEARNING. THANKS. JAMES. [APPLAUSE] >> THANK YOU VERY MUCH FOR THE OPPORTUNITY TO BE HERE IN SUCH GREAT COMPANY AND PRESENT SOME RECENT IDEAS AND WORK ON EVENTS IN TEMPORAL REASONING WITHIN THIS DOMAIN. I FIRST GOT INTERESTED IN THE PROBLEM OF (INAUDIBLE) ORDERING OF EVENTS AND TEMPORAL EXPRESSIONS WITHIN THE MEDICAL AND CLINICAL DOMAIN FROM ENCOUNTERING GEORGE RIPCHECK'S WORK AFTER DEVELOPING QUITE A BIT OF THE TIME SPECIFICATION AND AFTER QUITE A BIT OF INTERACTION WITH (INDISCERNIBLE) WORK AND OTHER OF HIS COLLEAGUES BECAME MORE INTERESTED IN THE PROBLEM OF HOW DIFFICULT IT WAS TO TAKE WHAT WE DEVELOPED WITHIN THE WORKING GROUP THAT CREATED A SPECIFICATION FOR HOW DIFFICULT IT WAS TO TRANSPORT THAT OR REPORT THAT TO A DIFFERENT DOMAIN FOR WHICH HE WAS NOT AT ALL WE THOUGHT WELL SUITED. I WOULD LIKE TO TALK WHY TIME IS RELEVANT BUT HARD TO UNDERSTAND LANGUAGE, WHY STANDARDS ARE IMPORTANT TO FIELD, THE STANDARD AS TIME ML AND SPECIFICS OF TIME ML WHICH WILL HELP YOU UNDERSTAND WHY IT MIGHT BE A REALLY GOOD ADOPTED ADOPTABLE STANDARD JUST AS WITH SOME OF THESE DATA RESOURCES THAT WERE MENTIONED, LYNNETTE AND CAROL AND OTHERS. I WANT TO TALK SPECIFICALLY HOW TO IMPROVE SOME OF THE EARLIER NOTIONS OF HOW TO ANNOTATE RELATIONS IN TIME ML BY INTRODUCING A NOTION OF INFORMATIVENESS, SOMETHING AKIN TO THE NOTION OF INFORMATIVENESS IN RELEVANCE THEORY BUT I THINK MIGHT HAVE TO DO WITH MORE TEMPORAL ANALYSIS AND CLINICAL TEXT BRIEFLY. ALL THAT 51 SLIDES IN 20 MINUTES SO NATURAL LANGUAGE OBVIOUSLY HAS LOTS OF REFERENCES TO PAST AND FUTURE EVENTS, RESTRICTED IN THE WAY THEY'RE REFERRED TO IN CLINICAL NOTES AND SUMMARIES. PERHAPS FEWER PLANNED ACTIVITIES AND GOALS IN THE DOMAIN OF TEXT WE'RE TALKING ABOUT BUT THERE WILL BE PLANS AND ACTIVITIES FOR WHAT PROJECTED THINGS PATIENT -- NOW PATIENTS SHOULD DO AND THINGS LIKE THAT SO MOST OF THE PURVIEW OF THE SCOPE OF WHAT WE WERE LOOKING AT MIGHT BE FOUND IN THE TEXT THAT WE SEE. WOWBT OUR BEST ABILITY TO IDENTIFY AND TEMPORALLY SITUATE, TO ANCHOR AN ORDER EVENTS THAT COME FROM THE TX YOU'RE NOT GOING TO GET THE IMPORTANCE OF INFORMATION. SO TRYING TO UNDERSTAND NATURAL LANGUAGE UNDERSTANDING IS JUST EXTRACKING IS NOT UNDERSTANDING THE TEXT. YOU HAVE TO PUT THEM IN A COHERENT SENSE TOGETHER, IT HAS TO BE A TEMPORAL CLOSURE OF THOSE. ANNOTATION STANDARD IS NECESSARY TO THREANCH INFORMATION FROM TEXT. SO HERE ARE SOME EXAMPLES FROM NON-CLINICAL TEXT. COLLAPSING ALL THE REDS ARE EVENT, BLUES ARE TEMPORAL EXPRESENTATIONS OF SOME SORT. WE NEED TO SITUATE THE EVENT WITHIN SOME PARTICULAR INTERVAL. FOR DOING REASONING EI> WE'LL HAVE ALL OUR SPEAKERS UP AND WE CAN START WITH QUESTION AND ANSWERING. >> HI. I GUESS I HAVE A QUESTION FOR LYNNETTE AND A QUESTION FOR JAMES. I TRIED TO MAKE THESE ONE QUESTION AND COULDN'T FIGURE HOW TO DO IT. SO MY QUESTION FOR I GUESS I TRIED TO GROUP THEM TOGETHER HAS TO DO WITH THE REAL WORLD OF SCALABILITY AND APPLICABILITY. DEIDENTIFICATION OR REIDENTIFICATION QUESTIONS YOU WERE ASKING A LOT SEEM TO BE THINKING GOLD STANDARD AND TEST SETS. WHERE ARE THINGS HEAD -- ARE WE GOING TO REACH THE POINT -- IF YOU -- EVEN IF YOU HAVE 1/100 MISS RATE, ONCE YOU HIT THE SCENARIO IT'S TOO MUCH BECAUSE IF YOU HAVE MILLIONS UPON MILLIONS OF RECORDS WHICH IS WHAT YOU'RE GOING TO WANT IN ORDER TO DO BIG DATA THINGS, ARE WE GOING TO GET TO THE POINT WHERE WE CAN THINK ABOUT THESE SORTS OF THINGS FOR DATA WE LEARN FROM, NOT JUST DATA WE TEST ON, MY QUESTION FOR JAMES, I WAS FASCINATING BY THE STANDARD WE'RE DEVELOPING AND WONDERING IF YOU CAN SAY MORE ABOUT APPLICATIONS OF IT. >> SO YES, WE HAVE HUGE AMOUNTS OF DATA MISSING ONE -- REVEALING IDENTITY OF ONE PATIENT CAN BE COSTLY POINT OF VIEW FROM THE PROVIDER WHO WILL GET SUEDED FOR THAT. THERE IS ONE OF THE PROBLEMS THAT THE MEDICAL RESEARCH COMMUNITY RECOGNIZES HAVE MADE IT VIRTUALLY IMPOSSIBLE TO SHARE FOR SECONDARY USE, THAT IS A BIG PROBLEM. THERE ARE PRESSURES WAY OUTSIDE THE NATURAL LANGUAGE PROCESSING COMMUNITY WHICH IS VERY -- THAT'S A TERTIARY USE OF THE DATA, THAT I THINK ARE GOING TO MAKE IMPACT ON THIS. I THINK IT'S PARTLY A MATTER OF EDUCATING PEOPLE WHO RUN IRBs ABOUT RISKS AND TRADE OFFS. AND GETTING THEM TO THINK -- YOU HAVE A VERY SIMILAR PROBLEM IN SECURITY FOR THE INTELLIGENCE AGENCIES. YOU HAVE TO GET AWAY FROM THE NOTION THAT YOU CAN HAVE PERFECT SECURITY, YOU HAVE TO BALANCE RISK VERSUS COST AND BENEFIT. SO THAT'S THE EVASIVE ANSWER TO YOUR QUESTION. >> YOUR QUESTION TO ME >> SNAP OUR FINGERS SUPPOSE WE HAVE THE WONFUL REPRESENTATIONS AND WE CAN CREATE THEM AUTOMATICALLY, CAN YOU SAY MORE ABOUT THE APPLICATIONS YOU'RE LOOK TO? I GUESS THERE WAS A CONNECTION HERE BECAUSE YOU TALKED THE IMPORTANCE OF EXTRINSIC FOCUS IF YOU WERE= OTHER SORTS OF THINGS WHAT'S THE EXTRINSIC GOAL OF THE TEMPORAL REPRESENTATIONS IN THE CLINICAL SETTING? >> THE EXTRINSIC GOAL, NOT SURE I UNDERSTAND THE QUESTION THERE. >> WHAT HAPPENS WHEN THEY'RE NOT PERFECT? >> SO AS -- ONE MIGHT ASK THE SAME THING OF THE RESULTS OF ANY CHALLENGES. SO I THINK LOTS DO SO WHAT AM GUYING TO DO WITH A GREAT SCORING RTE SYSTEM, WHO CARE? STILL DOESN'T DO ANYTHING. THAT'S A FAIR QUESTION. I THINK WITH TIME, WITHIN A SPECIFIC -- A LOT OF PEOPLE HAVE SPECIFIC THINGS THEY WANT TO BE ABLE TO REASON ABOUT. ONE WOULD BE MANAGING MEDICATION, MANAGING DOSES, MANAGING AND CONTROLLING THE INFORMATION ASSOCIATED WITH THE 25 MEDICATIONS THAT YOU HAVE AND THE INTERACTIONS BETWEEN THEM. THE EFFECTS, IMPOSSIBLE JUST ON PATIENT BY PATIENT BASIS. AND THEN OF COURSE OBVIOUSLY WITH SUCH DATA ACCUMULATED THEN YOU CAN AFFORD TO DO RETROACTIVE LARGE SCALE STUDIES. FOR SUCH THINGS. (INAUDIBLE) HELP ME ANSWER THE QUESTION. (OFF MIC) (OFF MIC) >> PA TERM RECK ANYTHING. >> I HAVE A FEW EXAMPLES ON THAT, WE'RE ALSO DOING CANCER SCREENING TESTS, ACTUALLY WANT TO FIND OUT, WHEN THE PATIENT DID COLON CANCER SCREEN TEST AND WHEN IT WAS THE NEXT SCHEDULE, TRYING TO SEND OUT REMINDER THIS KIND OF THING. ANOTHER THING IS DRUG EVENTS LIKE THE DRUG PHARMACO VIGILANCE AND WE WANT TO KNOW WHEN THE PATIENT STARTED DRUG FOR HOW LONG IN WHAT KIND OF DOSE. IT'S NOT CLEAR IN THE EMR SETTING WHEN THE PATIENT STARTED DRUG BECAUSE A LOT OF SETTINGS YOU THINK WE HAVE PHARMACY FIELD DATA AND THAT CAN TELL US YOU DON'T HAVE OUTPATIENT SO PATIENTS GO OUTSIDE TO CVS TO GET DRUG -- BUT WE DO HAVE CLINICAL VISITS. WHAT KIND OF DRUG YOU WERE ARMED, THREE MONTHS AGO I WAS ON THIS DRUG SO SORT OF ADDITIONAL EVIDENCE, THAT'S WHY WE WANT TO PASS THOSE DATA WITH TEN TIMES STANDARD TO HELP THIS KIND OF DISCOVERY AND STUDIES. >> THANK YOU. (OFF MIC) (OFF MIC) >> SO CONSENTING OF PATIENTS IS CERTAINLY AN IMPORTANT POSSIBILITY A PATIENT LIKE ME RELIES ON PATIENTS TO SHARE NOT MEDICAL RECORD BUT MEDICAL INFORMATION, SO THAT IS ATTRACTIVE SOURCE OF INFORMATION. >> THAT'S AN INTERESTING QUESTION, BECAUSE IN OUR CONSENT FOR BIOBANK WE HAVE OPT-OUT MODEL SO BY DEFAULT WE USE PATIENT FOR DNA RESEARCH AND WE HAVE A CHIP BOX THAT SAY IF YOU DON'T AGREE CHIP THIS. IF YOU DON'T AGREE YOU CHECK THIS. I NEVER HEARD THIS ASKING I AGREE USING RECORDS FOR RESEARCH. CLINICAL NOTES WHERE USING IT FOR RESEARCH. I DON'T KNOW IF THERE'S ANOTHER LEVEL OF CONSENT LIKE USING NOTES FOR NLP PURPOSES. KIND OF INTERESTING. >> (INAUDIBLE) FROM (INDISCERNIBLE). I WOULD LIKE TO ADD SOME DATA, DEIDENTIFICATION QUESTION, DEIDENTIFICATION AND SCALING FOR LARGER DATA SETS. SO WE DID AN ANALYSIS OF DEIDENTIFICATION SYSTEMS AND THE PAPER IS UNDER REVIEW. BUT WE CLEARLY NEEDED THE GOLD STANDARD OF CERTIFY NOTES AND IN THIS SET OF NODES WE FOUND ABOUT 30,000 PHI TOKENS SO (INDISCERNIBLE) METHODS WHAT TO DO WITH DEIDENTIFIED DATA, REMOVE IT AND REPLACE THE TOKENS OBFUSCATE THE TOKENS REPLACE WITH FAKE TOKENS, FAKE MA'AMS, THEN ORIGINALLY THE IRBs REPRESENTATIVE WAS ON THE SIDE OF JUST LEAVE WITH STUDIES AND LEAVE WHATEVER WAS NOT IDENTIFIED BY THE SYSTEM. I ASK THEN WE INSURANCE IN THE PHYSICIANS ARE WRITING ABOUT 5 MILLION NOTES IN A YEAR. SO THEY HAVE ABOUT 50 MILLION PHI TOKENS THERE. IF WE LEAVE 10% BECAUSE THE SYSTEM IS 90% EFFECTIVE REMOVING IT THEN WE ARE -- THE SYSTEM IS 95% THEN AT LEAST 2.5 MILLION TOKENS ARE LEFT THERE THAT ARE PHI TOKEN, ONE YEAR NEXT YEAR IT WILL BE BECAUSE OF THE ACCUMULATION OF DATA 5 MILLION TOKENS AND 7.5 MILLION AND YOU CAN TELL THAT NUMBER WHEN IT SCALED UP, THIS EXAMPLE SCALE MADE THEM THINK ABOUT MAYBE (INAUDIBLE) SHOULD BE ADDED TO THE PIPELINE. >> I WANT TO LET DAVID TALK. (OFF MIC) >> FIND THE FIRE WAM SAN IMPORTANT -- FIREWALL IS AN IMPORTANT THING. >> I TOTALLY AGREE WITH THAT. SO THERE IS REALLY TRYING TO EMPHASIZE THE IMPORTANCE OF BRINGING THE SOFTWARE TO THE DATA. I TOTALLY AGREE WITH THAT, REMEMBER WE HAVE ALL LIKE 1.8 MILLION PATIENTS COMPLETELY IDENTIFIED. BUT STILL OUR INSTITUTE IS NOT RELEASING TO PUBLIC. BUT IF IT INVOLVES INVESTIGATOR WE CAN SORT OF DO COLLABORATION AND WE HAVE PROCEDURE, FORMAL PROCEDURE FOR HE CAN TERM COLLABORATION. SO BY DOING THAT YOU FEEL THE DATA AGREEMENT BE ABLE TO ACCESS THAT DATA WITH COLLABORATIONS. I THINK OTHER INSTITUTES HAVE SIMILAR SITUATION. ANY OTHER QUESTIONS? GO AHEAD. >> I'M (INAUDIBLE) FROM BRAN DICE. MY QUESTION IS EVALUATION AND OBVIOUSLY THIS IS REALLY MOVED THE STATE OF THE ART FORWARD FROM SPEECH AND ALL THESE AREAS. IN SPEECH IT SEEMED THAT AS EVALUATIONS WENT ON THEY REALLY STARTED TO AWARD INCREMENTAL PERFORMANCE AS OPPOSED TO SOME OTHER THINGS THAT WE MIGHT WANT LIKE NEW APPROACHES. WE HEARD A LOT OF GOALS HERE SYSTEMS THAT PERHAPS AREN'T SO PIPELINED BUT MORE INTEGRATED BUT IF YOU CHANGE YOUR ARCHITECTURE YOUR PERFORMANCE ISN'T GOING TO IMPROVE NECESSARILY. SYSTEMS THAT ARE MORE EXPLANATORY THAN JUST GIVEN THE ANSWER. CAN YOU TALK LYNNETTE AND OTHERS ABOUT OTHER WAYS TO DESIGN EVALUATIONS IN THIS AREA THAT MIGHT REWARD DIFFERENT THINGS THAN SIMPLY THE PERFORMANCE IMPROVEMENT? >> RIGHT. IF YOU DO THE SAME THING OVER AN OVER AGAIN YOU GET INCREMENTAL PERFORMANCE IMPROVEMENTS, CERTAINLY ONE OF THE LESSONS FROM THE MESSAGE UNDERSTANDING CONFERENCE WHICH IS BY INTEREST -- QUITE INTERESTING IS YOU AT LEAST FOR SOME HAVE TO HIT A WALL. DIDN'T GET BEYOND IT AFTER A FEW TRIES. ALSO WHEN YOU GET TO THE POINT OF 10TH OF A PERCENT IMPROVEMENT IT'S PROBABLY TIME TO STOP AND DO SOMETHING ELSE. ONE OF THE WAYS THAT THIS IS BEEN ATTACKED IS TO DO DIFFERENT COMPONENT THINGS WHICH IS WHAT WE SEE IN I 2B 2 BUT I REALLY -- I WOULD LIKE TO SEE THE COMMUNITY STEP UP TO SOME OF THESE EXTRINSIC EVALUATIONS BECAUSE I THINK THAT THEY RAISE DIFFERENT QUESTIONS. I THINK THE PROBLEM IS YOU DON'T GET THE SAME INSIGHT, IT'S LESS SATISFYING FOR THE PEOPLE DOING THE DEVELOPMENT, THE COMPUTER SCIENTISTS, BECAUSE THEY DON'T NECESSARILY UNDERSTAND EXACTLY WHAT COMPONENTS CONTRIBUTED TO THE THE SUCCESS OR FAILURE. ON THE OTHER HAND, YOU CAN DO SOME VERY NICE ABLATION STUDIES WIDOW ALLOW YOU TO DEMONSTRATE, YES WHEN I ADDED THE TIME COMPONENT I DID MUCH BETTER ON THIS TASK. I THINK IT NICE TO ENCOURAGE THAT EMPERIMENT MENNATION AT LEAST IN MY VIEW MORE EXTRINSIC TASK EVALUATIONS. >> MAYBE SOMEWHAT RELATED WITH EVALUATIONCH WHEN YOU START TO PROVIDE GOLD STANDARD DATA WE DON'T KNOW WHETHER THAT IS REALLY (INAUDIBLE) IMPLICIT DATA SET SO YOU ANNOTATE CLINICAL DATA AND EVALUATE PERFORMANCE IN THAT SPECIFIC GOLD STANDARD DATA. THERE ARE NO GUARANTEES, THAT ACTUAL GOLD STANDARD REFLECT CHARACTERISTICS OF THE WHOLE SET OF PROGRAMS WE WANT TO ADDRESS. SO WE NEEDED SOME KIND OF WAY OF DESIGNING THE CHALLENGE TASK ITSELF, RIGHT NOW IT'S GIVEN AND WE TRIED TO DO SOMETHING BUT NOT SURE WHETHER -- I STONGLY FEEL IN THAT WAY WHEN WE SEE I 2B 2 CHALLENGE TASK WHETHER THAT IS PEER REVIEW REPRESENTATIVE OR JUST PART OF THE SMALL PROGRAM, LOTS OF CHARACTERISTICS WHICH ARE NOT TRANSFERABLE TO OTHER SIMILAR DOMAINS. >> GREAT COMMENT. WE DO NEED TO MAKE EVALUATION INTO SOME KIND OF SCIENCE. I HAVE BEEN WANTING TO DO SOME KIND OF STUDY OF THE COST OF ANTATION, COST OF REPAIRING THE GOLD STANDARD AND ANY WAY OF QUANTIFYING -- YOU RAISE AN ISSUE TO WHAT EXTENT RELATIVELY SMALL BIG DATA STANDARDS REPRESENTATIVE OF LARGER SPHERE WORLD WHICH THEY'RE OCCURRING IS ALSO RELATED TO THE PROBLEM OF DOMAIN ADAPTATION. AND I REALLY WISH WE COULD FIGURE A WAY OF E EVALUATING THE WAY OF ANNOTATION. SO THAT MIGHT PUSH THINGS FORWARD A BIT. >> SO FURTHER ON THE CHALLENGE TOPICS, I LIKE THE TOPICS OF DEIDENTIFICATION, PICKING -- AS WE'RE SAYING PICKING THE RIGHT TOPICS, DEIDENTIFICATION, CO-REFERENCE, TIME, I LOVE BECAUSE THEY BRING -- THEY SAY TO THE FIELD WE NEED TO DO MORE OF THIS, WE HAVE TO MOVE MORE INTO IT. SMOKING IS NOT ONLY BAD FOR YOUR HEALTH, IT'S BAD FOR NLP, IT WAS TOO SMALL AND LIKE IF ANYTHING YOU CAN WRITE A PEARL PROGRAM TO DO YOU PROBABLY SHOULDN'T HAVE AS A CHALLENGE, MAYBE FOR EXTERNAL BECAUSE THE EVALUATION BY ITS CHARACTER HAS TO BE SMALL. YOU DON'T TELL PEOPLE WHAT THE SMALL TOPICS ARE GOING TO BE. SO THE CHALLENGE IS THE CORPUS IS IN THIS FORM AND OUTPUT IS -- AND THIS CHALLENGE IS IT PRESENT TODAY, IN OTHER WORDS WHAT HAPPENS THREE DAYS FROM NOW. AND THEN SMOKING OR SOMETHING YOU CAN TRICK THEM BY DOING SMOKING AND REPEATING IT. I WOULDN'T DO THAT, ACTUALLY THAT'S WHAT WE DID IN '92, WHAT I DID TO CAROL WHEN I SAID WHY DON'T YOU GIVE THE MEDIALLY TO EVALUATE AND SHE SAID WHAT ARE YOU GOING TO DO WITH IT, I'LL GO FIX THAT PART SO SHE HAD NOD WHERE IDEA I WAS GOING TO ASK, WHERE IS IT GOING TO WORK AND OUTPUT TECHNOLOGY AND I PICKED COMMON PROBLEMS TO ANSWER AND I DIDN'T KNOW HOW MEDIALLY WORKED AND SHE DIDN'T KNOW WHAT I WAS GOING TO EVALUATE SO THAT'S A HARD EVALUATION AND MAYBE EVERYONE WON'T GET 96% ON IT, THAT'S OKAY. MAYBE THAT WILL BE BETTER DISTINGUISHING THE HOW SYSTEMS DO. >> THAT'S REALLY GOOD. WITHOUT TELLING YOU WHAT THE COST EXACTLY IS. WE TALKED ABILITY EXTRINSIC EVALUATION. I COME FROM THE COMMERCIAL SIDE, A LOT OF YOU ARE ACADEMIC SIDE, FEW COMMERCIAL PEOPLE, THIS IS SOMETHING WE DEAL WITH ALL THE TIME, WE BILL IN A HEALTHY SYSTEM AND IF IT DOESN'T SOLVE A CUSTOMER'S PROBLEM THEY DON'T BUY IT. WE GET OUR FEEDBACK OFF THE BAT. ANDEST NICK ACTUALLY HAS BOTH HATS, COMMERCIAL AND ACADEMIC HAT AND HE TALKED ABOUT THE CAC DOMAIN. AND CAC DOMAIN METRICS ARE CLEAR HOW TO EVALUATE IT. AND WE ARE PARTICIPATED IN HALF A DOZEN OF THESE CHALLENGE GRANTS, I 2B 2 CHALLENGES. PART OF THE PROBLEM, YOU DON'T MAKE THE PROBLEM SOMETHING RELEVANT FOR PRACTICAL USE, IF THEY'RE SAID ALL KINDS OF THINGS, WHY NOT BUILD A PROBLEM LIST. THAT'S A PRACTICAL PROBLEM WHICH (INAUDIBLE) SOLVES WONDERFULLY WELL BUT THAT WASN'T THE CHALLENGE. AND WE PARTICIPATED BECAUSE WE THOUGHT HEY, WE CAN FIGURE THE PROBLEM LIST FROM THIS LIST NEATLY. AND ACTUALLY IT TURNS OUT IT DID HELP FOR THAT BUT THAT WASN'T THE CHALLENGE. IF YOU MAKE THE PROBLEM PRACTICAL YOU EVALUATE OFF THE BAT. I HOPE WE CAN MOBILIZE THE INPUT TO FEED INTO THESE EVALUATIONS. >> THANKS, EVERYONE. [APPLAUSE] >> ALL RIGHT FOLK, WHAT YOU HAVE BEEN WAITING FOR THE ULTIMATE PANEL. LINGUISTICS IN BIOMEDICAL APPLICATIONS, ULTIMATELY BECAUSE IT'S THE LAST ONE. AFTER WHICH TOM AND CAROL ARE GOING TO SUMMARIZE IN 15 MINUTES OR LESS EVERYTHING YOU LEARNED TODAY AND THEN WE WILL BE ADJOURNING. I KNOW AMY HE ELHADAD HERE? SHE IS THE CHAIR FOR THIS PARTICULAR PANEL. SHE HAS HER DEGREE IN COMPUTER SCIENCE FROM COLUMBIA. SHE LEFT BRIEFLY FOR CITY COLLEGE? THEN RETURNED TO COLUMBIA AND IS SEAPT PROFESSOR IN THE DEPARTMENT OF BIOMEDICAL INFORMATICS. >> THANK YOU. HI, EVERYONE. MY PLEASURE TO INTRODUCE SPARES FOR THIS PANEL, THE FIRST SPEAKER IS ALAN ARONSON, PRINCIPAL INVESTIGATOR AT THE -- I'M SORRY, I'M READING FROM MY iPHONE, IT'S HARD. AT THE LISTER HILL NATIONAL CENTER FOR BIOMEDICAL COMMUNICATION NATIONAL LIBRARY OF MEDICINE SINCE 1988. HE'S RESEARCH FOCUS ON APPLYING NATURAL LANGUAGE TECHNIQUES TO BIOMEDICAL TEXT FOR TASKS RANGING FROM INDEXING AND RETRIEVAL OF BIOMEDICAL LITERATURE AND CLINICAL TEXT. AND HIS RESEARCH GROUP IS RESPONSIBLE FOR THE NLM TEXT INDEXER NPI WHICH ASSISTS VARIOUS RESEARCHERS AND ALSO INDEX EFFORTS SUCH AS MED LANE INDECKING. DR. ARONSON'S CONTRIBUTION TO NLP INCLUDE (INDISCERNIBLE) A TOOL TO MAP BIOMEDICAL TEXT TO THE (INAUDIBLE) WHICH IS INVALUABLE RESOURCE FOR THE BIOMEDICAL INFORMATICS COMMUNITY. DR. ARONSON, THANK YOU. [APPLAUSE] >> I WILL BE TALKING THE TWO PROGRAMS YOU MENTIONED, IN THE CONTEXT OF MY WORK IN THE NLM INDEXING INITIATIVE WHICH DEVELOPS PROGRAMS TO HELP WITH VARIOUS INDEXING TASKS AT THE LIBRARY. RATHER THAN CONCENTRATE TOO MUCH ON DETAIL I'M GOING TO FOCUS ON LINGUISTIC ROOTS ESPECIALLY MEDIMAP AND WORK WE HAVE BEEN DOING WITH MACHINE LEARNING TECHNIQUES TO ANSWER SOME OF THE PROBLEMS WE HAVE ENCOUNTERED ALONG THE WAY AND FINALLY TALK ABOUT AN ADDITIONAL FUNCTIONALITY AND MEDICAL TEXT INDEXER CALLED GENE INDEXING A VERY EXCITING DEVELOPMENT THAT IS ONGOING. SO LET'S INTRODUCE MEDIMAP. HERE IS A SNIP PET OF TEXT ON SMOKING AS IT EFFECTS ATHEROSCLEROSIS, TEXT IN BLUE IS TEXT THEY FOUND IN CONCEPT FOR AND WHAT MTI DOES IS IT TAKES MED MAP'S RESULTS ALONG WITH OTHER INFORMATION AND CONSTRUCTS A LIST OF MESH HEADINGS THAT ARE DESIGNED TO CHARACTERIZE THE ARTICLE AND PES OF TEXT IN QUESTION. SO NOTICE HERE THE WORD ELDERLY IS GETTING MAPPED TO THE MESH HEADING AGENT, SIMPLY BECAUSE THERE IS A CONCEPT IN THE RESOURCE ELDERLY AND IT HAS A SYNONYM AGENT, A SIMPLE RESULT THERE. THERE'S ANOTHER REASON PATIENTS GET MAPPED TO HUMANS BECAUSE IT SAYS IF YOU'RE INDEXING A CITATION ABOUT PATIENTS IT'S PROBABLY A HUMAN STUDY. SO YOU USE THE CHECK TAG HUMANS. SIMP M. OKAY. FOCUS ON MEDIMAP TO BEGIN WITH, AS I ILLUSTRATED IT'S A NAMED RECOGNITION PROGRAM THAT FINDS UMLS CONCEPTS BUT IT'S HALLMARKS ARE LINGUISTIC RIGOR, THE FLEXIBLE PARTIAL MATCHING. AND ITS EMPHASIS ON THOROUGHNESS RATHER THAN SPEED. THE CURRENT CARETAKER OF MEDIMAP, CONTRACTOR FRANCEOIS LANGUAGE MADE TRIEDZ SPEEDING IT UP. -- STRIDES SPEEDING IT UP, IF YOU'RE INTERESTED IN TRYING TO USE IT. THIS IS A VERY OLD SLIDE, IT DESCRIBE IT IS STEPS IN THE MEDIMAP ALGORITHM, ON THIS SLIDE THE ONLY THING THAT'S NOT LINK QISICLY MOTIVATED IS THE SPEECH TAGGER, A DELIGHTFUL PART OF SPEECH TAGGER, A STATISTICAL TAGGER THAT'S PART OF THE PARCELING PROCESS. THE MAIN PARSING IS ACCOMPLISHED BY (INDISCERNIBLE) MINIMAL COMMITMENT PARSER AND MANUALLY CONSTRUCTED LEXICON WHICH HAVE DEEP LINGUISTIC VALUATION. ONE MORE POINT IN THE ALGORITHM, THE CANDIDATE EVALUATION, IT'S A WEIGHTED AVERAGE OF FOUR COMPONENTS WHICH HAS A LINGUISTIC INTERPRETATION. THE FIRST COMPONENTS IS PURE LINGUISTICS, WHETHER OR NOT THE PHRASE IS INVOLVED IN THE MATCH. THE AVERAGE OF VARIATION, AGAIN, ALL BASED ON THE LINK QISIC MANIPULATION OF THE WORDS. THE LAST TWO ELEMENTS COUNT HOW MANY WORDS AND IN A VAGUE SENSE LINGUISTIC INTERPRETATION OF THAT TOO. SO LET'S GO THROUGH AN EXAMPLE FOR THE TEXT INFERIOR VENA CAVA STINT FILTER, THE INTERMEDIATE RESULTS, THE CONCEPTS THAT MEDIMAP FOUND AFTER PARSING AND VARIANT GENERATION. CANDIDATES ARE LISTED IN ORDER OF MEDIMAP SCORE T SCORE IS NORMALIZED TO A VALUE OF NO MORE THAN 1,000 IN ORDER SO TOP SCORING CANDIDATE IS TOP OF THE LIST. AFTER THE SCORES, WE GET CONCEPT UNIQUE IDENTIFIERS. THE ACTUAL MATCHING STRINGS FROM THE ME DA THEESOR RUS AND -- THESAURUS AND TYPES. THE TOP SCORING CANDIDATE GIVES YOU ALMOST THE COMPLETE ANSWER, THE ONLY THING IT DOESN'T MENTION IS STINT. BUT DOWN THE LIST YOU SEE SIX EXAMPLES OF FILTER, SIX WAYS AMBIGUOUS, PREFERRED NAMES OF THOSE CONCEPTS, SOME ARE MEDICAL DEVICES, THERE'S AN INFORMATION PROCESS AND KEN ACCEPT ACTUAL ENTITY, ALL OVER THE MAP. SECOND CASE OF AMBIGUITY, STINT IS TWO WAY AMS BIG WOWS WITH PREFERRED NAMES THEY'RE MEDICAL DEVICES AND THEY MEAN THE SAME THING. SO WHAT DOES MEDIM ABP DO? IT STARTS WITH INFERIOR VIENA CA THAT VA FILTER AND THE ONLY THING IT HAS LEFT TO DO IS FIND OUT WHAT'S GOING ON WITH THE STINT AND FORMS MAPPINGS FROM THE FILTER EXAMPLE AND THE TWO STINT ONES A.FORMING TWO MAPPINGS WHICH SCORE EQUALLY WELL, I ILLUSTRATES MED IMAP'S BIGGEST PROBLEM. THE AMBIGUITY PROBLEM IS GETTING WORSE. SO LET ME FOCUS ON WHERE IT SAYS THIS EVENT. HERE IS A SIMPLE EXAMPLE, KIDS WITH COLDS, ET CETERA, ET CETERA, THIS IS OBVIOUSLY ABOUT THE COMMON COLD, NOT ABOUT COLD SENSATION OR TEMPERATURE OR ANYTHING OTHER COAL CONCEPTS IN THE META THESAURUS. LUCKILY FOR THE LAST TWO YEARS I HAVE HAD A POST DOC FELLOW, ANTONIO (INDISCERNIBLE) WHO WORKED ON ANY NUMBER OF AMBIGUOUS METHODS. THEY'RE BOTH KNOWLEDGE BASED, FIRST ON KNOWLEDGE IN THE MLS AND WHAT AN ANTONIO DID WAS CONSTRUCT PROFILE VECTORS OF WORDS FROM THE DEFINITIONS SYNONYMS AND RELATED CONCEPTS FOR A GIVEN CONCEPT. SO WORDS ASSOCIATED WITH THE CONCEPT COMMON COLD ARE INFECT, DISEASE, FEVER, COUGH. ET CETERA. WORDS ASSOCIATED WITH COLD TEMPERATURE ARE DIFFERENT THAN COMMON COLD ONES. THE METHOD BY SIMPLY LOOKING AT THE CONTEXT OF AMBIGUITY AND FORMING A VECTOR OF THOSE WORDS AND SEEING WHICH OF THE COMPETING PROFILE VECTORS COMES CLOSEST TO US. SO IN THE EXAMPLE I GAVE, BECAUSE THE WORDS COUGH AND FEVER OCCUR IN THE WORD COAL, COMMON COAL, THE CORRECT ANSWER IS CHOSEN AS THE ANSWER BECAUSE ITS PROFILE VECTOR IS CLOSEST TO THAT TEXT. PRETTY SIMPLE. A SIMILAR METHOD ALSO KNOWLEDGE BASED BUT BASED ON CITATIONS OCCURRING IN PUBMED, THIS IS CONSTRUCTED BY NOTING THE UNAMBIGUOUS SYNONYMS FOR COMPETING CONCEPTS, IN THIS CASE THE TWO COLD EXAMPLES. ANTONIO FORMULATED THE QUERY, POSED THEM TO MED LINE AND USED TEXT FROM THE CITATIONS BACK FROM THE QUERY TO FORM THE PROFILE VECTORS AND DID THE SAME AS BEFORE. ONE LOVELY SIDE EFFECTS OF THIS PARTICULAR METHOD, THE CITATIONS FROM THIS QUERY CONSTITUTE A TEST COLLECTION. THEY'RE POSITIVE EXAMPLES FROM THE CONCEPTS THAT WE'RE USED TO RETRIEVE THEM. SO COMPARING THE TWO METHODS THE MED LINE OR CORPUS BASED METHOD ON TWO TEST COLLECTIONS THE OLD NLM TEST COLLECTION YEARS AGO MANUALLY TOOK A LOT OF EFFORT AND THE NEW TEST COLLECTION CREATED AS A SIDE EFFECT OF THE CORPUS METHOD FT. IN BOTH CASE IT IS CORPUS METHOD OUTPERFORMED THEsL– MLS METHOD, COUPLE OF THINGS ABOUT THE NEW CONSTRUCTED TEST COLLECTION, IT HAS 203 AMBIGUOUS WORD SENSES, FOUR TIMES THE NUMBER IN MANUALLY CONSTRUCTED COLLECTION WE DID MANY YEARS AGO. IT HAS BROADER SEMANTIC COVERAGE AND LARGER DATA SET AND DIDN'T TAKE MUCH EFFORT AT ALL. IT WAS SIMPLY A SIDE EFFECT OF THIS PARTICULAR METHOD. USING THESE TWO METHODS TO MODIFY THE BEHAVIOR OF MTI WAS CONFIRMED THE CORPUS METHOD PERFORMS BETTER AS FAR AS MTI IS CONCERNED TOO. SO MEDIMAP STARTED WITH LINGUISTIC ROOTS BUT WE HAVE MOVED ON AND FOR CERTAIN TASKS WE PROVIDE MACHINE TECHNIQUES TO IMPROVE IT. TO THE MEDICAL TEXT IB DEXER. HERE IS THE SAME EXAMPLE, MED LINE CITATION, CIGARETTE SMOKING AS IT EFFECTS OUR THEER AR THEERIAL SCLEROSIS. THE THING TO FOCUS ON IS ANOTHER PART OF THE RECORD. INDEXING THE PEOPLE IN THE INDECK SECTION CREATE. SO THEY DEFINE PUBLICATION TYPES AND IN THIS CASE THE STUDY IS A COMPARATIVE STUDY AND INVOLVES KNOWN U SEARCH AND SEIZURE GOVERNMENT RESEARCH SUPPORT. MOST IMPORTANT PART IS THE MESH TERMS BELOW THOSE. INDEXER TRIES TO CAPTURE THE MEANING OF THE ARTICLE. SO THERE ARE TWO MESH HEADINGS INVOLVING AGE, THERE IS A SPECIFIC TERM ARTERIAL SCLEROSIS AND AS LISTK MEANS -- FURTHER DOWN THE LIST THERE'S CHECK TAGS FEMALE HUMANS AN MALES WE'LL RUN INTO THOSE IN A LITTLE BIT. IT IS THE CONSTRUCTION OF THIS LIST OF MESH HEADINGS THAT MTI IS DESIGNED TO ASSIST WITH. HERE IS THE UBIQUITOUS SYSTEM DIAGRAM. I'M GOING TO SAY ALMOST NOTHING. THEY PRODUCE MESH, THEY'RE CLUSTERED AND RANKED. WHAT'S INTERESTING IS AS JIM (INAUDIBLE) WANTS TO SAY JIM IS THE MAIN DEVELOPER. THE INTERESTING STUFF HAPPENS AT THE POST PROCESSING LEVEL AT THE END. BEFORE THAT IT'S A FAIRLY GENERIC ALGORITHM BUT POST PROCESSING TAYLORS THE RESULTS TO INDEXING AS PERFORMED AS THE LIBRARY. IN PARTICULAR AMONG OTHER THINGS WE APPLY INDEXING RULES. I'LL GIVE YOU A COUPLE OF EXAMPLES, I GAVE YOU ONE, RECOMMENDING HUMAN WHEN YOU SEE PATIENTS IS AN EXAMPLE OF INDEXING RULE. SO THEY ARE KNOWLEDGE BASED. ANOTHER IS TO TRY TO IMPROVE OUR RECOMMENDATION OF CHECK TAGS AND I'LL DESCRIBE THAT IN A BIT. ONE IMPORTANT IS SUB HEADINGS, DUE TO THE WORK OF (INDISCERNIBLE) WHO WAS A POST-DOC NOW AT NCBI WE RECOMMEND SUB HEADINGS ATTACHED TO HEADINGS. HERE ARE EXAMPLES. WHEN MTI MAKES EGREGIOUS MISTAKES, WE GET FEEDBACK FROM THE INDEXES. IN THIS PARTICULAR CASE WE HAD STEM CELL HIBERNATION, WE RECOMMENDED THE MESH HELDING, THAT'S ABOUT ANIMAL HYDRATION SO WE FIXED THAT. SAME THING GOES WITH THE SECOND EXAMPLE, AT ONE POINT SAW -- SAW THE WORD CLEAVE AND MENTIONED (INAUDIBLE) IN THE SPIKES. SO THE INDEXERS ARE PAYING ATTENTION TO WHAT WE ARE DOING AND PAYING ATTENTION TO FEEDBACK HOW IS MTI USED IN ADDITION TO MAIN PURPOSE FOR MED LINE INDECKING. A MODIFIEDED VERSION PRODUCES MORE GENERAL RESULTS THAN NORMAL PROCESSING USED BY CATALOGING DIVISION AND HISTORY OF MEDICINE DIVISION TO INDEX THEIR COLLECTION. IN ADDITION UNTIL RECENTLY, THE AUTOMATIC INDEXING THAT IS INDEXING WITHOUT HUMAN REVIEW AT ALL, WAS USED TO INDEX THE MEETING ABSTRACTS CONTAINED IN THE NLM GATEWAY. THE MOST INTERESTING THING OF LATE IS THAT MTA IS NOW USED AS FIRST LINE INDEXER. NOR MAT INDEXING AT THE LIBRARY CONSISTS OF THE FOLLOWING STEPS. MTI PROCESSING THE TITLE IN THE ABSTRACT AS CITATION TO THE IB DEX, WE STORE THE INFORMATION, THE INDEXER CAN REFER TO THAT LIST OF RECOMMENDATIONS IF THEY SO DESIRE, THEY DON'T HAVE TO. THERE WAS ALSO PASSED FINAL ADJUSTMENTS AND RELEASES THE INDEXING FOR DISPLAY INu"z PUBMED SYSTEM. NOW MTI HAS (INAUDIBLE) COLLECTION OF 23 IN PARTICULAR, THAT IT IS NOW USED AS IF THE INDEXING IS PRODUCED CAME FROM A HUMAN. SO IN THIS CASE INDEXING PROs ISES AS USUAL, DIRECTLY TO OUR ADVISER WHO MAKES ADJUSTMENTS AND RELEASES TO PUBMED AS BEFORE. BECAUSE THIS PARTICULAR FEATURE IS RELATIVELY NEW WE HAVE A FOLLOW ON STEP. THE INDEX SECTION IS TRACKING NCI FIRST LINE INDEXING RESULTS TO MAKE SURE THAT THE HIGH STANDARD HAVING BECAUSE ONE OF THE SUSIE ROY JUST COMPLETED A STUDY OF MED LINE JURY ROOMS AND SHE CAME UP WITH ANOTHER 20 SOME ODD JOURNALS THAT CAN BE APPLY SOD WE RECENTLY DOUBLED THE MTI FIRST LINE INDEXER JOURNALS TO 45 NOW. THIS IS A DROP IN THE BUCKET FOR TOTAL NLM INDEXING THROUGH PUT BUT STILL REPRESENTS A MAJOR STEP INVOLVING TRUSTING WHAT MTI CAN PRODUCE. AT LEAST FOR SOME JOURNALS, NOT ALL. I'LL MENTION CHECK TAGS BEFORE. THE STUDY THAT ANTONIO DID, HE JUST APPLIED VARIOUS MACHINE LEARNING TECHNIQUES TO THE COMMONLY USED CHECK MESH HEADINGS CALLED CHECK TAGS AND LET ME ILLUSTRATE SOME OF THE RESULTS HERE. MALE, FEMALE AND HUMANS AMONG THE MOST APPLIED CHECK TAGS. THAT RESULTS THEY HAD F MEASURES REASONABLE FOR BUT AFTER HIS MACHINE LEARNING THEY INCREASED MORE SO AND HUMANS IS OVER 90% F MEASURE. MIDDLE AGED WHICH WE ALMOST NEVER GOT RIGHT. WE HAD A 1% F MEASURE BEFORE AND AFTER HIS RESULTS WE NOW HAVE ALMOST 60%. IT WAS A REALLY BIG WIN. HE JUST APPLIED STANDARD MACHINE LEARNING TECHNIQUES TO A CERTAIN SUBSET OF MESH HEADINGS THAT ARE OFTEN USED AND THE RESULT HAS BEEN DELIGHTFUL. SO HOW IS MTI DOING IN GENERAL? THE CHART SHOWS RECALL, THE PRECISION, AND THEN THE F-1 MEASURE OVER A PERIOD OF YEARS, THIS IS THE NORMAL MTI PROCESSING, IF YOU LOOK AT -- FOCUS ON MTI FIRST LINE INDEXER THE RESULTS ARE MUCH BETTER AND THAT'S BECAUSE MTI DOES BETTER ON THESE JOURNALS IN THAT REGIME ANYWAY. FOCUSING ON THE LATTER PART OF THE GRAPH, A BIG CHANGE HAPPENED MANY LATE 2010. THE INDEXERINGS PREVIOUSLY WE FOCUSED ON MAKING SURE OUR RESULTS HAD HIGH RECALL. THAT IS, WE DID NOT WANT TO MISANY MAIN POINT OF AN ARTICLE AND WE PROVIDED A LIST THAT HAD A NUMBER OF RECOMMENDATIONS THAT JUST GENERALLY WOULDN'T BE USED IN INDEXING. IN LATE 2010 INDEXER SAID NO WE WOULD RATHER HAVE A SHORTER LIST AT THE RISK OF LOSING MAIN POINTS IN THE ARTICLE. JUST CHANGED THE PARAMETER IN THE SYSTEM, THE PRECISION SKYROCKETED, RECALL WENT DOWN AND F MEASURE WENT UP. A SECOND POINT TO NOTE ABOUT THE GRAPH IS IN EARLY 2010, THERE WAS AN INCREASE IN PERFORMANCE. WE THINK THAT'S DUE TO A COMBINATION OF ANTONIO'S WORK ON MACHINE LEARNING TECHNIQUES BUT ALSO TO THE LARGE NUMBER OF INDEXING RULES THAT WE HAVE INSTITUTED INTO THE SYSTEM BASED ON INDEXER FEEDBACK. ONE FINAL POINT. THIS IS AN ENTRE GENE RECORD, YOU CAN READ IT OF COURSE. IT'S THE HUMAN GENE FLNA. THE FIRST ONE THERE IS THESE RESULTS DEMONSTRATE FLNA IS PRONE TO PATHOGENIC REARRANGEMENTS. GENE RISKS ARE LINKS FROM THE ENTRE GENE RECORDS TO PUBMED. IF YOU CLICKED ON THIS LINK YOU GET THE PUBMED ARTICLE THAT THE LINK REFERS TO. LINK, INTENDED TO CAPTURE SOMETHING THAT IS FUNDAMENTAL ABOUT THE FUNCTION OR BIOLOGY OF GENES. INDEXING IS PRODUCED BY INDEXERS AS THEY DONOR MALL MED LINE INDEXING. IT TAKES A LOT OF WORK. AND A LIBRARY ASSOCIATE FELLOW (INDISCERNIBLE) LAST YEAR DECIDED TRIED TO ASSIST IN GENE INDEXING DEVELOPING THE GENE INDEXING ASSISTANT. WE NOW HAVE A PROTOTYPE, MODULAR THAT CONSISTS OF LINGUISTICALLY MOTIVATED SYMBOLIC STATISTICAL COMPONENTS MOSTLY OFF THE SHELF BUT DESIGNED TO SIMPLY EXAMINE AN ARTICLE, DETERMINE IF IT'S APPROPRIATE FOR GENE INDEXING, IDENTIFY THE GENES THAT ARE MENTIONED IN THE ARTICLE, NORMALIZE THEM TO ENTRE GENE NAMES. AND MAKE THE LINKS TO ENTRE GENE. THEN SUGGEST THE ANNOTATIONS THAT ARE THE LITTLE SNIP PET ON THE ONE GENE, THEY'RE TAKEN DIRECTLY FROM THE ARTICLE ITSELF AND CAPTURE THE -- WHAT THE GENE RIFF IS ABOUT. IT'S HOPE THAT THE USE OF THE GENE INDEBSING ASSISTANT WILL INCREASE SPEED OF INDEXING A LOT OF GRUNT WORK THAT THE INDEXERS HAVE TO DO IS DONE BY THE PROGRAM. AND WE'RE ALSO HOPING BECAUSE THE INFORMATION IS THIS IT INCREASES THE COMPREHENSIVENESS OF GENE INDEXING. I HEARD THE END OF LAST WEEK NLM IS GOING TO GO AHEAD AND INCORPORATE THE GENE INDEXING ASSISTANT INTO THE INDEXING SYSTEM AND WE HOPE TO GO LIVE WITH REAL RESULTS FOR THE INDEXERS HOPEFULLY SOMETIME IN JUNE. WITH THAT, THIS IS A LIST OF MY STAFF MEMBERS AND TWO FELLOWS WORKING WITH US RECENTLY. IF YOU HAVE QUESTIONS ABOUT MEDIMAP YOU CAN GO TO THAT WEBSITE. THANK YOU. [APPLAUSE] >> OUR SECOND SPEAKER IS KEVIN COUNSEL, HE LEADS A BIOMEDICAL TEXT MINING GROUP IN COMPUTATIONAL BIOSCIENCE PROGRAM AT UNIVERSITY OF COLORADO HEALTH SCIENCE CENTER AND ALSO ADJUNCT ASSISTANT PROFESSOR DEPARTMENT OF LINGUISTICS AT THE UNIVERSITY OF COLORADO BOLDER. HE'S A CHAIR OF THE ACS SPECIAL INTEREST GROUP IN BIONLP. UNIVERSITY OF COLORADO. HE STARTED HIS WORK IN LP BUILDING SPEECH RECOGNITION FOR PSYCHIATRY NOTES IN 1997 AND PAST TEN YEAR HE AT THE UNIVERSITY OF COLORADO SCHOOL OF MEDICINE BEEN WORKING ON ENTITY IDENTIFICATION AN NORMALIZATION, BIOMEDICAL LITERATURE RGS INFORMATION EXTRACTION AND INFORMATION RETRIEVAL AND QUESTION ANSWERING AS WELL AS COMPUTATIONAL SEMANTICS. [APPLAUSE] >> GOOD AFTERNOON. WHAT I WOULD LIKE TO TALK ABOUT TODAY BRIEFLY IS OPPORTUNITIES FOR TRANSLATIONAL WORK IN BIOMEDICAL NATURAL LANGUAGE PROCESSING. I'M GETTING RESULTS FROM WORK THAT WE HAVE DONE IN THE BIOLOGICAL DOMAIN INTO THE CLINICAL DOMAIN. AND THE SECOND POINT I WOULD LIKE TO MAKE IS THAT BACKGROUND KNOWLEDGE ANDJg@ ONTOLOGIES ARE RICH ENOUGH POINT NOW TO LEVERAGE NATURAL LANGUAGE PROCESSING, THAT MAY NOT COME AS A SURPRISE AFTER -- YOU HAVE TO SAY GEZUNHE TIRKS NCI IF SOMEONE ?EES SO WE CAN LIVEN IT UP. THAT MAY NOT BE A SURPRISE BUT I'LL TALK ABOUT WHY IT IS A SURPRISE AND HOW WE CAN OPERATIONALIZE THAT. SO SINCE THE TITLE OF THIS WORKSHOP DEALT WITH CLINICAL DECISION SUPPORT AND I WORK PRIMARILY WITH JOURNAL ARTICLES, I TRIED TO APPROACH THE TOPIC OF MY TALK IS TRYING TO ANSWER THE QUESTION OF WHAT CAN 14 YEARS OF RESEARCH INTO JOURNAL ARTICLES TELL US ABOUT CLINICAL DOCUMENTS. SO IN EARLY OBSERVATION AND CLINICAL TEXT VERSUS JOURNAL ARTICLES COMES FROM CAROL FRIEDMAN AND ANDRE (INDISCERNIBLE) AND THEY POINTED OUT CLINICAL TEXT IS DOMINATED WHILE BIOLOGICAL JOURNAL ARTICLES ARE VERY DOMINATED. THOUGH THEY HAVE THESE NOMINALIZATIONS OF VERBS. WHAT WE MEAN BY NOMINALIZATION IS A NOUN DERIVED FOR -- SO THERE ARE A COUPLE OF KINDS OF THESE, ACTIVATION FROM THE WORD ACTIVATE AND ACTIVATOR WHICH IS AN ARGUMENT NOMINALIZATION FROM THE WORD ACTIVATE. IT TURNS OUT, I WROTE THIS ON A MAC AND THE PC MESSED UP SPACING THE THING SO I APOLOGIZE FOR THE OBSCURED HEADINGS AND THINGS LIKE THAT. IT TURNS OUT THAT NORMALIZATIONS ARE DOMINANT AND BIOMEDICAL JOURNAL ARTICLES AND IN THE LEFT HAND COLUMN OF THE TABLE IS TEN MOST COMMON VERBS IN A SET OF JOURNAL ARTICLES ABOUT MOLECULAR BIOLOGY. THEN YOU SEE THE ACCOUNTS FOR THE NUMBER OF NO, MA'AM NOL SAIGS OF THOSE -- NOMINALIZATIONS OF THOSE VERDICTS AND THE COUNTS -- VERBS AND ALL FORMS OF THE VERBS AS VERBS. WHAT YOU'LL SEE THAT THE NUMBERS IN BOLD ARE FOR THE LARGEST NUMBER COMPARING THE NOMINAL IZATION VERSUS VERB COLUMNS, AND THEY PREDOMINATE HEAVILY OVER VERB FORMS AND THEY MAY OUTNUMBER THE VERBS AS MUCH AS 3 TO 1. SO THESE NOMINALIZATIONS ARE THE KEY TO DOING TRANSLATIONAL RESEARCH. SO WHAT DO WE MEAN TALKING TRANSLATIONAL RESEARCH? WE MEAN GETTING RESULTS FROM THE BENCH TO THE BEDSIDE YOU MIGHT THINK THE BENCH ON JOURNAL ARTICLES AND THE BEDSIDE AS WORK THAT WE DO ON CLINICAL DOCUMENTS. SO IT TURNS OUT THAT THESE NOMINALIZATIONS ARE FREQUENT AND SOMETIMES PREVALENT IN JOURNAL ARTICLES AND THE NOMINALIZATIONS ARE ARGUMENT BEARING, TALK ABOUT THAT IN A MINUTE, AND TIPPING POINT FOR INFORMATION EXTRACTIONS AS WE SEE THE WORK FROM (INAUDIBLE). AND LINGUISTIC STUDY OF THESE THINGS REVEALS THEY'RE VERY COMPLEX AND INTERESTING PATTERNS OF BEHAVIOR ARGUMENTS AND NOMINALIZATIONS. SO WHEN THEY TALK ABOUT AN ARGUMENT WHAT I MEAN BY THAT IS A PARTICIPANT IN THE ACTION OF THE PREDICATE OF THE VERB OR NOMINALIZATION OF THAT VERB. SO WE USUALLY TALK ARGUMENTS WITH NUMBERS LIKE ARG-0, 1, 2, THESE ARE ARGUMENTS FOR THE PREDICATE INCREASE SO ARG IS THE CAUSE, ARG IS DECREASING AND SO ON. SO WE CAN TALK ABOUT THE ARGUMENTS FOR VERBS, APPEARING IN SENTENCES USUALLY SIN TACTIC INSTITCH WENTS SO IN THIS CASE FOR THE WORD. WE CAN TALK ABOUT THESE HAPPENING FOR NOUNS AS WELL. SO IN THIS CASE THIS IS OUR SECOND TIME SEEING IN THE EXAMPLE, DO 870 AND ANTI-FUNGAL AGENT IS THE ARG 0 OR CAUSER OF THE INCREASE IN THE ARG 1, INCREASING WHICH IS TOTAL CYTOCHROME P-450, EPOXIDASE. I'M GOING TO TALK A LITTLE BIT ABOUT SOMETHING CALLED AN ALTERNATION, SO AN ALTERNATION IS A VARIATION IN THE SURFACE SYNTAX FORM IN THE PREDICATE OF ARGUMENTS OF A PREDICATE. AND WE'RE FAMILIAR WITH THESE SO FOR VERBS WE HAVE SEEN THE ACTIVE IN THE PASSIVE ALTERNATION, AND THE TRANSITIVE AND INTRANSITIVE ALTERNATION. IF YOU'RE TALKING ABOUT NOMINALIZATION WHAT COUNTS AS ALTERNATIONVZx IS WHERE DOES AN ARGUMENT SHOW UP RELATIVE TO NOMINALIZATION ITSELF. WE HAVE FOUR POSITIONS FOR THESE. SO WE CAN HAVE PRE-NOMINAL ARGUMENTS AS IN PHENOBARBIE TOLL, AND POST NOMINAL INCREASES IN OXYGEN, WE HAVE HAVE CASES NO ARGUMENT IS PRESENT AND WE HAVE HAVE CASES WHERE THE NOUN -- ARGUMENT IS EXTERNAL TO THE NOUN PHRASE ITSELF. AND AN INTERESTING CASE OF THESE IS THE PRE-NOMINAL ARGUMENT SO WE CAN SEE CASES THE AGENT THE THING PERFORMING OCCURS IN FRONT OF THE NOMINALIZATION. WE CAN ALSO SEE CASES WHERE THE PATIENT, THE ARG-1 ROUGHLY EQUIVALENT TO THE LOGICAL OBJECT OCCURS IN FRONT OF THE NOMINALIZATION. I WOULD LIKE TO POINT YOUR ATTENTION IN PARTICULAR TO THE CASE OF THESE LAST TWO EXAMPLE HERE WITH NOMINALIZATION IS TREATMENT AND WE SEE THE EXAMPLES FEE KNOW BARBITOL TREATMENT AND CANCER TREATMENT. WE'LL COME BACK TO THESE LATER. SO I WAS INTERESTED IN GENERAL PHENOMENON OF THESE ALTERNATIONS IN NOMINALIZATIONS AND DID A CORPUS STUDY. WE INVESTIGATED A COUPLE OF HYPOTHESES IN HERE, ONE WB THE SUB LANGUAGE OF JOURNAL ARTICLES WE EXPECT A LIMITED NUMBER OF ALTERNATIONS. THE OTHER WAS THE ARGUMENT SEMANTIC TYPE SHOULD BE PREDICTABLE FROM THE RESTRICTED SEMANTICS OF THE DOMAIN. SO WE HAVE LOTS OF HISTORY TO SUGGEST THIS. WE HEARD NOEMIE'S WORK TODAY, SHE SUGGESTED IN THE BIOMEDICAL SUB LANGUAGE THE SCIENCE SPECIFIC VERBS HAVE ONLY ONE OR TWO OBJECT POSSIBILITIES FEWER THAN THE USE IN ENGLISH AS A WHOLE. AND LYNNETTE HIRSCHMAN POINTED OUT EARLY THE POSSIBLIES SUBJECT POSSIBILITIES DON'T HAVE TO BE DESCRIBED IN TERMS OF PARTICULAR WORDS. BUT RATHER THEY CAN BE DESCRIBABLE AS WORD CLASSES AND SHE IDENTIFIED ABOUT 50 OF THESE IN A SET OF CLINICAL DOCUMENTS. SO TO DO THE CORPUS STUDY WE USED A SET OF ARTICLES ABSTRACTS OF JOURNAL ARTICLES ANNOTATED WITH PARTS OF SPEECH SIN TACTIC STRUCTURES AND ENTITIES AND WE TOOK THE TEN MOST COMMON DOMAIN SPECIFIC VERBS AND PULLED A TOTAL OF 746 TOKENS OF THEM AND MARKED WHERE THE ARGUMENT SHOWED UP. WE SENT A NUMBER OF THINGS BUT ONE OF THE MOST STRIKING ONES WAS THAT THE TESTED ALTER NEIGHS ARE EXTRAORDINARILY DIVERSE. SO WHAT I'M SHOWING YOU HERE IS THE ALTERNATIONS FOR ARGUMENT 0 AND 1 OF INHIBITION WHICH HAS THREE ARGUMENTS. SO IF YOU REMEMBER THERE ARE FOUR DIFFERENT SPOTS WHERE AN ARGUMENT SHOWS UP, I'M SHOWING YOU ALONG HERE, THERE'S 16 POSSIBILITIES AND WE SAW 15 OF 16 POSSIBILITIES IN JUST 95 TOKENS OF THIS WORD. THE INITIAL PREDICTION WITHIN THE SUB LANGUAGE, WE SEE ONLY RESTRICTED AND ENORMOUS ACTIVITY THAT SHOWED UPCH THIS IS THE APPLICATION SECTION, SO I'LL TALK APPLICATION WE BUILT TO WORK WITH DATA LIKE THIS CALLED THE OPEN V MAP A RULE BASED SEMANTIC PARSER, OPEN SOURCE, YOU CAN DOWNLOAD IT ONE THING THAT'S INTERESTED ABOUT IT IN TERMS OF THE WAY WE USE IT IS ALL ASPECTS OF USE ARE STRUCTURED BY AN OPEN ACCESS COMMUNITY CONSENSUS ONTOLOGY. SO CERTAINLY ONTOLOGY THE BUILD THE SEMANTIC PARSER A NUMBER OF PEOPLE IN THIS ROOM HAVE BUT WE USE ONLY OPEN ACCESS COMMUNITY CONSENSUS ONTOLOGY SUCH AS ONES YOU FIND TO DO THIS. THE RULES ARE INCORPORATED, THEY HAVE SEMANTIC CORPORATION, THEY'RE HIGHLY FLEXIBLE WITH RESPECT TO HOW IT GETS ORD OARED. WE HAVE GOTTEN STRONG PERFORMANCE IN BIOCREATIVE SHARE TASK TWICE. ONCE BY US, ONCE BY SOMEONE ELSE, (INAUDIBLE) USED IT AND THEY LEARNED A SET OF RULES AUTOMATICALLY AS OPPOSED TO MANUALLY AND (INAUDIBLE) 2009 GOT THE HIGHEST PRECISION. SO THIS IS AN EXAMPLE HOW WE USE OUR ONTOLOGIES OPEN ACCESS ONTOLOGY TO STRUCTURE TASKS. HERE IS CONCEPTS IN GENE ONTOLOGY TRANSPORT AND NOTICE FOUR DIFFERENT SLOTS FOR THIS DESCRIBING THE DESTINATION OF THE TRANSPORT EVENT, THE ORIGIN OF THE AND THERE IS AN EXAMPLE WHAT A ROLE LOOKS LIKE FOR THIS. THE IMPORTANT POINTS FOR THE MOMENT ARE PARTICULAR SLOT FILLERS IN RED, THE TRANSPORTED NAD IN BLUE TRANSPORTED DESTINATION CONSTRAINED MEMBERS OF SEQUENCE ONTOLOGY, AND TRANSPORT ORIGIN AND DESTINATION ARE CONSTRAINED TO BE MEMBERS OF THE CELLULAR COMPONENT ONTOLOGY. I TALKED THE IDEA THIS ALLOWS FLECKABLE ORDERING OF TEXT. THAT WORKS THAT WE HAVE THIS SIGN OPERATOR. WHAT THAT ALLOWS FOR IS TWO THINGS FOR OPTIONALITY AND ALLOWS FOR FLEXIBILITY ORDERING RELATIVE TO OTHER THINGS MARKED WITH AN APP SIGN SO THE SINGLE RULE HERE TRANSLOCATION FROM OPTIONAL DETERMINE OR TRANSPORT ORIGIN TO OPTIONAL DETERMINENAL TRANSPORT DESTINATION RECOGNIZE BPO TRANSLOCATION TO MITOCHONDRIA TO CYTOSOL AND BACKS TRANSLOCATION TO THE MITOCHONDRIA AND ONE OF THOSE IS REAL THE OTHER IS THE ONLY MADE UP DATA I HAVE IN THIS TALK. IT TURNS OUT, ONE OF THE THINGS THAT'S CONDUCTED OVER AND OVER TODAY IS THE IDEA OF USE OF KNOWLEDGE. I SAID I WOULD TALK ABOUT HOW WE CAN USE KNOWLEDGE IN A LANGUAGE PROCESSING SYSTEM. SO I WANT TO TALK ABOUT ONE EXPERIMENT THAT WE DID USING EXTERNAL MODE SOURCE. OUR GOAL IN THIS PARTICULAR CASE WAS INFORMATION EXTRACTION ABOUT GENE ACTIVATION EVENTS. THE INITIAL SYSTEM DIDN'T MAKE USE OF ANY EXTERNAL KNOWLEDGE OTHER THAN STATISTICAL SYSTEM FOR RECOGNIZING WHERE PROTEIN NAMES OCCURRED. BOTH PARTICIPANTS IN GENE ACTIVATION EVENT CAN BE PROTEINS. LATER EXPERIMENT WE TRIED USING EXTERNAL KNOWLEDGE SOURCE SPECIFICALLY SPECIFICALLY ANNOTATIONS. WE RESTRICTED LOT FILLERS AS WELL AS ENZYMES TO THE ACTIVATION EVENT HAD TO BE A PROTEIN THAT HAD TO GO ANNOTATION OF CATALYTIC ACTIVITY. THE SUBSTRATE HAD TO BE A REACCEPT ANNOTATION WITH RECEPTOR ACTIVITY. SO THESE ARE RESULTS OVERALL IN THE -- THIS COLUMN YOU SEE THE ORIGINAL NO EXTERNAL KNOWLEDGE RESULTS IN THIS COLUMN THAN KNOWLEDGE RESULTS AND I WOULD LIKE TO DIRECT CRUR ATTENTION TO THIS PART OF THE SCREEN WHERE WE SEE USING THE EXTERNAL KNOWLEDGE GAVE US INCREASE IN PRECISION .20 POINTS AT COST OF DESCREES IN RECALL OF SIX POINTS FOR TOLL INCREASE IN MEASURE OF FIVE SO THE POINT AGAIN HAVING A LITTLE BIT OF THIS EXTERNAL KNOWLEDGE CAN BE HELPFUL. SO LOTS OF PEOPLE TALKED ABOUT USING KNOWLEDGE TODAY BUT AS I SAID THIS IS ACTUALLY A CONTROVERSIAL QUESTION. PEOPLE HAVE THOUGHT THAT USING KNOWLEDGE WOULD BE HELPFUL SINCE THE 1970s AND BANGED THEIR HEADS REPEATEDLY AGAINST THE ISSUE OF HOW DO YOU GET THESE THINGS TO SCALE. ONE OF THE CLAIMS THAT I WOULD LIKE TO MAKE TODAY IS THAT WE'RE IN A REALLY DIFFERENT SITUATION TODAY FROM WHERE WE WERE EVEN TEN YEARS AGO. WITH RESPECT TO THE NUMBER AND SIZE OF THE KNOWLEDGE SOURCES THAT WE HAVE AVAILABLE THAT WE CAN USE IN BIOMEDICAL NATURAL LANGUAGE PROCESSING. SO I WOULD LIKE TO BRING YOU BACK AGAIN TO A COUPLE OF INTEREST CASES WE SAW WHERE WE SAW PHENOMENAL ARGUMENTS IN FRONT OF THE WORD TREATMENT WHERE WE HAD THE CASE WHERE PHENO BARBITOL WHERE IT WAS THE AGENT OR THE ARG 0 BEFORE THE NO, NOMINALIZATION AN CANCER TREATMENT, THE ARG 1 OR PATIENT THAT WAS IN FRONT OF THE NOMINALIZATION. SO TURNS OUT THAT IF YOU LOOK TO THE UMLS META THESAURUS, THE BARBITOL TREATMENT SHOWS UP AS CONCEPT AND HAS SEMANTIC TYPE P-61 THERAPEUTIC OR PREVENTIVE PROCEDURE. IF WE LOOK AT THERAPEUTIC OR PREVENTIVE PROCEDURE, THIS IS NIDA, THEN WE SEE THAT THE RELATIONS FOR THERAPEUTIC PROCEDURE INCLUDE USERS OF PHARMACOLOGIC SUBSTANCE. IF WE TAKE THE CASE OF CANCER TREATMENT THAT SHOWS UP IN THE UMLS MINUTE THESAURUS AS WELL. WHEN WE LOOK AT THERAPEUTIC OR PREVENTIVE PROCEDURE WE NOTICE THAT ONE OF THE RELATIONS WITHIN IT IS THERAPEUTIC OR PREVENTIVE PROCEDURE TREATS PATHOLOGIC FUNCTION. SO WE CAN TURN THESE INTO RULES. VERY EASILY. SO GIVEN A PREDICATE WHERE NOMINALIZATION TREATMENT YOU COULD IMAGINE REPRESENTATION LIKE HAVING AN ARG 0 PROVIDER AND ARG 1 CONDITION AND ARG 2 TREATMENT, IF WE WRITE A RULE WITH THE STRUCTURE PHARMACOLOGICAL SUBSTANCE TREATMENT THAT MAPS PHARMACOLOGICAL CARM COLOGIC SUBSTANCE TO ARG-2, THAT WILL TAKE CARE OF THE PHENOBARBITOL CASE. FOR THE CANCER TREATMENT CASE A RULE OF THE FOREIGN DISEASE TREATMENT WILL CORRECTLY MAP CANCER TO THE ARG 1 POSITION. SOMETHING I WOULD LIKE TO KEEP YOU -- LIKE YOU TO KEEP IN MIND HERE IS THAT THE POINT OF THESE EXAMPLES ISN'T THAT PHENOBARBITOL DRUG AND CANCER TREATMENT ARE IN THE UML THESAURUS. THOSE ARE DUMB LUCK. THE POINT IS THAT THERAPEUTIC -- FEEL TERRIBLE SAYING DUMB LUCK BUT THIS ENVIRONMENT IT WAS DUMB LUCK FOR ME, THE POINT OF THESE EXAMPLES IS THAT THERAPEUTIC OR PREINVENTORY STIF PROCEDURE HAS TREATS AND USES RELATIONS AND MATCHED TO DIFFERENT SEMANTIC CATEGORIES. THAT MAKES IT POSSIBLE FOR THIS TO SCALE IN THIS WAY AND WORK FOR THE INTERESTING CHALLENGING EXAMPLE THE FACT THAT DIFFERENT ARGUMENT TYPES SHOW UP IN PRE-NOMINAL POSITION. MCINTOSH TO PC IS CONVERSION IS MESSING ME UP A BIT HERE BUT THE POINT I WOULD LIKE TO MAKE IS TURNS OUT RELATIONAL NOWPS -- NOMINALIZATIONS IN RELATIONAL NOUNS ARE IMPORTANT FOR CLINICAL TEXT AND MAYBE MORE IMPORTANT FOR CLINICAL TEXT SO WE HAVE SEEN A FAIR AMOUNT TODAY AND ONE THING THAT WE HAVE NOTED IS THAT IT'S SIN TACKICLY ABERRANT AS COMPARED TO GENERAL ENGLISH TEXT. AND WE MAY OR MAY NOT SEE VERBS, WE MAY OR MAY NOT SEE PREPOSITIONS SO WE DON'T HAVE THE NORMAL SYNTAX QUEUES TO THE RELATIONSHIPS. IN THIS CASE THE DOMAIN KNOWLEDGE ALLOWS RECOVERY OF MISSING ARGUMENTS. FOR EXAMPLE, THE ELLIPSES CASE IN CLINICAL TEXT IS MUCH DIFFERENT THAN GENERAL ENGLISH. THE DOMAIN KNOWLEDGE THAT ALLOWS US TO RECOVER THE FACT IN PREVIOUS HOSPITALIZATION THE PATIENT THAT'S BEEN (INAUDIBLE). NOT SAY THE PHYSICIAN, IN RELATIONAL NOUNS ALLOW US TO INFER ARGUMENTS THAT ARE IMPLIED BY AUTONOMY OR OTHERWISE THAT WE NEED TO GET THE FULL REPRESENTATION OF THE MEANING, FOR EXAMPLE, WE SEE ALTERNATIONS LIKE DORSAL SPINE SHOWS THAT IMPLIES WE HAVE HAD X-RAY OR AN MRI, SO WE MAY SEE THAT AS X-RAY OF DOOR SOL SPINE SHOWS THIS ALTERNATION SAN ASPECT OF THE SUB LANGUAGE THAT ALLOWS US TO RECOVER INFORMATION ONLY IMPLICIT IN THESE RELATIONAL NOUNS. FINAL POINT THE MAKE THAT JUMPED OUT AT ME FROM THE EARLY WORK FROM CAROL PREEDMAN AND ANDRE (INDISCERNIBLE), THE MEDIALLY PROGRAM WE REFERRED ABOUT TODAY IN THE GENIUS PROGRAM YOU HEARD LESS ABOUT WHICH IS INFORMATION EXTRACTION SOLUTION FOR A VERY DIFFERENT DOMAIN, FOR BIOMEDICAL JOURNAL ARTICLES HAS THE SAME PROCESSING ENGINE IN BOTH CASES. BUT THE SUB LANGUAGE GRAMMAR IN TWO CASES IS WHAT DIFFERS BETWEEN THEM. SO IN CONCLUSION THE LINGUISTIC APPROACHES TURN OUT TO BE REVEALING AND TO GIVE THE MEAN IT IS TRANSLATE WHAT WE LEARNED FROM JOURNAL ARTICLE INTO WORK ON CLINICAL DOCUMENTS AND NOMINALIZATIONS ARE JUST AN EXAMPLE OF HOW IT IS THAT WE CAN USE SOMETHING WE HAVE LEARNED ABOUT IN THE JOURNAL ARTICLES AND APPLY IN THE CASE OF CLINICAL DOCUMENTS WHERE NOUNS ARE SO IMPORTANT. SECOND IS KNOWLEDGE BASED APPROACHES ARE FEASIBLE IN THE BIOMEDICAL DOMAIN IN A WAY THEY WEREN'T RECENTLY AN BOTH WILL LEVERAGE A CONSIDERABLE PREVIOUS CURRENT NIH INVESTMENT IN PARTICULAR ONTOLOGY AND KNOWLEDGE BASED DEVELOPMENT. THANK YOU. [APPLAUSE] >> OUR NEXT SPEAKER IS (INAUDIBLE). AS WE LEARNED THIS MORNING DR. RINDFLESCH HAS Ph.D. IN ARABIC AND Ph.D. IN LINGUISTICS FROM THE UNIVERSITY OF MINNESOTA. HE JOINED LISTER HILL IN 1991 AND WORKED IN NATURAL LANGUAGE GROUPS FOR ACCESSING AND MANAGING BIOMEDICAL INFORMATION. HE'S DEVELOPED THE (INAUDIBLE) PROGRAM WHICH LEVERAGES MEDICAL DOMAIN KNOWLEDGE IN THE UMLS METHOD SOURCE AND SEMANTIC NETWORK FOR EXTRACTING SEMANTIC PREDCATIONS FROM BIOMEDICAL TEXT. A DATABASE OF PREDCATIONS HAS BEEN MADE VAIL TO BELIEVE THE NLP COMMUNITY AND IS EXPLOITED IN THE SEMANTIC MED LINE SYSTEM WHICH IS ADVANCED INFORMATION MANAGEMENT APPLICATION FOR BIOMED CANE. [APPLAUSE] >> I'M ACTUALLY -- I DIDN'T -- I HADN'T PLANNED THIS OUT. WE DIDN'T WORK OUT BUT I'M GOING TO -- REALLY TAKING WHAT KEVIN SAID SORT OF SHARED IMPLEMENTATION AND MY WORK IS LARGELY BASED IN THE WORK OF LAND SO STANDING ON THE SHOULDERS OF GIANTS I THINK. BUT ONE THE CONTEXT I WANT TO PUT THIS IN, THERE'S BEEN A LOT OF DIVERSITY ALONG SEVERAL ACT SEIZE IN APPROACH TO NATURAL LANGUAGE PROCESSING IN THE BIOMEDICAL FIELD ARTICULATED IN VERY WELL BY OUR PREVIOUS SPEAKERS. AND I GUESS THE THING I WANT TO SAY, PICKING UP ON KAREN AN KEVIN'S WORK IS -- CRUCIAL ISSUE HERE IS EFFECTIVE NATURAL LANGUAGE PROCESSING THAT WE NEED. THE LANGUAGE COMPLEXITY IS DAUNTING. I'M SAYING THE NECESSARY CONDITION WHICH EVERYONE SAYS, AND FOR GETTING EFFECTIVE RESULTS IS CHARACTERISTIC OF THE STRUCTURE OF LANGUAGE AND I'M GOING TO SAY AS WELL WHICH IS CONTROVERSIAL THAT THAT IS SUFFICIENT AS WELL. SO I REALLY DONE -- KEVIN AND I SAID THIS, THE THING WE NEED TO EXPLOIT, EXISTING RESOURCE, IN THE TWO AREAS. LINGUISTIC KNOWLEDGE AND ONTOLOGY OF THE DOMAIN. ONE THING I'M SAYING, THIS IS WHERE PACHE NEURONBERG, ONE WAY TO DO THIS AND ONE PROBLEM WITH RURAL BASED SYSTEMS SCALABILITY, HOW MUCH CAN YOU DO, PEOPLE SAID WELL THEY YEAR GOING TO BE AT THIS FOREVER, ABSTRACT A WAY FROM FULL INTERPRETATION. IN ORDER TO -- IN ORDER TO GET RESULTS MORE QUICKLY. ULTIMATELY YOU WANT TO DO EVERYTHING BUT CAN'T DO IT ALL TO BEGIN WITH. FURTHER YOU DO NEED SOME SORT OF PRINCIPLED SOLUTION BECAUSE WITHOUT A PRINCIPLED SOLUTION YOU WON'T HAVE THE INCREMENTAL DEVELOPMENT THAT WILL ALLOW YOU TO IMPLEMENT THE STARTING SMALL AND GOING FURTHER AND YOU NEED EXTENSIBILITY. YOU NEED IT QUICKLY WHICH IS CERTAINLY A BIG ASPECT OF ROBUSTNESS. SO I DON'T NEED TO GO INTO MUCH DETAIL HERE, JUST SAYING SPECIFICALLY, OTHERS HAVE POINTED OUT TWO GENERAL APPROACHES, AS CAROL SAID THESE SLUNG BACK OVER THE AGES. YOU I GUESS YOU COULD CALL MY METHOD BALANCED WITH RESPECT TO FORMALISM POPULAR IN THE '70s AND '80s BUT IT WASN'T BALANCED WITH RESPECT TO STATISTICAL APPROACHES SINCE IT DOESN'T USE THEM BUT IT IS ESSENTIALLY EXTREMELY KNOWLEDGE INTENSIVE. THE CASE STUDY I'LL GIVE IS -- DEVELOPING WITH MY GROUP. ESPECIALLY WITH THE INPUT OF (INAUDIBLE) AS KEVIN MENTIONED AND MARSELLO FISHER. SO WE EXTRACT PRED CASES LINGUISTS CALL PREPOSITIONS FROM THE BIOMEDICAL RESEARCH LITERATURE, TO SAY MED LINE CITATIONS TIET MS AN ABSTRACTS. THE BASED ON GENERALIZATIONS OF WHAT THE STRUCTURE OF ENGLISH AND STRUCTURE OF THE DOMAIN WE GET FROM THE UMLS. TO HAVE LINGUISTIC PRINCIPLE BUT WE BALANCE LINGUISTIC INSIGHT WITH IMPLEMENTATION EXPEDIENCY IN SEVERAL WAYS, THESE SYNTAXES UNDER SPECIFIED LAND SOMETIMES CALLED A CHUNKING PARSER, ESSENTIALLY IDENTIFIES SIMPLE KNOWLEDGE PHRASES THAT IS KNOWLEDGE PHRASES WITH NO MODIFICATION TO RIGHT OF THE HEAD. IT HAS CORE PREDCATIONS ONLY BUT NONETHELESS QUITE A FEW, 26 PREDICATES IN A RANGE OF ASPECTS OF THE BIOMEDICAL DOMAIN. WHICH I'LL POINT OUT IN DETAIL LATER. WE LIMIT BY SUB DOMAINS OF BIOMEDICINE. TIEING WITH THE CAROL FRIEDMAN ISSUE. I'LL GIVE MORE DETAILS LATER. IT IS A THEORETICAL FRAMEWORK, IT WASN'T JUST (INAUDIBLE) DREAMED UP BY ME. THE SEMANTIC NOTION OF THE MEANING OF WORDS IS CRUCIAL TO THE FINAL MEANING OF THE SENTENCE, CREWS AND SIGNIFICANT CONTRIBUTIONS BY (INAUDIBLE) WE HEARD FOR. EVEN THOUGH (INAUDIBLE) WILL NOT AGREE WITH EVERYTHING I SAY THAT'S FOR SURE, BUT I DO THINK AS I SAID EARLIER INTRODUCING HIM HIS BOOK ON TO LOGICAL SEMANTICS WHICH HAS AS ITS KEYNOTE ONTOLOGY IS THE MAIN META LANGUAGE OF MEANING IS EXTREMELY PROFOUND. YOU CAN BE A PERFECT SPEAKER OF ENGLISH AND KNOW YOUR SYNTAX INSIDE AND OUT BUT IF YOU START TALKING ABOUT DETAILS OF 16 CENTURY SAILING TECHNOLOGY, YOU'RE NOT GOING TO HAVE A CLUE WHAT IT SAYS. EVEN IF YOU CAN'T GET THE SYNTAX S2RES RIGHT IF YOU UNDERSTAND THE CONCEPTS, THE ON TO LOGICAL YOU GET THE RUFT DRIFT. THE UNDERLYING MEANING IS EXTREMELY IMPORTANT IN NATURAL LANGUAGE UNDERSTANDING. ALSO THE SYNTAX, ON THE IDEAS OF TEXT THEORY, NOTION THAT SYNTAXIC RULES ARE ENTERPRETIVE DEVICES NOT JUST SIMPLY GENRETIVE DEVICES. I USE A SET OF UNDERSPECIFIED DEPENDENCE EXAMINE WHICH IS CERTAINLY POPULAR IN GENERAL TODAY. SO SOME OF THE STRUCTURES THAT WE -- THAT IT SOMEWHAT ADDRESSES ARE TO POINT OUT IT IS NOT SIMPLY LOOKING AT SIMPLE NOWNS PHRASES BUT SOME STRUCTURES ARE SOPHISTICATED AND AS PROFESSOR (INAUDIBLE) POINTED OUT SOME IS RATHER CHALLENGING. WHAT I CALL THE CORE STRUCTURE OF ENGLISH PROPOSITIONS, MPV, NOMINALIZATIONS, (INAUDIBLE) DID CONSIDERABLE WORK AS KEVIN POINTED OUT ON DETAILS OF FINDING THE ARGUMENTS OF NOMINALIZATIONS, MARSELLO FISHMAN DID WORK ON COMPARATIVE, SOME OF THEM. CAROL HAS MAJOR WORK ON ENGLISH PREPAREDNESS AND WE DON'T ADDRESS ALL OF THEM. RELATIVIZATION, ARGUMENT COORDINATION. CARD NATION IS A HUGE PROBLEM IN NATURAL LANGUAGE PROCESSING. AND WE HAVE SORT OF SCALED BACK RATCHETED BACK THE COMPLEXITY. SO IT'S MUCH EASIER TO FIND TWO NON-PHRASES ARE COORDINATEDDED, THAN TERM THE ENTIRE COORDINATION WHAT I CALL THE ORDINATION CONFIGURATION, INTERACTION OF COORDINATE VERBS IN A COMMON SENSE THAT WE COMMONLY SEE IN THE BIOMEDICAL DOMAIN. THIS IS AN ISSUE STARTING SMALL. WE'LL MOVE TO A PREDICATE COORDINATION AS THE NEXT STEP. SO TO GIVE SOME IDEA HOW THE SYSTEM WORKS. WITH THE INPUT TEXT IT'S NOT ACTUALLY A COMPLETE SENTENCE BUT COMPLICATED ENOUGH PHRASE TO GIVE YOU AN IDEA OF THE DETAILS SO AGGRESSIVE COMBINATION CHEMOTHERAPY IN HYPER CALCIUMIC RENAL FAILURE. NONE OF THIS COULD BE DONE WITHOUT IT AND USE THE MED POST TAGGER BY LARRY SMITH WHO HAS SINCE MOVED ON. NONE OF THE WORDS IN THIS SENTENCE HAVE CATEGORY LABEL PART OF SPEECH AMBIGUITY F THEY DID THE TAGGER WOULD TAKE CARE OF IT. SO THIS APPLIED THE PARCEL, THESE MANY YEARS AGO WHICH IDENTIFIES THESE SIMPLE NOUN PHRASES AND DOESN'T GIVE THE ENTIRE INTERNAL STRUCTURE OF A SIMPLE NOUN PHRASE BECAUSE THAT NOEMIE POINTED OUT MANY YEARS AGO CAN BE EXTREMELY COMPLICATED. IT DOES FIND THE HEAD AND EVERYTHING ELSE, PREPOSITION FOR THE PURIST, INCLUDED RIGHT INSIDE THE NOUN PHRASE. I COULDN'T SEE A REASON TO CALL ANOTHER NODE AND CALL THAT A PREPOSITION PHRASE. THINK OF IT AS A SURFACE CASE MARKER. THIS MED MAP OUT OF THE BOX. CRUCIALLY HERE, WE GET THE SEMANTIC TYPES, SOURCES ASSOCIATED WITH THIS. SO AS CAROL SAID EARLIER THIS MORNING, YOU NEED TO KNOW IF YOU HAVE NOT ON ONCOLOGY YOU HAVE HUGE KNOWLEDGE AVAILABLE TO YOU. KEVIN POINTED THE L VALUE OF SEMANTIC TYPE SO WE KNOW THIS IS A THERAPEUTIC OR O PREVENTIVE PROCEDURE. AND NORMALIZED WITH THE UMLS PREFERRED NAME TO DRUG THERAPY COMBINATION. THE SAME THING FOR RENAL FAILURE, NORMALIZED TO KIDNEY FAILURE DISEASE OR SYNDROME. HYPERCAL SEEMIC DIDN'T AT THE TIME, MAY NOW, PROBABLY DOES MAP NOW THAT I HAVE DONE MAYBE THIS EXAMPLE FROM OLDER VERSION OF THE UMLS BUT IT'S UNDERSPECIFIED IN THE SENSE, WE DON'T PUT UP OUR HAND AND STOP WE SAY WELL WE RATHER AT LEAST KNOW THIS IS KIDNEY FAILURE EVEN IF YOU DON'T HAVE THE FULL DETHE TAIL. HALF A LOAF IS BEAR THAN NO LOAF. KEVIN WE HADN'T PLANNED THIS BUT WITH THE WORK THAT WE HAVE DONE IN MY GROUP IDENTIFYING ARGUMENTS OF NOMINALIZATIONS, THIS IS AN EXAMPLE OF THE LINGUISTIC PRINCIPLE IT IS THAT APPLY. THERE ARE LOTS OF PRINCIPLES THAT KEVIN HAS DONE SEMINOLE WORK IN SORTING MANY OF THESE OUT. WE HAVE DONE I THINK WHAT I CALL SEMINOLE WORK IN APPLYING THEM. SOME OF THE (INAUDIBLE) APPLY HERE. YOU KNOW MANAGEMENT IS A NOMINALIZATION, ONE OF THE WAYS YOU CAN FIND THE OBJECT IF IT'S FOLLOWED BY PREPOSITIONAL PHRASE, ONE THING WHERE THE SUBJECT IF IT OCCURS OUTSIDE THE NOUN PHRASE AND TO THE LEFT$ NOMINALIZATION OCCURS, IN A NOUN PHRASE WITH A PREPOSITION THIS IS OFTEN NOT SURE ALL CASES BUT OFTEN SAYING THE SUBJECT IS TO THE LEFT. SO THOSE ARE SYNTAXIC ARGUMENTS. NOW MOVING TO THE PARTNER OF THAT, THE SI MAN TICKS AND -- SI MAN TICKS. THEY CAN GD CALL THEM NORMAL SAIGS OF SI MAN TICK RELATIONSHIPSCH THEY'RE INDICATOR ROLES SO MANAGEMENT CAN MEAN RELATION FROM THE UMLS NETWORK TREATS. THAT IS A CASE TO FIND THOSE RULES. THAT'S A LEARNING THROUGHOUT THE SYSTEM THROUGHOUT LOTS OF TEXT AND DISCOVERING THOSE RULES SO IF ANY OF THAT MAPS TO KIND OF ABSTRACTED AWAY FROM THE RELATIONSHIPS THAT ARE IN THE SEMANTIC NETWORK AND THIS IS HUGE FOR EXTENSIBILITY AND ROBUSTNESS AND SCALABILITY. BECAUSE THE UMLS TELLS US THROUGH THE THESAURUS, THE PHARMACOLOGIC SUBSTANCE TREATS DISEASE OR SYNDROME, MEDICAL DEVICE TREATS SIGN OR SYMPTOM AND SO ON. SO THESE ARE ALL THE POSSIBILITIES. IF YOU SAY MANAGEMENT SYNTAX YOU MIGHT MEAN ONE OF THESE. IT JUST SO HAPPENS WHAT DO YOU MEAN IN THIS SPECIFIC CONTEXT? WELL, WE SORT OF MAP UP THESE SEMANTIC TYPES FROM HERE AND THE OBJECT AND WE GET OH, THIS IS WHAT WE MEAN. THIS MEANS A THERAPEUTIC PROCEDURE, TREATS DISEASE OR SYNDROME. FOR THE FINAL PROPOSITION OR PREDCATION THEY TAKE THE CONCEPTS THAT MEDIMAP GAVE US AND PUT THOSE IN THE STRAIR AND WE HAVE A FORMAL FIRST ORD OAR PREDICATE CALCULUS OR SEMANTIC WEB TRIPLE THAT TELLS US, NOT THE WHOLE THING BUT IT TELLS US PART OF THE MEANING OF THIS PHRASE. SO WE HAVE WON THE SYSTEM, WITH THE WORK OF MED LINE FROM WHAT MORE THAN 21 MILLION CITATIONS ENTITLES AND ABSTRACTS, AND AB TRACTS IN 6 MILLION PRED CASES AND WE'RE UPDATING THIS AND AS WE MAKE IT VAIL TO BELIEVE THE RESEARCH COMMUNITY IN SQL DATABASE AND AS RDS TRIPLES TO WORK DONE BY (INAUDIBLE). SO SOME PRACTICAL MATTERS. HOW MUCH DOMAIN, NOT EVERYTHING BUT IT'S REALLY MUCH MORE EXTENSIVE THAN I THINK THAN MOST PEOPLE. MOST PEOPLE SAID IN THIS AREA OF SEMANTIC RELATIONS CONCENTRATED IN THE AREA OF MICROBIOLOGY BUT WE STARTED WITH CLINICAL MEDICINE, CLINICAL RESEARCH LITERATURE AN EXTEND QUITE EARLY TO GENETIC ETIOLOGY OF DISEASE, SUBSTANCE INTERACTIONS, PROTEIN, PROTEIN, THAT SORT OF O THING, PHARMACO GENOMICS, PROTEIN ERA DRUGS AB GENES AN MORE RECENTLY WITH THE WORK OF (INAUDIBLE) IN MY GROUP WE'VE GONE TO INFLUENCE EPIDEMIC PREPAREDNESS ON CLAY MATE CHANGE AND HEALTH PARTLY BECAUSE DR. LINDEBERG HAS BEEN INVOLVED IN GOVERNMENT EFFORTS IN THAT REGARD. HEALTH PROMOTION. PUBLIC HEALTH. HEALTHY LIVING. AND VERY RECENTLY DID WORK IN APPLYING THE SYSTEM TO TACT ON BIOMEDICAL KNOWLEDGE PROCESSES MAINLY B OIRKSONLP, WE SUBMITTED THAT TO AMY AND THE RESULTS LOOKED QUITE PROMISING. SO EVALUATIONS WE HAVEN'T EVALUATED 57 MILLION BUT WE HAVE SEVERAL FOCUSED EVALUATIONS OVER THE YEARS. SOME ARE FOCUSED IN MEDICAL SUB DOE MAINES BECAUSE YOU GET CERTAIN ISSUES THAT CAUSE PROBLEMS THERE AND OTHERS FOCUSED ON STRUCTURE. ONE WAS THE PRED CASES WORK DONE WITH MARSELLA, COMPARATIVE WORK HE DID WITH NOMINAL SAIGS BY ALLELE. AND OVERALL, DECISIONS SEEMS TO BE 75%, MORE FOR MICROBIOLOGY, HARD TO GET THE ENTITIES IN MICROBIOLOGY AN 60% OF PRED CASES WE SHOULD GET. NOT OTHER THING SAID IN THE SENTENCE. APPLICATION FOR BIOMEDICAL APPLICATION WE CALL IT SEMANTIC MED LINE THAT SITS ON TOP OF PUBMED. SO IT TAKES MED LINE CITATION, SEMANTIC PREDCATION, AUTOMATICALLY SUMMARIZES THEM BY WORK DONE BY MARCELLO FISHMAN AND HAS A GRAPHICAL SUMMARY LIKE THIS WHERE YOU TELL THE COLORS REPRESENT PREDICATE SO IT SAYS HERE THE PTGER 2 ALLELE IS EFFECTS SKIN NEOPLASMS ASSOCIATED WITH MOR GENESIS AN INTERACTS WITH ONCO GENE PROTEIN PP-6. SO WE HAVE BEEN APPLYING THIS PREDCATION AND GRASS-BASED REPRESENTATION ON THIS, NOT MED LINE BUT OTHER ABSTRACT WAYS ESPECIALLY TO DISCOVERY. WORK BY POST DOC IN MY GROUP WHO IS A PHYSICIAN, WORK IN (INAUDIBLE) LAB, HE USING THIS TECHNOLOGY DISCOVERED (INDISCERNIBLE) LOW BLOOD IN TESTOSTERONE DEGRADES SLEEP IN AGING MEN BECAUSE IT INHIBITS CORTISOL WHICH IS NOT DOING A VERY GOOD JOB OF DOING ANY MORE. WE RECENTLY HAD THIS PUBLISHED IN THE PREMIERE SLEEP BASIC RESEARCH JOURNAL SLEEP. (INDISCERNIBLE) DISCOVERED MELANIN IS CRUCIAL IN RESTLESS LEG SYNDROME. NEUROTRANSMITTER ETIOLOGY FOR EFFECTIVE DRUGS FOR EFFECTIVE SLEEP APNEA AND RECENTLY DID WORK IN DISCOVERY USING THIS TECHNOLOGY TO ELUCIDATE CLEARLY UNDERSTOOD -- POORLY UNDERSTOOD AREA, DISCOVERY BROWSING. AND WE WORKED IN INTERACTION SLEEP INFORMATION, NOREPINEPHRINE, CASPASE 3 DEVELOPMENT OF INDUCED PLURIPOTENT STEM CELLS. WORK WITH PORTFOLIO ANALYSIS ON NIH GRANTS. WORKING WITH NHLBI, NATIONAL HEART LUNG AND BLOOD INSTITUTE AND GOING TO PUT THIS INTO PRODUCTION. TO SUPPORT CLINICAL PRACTICE GUIDELINE NOWS IN DISCUSSION WITH INTEGRATION AND TECHNOLOGY INTO DATA.GOV. APPLICABILITY OF THIS WORK, WE HAVE COLLABORATED WITH SEVERAL PROMINENT BIOMEDICAL RESEARCHERS IN THE UNIVERSITY. MOST RECENTLY, (INAUDIBLE) WORKED FOR ME MANY YEARS AND HE RECENTLY GOT A Ph.D. IN COMPUTER SCIENCE NCI FROM FLORIDA UNIVERSITY MONTREAL AND THE SUBJECT OF DISSERTATION WAS GOING BEYOND PROPOSITIONAL MEETING. SO IT DREW HEAVILY ON THE WORK OF SERGE NUREMBERG AND PROVIDED IMPLEMENTATION FOR PROPOSITION MODIFYING INFORMATION SUCH ADS SPECULATION, OPINION EVIDENCE ATTITUDES AND DISCOURSE. IMPEDING A COMPOSITIONAL TYPE OF WAY, IMPLEMENTED IN A COMPLETELY TOTALLY RURAL BASED SYSTEM. SO AN EXAMPLE HERE TO SHOW THE PROPOSITION EXTRACTED FROM THIS EXAMPLE, WE SEE HIGHLIGHTED IN THE TEXT AND THEY SHOW THESE ARE JUST SIMPLY THE PROP SIGNATURES WHICH THEY SAY THAT BASICALLY THIS GENE RECEPTOR IS ASSOCIATED WITH BOTH SKIN NEOPLASM AND TUMORIGENESIS IN GENERAL. A RULE THAT WAS ADDED IS THAT TAKING -- IT'S NOT JUST SIMPLY FINDING THESE PHRASES BUT PART OF WHAT IT IS. IS THAT IT'S -- SO THIS PREDCATION AT THE BOTTOM WE NOW KNOW HAS GENERICALLY, THIS IS CONSIDERED CERTAIN. THE SECOND ONE IS POSSIBLE ON THE BASIS OF WHAT THE WRITER SAYS. THERE IS A DISCOURSE RELATIONSHIP BETWEEN THE CONTRAST, FURTHER THAT THE WRITER NOW SAYS THAT IT IS PROBABLE OUR DATA SUGGESTS THAT THE PHENOMENON IS ASSOCIATED WITH SKINPLASMS SO THIS IS A HUGE ADDITION TO THE SIMPLY HAVING PROP SIGNATURES, IT ALLOWS YOU ACCESS TO RELIABLE OF A SCIENTIFIC CLAIM, ALLOWS YOU TO WORK WITH SPECULATIONS WHICH ESTABLISH CURRENT TRENDS FUTURE DIRECTIONS. WHICH WE CAN NOW PROCESS AND TRACK WITH THE PREDICATION DATABASE AND ALSO ENHANCES THE RICHNESS OF THE NLP STRUCTURE. SO IN CONCLUSION THERE'S A BALANCED APPROACH BASED ON LINGUISTIC DOMAIN KNOWLEDGE, PRINCIPLES TO EXTENSIBILITY, SUPPORTS INCREMENTAL DEVELOPMENT OF WHICH THE EXTRA PREPOSITIONAL MEANING IS AN EXCELLENT EXAMPLE. AND THE RESULTS ARE FOR A VARIETY OF BIOMEDICAL RESEARCH AND I THINK AN IMPORTANT THING IS THIS IS -- SUCCESSFUL BECAUSE OF THE COOPERATION OF COMPUTER SCIENTIST, LINGUISTS AN DOMAIN KNOWLEDGE -- DOMAIN EXPERTS AND WHO THIS IS SORT OF THE CURRENT MEMBERS OF THE TEAM. PHYSICIANS, LINGUISTS AND COMPUTER SCIENTISTS. SO THANK YOU. [APPLAUSE] >> SO WE'LL OPEN THE FLOOR TO QUESTIONS. >> I HAVE A QUESTION FOR ALAN. REGARDING THE LAST PART OF YOUR TALK, YOU MENTIONED THAT YOU HAD THINGS WORKING IN PROVIDING THAT INCLUDES MORE OF THE LINKS. IDENTIFYING THE GENES BETTER. CAN I ASK WHAT DO YOU USE TO IDENTIFY THE GENES? >> WE HAVE A NUMBER OF TECHNIQUES. WE HAVE -- WE HAVE EXTRACTED SOME, NOT ALL THE NAMES FROM GENE JUST AS DICTIONARY LOOK UP METHOD BUT WE ARE ALSO USING GENE IS ONE SOURCE, WE'RE EXTENDING THAT NOW, IN THE PROCESS OF EXTENDING THAT TO USE LINEAS. IT'S CHANGING ALL THE TIME BUT HAS BOTH OFF THE SHELF SOFTWARE PLUS LOCALLY GENERATED DICTIONARIES. >> OKAY. I RECOMMEND BANNER. >> WE CONSIDERED IT. AND WE HAVE AT LEAST DOWNLOADED, WE'RE PLAYING WITH IT, WE ARE ATTACKING THE PROBLEM ON ALL FRONTS. >> OKAY. >> THANK YOU. >> ANY OTHER QUESTION? >> THIS IS (INAUDIBLE) NEURONBERG, HAVE YOU NOTICED IN THE OUTPUTS OF THIS WHOLE SET OF SYSTEMS THE NEED OR THE DESIRE EVEN TO INCLUDE INFORMATIONAL KNOWLEDGE THAT WAS NOT OVERTLY PRESENT IN THE INPUTS? >> YES, EXACTLY. TO DO INFERENCING ON TOP OF THIS, ABSOLUTELY. MY FEELING, I HAVE BEEN BUILDING A HOUSE, I'M BUILDING THE FOUNDATION TO HAND OUT OVER TO PEOPLE TO DO IT BUT WE PROVIDE -- THE FIRST STEP, IT'S ABSOLUTELY NEEDED. YOU CANNOT UNDERSTAND LANGUAGE WITHOUT INFERENCE. >> I HAVE ONE SMALL QUESTION IN THE ONE OF THE LAST SLIDES OF ONE SLIDE, YOU REFERRED TO REFERENCE. I SUPPOSE THEN IT GETS PROCESSED, RIGHT? WHAT TERMS ENCODE RESULTS ESTABLISHING REFERENCE? IF IT'S APPROPRIATE TO US? I SAID REFERENCE BUT CO-REFERENCE IS PART OF IT. >> BY REFERENCE I GUESS IN THE FORMAL SENSE I WILL TAKE TO MEAN THE MAPPING TO THE -- TO THE CONCEPTS IN THE META THESAURUS. THE CO-REFERENCE HAS RECENTLY BEEN IMPLEMENTED -- NOT IMPLEMENTED IN SAME VAIL TO BELIEVE THE PUBLIC BUT PART OF ALLELE'S DISSERTATION WORK WHERE WE HAS FIT THIS INTO THE ENTIRE PANOPLY OF EXTRA PROPOSITIONAL MEANING SO PART OF DISCOURSE ANALYSIS. TO BE HONEST, I CAN'T GIVE YOU DETAILS. >> FAIR ENOUGH. >> IT'S PART OF DISCOARSE PROCESSING. IT'S PREDICATIONS WE HAVE EXTRACTED NOW, DON'T TAKE ADVANTAGE OF CO-REFERENCE ANALYSIS, IT WILL BE THE NEXT. >> THANK YOU. >> CAROL. >> TWO QUESTIONS FOR ALAN. THE INDEX IS REALLY FASCINATING, WONDERING YOU POINT ANY DIFFERENCES BETWEEN THE INDEXING THAT INDEXES VERSUS THE COMPUTER LIKE ANY GENERAL DIFFERENCES, MORE COVERAGE, COMPLETENESS, LESS ABSTRACT AN THOSE QUESTIONS. >> THERE ARE DIFFERENCES. IN FACT (INDISCERNIBLE) WHO I MENTIONED DID A STUDY BASED ON THE INFORMATION THAT WAS DONE COUPLE THREE YEARS AGO ON TRYING TO ASSESS CONSISTENCY AND THERE WERE A NUMBER OF CITATIONS INDEXED BY FOUR OR FIVE PEOPLE. SHE ADDED MTI TO THAT LIST. YOU CAN TELL, THERE'S A DIFFERENCE. NTI INDEXING IS JUST NOT AS INTELLECTUALLY COHERENT. AS THE HUMAN GENERATED. MORE OFTEN THAN NOT THEY MAY APPLY DIFFERENCE PERSPECTIVE FOR THE INDEXING BUT DON'T MAKE BLOOPERS LIKE MED ISHMAP AND NTI DO. A COUPLE OF THOSE EXAMPLES I SHOWED. >> DO THEY USE TOOL TEXT ARTICLE? >> MTI ONLY USES TITLE AND ABSTRACT NOW. WE WORKED ON FULL TEXT METHODS AND PREPARED TO MOVE TOWARD FULL TEXT AS THAT BECOMES AVAILABLE. >> DO YOU THINK IT COULD INDEX CLINICAL RECORDS? WE WORKED ON CLINICAL TEXT. THERE'S A CHALLENGE THERE. MEDIMAP WAS DESIGNED FOR MED LINE RECORDS, WELL FORMED ENGLISH IN GENERAL. WE KNOW CLINICAL RECORDS POSE ANY NUMBER OF PROBLEM, TELEGRAPHIC EXPRESSIONS, LOCALLY DEFINED ACRONYMS AN ABBREVIATIONS, ALL SORTS OF ISSUES THAT MAKE CLINICAL TEXT HARDER TO PROCESS. WE'RE WORKING ON IT. WE JUST RECENTLY ADDED AN OPTION TO MEDIMAP SO YOU CAN TAYLOR MEDIMAP BEHAVIOR TO THE ACRONYMS THAT YOU KNOW OCCUR IN YOUR TEXT. THE HARDER ISSUES WE'LL ATTACK OVER TIME. >> A LITTLE MORE ENTHUSIASTIC ABOUT THAT. SO IN THE RECENT TRACK ELECTRONIC MEDICAL RECORDS TASK MANY PEOPLE TRIED USING LEUCINE AS A BIAS LINE AND WE TRY TO USE MEDIMAP AND BASELINES USING MEDIMAP AS INDEXER OF MEDICAL DOCUMENTS. >> IF A CONCEPT IS MENTIONED WITHOUT SPELLING ERRORS, MEDIMAP WILL GET IT. THAT'S TRUE. BUT THE -- YOU OFTEN HAVE OTHER PROBLEMS. FOR THAT PARTICULAR TASK, YES MEDIMAP DID QUITE WELL. >> PARTLY GOES BACK TO SOMETHING CHRIS MANNING SAID EARLIER AT THE BEGINNING OF THE DAY. THESE ARE REMARKABLE APPLICATIONS SUCCESSFUL APPLICATIONS OF RICH KNOWLEDGE STRUCTURE WITH A LOT OF STUFF CLOSE TO MY HEART, ELECTIONAL PREFERENCES SO FORTH. THERE ARE TWO THINGS THAT HAVE COME UP THAT I'M INTERESTED IN LEARN MORGUE ABOUT THOUGHTS ONE WITH THE NOTION OF DOWNSTREAM INFERENCE. THERE IS A DESIRE TO KNOW HOW CONFIDENT YOU CAN BE IN YOUR INFERENCES. THE OTHER IS RESOLUTION OF AMBIGUITY SINCE CHRIS MENTIONED EARLIER HOW WE GOT INTO STATISTICAL NLP IN THE FIRST PLACE SO SPECIFICALLY INTERESTED IN THOSE THINGS BUT ALSO MORE GENERALLY, HOW HAVE YOU MANAGED TO AVOID BEING DRIVEN INTO THE ARMS OF STATISTICAL METHODS ON THOSE TWO ISSUES? UNDERSTANDING HOW CONFIDENT YOU CAN BE IN INFERENCES AN RESOLVE AMBIGUITY, DO YOU THINK THERE IS A DIRECTION GOING FORWARD WHERE YOU CAN IMAGINE BRINGING THOSE TOGETHER. >> THOSE ARE VERY GOOD QUESTIONS, PHILLIP. , THE CONFIDENCE, WE HAVEN'T COMPUTED IT BUT THERE ARE TWO THINGS THAT WE COULD BUILD A FORMULA ON. ONE THING, THE -- WE HAVEN'T DONE IT BUT WE COULD ASSIGN A CONFIDENCE VALUE TO THE PREDICATE. SO IF IT'S TREATS WE DO BETTER THAN 90%. IF IT INHIBITS, GENE INHIBITS A GENE, WE'RE PROBABLY WAY DOWN SO WE COULD ASSIGN A CONFIDENCE LEVEL TO THAT. ALSO THE OTHER THING THAT WE CAN DO WHICH IS DEEMED SIMPLISTIC BUT EFFECTIVE, HAD A POST-DOC SOMETIME AGO IN A PAPER IN BMC WE CAN ASSIGN A CONFIDENCE LEVEL, WE DID THIS FOR THE MICROBIOLOGY SIMPLY HOW CLOSE WAS THE ARGUMENT TO THE INDICATOR. RECALL GOES DOWN BUT IF IT'S CLOSE WE CAN SAY WE'RE QUITE CERTAIN THIS IS TRUE. SO WE COULD ASSIGN A CONFIDENCE VALUE AND PROBABLY AS IT BECOMES USED MORE WE SHALL DO THAT. WITH AMBIGUITY ONE THING WE HAVE DONE, WE HAVE MODIFIED A MEDIMAP A BIT, WE USE IT AMBIGUOUS, THE STANDARD WHICH IS BASED ON THE CO-OCCURRENCE OF SEMANTIC TYPES BUT SIMPLY BY ELIMINATEING A CLASS OF AMBIGUITY IN THE UMLS WHICH ARE DISONYMS, NOT SYNONYMOUS AT FACE VALUE, PROSTATE AND PROSTATE CANCER, IF THE AMBIGUITY IS A STRING SUBSET OF THE OTHER, WE JUST SAY IT'S NOT AMBIGUOUS. THAT IS IMPROVED OUR RESULTS BY A GREAT DEAL. IT IS STILL TRUE THE MAJOR PROBABLY 50% OF ERRORS ARE DUE TO WORD SENSE AMBIGUITY SO THE WORK IN LAND ICE GROUP BY ANTONIO IS CREATING PROFILES FOR SOME THINGS IS AN EXCELLENT THING WE NEED TO TAKE ADVANTAGE OF. AND MORE OF IT. I'M CONFIDENT THAT WE MAY NEED HELP WITH THE WORD SENSE DISAM BIG WAITION BUT I CAN'T REALLY SEE -- WE CAN'T -- WE DONE KNOW AHEAD OF TIME, SOME OF THE BIG ONES WE CAN BUT THERE'S AMBIGUITY THAT YOU WOULDN'T IMAGINE BECAUSE IT WASN'T MEANT TO BE ONTOLOGY, IT WAS MEANT TO BE VOCABULARY SO THERE ARE SORTS OF THINGS THAT NO, NO ONE WOULD THINK THEY'RE AMBIGUOUS AS CONCEPTS BUT THEY ARE IN SOME TERMINOLOGY. WE CAN'T -- IT WOULDN'T BE PRACTICAL TO CREATE A CLASSIFIER FOR EACH ONE OF THOSE AND HOW WE KNOW AHEAD OF TIME, WE HAVE TO CALL IT SO I THINK AMBIGUITY HAS TO BE SOLVE BUD MAJOR METHOD WILL BE A KNOWLEDGE BASED ONE. >> IN OUR CASE WE'RE NOT TRYING TO AVOID BEING DRIVEN INTO THE ARMS OF MACHINE LEARNING, WE'RE EMBRACING IT. SO (INAUDIBLE) AT ARIZONA STATE SHOWED WITH OUR SYSTEM THAT HE COULD BEAT OUR MAN IDEALLY CONSTRUCTED RULES BY LEARNING RULES AND INCREASING HIS RECALL REMARKABLY MANUALLY BUILT RULES. WE HAVE BEEN WORKING ON LEARNING RULES BASED ON DEPENDENCY TREES USING APPROXIMATE SUB GRAPH MATCHING. LEAVING THE MACHINE LEARNING OUT AND LOOKING AT THE STATISTICS -- THE CHARACTERISTICS OF THE LANGUAGE ITSELF THE TABLE SHOWED 15 OF 16 POSSIBLE ALTERNATIONS SHOWED UP FOR THE TWO ARGUMENTS SOME MUCH LARGER THAN OTHERS. YOU CAN EXPLOIT THAT INFORMATION IF YOU RUN INTO AMBIGUITY IN A KNOWLEDGE SOURCE LIKE THE UMLS, THEN YOU CAN FALL BACK TO THE STATISTICAL TENDENCIES AN LIKELY HOODS ARE OF PARTICULAR ARGUMENTS IN PARTICULAR POSITIONS SO WE DEFINITELY SEE AREAS FOR STATISTICAL AND RURAL BASED METHODS. >> A QUICK QUESTION. DO YOU THINK THE METHODS YOU HAVE KISS CUSSED CAN BE INCORPORATED INTO NIH IT SYSTEMS FOR INDEXING GRANTS, CONTRACTS AND INTERAEXPWREN SI AGREEMENT? >> AS -- I DIDN'T HAVE ENOUGH TIME BUT YES WE HAVE DONE THAT. WORKING WITH WHAT -- WITH THE PORTFOLIO ANALYSIS I THINK BRANCH OF DPKPSI THEY HAVE GIVEN US UNDER CONTRACT, WE HAVE ALL GRANTS FROM 2008 TO DATE, GRANT APPLICATION THE ABSTRACTS APPROXIMATE SPECIFIC AIMS, AS I POINT OUT I HAD TO BREEZE OFFER IT, WE HAVE RUN THIS ON THEM. I CAN'T REMEMBER P IF IT'S 3 MILLION PREDICATIONS OR SOMETHING AND WE'RE ALSO AT THE SAME TIME I DIDN'T HAVE TIME TO GO INTO DETAIL BUT DOMAIN MIGRATING INTO BIOLOGY RESEARCH TECHNOLOGY WESTERN BLOTS WE DIDN'T COVER VERY WELL. WE HAD A CONTRACTOR WORKING ON THAT TOO AND WE ARE ABOUT TO ACTUALLY HAD TO PROVIDE TO GEORGE ANGLO'S GROUP, IT DOSE BEYOND -- IT'S HONED FOR NIH GRANT APPLICATION, PORTFOLIO MANAGEMENT, THERE ARE PEOPLE THERE WHO ARE I DON'T KNOW THE TECHNIQUES THAT YOU NEED TO DO THAT, THEY WERE ENTHUSIASTIC ABOUT HAVING WE CALL IT SPA FOR SEMANTIC PORTFOLIO ANALYSIS AVAILABLE TO THEM. SO YES, ABSOLUTELY. >> EXCELLENT. GOOD TO HEAR. >> FIRST TO MAKE ONE COMMENT, THIS IS A GREAT SHOWCASE OF APPLICATIONS IN THE BIOLITERATURE DOMAIN, BECAUSE THE FOCUS IS THE CLINICAL NARRATIVE, I WANTED TO MENTION THAT THEY'RE IN LAST YEARS PLENTY OF APPLICATIONS OF CLINICAL NLP MINING THE CLINICAL NARRATIVE IN THE EMR. THOSE APPLICATIONS GO THE FULL RANGE OF PHARMACO GENOMICS OF LARGE KALE PATIENT COHORT IDENTIFICATIONS. STUDY T VIOXX STUDY AN ASSOCIATED SIDE EFFECTS, THAT WAS DONE WITH CLINICAL NLP PROCESSING AND LIKE TO ACTUALLY WITHDRAWAL OF VIOXX SO THERE ARE VERY LARGE SCALE APPLICATIONS, CLINICAL N PLRKSP AND MINING THE CLINICAL NARRATIVE. SO THAT'S ONE POINT THAT I WANT TO MAKE. THEN I HAVE A QUESTION FOR TOM. MY QUESTION IS WHEN I WAS WORKING AT YOUR RELATION EXTRACTION ALGORITHM, I WAS THINKING THOSE FEATURES THE FEATURES YOU'RE USE IN THE RULES SEEM TO BE VERY WELL CAPTURED BY FEATURE ENGINEERING AN CAN BE FAIRED FAIRLY WELL. FEATURE ENGINEER -- FUTURE ENGINEERING AND CAN BE VERY EASILY FIT INTO A MACHINE LEARNING ALGORITHM. SO HAVE YOU THOUGHT OF COMPARING RULE BASED APPROACH WITH MACHINE LEARNING APPROACH? >> I HATE TO SAY I'M A STRONG RURAL BASE PERSON, NOT SURE HOW EASY THEY ARE TO CAPTURE MACHINEkyz LEARNING SYSTEM BECAUSE I WASN'T ABLE TO GIVE THE DETAILS HERE BUT ONE OF THE CORE ASPECTS OF WHAT I SAY AS THE CHARACTERISTICS OF NATURAL LANGUAGE IS ARGUMENT SHARING, THAT IS A SINGLE ARGUMENT USED -- SINGLE NOUN PHRASE USED AS ARGUMENT IN TWO PRED CASES CAN BE DONE ONLY UNDER VERY SPECIFIED CONDITIONS. ONE OF THEM IS IF IT'S THE HEAD OF A RELATIVE CAUSE. ANOTHER IS IF THE PREDICATE, IF THEY'RE COORDINATED F PREDICATES ARE COORDINATED. JOHN HUGGED AND KISSED MARY OR JOHN IS BOTH AN ARGUMENT OF HUG AND THE PREDICATION ON KISS. SO I'M NOT SO SURE HOW EASY THAT WOULD BE TO -- IT COULD BE DONE. NOT SAYING IT COULDN'T. I GUESS BASICALLY I WOULD BE HAPPY TO DO COMPARISONS. JUST NOT INTERESTED IN DOING IT MYSELF. I WILL BE PAP PI TO SEE IT COMPARED. I'M POINTING OUT THE VALUE OF RULE BASED METHODS WHICH HAVE NOT BEEN DISCUSSED SORT OF ROBUSTLY IN THE RESENT RESEARCH LITERATURE. ONE THING IS AS CRIST POINTED OUT, YES, IF YOU USE A PHRASE STRUCTURE GRAMMAR YOU'RE GOING TO RUN INTO EXACTLY THE ISSUES THAT POINTED OUT, A ZILLION RULES AND REALLY ALMOST HUMANLY IMPOSSIBLE TO DETERMINE -- GET ENOUGH RULES, THEY DO A PRETTY GOOD JOB BUT THERE THEY'RE NOT 100% RIGHT. MY NOTION WAS RATHER UNTIE THE KNOT JUST CUT IT AND JUST SAY I DON'T NEED MOST OF THESE RULES THAT TELL ME HOW TO INTERPRET SIX BLOOD RED OCTAGONAL SHAPED -- I JUST KNOW THIS IS A PILL. SO YOU JUST JUMP OVER THE TOP OF THE DETAIL AND YOU SAY HERE IS SOMETHING THAT NEEDS A SEMANTIC RELATION IT SAYS ITS SUBJECT IS SEMANTIC TYPE SPECIFIED VERY EASY TO IMPLEMENT AND I CAN'T SHARE THAT IF IT'S BEEN USED A FEW THINGS LIKE THAT TAKE YOU A LONG WAY. WE HAVE PRECISION TO SHOWY IS TOTALLY COMPARABLE TO ANY STATISTICAL METHOD. PLUS INSIGHTFUL, IT IS INSIGHTFUL TO KNOW THAT ARGUMENT RESTRICTIONS ON ARGUMENT SHARING IS A CORE FEATURE OF NOT JUST ENGLISH BUT NATURAL LANGUAGE. THIS SHOWS YOU THE STATISTICAL SYSTEMS WORK IS ESSENTIALLY FREQUENCY OF OCCURRENCE BUT THAT'S AN EPIPHENOMENON OF MEANING SO IT DOESN'T IN AND OF ITSELF GIVE YOU INSIGHT WHERE TO GO FURTHER WITH THIS. A BIT OF PREACHING. >> I HAVE LOCAL QUEUES. EXROW SEE SOME KIND OF LOCAL STRUCTURE AND YOU DETERMINE HOW THE STRUCTURE WOULD BE THE CORRECT ONE. SOMETIMES THE (INDISCERNIBLE) COMES FROM COMPLETELY DIFFERENT PARTS OF THE SENTENCES, FOR EXAMPLE IRENE (INAUDIBLE) DISOF GENES OF PROTEINS. SOMETIMES SOMEWHERE SOME SPECIES NAME APPEAR. WE DONE KNOW WHERE. PREVIOUS SENTENCES HOW THAT APPEARS SO SOMETIMES SOME TYPE OF TECHNIQUES INTERROGATE THE DIFFERENCE QUEUES OF HOLISTICALLY AND THE RULE BASE IS ONLY LOOKING SOME KIND OF SMALL LOCAL CONTEXT AND DECIDE SOMETHING SO MY INTUITION IS WE HAVE TO REALLY COMBINE THESE TWO MECHANISMS IN A PROPER WAY SO OTHERWISE SYSTEMS BECOMES NOT SCALABLE, I LIKE KNOWLEDGE BASED SYSTEMS SO ON, BUT SOMETIMES METHS CONVINCE THE LACK OF HOLISTIC NATURE OF LANGUAGE, H HOW TYPICALLY1E APPEAR IN THIS PROGRAM. >> IN WORDS OF DISAM BIG WAITION, THE SYSTEM THAT MEDIMAP USES AND WE, IS ESSENTIALLY A STATISTICAL SYSTEM, IT'S JUST NOT A CLASSIFIER. IT'S BASED ON DONE BY FORMER EMPLOYEE OF THE LIBRARIER BRAIR -- LIBRARY AND THIS PAPER WAS PU PUBLISHED IN (INDISCERNIBLE), A LOT OF WHAT YOU SAY IS TRUE. IT SAYS OH, ALL THE WORDS OF HAVE A CULTURE TO THEM. DO I MEAN THE HUMAN BEHAVIORAL SORT OF THING OR MEAN GROWING STUFF. SO NO, NO, I THINK WHEN I SAID -- IT'S NOT I CAN'T SEE HOW CLASSIFIERS THEPSES ARE USED IN A PRACTICAL WAY FOR IT BUT SOME SORT OF FREQUENCY OF OCCURRENCE. STILL, THAT REALLY ISN'T REALLY GIVING YOU THE INSIGHT, IT ISN'T JUST THAT THIS IS FREQUENT BUT IT'S BECAUSE IT REALLY IS ON TO LOGICAL. THAT'S A WHOLE 'NOTHER PHILOSOPHICAL ARGUMENT. IMPLEMENTATION RIGHT NOW YOU NEED STATISTICAL PROCESSING AT SOME POINT. >> MY TAKE ON IT WOULD BE THAT WHAT MUCH OF THE WORK THAT WE SERING IN THE BIOMACHINE LEARNING SYSTEM IN MANY CASES IS RULES OVER DI PENDENCY PARSES SO NOT THAT THERE'S A DISTINCTION, IT'S RULES BEING LEARNED. AND THE FACT THAT SO MANY SYSTEMS USE THE DEPENDENCY ADDRESSES THE POINT THAT YOU BRING UP SO ONE OF THE BIGGEST AMBIGUITIES WITH NOMINALIZATIONS IS TELLING A CASE DISTINGUISHING BETWEEN A CASE WHERE ARGUMENT IS COMPLETELY ABSENT VERSUS THE CASE WHERE THE ARGUMENT IS OUTSIDE OF THE NOUN PHRASE. AND THIS IS THE SITUATION WHERE HAVING A DEPENDENCY IS USEFUL AND CAN HELP YOU DISAM BIG WAIT BETWEEN THOSE TWO SITUATIONS. >> I THINK WE HAVE TIME FOR JUST A VERY SHORT QUESTION. >> I HAVE A SHORT ONE, ACTUALLY IT'S A COMMENT THAT OCCURRED TO ME SITTING HERE IN THE END OF THE DAY AS YOU HAVE WE HEARD SPEAKERS REFER TO CLINICAL TEXT. THIS ACTUALLY A LOT OF VARIETIES OF CLINICAL TEXT AND I DON'T HAVE MUCH FEELING FROM WHAT WAS SAID HERE, WHETHER THOSE OF US WHO ARE NOT DOING THIS ARE SUPPOSED TO ASSUME THE PROBLEMS ARE ALL THE SAME FOR DISCHARGE SUMMARY OR SOAP NOTE OR COULDN'T POSSIBLY BE. I WOND IRIF ANYONE HAS GIVEN THOUGHT TO WHICH FORMS OF THE CLINICAL TEXT ARE PARTICULARLY RICH AND WORTHWHILE. THERE'S NO POINT DOING NATURAL LANGUAGE PROCESSING ON TEXT FROM WHICH NO VALUABLE INFORMATION IS GOING TO BE EXTRACTED FOR PURPOSES OF SCIENTIFIC DISCOVERY. ANY PARTICULARLY VALUABLE SPECIES HERE? MOUNTAIN GOATS OR SOMETHING OF THE SORT? ANYONE LOOKED AT THIS? >> IN PARTICIPATING IN THE MEDICAL RECORDS TRACK THE SOURCE OF -- THERE WERE DIFFERENT KINDS OF DOCUMENTS USED IN THAT TRACK. AND I -- MY MEMORY ON IT IS UNCLEAR BUT THERE ARE CERTAINLY SOME PARTS OF THE MEDICAL RECORD THAT ARE ALMOST AS RELIABLE AS THE LITERATURE AS FAR AS GRAMMAR IS CONCERNED AND OTHER PARTS THAT ARE JUST FULL OF -- >> DISCHARGE SUMMARIES. THERE ARE OTHERS. >> DRAWING A BLANK. >> I THINK THERE'S FIRST OF ALL YES, THERE'S VERY WIDE VARIETY OF NOTE TYPES BUT ALSO DIFFERENCE TYPES OF AUTHORS. UNTIL RECENTLY PEOPLE STARTED LOOKING AT NURSING NOTES AND I THINK IT DEPENDS ON THE TEST Y'ALL ARE LOOKING AT THE TYPE OF KNOWLEDGE YOU'RE GOING TO EXTRACT IS REALLY DIFFERENCE FROM WHAT YOU GET IN THE DISTRICT SUMMARY. SO ONE EXAMPLE OF TASK WHICH I THINK IS VERY INTERESTING TO LOOK AT IS WORK FLOW AND HOW DOCUMENTATION REPS WORK FLOW AN NURSING NOTES WOULD BE EXTREMELY USEFUL FOR CLINICAL DISCOVERY, IT'S NOT SURE IT WOULD BE. >> THANK YOU VERY MUCH. LET'S THANK THE PANEL. [APPLAUSE] >> CAROL FRIEDMAN AND TOM WILL EXPLAIN EVERYTHING, THE ANSWER IS 42. WE'LL BE VERY BRIEF BECAUSE OUR BRAINS ARE VERY SLOW RIGHT NOW. I THINK THIS WAS A GREAT DAY. IT WAS A VERY EXCITING CONFERENCE. I LOVED HAVING THE GENERAL NLP PEOPLE COME WITH A BIONLP PEOPLE, I THINK THAT WAS RIGHT. I GUESS THERE WAS -- SOME GENERAL THEMES. THE AVAILABILITY OF CORKSRORPRA FROM THE CLINICAL DOMAIN, WE HAVEN'T SOLVED IT BUT SAID IT WAS A PROBLEM, IT'S A BIG PROBLEM. SOME PEOPLE SUGGESTED BRINGING THE SOFTWARE TO THE INSIDE THE FIREWALL WHICH SEEMS TO BE A TEMPORARY SOLUTION. OTHER THINGS DISCUSSED WERE HOW MACHINE LEARNING COULD BE -- HOW MACHINE LEARNING WAS VERY IMPORTANT TO BE SCALABLE, VERY ROBUST. AND VERY USEFUL, I GUESS ALSO DISAMBIGUATION WAS DISCUSSED, ANOTHER BIG PROBLEM. >> >> DIALING ON CAROL, SEEMS LIKE THE THEME IS THAT WE HAVE SEEN RICHNESS OF DIVERSITY, ALL THE SERIOUS VIABLE AND VERY COMPETENT RESEARCHERS IN THE AREA, I THINK THAT KIND OF THE MOSAIC AND COMPLEXITY HAS BEEN ONE OF THE LARGEST BENEFITS, BRINGING EVERYONE TOGETHER AND TALKING AND SEEING THAT YES, THIS IS A HUGE PROBLEM AND WILL REQUIRE MANY DIFFERENT APPROACHES THAT PEOPLE ARE GOING TO HAVE TO COOPERATE. >> SO I THINK TOMORROW WOULD BE INTERESTING, IF FOR THOSE OF YOU WHO ARE STAYING IN THE NLP PART WOULD BE REALLY TO SIT DOWN AND WORK OUT SOME ISSUES ON A SMALL BASIS AND SPEP MORE TIME, I THINK THAT WOULD BE GREAT. >> THE IDEA WOULD BE IN THE BREAK OUT GROUPS THAT WE WOULD ACTUALLY IMPLEMENT WHAT WE'RE TRYING TO DO HERE JUST IN GENERAL BUT WE REALLY ARTICULATE SOME OF THE DETAILS IN PERHAPS SOME KIND OF A COMPREHENSIVE FORM THAT COULD SERVE AS A WHITE PAPER FOR MOVING FORWARD. >> ALSO I WOULD LIKE TO THANK THE SPEAKERS TODAY. I THINK THEY WERE TERRIFIC AND THANK THE AUDIENCE FOR COMING. ALSO TERRIFIC. >> I SECOND THAT. [APPLAUSE] >> ON BEHALF OF THE LIBRARY I THANK TOM AND CAROL AND ALL OF YOU FOR PARTICIPATING HERE. REMEMBER ALL DECISION SUPPORT GROUPIES HERE AT EIGHT O'CLOCK TOMORROW. NLP OVER AT NATCHER. WE'LL BE GETTING THERE AT 8:45.