WELCOME BACK. I HOPE Y'ALL HAD A NICE EVENING AND ARE READY FOR A GREAT DAY. MY NAME IS ROSE MARY MCKAIG. I'M AN EPIDEMIOLOGIST PROGRAM OFFICER, PROJECT SCIENTIST AND EPIDEMIOLOGY BRANCH OF DAIDS AND WE WELCOME YOU BACK TO THE SECOND DAY OF OUR BIG DATA MEETING. WHAT I WANT TO DO NOW IS INTRODUCE THE MODERATOR FOR THE MORNING SESSION, MANY OF YOU KNOW HIM VERY WELL. DR. VICTOR DEGRUTTOLA. DR. DEGRUTTOLA, PROFESSOR BIOSTATISTICS AT HARVARD UNIVERSITY SCHOOL OF PUBLIC HEALTH. HE WROTE HIS LONG CAREER AND I PREFER TO SAY HIS BRILLIANT PRODUCTIVE SIGNIFICANT AND MEANINGFUL CAREER HAS BEEN FOCUSING ON THE DEVELOPMENT OF STATISTICAL METHODS FOR APPROPRIATE PUBLIC HEALTH RESPONSE TO THE AIDS EPIDEMIC INCLUDING TRANSMISSION, NATURAL HISTORY, CLINICAL INTERVENTIONS, AND WORK IS NOT ONLY INVOLVED IN STATISTICAL METHODOLOGY BUT PUBLIC HEALTH SURVEILLANCE, MEDICAL ISSUES AND IN CONCERNS OF COMMUNITIES, MOST AFFECTED BY HIV. HIS RESEARCH GOALS INCLUDE FORECASTING FUTURE AIDS INCIDENTS, DEVELOPING STRATEGIES FOR CLINICAL RESEARCH, AND EVALUATING THE PUBLIC HEALTH IMPACT OF ANTIRETROVIRAL TREATMENT T. STATISTICAL ISSUES WHICH DR. DEGRUTTOLA HAS BEEN ENGAGED INCLUDES EVALUATING THE DEGREE TO WHICH THE TREATMENT RESPONSE OF MARKERS HIV INFECTION CONSTITUTE ADEQUATE EVIDENCE FOR CLINICAL EFFICACY. HE WORKED ON PROJECTIONS OF AIDS INCIDENCE USING DATA FROM THE NEW YORK CITY HEALTH DEPARTMENT AND A SPECIAL FOCUS OF HIS WORK WAS THE ESTIMATION OF THE RISK OF CHILDREN OF HIV INFECTED MOTHERS WHETHER THEY DEVELOP AIDS IN THE FIRST TEN YEARS OF LIFE USING DATA COMBINED FROM A VARIETY OF SOURCES. DR. DEGRUTTOLA RECEIVE HIS BACHELORS DEGREE IN PHYSICS FROM BROWN UNIVERSITY, MASTERS DEGREE IN BIOENGINEERING AND MASTERS IN EPIDEMIOLOGY AND HIS BIOSTATISTICS Ph.D. OR SD FROM HARVARD UNIVERSITY. WE THANK YOU, DR. DEGRUTTOLA FOR MODERATING THIS MORNING AND FOR BEING HERE. >> THANK YOU, VERY MUCH, ROSE MARY FOR THAT NICE INTRODUCTION. I'M DELIGHTED TO HAVE THE OPPORTUNITY TO SERVE AS MODERATOR OF THIS SESSION. YESTERDAY WE HEARD A LOT ABOUT THE NEED TO COMBINE ACROSS THE DIFFERENT QUANTITATIVE SUBDISCIPLINES IN ORDER TO HARNESS BIG DATA TO STOP HIV AND TODAY WE HAVE A GROUP OF OUTSTANDING SCIENTISTS WHOSE SKILLS AND ACHIEVEMENTS COMBINE ACROSS THE DISCIPLINES OF STATISTICS, COMPUTER SCIENCE, PHYSICS, PSYCHOLOGY, BIOINFORMATICS, NETWORK SCIENCE, EPIDEMIOLOGY ESPECIALLY FROM THE EVOLUTIONARY PERSPECTIVE AND MATHEMATICAL MODELING. I WAS ALSO DELIGHTED TO HEAR ABOUT THE INTEREST IN GRAPHICAL METHODS TO REPRESENT MODELS. IN THAT REGARD I HAVE TO SAY CARLY, I LOVE YOUR LOGO AND I WANT THE T-SHIRT. WHEN I FIRST STARTED OUT DOING HIV RESEARCH, THE INFECTION WAS BASICALLY A BLACK BOX AND MOST OF THE INFORMATION THAT WAS AVAILABLE FOR MODELING WAS JUST REPORTED AT THAT TIME AIDS INCIDENTS. I THINK AFTER MEASUREMENTS AND METHODS WITH WERE DEVELOPED TO ELUCIDATE MECHANISMS INFECTION AND ALSO THE IMPACT OF TREATMENTS ON THEM, TREMENDOUS PROGRESS IN THE TREATMENT AGENDA WAS MADE. I BELIEVE THE COLLECTION OF BIG DATA FROM ACROSS A LARGE NUMBER OF DOMAINS AND ALSO THE COMBINATION OF SKILLS WERE REPRESENTED BY THE SCIENTISTS THIS MORNING WILL HELP US ACHIEVE THE SAME ELUCIDATION OF MECHANISM FOR DIFFUSION OF HIV WITHIN HOST AND ACROSS POPULATIONS. PEOPLE'S BEHAVIOR CAN BE A LOT HARDER TO PREDICT THAN BEHAVIOR OF CELLS THAT CREATES SPECIAL CHALLENGE FOR SELECTION ACTS AS INTEGRATIONND A ANALYSIS -- AND ANALYSIS AND I THINK TODAY'S SESSION SCIENTISTS ARE REPRESENTED HERE A GREAT STEP IN FACILITATING THAT ENTERPRISE. OUR FIRST SPEAKER THIS MORNING IS DR. JERRY REITER WHO IS THE MRS. ALEXANDER MAYER PROFESSOR OF STATISTICAL SCIENCE AT DUKE. DR. REITER'S METHOD LOGIC RESEARCH FOCUSES MAINLY ON STATISTICAL METHODS FOR THE ALL IMPORTANT TOPIC OF PROTECTING DATA CONFIDENTIALITY AS WELL AS HANDLING MISSING DATA AND MODELING COMPLEX DATA THAT INCLUDES METHODS FOR COGNITIVE INFERENCE. DR. RETEAR IS DEPUTY DIRECTOR OF THE INFORMATION INITIATIVE AT DUKE WHICH IS DEDICATEDDED TO RESEARCH ANALYSIS OF LARGE SCALE DATA AND HE'S PI OF TRIANGLE SENSES CYST RESEARCH NETWORK DEDICATED TO IMPROVING THE PRACTICE OF DATA, DISSEMINATION AMONG FEDERAL AGENCIES. IF THEY CAN ACHIEVE THAT I'M SURE IT WILL BE A BIG HELP IN MAKING THE DATA AVAILABLE TO ALL OF US. THANK YOU, DR. REITER. >> GOOD MORNING, THANK YOU VERY MUCH FOR THE INVITATION TO PRESENT. I WANTED TO TELL YOU OR DESCRIBE A VISION THAT I HAVE BEEN WORKING TOWARDS, FOR ACCESSING CONFIDENTIAL DATA. THIS IS A VISION THAT GROWS OUT OF RESEARCH SUPPORTED BY NSF AND NIH AS WELL AS LONG STANDING COLLABORATION WITH THE U.S. BUREAU OF CENSUS. LET ME START OUT BY MAKING A PERHAPS NOT SO CONTROVERSIAL STATEMENT IN THIS ROOM BUT PERHAPS CONTROVERSIAL, IT IS CRUCIAL THAT WE HAVE PUBLIC USE DATA. RECORD LEVEL DATA ARE ENORMOUSLY BENEFICIAL FOR SOCIETY. THEY CAN HELP FACILITATE RESEARCH AND POLICY MAKING OF COURSE BUT THEY ALSO HAVE OTHER BENEFITS GIVING DATA TO STUDENTS TO TRAIN ON IS CRUCIALLY IMPORTANT, ALLOWING THE DEVELOPMENT OF NEW METHODS, WE NEAL READ DATA, REALISTIC DATA AT LEAST IN ORDER TO BE ABLE TO DEVELOP NEW METHODS AND ASSESS THEIR QUALITY. AND EVEN HEALTHY CITIZENS UNDERSTAND THEIR COMMUNITIES. I THINK ACCESS TO THIS RECORD-LEVEL DATA IS IMPORTANT EVEN IN THIS WORLD WHERE DATA ARE NOT COMING THROUGH THE ANALYSTS BUT ANALYSIS IS GOING TO THE DATA, WE'RE PASSING AROUND BIG DATA SETS PROBABLY, BECAUSE IT HELPS YOU FOR ALL THESE REASONS AND AS WELL AS JUST TO UNDERSTAND WHAT THE DATA LOOK LIKE SO YOU HAVE A GOOD IDEA WHAT ANALYSIS STRATEGIES YOU MIGHT USE. THE ISSUE IS CONFIDENTIALITY OF DATA. STEWARDS ARE ETHICALLY AND LEGALLY OBLIGATED TO PROTECT CONFIDENTIALITY OF DATA SUBJECTS, IDENTITIES AND SENSITIVE ATTRIBUTES. AND JUST RELEASING THE DATA AS THEY ARE IS INSUFFICIENT, EVEN STRIPPING THE DIRECT IDENTIFIERS LIKE NAMES AND IDs, ET CETERA, IS NEED BUD NOT SUFFICIENT. I DON'T BELIEVE SAYING PROTECTIVE BECAUSE THERE'S AN OPPORTUNITY TO LINK -- FOR INTRUDERS OR ILL INTENTIONED PEOPLE TO LOOK AT EXTERNAL SOURCES. ONE EXAMPLE RELATIVE TO HEALTH DATA, THIS IS WORK FROM LA TONYA SWEENEY'S GROUP WHO PURCHASED HIPAA DATA FROM THE STATE OF WASHINGTON AND LOOKED AT DIAGNOSIS CODES THAT WERE AVAILABLE AND DATES AVAILABLE WHICH IS MONTHS AND YEARS AND WAS ABLE TO GO TO NEWSPAPERS AND IDENTIFY ACCIDENT VICTIMS. 40 OUT OF 83 ACCIDENT CODES SHE IDENTIFIED SO THAT'S AN EXAMPLE THAT IS A REAL ISSUE AND NOT PROTECT. WHAT ABOUT LARGE SCALE DATA? BIG DATA WE TALK ABOUT HERE ARE EVEN MORE DIFFICULT TO PROTECT BECAUSE THEY OFTEN COME FROM ADMINISTRATIVE SOURCE OR SOCIAL MEDIA SO THEREFORE BY DEFINITION THEY'RE AVAILABLE TO OTHER RESEARCH TEAM COLLECT DATA, IF YOU HAVE A LARGE NUMBER OF VARIABLES AND RICH DATA SETS FANTASTIC FOR ANALYSIS BUT ALSO CREATES MULTIPLE OPPORTUNITIES FOR ILL INTENTIONED USERS TO MATCH. YOU HAVE MANY VARIABLES, YOU CAN'T LARGE DATA FILES YOU CAN'T LEAN ON WHAT STATISTICAL AGENCIES LEAN ON WHICH IS THE RANDOM SAMPLE SO THEREFORE WE DON'T REALLY KNOW IF A PERSON IN THE DATABASE PROVIDES PROTECTION, I DON'T THINK THAT'S THE CASE THAT A THESE ADMINISTRATIVE SOURCES PARTICULARLY WITH LARGE NUMBERS OF VARIABLES, EVERYBODY IS A POPULATION -- SO YOU FIGURE OUT WHO THEY ARE. THIS NOT A NEW PROBLEM, STATISTICAL AGENCIES WILL BE THINKING ABOUT THIS FOR A LONG TIME AND OTHER DATA STEWARDS AND THERE ARE TECHNIQUES EMPLOYED, MOST TYPICAL ONES THAT GOVERNMENT AGENCIES DO ARE THESE FOUR DATA AGGREGATION T GEOGRAPHY HIPAA SAYS SAFE HARBOR RULES YOU CAN'T RELEASE A DATA SET 20,000 OR AGES, TOP AGES REACH 90, COLLAPSING CATEGORIES THINGS OF THIS NATURE, DATA SUPPRESSION DOES NOT RELEASE THE VALUE OR THE VARIABLE. IS SWAPING IS A COMMON TECHNIQUE PARTICULARLY EDUCATION REALM YOU CAKE ONE RECORD VALUE AND SWITCH THEM WITH ANOTHER RECORDS VALUES. THE IDEA BEING DISCOURAGE ILL INTENTIONED USERS FROM MATCHING BECAUSE MAYBE THEY FOUND A MATCH THAT'S NOT REALLY A MATCH, MAYBE IT WAS A SWAP. THEN ADDING NOISE TO VALUES, PARTICULARLY CONTINUOUS VALUES WHERE YOU MIGHT PERTURB IT SO THEY CAN'T EXACT MATCH OR CAN'T LEARN THE EXACT VALUE OF THE SENSITIVE VARIABLES CONTINUOUS. THESE ARE COMMON TECHNIQUES, THESE ARE DONE LOW INTENSITY BECAUSE THERE IS A FEAR JUSTIFIABLE A FEAR THAT PERTURBING THE DATA AT HIGH INTENSITY COULD HURT THE DATA QUALITY. NOBODY ACTUALLY KNOWS THE RATE WHICH DATA IN DECENNIAL CENSUS WITH SWAPPEDS, NONE OF US KNOW WHAT THE RATE IS AND DIRECTOR OF CENSUS BUREAU DOESN'T KNOW THE SWAP RATE SO HE OR SHE CAN PLAUSIBLY DENY THE SWAP RATE IF DECENNIAL WHY THEY WOULD BE ASKED TO SWAP RATE FOR A CONGRESS I DON'T KNOW BUT THEY DO HAVE THAT PROTECTION. SO WHAT ABOUT FOR LARGE SCALE BIG DATA I DON'T THINK THESE METHODS ARE GOING TO WORK, LOW INTENSITY PERTURBATIONS ARE NOT PROTECTIVE. IN THE SENSE THAT IF YOU HAVE A LOT OF VARIABLES, HOW YOU AGGREGATE UP F YOU HAVE TO AGGREGATE TONS OF VARIABLES IF THERE ARE TONS OF VARIABLES AVAILABLE FOR MATCHING. SO THAT'S A HIGH INTENSITY PERTURBATION THAT COULD HURT THE QUALITY OF THE DATA WHAT YOU HAVE LEFT OVER. JUST WAPPING A SMALL PERCENTAGE, WE DON'T KNOW WHAT THE SWAP RATE IS, IT'S 1% OR LESS IN THE DESEN YELL DATA BUT IF YOU ONLY CHANGE 1% OF VALUE, IS THAT GOING TO BE PROTECTIVE IN THIS BIG DATA WORLD? T 99% OF THE RECORDS ARE NOT SWAPPED SO PROBABLY NOT. I'M NOT OPTIMISTIC ABOUT THE TECHNIQUES EMPLOYED CURRENTLY BY STATISTICAL AGENCIES WHICH HAVE BEEN APPROPRIATE, YOU CAN ARGUE THAT, BUT APPROPRIATE FOR RANDOM SAMPLES AND SMALL DATA. HERE IS THE POTENTIAL PATH FORWARD. THIS IS WHAT I HAVE BEEN WORKING TOWARDS WHICH IS INTEGRATIVE SYSTEM HAVING THREE COMPONENTS, THE FIRST COMPONENT IS A HIGH REDACTED DATA SETS, WHAT WE CALLED IN THE STATISTICAL LITERATURE, SYNTHETIC DATA AND STEVEN EUBANK WILL ALSO TALK SYNTHETIC DATA TO TALK ABOUT. I'LL TALK MORE ABOUT EACH PIECE AS WE GO FORWARD BUT SYNTHETIC DATABASEICALLY ALL THE RECORDS ARE SIMULATED. GENERATED. YOU CAN'T LINK -- THAT'S YESRY BECAUSE IT'S NOT, IT'S A SIMULATED RECORD MORE ABOUT THAT IN A SECOND. SO HIGHLY REDACTED PUBLIC USE DATA, ANYBODY CAN ACCESS IT. FOLLOWED WITH APPROVED RESEARCH ISERS TO ACCESS THE CONFIDENTIAL DATA VIA REMOTE ACCESS, IT HAS TO BE VETTED AND APPROVED, SOME BACKING THAT THE DATA STEWARD TRUSTS. THIS IS SORT OF THE PERHAPS THE KEY INNOVATION IS THE BOTTOM COMPONENT, THE VERIFICATION SERVER, IT SITS IN THE MIDDLE BETWEEN REMOTE ACCESS SOLUTION AND SYNTHETIC DATA SOLUTION AND THE IDEA IS THAT THE USER OF THE SYNTHETIC DATA CAN QUERY THE VERIFICATION SERVER FOR OUTPUT ON THE QUALITY OF THEIR -- THE ANALYSIS DONE WITH THE SYNTHETIC DATA. IF THE QUERY COMES BACK, THIS IS HIGH QUALITY, THEN MAYBE THEY'RE SATISFIED AND THEY CAN GO OFF AND DO SCIENCE AND POLICY WITHOUT EVER HAVING TO GO THROUGH THE HOOPS OF TRYING TO GET TO THE REMOTE ACCESS SOLUTION. BUT IF NOT THEY KNOW AND THEY CAN ACTUALLY GO APPLY FOR ACCESS TO THE CONFIDENTIAL DATA AND GO THROUGH ALL THOSE HOOPS AND COSTS THAT HAVE TO BE THERE. LET ME DESCRIBE EACH OF THESE THREE PIECES A BIT MORE. THE FIRST PIECE, REDACTED DATA PIECE, SYNTHETIC DATA. THIS IS AN IDEA THAT IN THE LITERATURE BACK TO REUBEN IN 1993, I'M SURE FURTHER LIKE EVERYTHING ELSE. THE BASIC IDEA IS TO FIT AFTER STATISTICAL MODEL THAT DESCRIBES THE RELATIONSHIPS IN THE DATA AND SIMULATE NEW RECORDS FROM THAT MODEL FOR PUBLIC RELEASE. SO THESE MODELS, THE STATISTICAL MODELS ARE ESTIMATED WITH THE ORIGINAL DATA AND OTHER EXPERTISE THAT THE AGENCY, THE MODELER HAS P. THE GOAL ESSENTIALLY IS TO PRESERVE A LOT OF THE GLOBAL STRUCTURE, A LOT OF BIG IMPORTANT RELATIONSHIPS AT THE EXPENSE OF SACRIFICING SOME OF THE FINER STRUCTURE. SOMETHING WILL BE SACRIFICED, IT IS IMPOSSIBLE TO PRESERVE ALL THE INFORMATION OTHERWISE, THE ORIGINAL DATA AGAIN, BACK TO THE SAME PROBLEM WITH CONFIDENTIALITY POTENTIAL BREECHES SO WHAT IF, IF I EEL GRANT BIG IF, WHAT IF WE CAN GENERATE DATA THAT CAPTURE IT IS GLOBAL RELATIONSHIP? THE ORIGINAL DATA? AS MENTIONED BEFORE IT'S LOW RISK BECAUSE YOU CAN'T MATCH ANY MORE WHICH IS THE KEY ATTACK THAT MOST OF THESE EXAMPLES RELY ON. I SAY MORE RISK BECAUSE IT'S NOT ZERO RISK BECAUSE YOU CAN IMAGINE SCENARIOS WHERE THE ONLY WAY A PARTICULAR SYNTHETIC SET OF VALUES CAN BE RELEASED IS IF THERE WAS AN OBSERVATION WITH THE REAL DATA IN CERTAIN SETH OF VALUES. WE HAVE SHOWN THAT. THOSE ARE QUANTIFIABLE RISKS AND WE CAN ALWAYS SYNTHESIS MODEL TO REDUCE THOSE RISK SO IT'S LOW RISK BUT NOT ZERO. WE HAVE THE POTENTIAL TO PRESERVE ASSOCIATIONS, NO TOPCOATING NO CHOPPING 90 AND ABOVE AND SMALL AREA ESTIMATION, WE CAN DO IT FIVE SCALE BECAUSE DATA ARE SIMULATED SO I DON'T THINK TO THINK ABOUT 20,000 POPULATION SIZE RULE. WHAT IS ALSO NICE TOO, COMPLEX SURVEYS, I DON'T HAVE TO RELEASE THE COMPLEX SURVEY, I CAN GENERATE A RANDOM SAMPLE, THOSE EPIDEMIOLOGISTS IN THE ROOM ANALYZING COMPLEX SURVEYS IS A DISASTER AND NOBODY REALLY KNOWS HOW TO DO IT THOUGH WE ALL DO IT, GETTING STANDARD ERRORS RIGHT BUT IF WE CAN GET RANDOM SAMPLES THAT WE ALL KNOW HOW TO DO OR IF IT'S A BIG ADMINISTRATIVE DATABASE IT LEAFS A SIMPLE RANDOM SAMPLE, SMALLER SAMPLE FOR COMPUTATIONAL CONVENIENCE. SO THIS SAYS A LOT OF DESIRABLE PROPERTIES, AND IT HAS BEEN DONE, THERE ARE PRODUCTS OUT THERE ESSENTIALLY SYNTHETIC DATA, ONE THAT I SUPERVISE, SYNTHETIC LONGITUDINAL BUSINESS DATABASE, THIS HAS ONE RECORD FOR EVERY BUSINESS ESTABLISHMENT THAT'S BEEN IN THE U.S. SINCE 1975, LONGITUDINAL FILE, INCLUDES WHEN THEY COME IN THE DATA WHEN THEY ARE ARE BORN, IF YOU WILL, WHEN THEY EXIT, IF THEY DO EXIT BEFORE THE LAST YEAR. IT HAS THINGS LIKE THE PAYROLL, NUMBER OF EMPLOYEES, WE'RE NOW ADDING FIRM STRUCTURE AND OTHER RICH VARIABLES. THIS IS SYNTHETIC BECAUSE YOU MIGHT THINK IT'S JUST BUSINESS DATA BUT TURNS OUT IT'S -- COMES FROM THE IRS AND ANYTHING THAT COMES FROM THE IRS AS YOU MIGHT IMAGINE IS SUPER PROTECTED, IN FACT, ESTABLISH FILE TAXES CANNOT BE REVEAL SOD ONLY WAY TO RELEASE PUBLIC USE MICRODATA FILE IS CREATE SYNTHETIC VERSION SO DOWNLOAD FROM THE CENSUS BUREAU WEBSITE NOW AND USE IT. ANOTHER INTERESTING PRODUCT, IS SYNTHETIC SURVEY OF INCOME AND PROGRAM PARTICIPATION, THE SIP. THE SIP IS THE LARGEST SURVEY OF PEOPLE ON PUBLIC ASSISTANCE IN THE UNITED STATES. AND THIS IS CREATED BY SOME ECONOMISTS AT CORNELL. AND WHAT THEY DID WAS ESSENTIALLY -- HERE IS THE SET OF -- SIB HAD A PUBLIC USE FILE RELEASED WITH THE STANDARD PROTECTION METHODS THAT I TALKED ABOUT BUT THE ECONOMISTS WANTED TO LINK IN EARNINGS HISTORIES AND SOCIAL SECURITY DISABILITY HISTORIES. YOU CAN IMAGINE GETTING THIS LONGITUDINAL HISTORY OF EARNINGS AN SOCIAL SECURITY PAYMENTS, THAT WOULD BE REALLY USEFUL FOR THINKING ABOUT PEOPLE ON PUBLIC ASSISTANCE, WHEN THEY GO ON H THAT I KNOW GO OFF AND HOW THAT TIES TO EARNINGS HISTORY AND TRENDS, SO SOCIAL SECURITY AND IRS SAID NO, WE WILL NOT ALLOW YOU TO RELEASE PUBLIC USE FILE OF EARNINGS HISTORIES, UNDERSTANDABLE. THE ONLY WAY THEY AGREE TO DO IT IS CREATE A SYNTHETIC VERSION SO THE SIB HAS 600 VARIABLES SYNTHESIZED AND A QUARTER MILLION PEOPLE. SO IT'S A GOOD SIZE PROJECT FOR SURE. YOU CAN DOWNLOAD THAT IN THE CENSUS BUREAU WEBSITE. THE KEY TO THE SUCCESS OF THESE PROJECTS IS TO HAVE FLEXIBLE SYNTHESIZEERS AND WHAT WE DO IS ONE THING WE TO ADAPT TECHNIQUES FROM MACHINE LEARNING TO TURN THEM INTO DATA GENERATORS. HOPE TO MAKE THINGS FLEXIBLE AND CAPTURE AS MUCH STRUCTURE IN THE DATA. THERE IS A NICE FEATURE OF THIS BUT BECAUSE WE GENERATE DATA WE DON'T WORRY INTERPRETATION WHICH IS ONE OF THE DIFFICULTIES OF THE MACHINE LEARNING TECHNIQUES. THAT'S THE FIRST PIECE, REMOTE ACCESS SOLUTION THERE ARE A LOT OF THESE AROUND, CERTAINLY NOT NEW BUT THE BASIC IDEA IS THAT YOU HAVE THE DATA THAT ARE STORED ON THE STEWARDS COMPUTE SERVER, NOT LOCALLY ON THE ANALYST SERVER. AND SOME APPROVED USER AND HOW THAT GETS DONE IS SOMETHING I'M NOT GOING TO TOUCH ON, THERE'S CERTAINLY LOTS OF MODELS FOR THAT ALL THE COMPUTATION IS DONE ON THE REMOTE SEVER, NO DATA IS ALLOWED TO BE DOWNLOADED TO THE USERS MACHINES, PRINT SCREENS, DISABLED, ALL THIS STUFF THAT ELIMINATES THE ACCIDENTAL DISCLOSURE, CAN ELIMINATE THE PURPOSEFUL DISCLOSURES, THAT PROBABLY NEEDS SOME DISINCENTIVE PENALTIES IF FOUND TO DO SOMETHING ILL WITH THE DATA, THAT IS THE IDEA YOU TRY THE PREVENT ACCIDENTAL DISCLOSURE IN SOME SENSE, THERE'S LOTS OF THESE, NATIONAL OPINION RESEARCH CENTER DATA ENCLAVE, VARIOUS UNIVERSITIES DEVELOPING THEIR OWN NETWORKS. WE HAVE ONE AT DUKE THAT WE USE. A KEY THING HERE, TO RECOGNIZE IS THAT THESE SERVERS ARE NOT GOING TO BE FREE. THEY ARE -- THEY WILL COST THE DATA STEWARD TO SET UP AND MAINTAIN. AND HENCE WILL COST USERS TO ACCESS AND UTILIZE SO THAT IS IMPORTANT BECAUSE IT SUGGESTS THE USERS OF THESE NEED TO BE AS EFFICIENT AS POSSIBLE FOR REMOTE ACCESS SOLUTIONS. THE FIRST BUILDING BLOCKS FOR PUBLIC USE REDACTED DATA, REMOTE ACCESS DATA THAT HAS CONFIDENTIAL DATA, THAT'S WHAT YOU WANT IN SOME SENSE. AND THESE ARE THE PIECE THAT GLUES THEM TOGETHER, A VERIFICATION SERVER. SO THIS IS A SEPARATE SYSTEM THAT HAS THE REAL DATA AND REDACTED DATA. HOWEVER ELSE IS REDACTED THE USER SCHMITZ A QUERY TO THE SYSTEM FOR VERIFICATION OF A PARTICULAR ANALYSIS. SO THE USER MIGHT SAY I WANT TO DO A REGRESSION OF VARIABLE TWO, VARIABLE SIX, EIGHT, AND TWELVE AND THE SERVER TAKE IT IS QUERY, RUNS THE REGRESSION ON THE REAL DATA AND ON THE REDACTED DATA AND COMPUTES SOME MEASURE OF SIMILARITY. HOW MUCH DOES IT OVERLAP, HOW FAR APART ARE POINT ESTIMATES. SIGNIFICANT CHANGE. YOU CAN IMAGINE FIDELITY MEASURES, HOW SIMILAR ANALYSES. AND THE SYSTEM REPORTS BACK MEASURE OF SIMILARITY, THE USER CAN DECIDE IF SYNTHETIC DATA ARE HIGH ENOUGH QUALITY FOR THEIR PARTICULAR ANALYSIS THAT,'S THE THING. YOU NEVER KNOW SYNTHETIC DATA WHETHER YOUR PARTICULAR ANALYSIS WILL BE PRESERVED OR NOT. CHECK QUALITY OF DATA BEFORE RELEASING BY RUNNING TYPICAL NAIL SEIZE BUT THAT'S NOT EXHAUSTIVE OBVIOUSLY. SHE CAN DECIDE IF SHE WANTS TO PUBLISH OR NOT, IN THE BROAD SENSE BUT QUALITY MEASURES LEAK INFORMATION ABOUT THE REAL DATA. IMAGINE SOME OF THESE CLEVER SUBMIT QUERIES IN WAYS THAT CAN ALLOW THEM TO HONE IN ON SENSITIVE VALUES. SO THAT'S WHY SYSTEM IS NOT BUILT YET, WHY WE ARE BUILDING IT. SO THE WHOLE KEY IS TO CREATE QUALITY MEASURES. I GENERATE SYNTHETIC DATA FOR HIV OR ACTUALLY NOT EVEN HEALTH DATA TO BE HONEST. WE HAVE THE MOST SIMILARS BUT NO VERIFICATION PIECE AT THE MOMENT, THOUGH DEVELOPING MEASURES THAT TURNS OUT A CHALLENGING PROBLEM. WHAT CITY ANYWHERE JURY SELECTION JUSTS TO WRITE THIS TYPE THE SYSTEM. THE SYNTHETIC DATA ARE PUBLICLY AVAILABLE >> THEY'RE USEFUL FOR TRAINING, YOU CAN DEVELOP CODE, DO EXPLORATORY DATA ANALYSIS, DO I NEED TO THINK ABOUT WHAT TRANS FORMATIONS TO USE, WHAT RECODINGS I MIGHT WANT TO DO. ARE THERE ENOUGH PEOPLE IN PARTICULAR DEMOGRAPHIC I'M INTERESTED IN LEARNING ABOUT. ALL THESE EXPLORATORY QUESTIONS HELP DETERMINE WHAT ARE THE RIGHT QUESTIONS TO ASK TO BE DONE ON THE SYNTHETIC DATA. ANALYSIS SYNTHETIC DATA WILL PRESERVE ANALYSIS, AND VERIFICATION PROUDER WILL HELP DECIDE THAT, SO GREAT, THIS IS DONE, DOESN'T HAVE TO DEAL WITH GOING THROUGH THE APPROVAL PROCESS, PAYING COST FOR ACCESS TO REMOTE SYSTEM, IF NOT YOU CAN APPLY FOR ACCESS TO REMOTE SYSTEM BUT SHE HAS NOT WASTED TIME BY WORKING ON SYNTHETIC DATA. MORE SUFFICIENT WHEN SHE GETS REMOTE ACCESS, SHE WON'T SPEND ALL THAT TIME DOING THIS EXPLORETORY ANALYSIS BECAUSE SHE'S DONE A LOT OFFLINE. IT FREES COSTS FOR HER AND FREES CYCLES FOR THE INTENSE DATA ANALYSES, THE STATISTICAL ANALYSES THAT WILL BE DONE BY PEOPLE SO IT'S BENEFICIAL FOR THE SYSTEM TO MOVE SOME OF THE EXPLORATORY WORK WHICH CAN CHEW UP TIME IN BIG DATA SETS, REMOVE EXPLORATORY WORK OFFLINE. THAT IS THE IDEA, THE SYNERGIES, WE ARE LOOKING TO BUILD THE SYSTEM. I HAVE ANOTHER FROM NSF AND THE DATA INFRASTRUCTURE BUILDING BLOCKS WHERE WE'RE TRYING TO BUILD THIS THING, -- PROTOTYPE AND SEE IF WE CAN DO IT. AND HOPEFULLY START. MAKING TOOLS TO OTHERS. IF YOU'RE INTERESTED IN MORE OF THE WORK THAT MY GROUP IS DOING, THINK WEBSITE FOR TRIANGLE RESEARCH CENSUS NETWORK AND I WILL PITCH THE NCRR NETWORK AS A WHOLE, THIS IS AN NSF NETWORK THAT WAS ESTABLISHED FOUR YEARS AGO TO BRING DIFFERENT RESEARCH TEAMS ACROSS THE COUNTRY TO AND WE'RE ONE OF THE EIGHT NODES OF DISSEMINATION. THANK YOU FOR YOUR ATTENTION, I THINK I FINISHED RIGHT AT 20 MINUTES. [APPLAUSE] >> DR. REITER HAD GIVEN AN EXCELLENT PRACTICAL SUGGESTION AND APPROACHES TO DATA SHARING AND OUR NEXT SPEAKER DR. J.C. SMART WILL CONTINUE WITH DISCUSSION ABOUT APPROACHES TO SHARING DATA AND INFORMATION. DR. SMART IS COMPUTER SCIENTIST WHOSE RESEARCH SEVERAL PROFESSOR AT GEORGETOWN UNIVERSITY WHERE HE SERVED AS CHIEF SCIENTIST OF THE INITIATIVE FOR OFFICE OF SENIOR VICE PRESIDENT FOR RESEARCH. IN THIS ROLE DR. SMART IS RESPONSIBLE FOR TECHNICAL LEADERSHIP AND STRATEGIC OVERSIGHT OF GEORGETOWN MULTI-DISCIPLINARY INTEGRATIVE SCIENCE ACTIVITY TO BUILD AN EXTREME SCALE KNOWLEDGE REPRESENTATION OF THE PLANET TO ADDRESS ISSUES OF SUSTAINABILITY GLOBAL HEALTH AND GLOBAL SECURITY. SO MUCH FOR THE IVORY TOWER ACADEMIA. DR. SMART WAS PREVIOUSLY AT RAYTHEON WHERE HE WAS CHIEF TECHNAL OFFICER FOR INTELLIGENCE AND INFORMATION SYSTEMS BUSINESS. ALSO SENIOR DIRECTOR OF THE NATIONAL SECURITY OPERATIONS CENTER AT THE NSA AND AMONG IN THIS HISTOCONTRIBUTIONCONTRIBUTIONS THERE IN COMPUTER SCIENCE THEORY, WAS ORIGINAL APPROACHES TO KNOWLEDGE REPRESENTATION. DR. J.C. SMART. >> THANK YOU, APPRECIATE THE OPPORTUNITY TO SHARE ONGOING WORK AT GEORGETOWN WITH SOME REALLY EXCITING GROUP OF COLLABORATORS. SO WE HAVE HEARD THE TERM BLACK BOX SO I WILL REUSE IT PROBABLY IN A DIFFERENT WAY AS APPLIES TO INFORMATION SHARING AND ANALYSIS YESTERDAY YOU HEARD ABOUT THE FOUR Vs, FOUR Rs BEAR WITH ME, I GOT THE FOUR Ps TODAY. NOT AS SEXY A LETTER AS SOME BUT WE SPENT A GREAT DEAL OF TIME OVER LARGE NUMBER OF YEARS UNDERSTANDING WHICH THIS CONNECTING DOTS AND ANALYSIS WHY IS THAT SO HARD? MY SUMMARY BOIL DOWN TO FOUR MAJOR AREAS. PLUMBING, HOOKING THESE TOGETHER COMPUTATIONALLY THAT'S A LOT OF WORK, TIME AND ENERGY YOU HEARD YESTERDAY SOLUTIONS TO ADDRESS THAT. HOOKING MORE SYSTEMS TOGETHER CYBER SECURITY TYPE CHALLENGES, YOU LIKE TO NOT MAKE IT WORSE BUT PUTTING MORE TOGETHER. A BIG ONE IN THIS PATTERN MATCHING, HOW DO YOU FIND NEEDLE IN THE HAY STACK OR NEEDLE IN THE HAY STACK OF NEEDLES, SO FORTH. THE LAST ONE THOUGH HAS BEEN PROBABLY THE MOST PERPLEXING MOST CHALLENGING FROM MY PERSPECTIVE PRIVACY, HOW DO YOU DO AT GLOBAL SCALE IN THIS RESERVE -- PRESERVE INDIVIDUAL CIVIL LIBERTIES SO PERCEIVED TENSION BETWEEN SECURITY AND LIBERTY THAT YOU HEAR IN MANY DIFFERENCE AREAS, WE TOOK A DIFFERENCE APPROACH, WHAT I BELIEVE'SES THE TENSION. THAT'S WHAT WE CALL BLACK BOX TECHNIQUE FOR LACK OF BETTER NAME. SO A QUICK OVERVIEW TECHNIQUE N THIS DIAGRAM YOU WILL SEE A RETICKLE HERE BUT IT IS A CONTAINER SO FOR THE MOMENT ASSUME THE EXISTENCE OF CONTAIN ER, A NUMBER ACROSS THE BOTTOM, EACH ORGANIZATION ARE PARTICIPATING THEY HAVE THE AUTHORITY, THE LEGAL RIGHT TO OWN THE DATA THEY HAVE THE AUTHORITY TO OWN THAT PARTICULAR DATA TO CUSTODIAN OF IT. THE DIFFICULTY COMES WHEN YOU WANT TO SHARE THAT DATA, THEY WANT TO SHARE EXCHANGE AND ANALYZE THAT DATA, WHERE THE RISK IS INTRODUCED. IF YOU THINK ABOUT THAT COMPUTATIONALLY, IT'S EASY TO BUILD A DATABASE NO ONE CAN QUERYRY THAN A DATABASE SOME CAN QUERY SOME OF THE TIME FOR SOME THINGS SO ESSENTIALLY THIS IS BUILDING A CONTAINER, DATA IN, NO DATA OUT. WITH A COUPLE OF KEY EXCEPTIONS. THE EXCEPTION HERE SON LEFT, WE DON'T SPECIFY IN GENERAL SENSE BUT THINK AS REGARDS CIVIL LIBERTARIANS AND CONSUMER RIGHTS ADVOCATES INDIVIDUALS OR REPRESENTATIVES OF THE GENERAL COUNCIL FOR DIFFERENT ORGANIZATION. THOSE FOLKS WHAT IS LEGAL AND ETHICAL. A QUESTION, A PATTERN, TO POSE TO THAT CONTAINER AS -- BECAUSE THERE'S A LIKE TO GET THAT ANSWER TO THAT QUESTION, THEY HAVE TO AGREE THAT THAT PATTERN IS LAWFUL, ETHICAL THIS PROCESS WORKS THE PATTERN GOES IN, THE ONLY THING TO COME OUT OF THIS BOX IS ANSWER TO THAT PATTERN. THAT PATTERN, INDIVIDUAL INFORMATION ABOUT INDIVIDUAL IS NOT WITH OUT PUT SO THE PATTERN, WE'LL GIVE A VERY SPECIFIC EXAMPLE HOW THIS WORKS. THE ONLY THING THAT COMES OUT AS A RESULT OF THAT MATCHING INFORMATION. WHICH AGAIN, DOESN'T REVEAL -- IT ANSWERS THE ANALYTIC QUESTION BUT DOESN'T REVEAL INDIVIDUAL PRIVATE INFORMATION. THE FIRST APPLICATION IS PILOT STUDY AND REPRESENTATIVES FROM EACH THREE JURISDICTIONS ARE IN THE ROOM, I SEE COLLIN, DEE ANN AND DEREK RACE HERE YESTERDAY. THESE ARE HEALTH DEPARTMENTS HERE, CLAM BEE Y MARYLAND. ONE OF THE CHALLENGES IN HIV CARE ACROSS LARGE METROPOLITAN AREAS ARE INDIVIDUALS DROPPING OUT OF CARE. WHERE AN INDIVIDUAL LIVE WHERE IS THEY WORKS WHERE THEY RECEIVE CARE, THE HOSPITALS, TREATMENTS, DOCTORS ALL THOSE CAN BE IN DIFFERENT JURISDICTION SO DIFFICULT PROBLEM TRACKING THAT INFORMATION CORRELATING AND LINING THOSE DIFFERENCES AND LIKEWISE -- SIMILAR TO THAT, IT WAS ALL THAT COMPLEX OVERLAP, HOW YOU MINIMIZE THE RESOURCES. IT'S A PRECIOUS RESOURCE FINDING AND TIME, SO TO LOOK AT THIS PROBLEM WE STARTED WITH A REAL FIRST STEP. LET'S START WITH WHAT PEOPLE IN VIRGINIA ARELING -- IN VIRGINIA MATCH WITH WHAT PEOPLE IN COLUMBIA AND PEOPLE MATCH WITH MARYLAND. A START POINT. WHICH PEOPLE YOU HAVE IN YOUR DATABASE I TELL YOU WHO WE HAVE IN OUR DATABASE. NOT AN EASY PROCESS. EACH JURISDICTION HAS DIFFERENCE POLICIES, DIFFERENCE PROCEDURE, THERE'S DIFFERENCE DATA HANDLING TECHNIQUES STANDARDS AND SO FORTH. SO THAT WAS THE FIRST PILOT AND WE WILL SHOW YOU HOW THAT WORKED. I MENTIONED A BLACK BOX, THAT'S A IMPENETRABLE BLACK BOX. BEING ABLE TO SAY SOMETHING IS PERFECTLY IMPENETRABLE IS A RECKLESS STATEMENT WITH EVERYTHING YOU HEAR IN THE NEWSPAPER SO MORE APPROPRIATE IN TERMS OF WHAT ASSUMPTION, PENETRABLE UNDER THESE SET OF ASSUMPTIONS BUT GEORGETOWN WE WE WERE DEFINING AN ASSURANCE LEVEL AND DIFFERENT LEVELS ONE BEING HIGHEST AND FOUR BEING LOWEST. TYPE ONE WE CALL EXTREMELY LOW RISK COMPROMISE, PROTECTING AGAINST JAMES BONDS OR WELL FUNDED NATIONS STATE KIND OF THINGS, PROBABLY A COST ASSOCIATED PROBABLY NOT NECESSARY AS EFFECTIVE LEVEL SO TYPE 2 IS A RELAXATION OF SYSTEM OF THOSE ASSURANCE PROTECTION PROTECTIONS, COMPUTATIONALLY WHAT THIS MEANS IN IMPLEMENTATION WE USE FORMAL METHODS MATH A MATICALLY APPROVE THE CONTAINERS SO THE TYPE 3, THE BOX IS IN THE WRONG SPOT, TYPE 3 IS PILOT STUDY AND TYPE 3 IS WE'RE USING BEST COMMERCIAL PRACTICE WITH PROCESS WRAPPED AROUND THAT, WITH THE BLACK BOX SO FAR MORE RIGOROUS THAN JUST REASONABLE OFF THE SHELF COMMERCIAL TECHNIQUE. YOU WILL SEE SOME OF THE RIGOR HERE. LOW RISK BUT AGAIN PILOT STUDY FAIR AMOUNT OF RESEARCH SO TO GIVE YOU IDEAS IN PILOT STUDY, WE HAVE SECURITY, THERE'S A COMPUTER CENTER, THAT HAS ALL SORTS OF -- I WON'T GO INTO THE ISSUES THERE FOR PROTECTION. A WHOLE HOST OF NETWORK SECURITY. MANY YOU WON'T HEAR ABOUT PROTECTING THIS PARTICULAR PACKET BY THIS PARTICULAR IMPLEMENTATION. ASIDE FROM THAT HERE IS A BLACK BOX, HERE IS THE PHOTOGRAPH BUT MORE IMPORTANTLY I BROUGHT THE REAL ONE ALONG. I WILL RAISE IT UP A LITTLE BIT. SO THIS IS ACTUALLY DECOMMISSIONED YOU WOULD HOPE. THIS DECOMMISSION WHICH IS WHAT IT MEANS IS THE COMPUTER INSIDE OF IT WAS IN ORDER WITH A WITNESS FROM ONE OF THE JURISDICTIONS PRESENT, THE DISC DRIVE HAS TO BE REMOVE AND REGOSSED THREE TIMES TO MAKE SURE THERE'S DATA ON THAT. BUT THIS PARTICULAR BOX WE'LL SHOW IN A MOMENT HOW IT'S RUN, THIS WAS INSIDE THAT DATA CENTER AND VERY SENSITIVE DATA FROM EACH OF THE THREE JURISDICTIONS, LOADED TO THIS PACKAGE, THE PACKAGE ITSELF TRIPLE LOCKED, I CAN GO THROUGH A LONG LIST OF THINGS TRIPLE LOCKED, NO WIRELESS KEYBOARD, IF IT'S USEFUL TO COMPUTING IT WAS TURNED OFF, WE HAD APPLE, WHO HELPED WITH COMPUTER CONFIGURATION, AND THAT WAS USEFUL ON THIS THING WAS DISABLED, EXCEPT ONE ASPECT. THE WAY YOU CAN INTERACT WITH THIS IS THAT EACH JURISDICTIONS VIA SECURE FIRE TRANSFER OF DATA FILE, DATA FILE IS IN A SALLY PORT OPERATION, PUT INTO A DIRECTORY AND INSIDE THE BOX THERE'S AN ALGORITHM THAT IS LOOKING AT THE DATA SETS AND BALANCE BOX AND PATTERN IF YOU WILL AND OUTPUTTING RESULT, WRONG BUTTON. OUTPULLING RESULTS DIRECTORIES EACH RESPECTIVE JURISDICTION CAN READ SO ALL THAT CAN BE DONE WITH DISCRETE JURISDICTION, MOVE DATA FILES TO TREE DIRECTORIES, ALL THEY CAN DO IS EXAMINE THREE DIRECTORIES TO GET RESULTS. EVERYTHING ELSE IS TURNED OFF, RIGOROUS REVIEW TO MAKE SURE ALL THE OTHER TYPES OF EXPLOITS, THAT WORRIES AND CONCERNS ARE DISABLED SO THE WAY THE ALGORITHM, THE PATTERN ITSELF THAT IS AN ALGORITHM, A PROGRAM RUN REVIEWED WITH EACH JURISDICTION, THAT PROGRAM WAS MANUALLY INSTALLED, THAT PROGRAM LOOKED AT THE INPUT DIRECTORIES VALIDATED THEY'RE PROPER, THEY WEREN'T FOCUSED DATA FILES, DATA COMPUTATION AND PRODUCED RESULTS. KEY HERE IS NO PEOPLE WERE INVOLVED IN THIS PROCESS. THERE'S NO -- WHEN I SAY BLACK BOX, I'M SERIOUS. NO ONE CAN LOOK INSIDE THAT CONTAINER. WITHIN THE LIMITS OF ASSUMPTIONS FOR THE TYPE 3 SECURITY. NO ADMIN, LITERALLY NO PEOPLE INVOLVED IN THAT PROCESS. THE ANALYTIC VERY RICH, IT CAN CORRELATE NOT ANONYMOUS BUT VERY SPECIFIC INDIVIDUALS ACROSS JURISDICTIONS AS IT HAS TO DO AND IS NEEDED BY THE JURISDICTION THAT NEEDED THAT VERY SPECIFIC INFORMATION. AN EXAMPLE OF HOW THIS BOX WORK WORKS, SHEER SAMPLE SYNTHETIC, I DON'T HAVE REAL DATA. WE AT GEORGETOWN NEVER SAW ANYTHING ABOUT DATA GOING IN, THE ONLY THING WE GET WAS FROM JURISDICTIONS, WE COULDN'T EVEN KNOW IT WAS RUNNING WHICH WE DIDN'T DO BUT UNLESS WE MONITORED THE POWER WE WOULDN'T KNOW THE BOX WAS RUNNING OR NOT. WHICH IS A CHALLENGE FOR COMPUTATIONALLY, HOW DO YOU BUILD SOMETHING THAT HAS TO BE EXTREMELY RELIABLE. WITH WE HAVE NO WAY OF EVEN KNOWING IF IT'S CRASHES ESSENTIALLY. SO ROBUST COMPUTATIONAL PROCESS WHICH WE SUCCEEDED. HERE IS AN EXAMPLE OF A SAMPLE EXAMPLE OF VERY SENSITIVE DATA ABOUT INDIVIDUALS DEALING WITH HIV THAT CAME FROM JURISDICTIONS, THESE FIELDS MEAN SOMETHING TO THE JURISDICTION AND HAVING TO DO WITH RACE, GENDER, SOCIAL SECURITY NUMBERS, SO FORTH. DATE OF BIRTH, SO FORTH. VERY SENSITIVE INFORMATION BECAUSE THEY HAVE TO DO WITH HIV EACH JURISDICTION SEND DATA FILES WITH INDIVIDUALS, ALL THE WHICH TO DEBUG, WE HOSTED AND ENGINEER AND LOCKED SO NOBODY COULD LOOK AT IT AND THE ONLY WAY TO DEBUG IS THROUGH A LOG FILE PROCESS, WHICH EACH OF THE JURISDICTIONS CAN LOOK IN THE ALGORITHM UPDATE SOD THEY CAN THEY WILL TELEWHAT'S GOING ON INSIDE THE BOX. THAT'S ESSENTIALLY A WAY TO HELP DIAGNOSE IF THERE IS CHALLENGES WHAT WOULD HAPPEN, THEN THE OUTPUT. SHEERIN EXAMPLE OF THE OUTPUT. SO THE PATTERN WAS FIND INDIVIDUALS THAT IN THE PATTERN SUBMITTED AGREED UPON BY JURISDICTION, TAKE THE DATA AND LOOK FOR INDIVIDUALS THAT WE -- IN BOTH DATABASES, TWO JURISDICTION VERSUS THE SAME INDIVIDUAL. AND THAT IS A FUZZY MATCH BECAUSE THERE IS THE NAMES MAY NOT MATCH EXACTLY, DATE DAYTIME OF BIRTH -- DATE OF BIRTH WILD CARD, THERE'S DATA CONVENTIONS ACROSS THE JURISDICTION BUT ALGORITHM HAS TO DO THAT TYPE OF MATCHING. IN THIS PARTICULAR CASE, EQUATED TO 30 BILLION COMPUTATIONS, 30 BILLION CORRELATIONS THAT HAVE TO DO. IT DIDN'T HAVE ANYTHING ELSE TO DO BUT THE 30 BILLION COMPUTATIONS, DONE MANUALLY THAT COULDN'T HAVE BEEN DONE BY JURISDICTION, THAT WOULD HAVE TAKEN MANY YEARS AN PROBABLY COULDN'T HAVE DONE IT BECAUSE THE DATA ITSELF CHANGES OVER TIME. ON THE ORDER, FULLY AUTOMATED. NO PEOPLE INVOLVED. WHAT IS GENERATED IS AN EXAMPLE OF WHAT WAS OUTPUT, IDENTIFIER THAT DIDN'T MEAN ANYTHING TO ANYONE EXCEPT D.C. AND IDENTIFIER THAT KIN MEAN ANYONE ANYTHING TO ANYONE EXCEPT MARYLAND AND A SCORE HOW WELL IT MATCHED. ALL THIS DATA MAPPED ALGORITHM AND EACH JURISDICTION GETS BACK A FILE TELLING WHAT THEY HAVE MATCHED WITH ANOTHER JURISDICTION, ALL DONE WITH NO PRIVATE -- ALL THIS PRIVATE INFORMATION NOT EXPOSED TO ANY INDIVIDUALS. SO JUST TO GIVE YOU A SUMMARY, THE ONLY REASON WE HAVE THIS DATA HERE IS COURTESY AND KINDNESS OF THREE JURISDICTIONS SHARING WITH US BECAUSE NONE OF US EVER SLOWED THE THE CHARTS COULDN'T BUT THE ALGORITHM, IT HAS VARIOUS SCORING CRITERIA. THE EXACT MATCH, VERY HIGH, HIGH, DOWN TO VERY LOW. EXACT MEANS ALGORITHMS ON THESE TWO INDIVIDUALS ARE THE SAME INDIVIDUAL DOWN TO THESE MIGHT BE THE SAME. SOME CORRELATION. I UNDERSTAND, I DON'T WANT TO SPEAK FOR JURISDICTIONS THEY'RE HERE AND THEY CAN SHARE EXPERIENCES LATER BUT CATEGORY OF EXACT VERY HIGH, CORRELATIONS WERE -- THEY WENT THROUGH A VALIDATION PROCESS, ALGORITHM ESSENTIALLY PERFORMED 90 TO 100% VALID. UNDERSTANDING THE EXACT CORE CATEGORY, 100%, WHAT I MEAN IS IN THIS AL BOAR RHYTHM REPORTING YOU CAN SEE THE LARGE NUMBER 9,005 # 3 EXACT MATCHES ACROSS THE JURISDICTION, WHEN THE ALGORITHM REPORTED THIS IS SAN EXACT MATCH, IT WAS TRULY WAS AN EXACT MATCH, I DON'T KNOW THE EXACT NUMBER BUT 95 TO 100 VERY HIGH AS AS WELL. HIGH MATCH, ALMOST CERTAINLY WAS A HIGH MATCH. THAT'S A PROCESS AGAIN FULLY AUTOMATED AND I UNDERSTAND GREAT VALUE IN EXPEDITING THIS TREATMENT OF CARE IN HUMAN -- ASSURING INDIVIDUAL, STAYING IN CARE DON'T DROP OUT OF CARE AND WHOLE PAPER IS GENERATED WITH THE THREE JURISDICTIONS THAT WILL REPORT OUT IMPACT OF PUBLIC HEALTH AND SO FORTH. SO THAT'S AN OVERVIEW OF THE FIRST PILOT. SO THE QUESTION, THAT WORKED. THAT'S THE GOOD NEWS, THIS PROCESS WORKED. PRIVACY IS PRESERVED, GOT ANSWERS TO REAL MEANINGFUL QUESTION WITH A REAL POTENTIALLY SIGNIFICANT IMPACT. WHAT IS NEXT? THERE'S THREE DIMENSIONS WHERE DO GO WITH THIS PROCESS. THE FIRST IS SCALE. DO IT NATIONAL SCALE, THERE'S PROBABLY HUGE OVERLAP IN 60 JURISDICTIONS IN HEALTH DEPARTMENTS WORKING HIV SO A SCALING ISSUE COMPUTER SCIENTISTS SCALE, GET THE ENGINEERING SLIDE RULE OUT AND TELL YOU HOW TO DO THAT. THE OTHER QUESTION IS FIDELITY, THIS PARTICULAR PILOT, THE PATTERN WAS SIMPLE, THERE WAS A LOT OF SOPHISTICATION AND SOME OF IT SLIGHTLY MATCHING BUT NOW, IT WAS A RELATIVELY MODEST ALGORITHM. YOU CAN DO A MORE SOPHISTICATED. SOME OF THE QUESTIONS SPECIFICALLY ABOUT LIFESTYLE AND MAKING MONEY, OTHER TYPES OF DATA SOURCES MORE SENSITIVE IN THIS PROCESS SO WE CAN GET FIDELITY, ADDING RICHER TYPE OF PATTERN QUESTIONS. AND THE NEXT QUESTION, NEXT DIMENSION, THIRD DIMENTION, HOW MUCH PRIVACY IS GOOD ENOUGH. SO DO WE GO FROM A TYPE 3 TO TYPE 2 TO TYPE 1. TYPE 1 JAMES BOND NOT GOOD USE OF FUNDS PERHAPS DEPENDING RISK AND SO FORTH. THERE'S A BALANCE BETWEEN FIDELITY AND SCALE. TWO AREAS SCREEN BOTH OR ALL THREE REALLY. WORKING SCALE, THIS IS THINKING TOWARD WHAT NATIONAL CLASS OPERATIONAL REVIEW, 60 JURISDICTIONS WHATEVER, LIKEWISE , WORKING ON FIDELITY, TACKING MORE COMPLEX QUERIES OF -- TO HELP PEOPLE -- HELP PREVENTED IN MEASURES TO KEEP THEM IN CARE AND AGAIN PREVENTION OF DISEASE, SO FORTH. AND ON THIS CURVE IT'S A SMALL RELATIVELY SMALL NUMBER OF -- SMALLER NUMBER OF INDIVIDUALS, I DON'T KNOW IF WE CAN SHARE THE NUMBER BUT ON THE ORDER OF 100,000 TYPE OF INDIVIDUALS WHERE THIS YOU'RE TALKING LOW NUMBERS OF MILLIONS OF INDIVIDUALS. IT'S REASONABLY PUT TO CRANK THE PRIVACY BUT WHAT WE CALL TYPE 2 HERE ESSENTIALLY YOU ELIMINATE ALL PROFESSIONAL HACKER AMATEURS IN THE PROCESS DO THAT. OPM IS TYPE 4 TECHNOLOGY, I DON'T KNOW ABOUT THE EXPLOIT. WHEN WE SAY TYPE 2 THIS IS SERIOUS. THIS IS COMPUTATIONAL PROOF OF CORRECTNESS, MATHEMATICAL, TYPE ONE PROBABLY, THAT'S EXTREME BUT TYPE 2 IS SERIOUS, BUT ESSENTIALLY INVOLVES CUSTOM HARDWARE TO REALLY MAKE SURE INFORMATION WE SAY ANY AND ALL PIECES OF THE PROTOTYPES ARE GETTING FAIRLY WELL UNDERSTOOD NOW SO THE REASON E SHOW OTHER THING WE DID THE FINAL WORDS HERE, THIS PARTICULAR PILOT RAN ON THIS MODEST APPLE COMPUTER PERFECTLY FINE FOR WHAT WE NEED TO DO AND PROBABLY DO SIGNIFICANT SCALING IF WE DON'T ANALYZE FIDELITY, WE ALSO RAN THE SAME CODE, WE ARGUE THAT'S ONE OF THE MOST MODEST COMPUTERS AVAILABLE, WE ALSO RAN THE EXACT SAME CODE PROBABLY ONE OF THE MOST POWERFUL COMPUTER IN THE WORLD. THEY RAN THE CODES ALL O ACROSS THE SPECTRUM TO REALLY UNDERSTAND HOW TO MIX AND MATCH BALANCE POINTS DEPENDING WHERE WE WANT TO GO NEXT STEPS. THE WAY THIS CURRENT WORKS, ALL THREE FOLKS ARE INDIVIDUAL REPRESENTATIVES FROM JURISDICTIONS SHOWED UP MANUALLY INSPECTED EVERYTHING, A LITTLE TEDIOUS PROCESS, VERY RIGOROUS PILOT. BUT WE WANT TO KEEP THE TRAVEL DOWN AND NOT HAVE TO TRAVEL FROM RICHMOND AND BALTIMORE. SO THE NEXT STEP IS AUTOMATING OR SEMIAUTOMATING WITHOUT LOSING THAT CONTROL, BEING ABLE TO GIVE THE BOX MORE PATTERN. WITHOUT CIRCUMVENTING THAT POLICY OVERSIGHT PROCESS. WE KNOW THE BOX IS ONLY EXECUTING OUTPUTTING PATTERNS WE HAVE AGREED POLICY BODY AGREED ARE COMPLIANT. AND DON'T JEOPARDIZE INDIVIDUAL AND PRIVACY ISSUES OR CONCERNS. AND WE QUICK ACKNOWLEDGMENT, THIS WAS A PILOT STUDY FORMATTING BROKEN ON THE LAST LINE BUT THE SPECIAL THANKS, I CAN -- WE CAN'T -- THIS IS WOEFULLY INADEQUATE QUAT, THIS IS A PARTNERSHIP COLLABORATION WITH THE THREE JURISDICTIONS PILOT THIS WHOLE PROCESS. THANK YOU. [APPLAUSE] >> THANK YOU, DR. SMART. GREAT TO SEE THE INVASION BEING DEVELOPED IN -- INN INVESTIGATION DEVELOPED IN EXCITE SCIENCE -- INNOVATION WITH THE IMPORTANT OF DATA SHARING. THE NEXT TALK WE WILL SWITCH TO THE TOPIC OF PSYCHOLOGY. OUR SPEAKER DR. DELORES ALBARRACIN RECEIVED DOCTORAL DEGREES IN SOCIAL AND CLINICAL PSYCHOLOGY AND PROFESSOR OF PSYCHOLOGY AT UNIVERSITY OF FLORIDA. SHE'S A FELLOW OF MANY IMPORTANT SOCIETIES IN HER DISCIPLINE INCLUDING SOCIETY FOR EXPERIMENT AL SOCIAL PSYCHOLOGY AND SOCIETY FOR PERSONALITY AND SOCIAL PSYCHOLOGY. DR. ALBARRACIN'S WORK FOCUSES ON ATTITUDES APPROXIMATE PERSUASION, THE INTENTION BEHAVIOR RELATION, GOALS PREDICTING GENERAL ACTIVITY PATTERNS PREDICTING AND CHANGING RELATE HISK BEHAVIORS AND REVIEWING THE EFFECT OF BEHAVIORAL AND CLINICAL TREATMENTS IN A VARIETY OF SETTINGS. ALL WHICH PERTINENT TO THE GOALS OF H THIS SESSION. DR. ALBARRACIN. (INDISCERNIBLE), I LIKE IT VERY MUCH THERE. THANK YOU FOR ORGANIZING THIS VERY EXCITING MEETING. IN A SECURED BUILDING. WHAT I WOULD LIKE TO DO IS PRESENT ON SOME OF OR EXPLORATIONS ON SOME OF THE ISSUES THAT CONCERN TOPIC OF THIS MEETING THOUGH NOT EXACTLY A MASSIVE PRESENTATION. WE HAVE BEEN DOING, YOU WILL HEAR THIS, LOOKING AT PATTERNS AND ADDICTIONS AND THEORIES THAT MAYBE APPLICABLE TO ANALYZE SOME OF THE ONLINE DATA THAT ARE RELEVANT TO HIV ADDICTION AND EVENTUALLY CHANGE AS WELL. WHAT I DO IS REALLY AN ATTITUDE BEHAVIOR CHANGE COMMUNICATION, BOTH SEMANTIC AND SYNTAXIC IMPACT AND HOW THOSE INFLUENCE BEHAVIOR. SO THAT'S LITTLE BIT OF MY ANGLE AS WELL AS PREDICTING BEHAVIOR AND LOOKING AT STRUCTURAL DETERMINANTS AND BEHAVIOR AND HOW THOSE DETERMINANTS IN TURN INNUANCE -- THE INFLUENCES OF ATTITUDES ON BEHAVIOR. IN THIS CASE WHAT I SIMPLY LOOK AT IS DRYING TO PREDICT HIV TEST ORGANIZE PREVENIENCE OR TREATMENT FROM INFORMATION THAT IS READILY AVAILABLE AND QUITE EASY TO ACCESS NOW ANOW ADAY SUCH AS STRUCTURAL VULNERABILITIES AND STRENGTHS IN DIFFERENT COMMUNITIES IN THIS CASE AT COUNTY LEVEL. NEXT LEVEL AND SOCIAL COMMUNICATION SO WE HAVE WITH THE STRUCTURE OF COURSE INFLUENCES, GREATLY WITH HIV POSITIVE, MUCH MR. CXFC LIKELY TO FIND HIGH RATES OF HIV IF THE COUNTY IS PRESENTATION OF MALES, PEOPLE OF COLOR, YOUNG INDIVIDUALS BUT WE HAVE SOCIAL COMMUNICATION FLOWING THROUGH THE STRUCTURE, IN TEXTURAL AND NON-TEXTURAL FORMS ONLINE AND IN REAL LIFE. THE INFORMATION IS EASY TO ANALYZE OR EASIER THAN RECORDING CONVERSATIONS GOING ON OUT THERE IN BARS OR PUBLIC PLACES THOUGH WE WILL GET THAT AT SOME POINT WITH RESTRICTION OF COURSE. SO THAT'S THE IDEA. COLLABORATION IS VERY IMPORTANT, VERY LARGE, IT'S BEING HIGHLIGHTED BY MANY OF YOU, PROBABLY A BILLION TERABYTES OF DIGITAL INFORMATION, A LOT OF WHICH ALIGN, A LOT -- THESE LARGE AMOUNTS OF TEXT ARE NOT TRIVIAL, IT'S PROBABLY THE MOST MEANINGFUL INFORMATION WE CAN GET ABOUT HUMANS, NATURAL LANGUAGE IS AFTER ALL THE EASIEST MOST EFFECT ACTIVE WAY WE HAVE TO ENCODE WHAT WE LEARN. IT'S COMMONLY ENCOUNTERED AND THE MOST EXPRESSIVE INFORMATION, EVEN IF YOU TRY TO CLASSIFY PHOTOS WHAT YOU ARE SEE IS THE TEXT IS MATCHED WITH THE ACTUAL PHOTO TO PRODUCE CORRECT CLASSIFICATION. AND IT HAPPENS TO BE BIG DATA, IT HAPPENS TO BE FOR CONSUMPTION OF OTHER HUMANS BY HUMANS. BUT WE HAPPEN TO NEED SOFTWARE TO ANALYZE SIMPLY BECAUSE IT'S A LOT OF IN THE PAST WE CODE IT IN A COMPLETELY HUMAN WAY AND ANALYZE FAIRLY SIMPLE MANNERS. NOWADAYS WE LOOK AT OTHER METHODS. IN THE AREA OF HIV THIS VAILIBILITY AFFORDS A LOT OF POSSIBILITY, ONE IS PREDICTION OF HIV, POTENTIALLY ALSO IN THE FUTURE, CHANGING HIV BY EITHER ENACTING LARGE NUMBERS OF SMALL CHANGES IN A NETWORK OR BY SIMPLY GOING TO A HOT SPOT AND WE PREDICTED USING PUBLICLY AVAILABLE ON LION INFORMATION. AND TEST OUR THEORY WHICH ENDS UP BEING CONSEQUENTIAL. FOR PUBLIC HEALTH AS WELL. DATA MINING, THEY HAVE SEVERAL PEOPLE IN THE TEAM WHO ARE LENDING THIS EXPERTISE BUT ESSENTIALLY HAVING A LOT OF DATA, SAID MANY TIMES IN THE MEETING. DATA SOFTWARE AND PRODUCE ACTIONABLE KNOWLEDGE TO ENACT CHANGE MS. THE REAL WORLD. AND THOSE CHANGES WHICH IS THE SECOND BULLET, WOULD ALSO COME WITH THE ABILITY TO PREDICT SOMETHING THAT'S HAPPENING OUT THERE. SO YOU HAVE THE TEXT, YOU MINE THE TEXT, PRODUCE PREDICTION, PREDICT VALUES THAT ARE MEANINGFUL IN THE REAL WORLD SUCH AS HIV INCIDENCE AN ATTEMPT TO CHANGE WHAT'S GOING ON. THIS INVOLVES THE MINING OF TEXT, IN POSSIBLE PHASES NOT ALL OF WHICH HAVE BEEN DONE IN THIS DOMAIN BUT FIRST READ PARTS INTO MEANINGFUL UNITS AND SYNTAXICLY AND PRAGMATICALLY ANALYZE FIGURING THE MEANING OF THE WORD, STRUCTURE, GRAMMAR, SOMETHING IN THE SENTENCE AND PRAGMATICS WHICH INVOLVE DETERMINING EFFECTS OF WORDS IN CONTEXT AND THE CONTEXT DEPENDENTS OF THE TEXT. WHEY CARE ABOUT THE SYNTAXIC, IT'S IMPORTANT LINGUISTICALLY AN PSYCHOLOGICALLY, YOU BASICALLY IT ALLOWS YOU TO KNOW THE WORD (INAUDIBLE) OR WE CAN SEE IN -- THE LANGUAGE MEAN EXPLANATION BUT ENGLISH MEANS SOMETHING DIFFERENT. SO THAT'S BY MAPPING THIS ON TO LEXICON. RULE BASED PROCESS BY WHICH YOU KNOW THAT YOU WILL GO, IT'S AN ASSERTION BUT YOU INVERT THE ORDER OF THE VERB ACTIVATE THE VERB IN THAT CASE AND THE PRONOUNCE YOU GET A QUESTION. THE PRAGMATICS ALLOWS YOU TO KNOW THAT SLIDES IN THAT FIRST SENTENCE, LIKE AN ARROW IS A DIFFERENT SLIDE AND FRUIT FLIES LIKE A BANANA AND THE PRAGMATIC EFFECTS THAT ARE MOST EXTREME ARE THOSE IN WHICH THE WORDS ACTUALLY CREATE A SITUATION. TO SWEAR TO TELL THE TRUTH, SAY I DO, THAT CREATES AN EVENT IN THE WORLD SO ENACTING SOMETHING, NOT JUST MEAN SO PSYCHOLOGICALLY THIS MATTERS A GREAT DEAL TOO. SO FOR INSTANCE THERE'S SEMANTIC EFFECTS OF SIMPLE EXPOSURE TO WORDS, ON BEHAVIOR SOME WHICH ARE CONTROVERSIAL IN INVENTORY YEARS BUT SOME WE RECENTLY META ANALYZE AND SOME ACTUALLY HOLE. ONE IS THE FACT THAT IF YOU EXPOSE INDIVIDUALS THE WORDS ONESTI, EITHER SUBLIMINALLY YOU GET HONEST BEHAVIOR IN THE MOMENT. AND THAT'S SMALL EFFECT THAT (INDISCERNIBLE) 3 AND VARIABLE BUT HOLDS UP AND HAS RELATIVELY SMALL BIASES. THE EFFECTS ON BEHAVIOR IS ONE FOR INSTANCE, WHICH BEING PRESENTED WITH TWO WORDS WHICH IMPLY A QUESTION, IMPROVE PERFORMANCE ON THE SEQUENCE TASKS RELATIVE TO I WILL WHICH IS AN ASSERTION SUGGESTING THAT THAT IDEA OF I I CAN I CAN IS NOT A VERY GOOD IDEA, IT'S BETTER SOMETIMES TO ASK YOURSELF WHETHER YOU CAN DO SOMETHING WHETHER YOU'RE CAPABLE AND SO ON. BUT THIS SIMPLY ILLUSTRATES THAT SYNTAX MATTERS AND CHANGES HOW WE BEHAVE. VERY MINUTE LAB CONTEXT. NOW LOOKING AT SOCIAL MEDIA, WE HAVE SIMILAR PROBLEMS AND HAVE GONE INTO GREATER SYNTAX COMPLEXITY BUT WE HAVE BEEN COLLECTING DIFFERENT FORMS OF PRELIMINARY ANDS ARE PRELIMINARY DATA, BASED ON -- AVAILABLE, FAIRLY GEOLOCATEABLE AND WE HAVE BEEN POLLING THIS A NUMBER OF YEARS LOOKING AT A RANDOM SAMPLE THAT'S THE TEAM. AND WE HAVE THE RESULTS. ONE HAND BOTTOM UP AND WE ALSO HAVE TOP DOWN THEORY-BASED DATA WHICH WE BELIEVE TO BE VERY IMPORTANT, TELLS US MORE ABOUT WHAT TOCO, WHAT WE CAN LEARN FOR NEW SITUATIONS NOT JUST PATTERNS THAT HAPPEN TO BE PRESENT TODAY. PRESENCE WE LOOK AT PREDICTING PREVALENCE AT THE COUNTY LEVEL, BASED ON THE COMPLETELY MACHINE LEARNING, APPROACH FOR -- AND WHAT YOU SEE HERE IS AMOUNT OF EXPLAIN OR RELATIVE TO DEMOGRAPHICS SOCIO ECONOMIC DATA. THE GREEN BAR IS SHOWING WHAT TRITER IS DOING. IT DETECTS TWITTER, YOU CAN PREDICT OUT TO 30% VARIABILITY IN PREVALENCE ACROSS COUNTIES. OF COURSE THAT VARIABILITY IS PREDICTED, BY DEMOGRAPHIC AND SOCIO ECONOMIC DATA. BUT IN A LOT OF CASES LANGUAGE DATA OVER AND ABOVE PREDICTION BASED ON STRUCTURAL FACTORS. THIS YOU CAN LOOK AT THIS AS A PHRASE MAYBE NOT SO HIGH BUT IT IS AMAZING BECAUSE LOOKING FOR ARE NOT HEART DISEASE. THIS ALSO SHOWS JUST LIKE A FIRST STEP NOT EVEN LOOKING AT SYNTAX, NOT LOOKING AT LINEAR EFFECTS. WE'RE NOT EVEN LOOKING AT INTERACTION. MORE HIV IS A SIMILAR SITUATION SO THE LANGUAGE DATA PREDICTS 30% OF THE VARIANCE IN COUNTY PREVALENCE, THAT'S A LITTLE OVER AND ABOVE DEMOGRAPHICS. WHAT ARE THESE WORDS? FOR HIV, IT'S VARIOUS GROUPS OF TOPIC, ONE LIKE THIS, THIS CLOUD OF WORDS THAT IS NOT COMING FROM NEW YORK CITY NECESSARILY BUT IM PLIES IN NEW YORK, FASHION, AND SERVES -- PERHAPS LIFESTYLE, IN AN URBAN SETTING. WORDS OF THIS TYPE. SO THIS IS ONE EXAMPLE OF THIS, A BOTTOM UP APPROACH, HOW THIS IS NOT PREDICTING PREVALENCE BASED SON I'M HIV POSITIVE BEING TREATED. IT'S A LITTLE BIT DIFFERENT. WE CAN IMPROVE THESE MODELS, SO THIS IS SHOWING DIFFERENT TYPE OF INDEX, BASE CLASSIFIER SHOWING LEVEL OF PREDICTION OR ACCURACY ACHIEVING CLASSIFYING HIGH AND LOW LEVEL, DICHOTOMIZING HIGH AND LOW LEVEL PREVALENCE AND WHEN YOU GET TO RED YOU'RE GETTING TO A SIGNAL, SO YOU HAVE LINEAR MODELS WITH EXTREME HIGH LOW CASES AND A LOT OF LANGUAGE FEATURES BEING IN THERE. YOU GET HIGHER PREDICTABILITY THAN IF YOU ARE IN THE BLUE AREA. THAT'S FOR BOTTOM UP. TOP DOWN CAN OFFER SOME ANSWERS FOR US. WE HAVE LOOKED AT TWO ISSUES. ONE ARE WORDS RELEVANT TO ACTION. AND WORDS RELEVANT TO PLANNING FUTURE ORIENTED. BOTH MATTER. ACTION WORDS CORE RATE OR INDUCE ACTIVISM IN THE REAL WORLD AND PLANNING FUTURE IMPLEMENTATION OF COURSE IMPLIES REDUCTION OF IMPULSIVITY, A COMMUNITY LOOKING AHEAD, AND POTENTIALLY BEING MORE EFFICACIOUS FOR TREATMENT. LAB ACTION WORDS PRESENTING THE WORDS THOUGH ACTION REDUCES GREATER COMMITMENT OF TIME TO MAKE PHONE CALLS FOR SOCIAL CALL. DOESN'T MATTER IF WE LOOK AT THIS IN THE TWITTER DATA. SO TO ANSWER THAT QUESTION, WE CREATED TOP DOWN DICTIONARY USING FAMILIAR TO LINGUIST PROGRAM TO ANALYZE LANGUAGE AND WE ASSEMBLE A LEXICON OF WORDS TO MOTOR ACTION ACTIVITY IMPLYING TO DEDUCING AN PLANNING TO SEE IF THAT ACTUALLY PREDICTS PREVALENCE. IT DOES, 20 PEST OF THE VARIANTS, IN COUNTY PREVALENCE IS PREDICTED BY THESE WORDS WHEN YOU ADD THOSE TO THE DEMOGRAPHIC S YOU STILL CONTRIBUTE OVER AND ABOVE. YOU GET ALSO SIMILAR LEVELS, SLIGHTLY SMALLER AFFECT BUT STILL SOME ABILITY OF THESE -- THIS CASE IT'S A VERY SMALL SET OF WORDS PREDICT PREVALENCE ACROSS COUNTIES IN COMBINATION ONCE DEMOGRAPHICS AN DENSITY ARE ACCOUNTED FOR ALREADY. THAT TENDS VARY GEOGRAPHICALLY IN THIS CASE FOR INSTANCE CORRELATION BETWEEN FUTURE ORIENTATION AND PREVALENCE IS STRONGER IN THE CELLS THAN IN THE -- AND WHAT THIS IS SHOWING IS LATITUDE, SO LOWER LATITUDE WOULD IMPLY STRONGER EFFECT SIZE THAN WHAT YOU SEE IS STRONGER ASSOCIATION BETWEEN FUTURE WORK AND PREVALENCE IN THE CELLS THAN NORTH WHATEVER THAT MEANS. WE HAVE A THEORY WE'RE CURRENTLY DOING A MORE PERCENT SITUATION BEHAVIOR CHANGE MODEL APPROACH -- PERSUASION BEHAVIOR CHANGE MODEL APPROACH, SIMILAR TO WHAT I HAVE DONE AT FAIRLY LARGE SCALE WITH META ANALYSIS BEHAVIOR CHANGE PROGRAM. THAT WAS WHAT WITH WE'RE PURSUING. AND WE FEEL THINGS ARE LOOKING QUITE GOOD ACTUALLY, TWITTER PROVIDES INFORMATION FOR PREDICTION AND THEORY TESTING. REAL TIME DATA ARE OF COURSE OUT THERE. AND EASIER THAN EVER TO RETRIEVE, THE STRUCTURAL AVAILABILITY IS REALLY GREAT. A LOT OF DATA ARE COMING IN IN REAL TIME SO PREDICTIVE CAPABILITIES WOULD ONLY IMPROVE MOVING FORWARD BOTH BECAUSE OF THE AVAILABILITY OF DATA AND ALSO BECAUSE OF ALL THE COLLEAGUES LIKE SOME OF THE HEARD FROM ARE IMPROVING METHODS FOR THIS. THE HOPE IS THAT THE RESULTING INFORMATION WILL CHALLENGE OUR METHODS FOR INDUCING CHANGE BOTH VIA SOCIAL MEDIA AND ALSO BY SIMPLY GOING TO THE PLACES THAT NEED OUR ATTENTION. THANK YOU. [APPLAUSE] >> TO LEARN ABOUT HOW TWITTER DATABASES COULD BE USED FOR PREDICTION AND IT WILL BE INTERESTING TO EXPLORE HOW USING MANY OTHER DATABASES USING APPROACHES LIKE FIRST TWO TALKS COULD BE USED AS AS WELL. WE NOW HAVE A 20-MINUTE BREAK. DURING THIS TIME, IT'S OKAY TO TALK ABOUT SUBJECTS OTHER THAN BIG DATA BUT YOU SHOULD REALLY ASK ROSE MARY FOR PERMISSION FIRST BEFORE YOU DO THAT. WE'LL RECONVENE AT 10:20. >> DEGREE IN PHYSICS AND HAS MODELED A VARIETY OF COMPLEX SYSTEMS INCLUDING FLUID DYNAMICS FINANCIAL MARKETS ECOLOGY, NATURAL LANGUAGE, TRANSPORTATION AND INFECTIOUS DISEASE EPIDEMIOLOGY. HE'S PI OF A RESEARCH GROUP IN THE MIDAS STUDY AND HAS BEEN SINCE INCEPTION IN 2004, CURRENT RESEARCH FOCUSES ON UNDERSTANDING THE HOW STRUCTURES OF VERY LARGE REALISTIC SOCIAL CONTACT NETWORKS CONSTRAINS SPREAD OF CONTAGION. >> THANK YOU FOR THIS (INAUDIBLE) I WANT TO SAY A COUPLE OF WORDS BEFORE I START MY TALK, ABOUT -- (INAUDIBLE) THIS IS NOT THE ROYAL ME OR THE EDITORIAL ME, IT'S A COUPLE OF DOZEN PEOPLE WHOSE WORK I'M TALKING ABOUT HERE, TRYING TO SYNTHESIZE THEIR WORK AS WELL AS (INAUDIBLE). I WOULD LIKE TO THANK JERRY REITER FOR THE REALLY WONDERFUL INTRODUCTION HE GAVE THE INFORMATION THERE'S A LOT OF STUFF I DON'T HAVE TO SAY BUT WHAT I WOULD LIKE TO DO, I REALIZE IN LISTENING TO YESTERDAY'S SESSION, GIVE YOU A CONCRETE -- CONTEXT OF (INAUDIBLE) SO I'M GOING TO TALK ABOUT THIS, WHAT I WOULD LIKE YOU TO HOLD IN YOUR MIND IS A SYNTHETIC POP POPULATION FOR THE UNITED STATES, THAT WE HAVE CREATED IN MY LAB. IN WHICH WE HAVE REPRESENTATIONS OF 300 MILLION OR SO SYNTHETIC PEOPLE, ATTACHED TO DEMOGRAPHIC VARIABLES, IF YOU WERE TO TAKE A CONSENSUS OF SYNTHETIC POPULATION IT LOOKS LIKE THE U.S. CENSUS AT THE BLOCK GROUP LEVEL. WE HAVE ATTACHED ACTIVITY PATTERNS TO THEM, THE KINDS OF ACTIVITIES YOU WILL FIND IN FEDERALLY MANDATED ACTIVITY SURVEYS, IF YOU WERE TO TAKE AN ACTIVITY SURVEY OF OUR POPULATION LOOKS LIKE THE ACTIVITY SURVEYS AND THE RIGHT PEOPLE WOULD BE DOING THE RIGHT THING. FURTHERMORE, THEY'RE DOING THEM AT LOCATIONS FROM VARIOUS DEBT SETS, DATA SOURCES -- DATA SETS DATA SOURCES OF EMPLOYMENT BY LOCATION, NUMBER OF STUDENTS BY LOCATION, THINGS LIKE THAT. AND WITH ALL THIS INFORMATION WE HAVE BEEN ABLE TO SYNTHESIZE WHAT I THINK OF AS REALISTIC LARGE SOCIAL CONTACT NETWORKS. WE HAVE BEEN ABLE OVER THE COURSE OF 15 YEARS NOW TO CONVINCE PEOPLE THESE ARE SUFFICIENT FOR STUDYING THE SPREAD OF INFECTIOUS DISEASE, RESPIRATORY ILLNESS ACROSS A POPULATION. THEY'RE NOT SUFFICIENT FOR STUDYING SPREAD OF HIV THROUGH POPULATION SO WHAT I'M THINKING ABOUT NOW AS I GIVE THIS TALK, HOW WE'RE GOING TO USE BIG DATA TO IMPROVE THESE SYNTHETIC NETWORKS TO AN EXTENT WE CAN USE WITH STUDY HIV. WITH THAT LONG WINDED INTRODUCTION I'M REALIZING THAT I FOLLOWED YOUR ADVICE AND MADE THE LANDSCAPE POWERPOINT BUT THE BOTTOM IS GOING TO BE CUT OFF OF EVERY ONE OF THESE SLIDES. SO WHAT IS SYNTHETIC INFORMATION? IT'S A SYNTHESIS OF I SAY IN COMMENSURATE DATA, DATA THAT WAS CERTAINLY NEVER INTENDED TO BE USED TOGETHER. WHAT IS IN THE PICTURE HERE IS A SET OF ALL KINDS OF INFORMATION BELOW THE SKIN AND CONTEXT FOR INDIVIDUAL HUMANS SYNTHETIC INFORMATION DOESN'T HAVE TO BE POPULATION OF HUMAN BEINGS, IT CAN BE A REPRESENTATION OF ANY SET OF ELEMENTS IN A COMPLICATED INTERACTING SYSTEM. WHERE YOU REPRESENT THE STATE OF EACH INDIVIDUAL ELEMENT AND ITS INTERACTIONS WHICH OTHER ELEMENTS IT INTERACTS WITH. SO YOU CAN MODEL CELLS CYTOKINES AN ORGANS PEOPLE PLACES AND THINGS, ACROSS SCALES, THESE ARE ALL USEFUL. ONE WAY OF THINKING OF SYNTHETIC INFORMATION IS PROVIDE YOU WITH CO-ORDINANT SYSTEM, LITERALLY A PLACE YOU CAN ATTACH ADDITIONAL FEATURES. SO IF I HAVE A NEW SET OF INFORMATION ABOUT MY POPULATION, NOT PRESIDENT IN THE FEATURES THAT I HAVE IN THERE, I CAN ADD AT THE LEVEL OF INDIVIDUAL PEOPLE. THE FIRST REMARK WE BOT IN REVIEW, THIS IS NOT GOOD DATA I DON'T KNOW WHAT TO MAKE OF IT. THIS IS NOT SO MUCH A PROBLEM BECAUSE THE REAL SYSTEM IS RANDOM, TIME VARYING, ESSENTIALLY UNOBSERVABLE. I REALLY HOPE THAT WE NEVER GET TO THE SITUATION WHERE SOMEBODY KNOWS EXACTLY WHAT EVERYONE IN THIS COUNTRY IS DOING OR THE WORLD IS DOING SECOND BY SECOND. FROM THE THAT IS A FRIGHTENING FUTURE TO ME. THE ALTERNATIVE TO USING SYNTHETIC INFORMATION OR TO OBSERVE THE EXACT SYSTEM AT EVERY INSTANCE, THEY BUILD IN ERRORS. INSTEAD OF INTERACTING SET OF ELEMENTS YOU LET EVERY ELEMENT INTERACT WITH EVERY OTHER, SO YOU HAVE A BIG BAG YOU SHAKE TOGETHER TO SEE HOW THE DISEASE SPREADS THROUGH PEOPLE, YOU KNOW YOU'RE MAKING A MISTAKE AND YOU SWEEP UNDER THE RUG THE FACT YOU'RE MAKING THE MISTAKE. THE DATA SET IS NOT REAL, NOT THE EXACT FACTS FOR THE SITUATION BUT THAT'S NOT SO BAD BECAUSE I'M GOING TO BE DOING COUNTER FACTUAL ANALYSIS ANYWAY. WHAT I REALLY NEED TO KNOW ARE THE RELATIONSHIPS AMONG THE ENTITIES IN MY SYSTEM. PREFERABLY CAUSAL RELATIONSHIPS SO IF I ADJUST HYPOTHESES WHAT'S HAPPENING IN THE SYSTEM, I CAN SEE WHAT HAPPENS TO THE DYNAMIC AND THE FUTURE OF THE SYSTEM. I AM SURE THIS IS ARGUABLE THE SYNTHETIC INFORMATION SOURCES WE HAVE ARE AS GOOD AS ALTERNATIVES AND BIG DATA KEY TO IMPROVING THEM. THE LAST BULLET CONFORMS TO BEST PRACTICES AND ENGINEERING, MODULARITY, AND INCREMENTAL IMPROVEMENT. YOU DON'T LOSE THE GOOD STUFF YOU BUILT IN WHEN YOU ADD SOMETHING NEW. THE STATE OF MY TALK IS EFFECTIVE USE OF BIG DATA WITH SYNTHETIC INFORMATION, THIS IS HOW I SEE WE USE DATA TO UNDERSTAND SYSTEMS, WE SAMPLE SOME PROPERTY IN THE SYSTEM WE'RE TRYING TO STUDY, MODEL THE DISTRIBUTION, AND WE REASON ABOUT OUR MODEL DISTRIBUTION OF DATA. IF WE DON'T HAVE ENOUGH DATA WE HAVE TO MAKE ASSUMPTIONS ABOUT CONSTRAINTS TO HELP BUILD BETTER KNOWLEDGE ABOUT THE DATA SO PRIOR KNOWLEDGE THROWN INTO THIS. IT MAYBE A BAYESIAN SENSE OF WHAT WE HAVE DONE. BUT WE DON'T WANT TO JUST UNDERSTAND ONE SIMPLE PROPERTY OF THE SYSTEM, WE WANT TO UNDERSTAND THE INTERACTIONS AMONG LOTS OF PROPERTIES IN THE SYSTEM SO WE HAVE TO SAMPLE LOTS OF PROPERTIES, MODEL NOT A BUNCH OF MARGINAL BUT JOINT DISTRIBUTION OF THESE PROPERTIES. WHAT WE WANT TO REASON ABOUT ARE FEATURES OF THE JOINT DISTRIBUTION. ESPECIALLY CAUSAL DEPENDENCIES THAT STRUCTURE THE DISTRIBUTION THE WAY IT IS. MAKING EFFECTIVE USE OF DATA TO EFFECTIVE USE OF BIG DATA. THE TING THAT OCCURS TO ME ABOUT BIG DATA, WE DON'T -- WE NO LONGER -- WE ARE NOT IN A DATA POOR ENVIRONMENT BY DEFINITION. WE DON'T HAVE TO SMOOTH DATA. WE DON'T HAVE TO COME WITH PRIOR KNOWLEDGE ABOUT CONSTRAINTS WE NEED THE DATA OBEY JUST USE THE DATA ITSELF, SPEAK LET THE DATA SPEAK TO US DIRECTLY. THE FIRST IS SAYING THINGS IN DIFFERENT WAYS. WE DON'T HAVE TO BUILD IN ASSUMPTIONS THAT WE KNOW ARE WRONG BUT BEST THINGS WE CAN DO UNDER THE CIRCUMSTANCES. WE CAN JUST NOT HAVE THE ASSUMPTIONS. THE THIRD BULLET IS ECHO OF ONE OF TOOKY'S FAMOUS REMARKS THAT YOU SHOULDN'T TRY TO GIVE EXACT ANSWERS TO THE WRONG QUESTIONS. YOU SHOULD TRY TO GIVE APPROXIMATE ANSWERS TO THE RIGHT QUESTIONS. SYNTHETIC INFORMATION ALLOWS US TO DO THIS. THE THIRD BULLET -- THE FOURTH BULLET WHICH ISN'T SHOWING UP IS ONE OF THE MORE IMPORTANT ONES BECAUSE IT'S DISJOINT FROM THESE THREE, TO IGNORE OR LET'S GET A LITTLE BIT BEYOND -- THESE ARE COMPLEX SYSTEMS TO MODEL AND SIMPLE MODELS ARE NOT NECESSARILY GOING TO BE ABLE TO MODEL THEM. SO THE NEXT TIME YOU SEE A PAPER THAT TALKS ABOUT A COMPLEX MODEL, FOR A COMPLEX SYSTEM CONE REJECT OUT OF HAND, THINK WHETHER IT'S TRULY NECESSARY TO MAKE SUCH A COMPLEX MODEL. YOU MAY FINDS THE. I TALKED ABOUT THIS AT THE BEING, THIS IS A CARTOON THAT DESCRIBES THAT I WILL USE FOR A WHILE TO DESCRIBE PROCESS ONGOING. SO WE INCLUDE A LOT OF VARIABLES, DEMOGRAPHIC VARIABLES AND WE FIND YOU CAN GET JOINT DISTRIBUTION FOR THESE THINGS PRETTY WELL. THERE'S TYPICALLY SURVEY CENSUS KINDS OF INFORMATION THAT INCLUDE ALL THESE VARIABLES AT THE SAME TIME YOU CAN INFLATE THE DATA FOR EXAMPLE FROM THE CENSUS TO ENTIRE POPULATION. WE COMBINE THESE DATA SETS WITH THINGS THAT THEY WESTERN NEVER -- THEY WERE NEVER INTENDED TO BE USED WITH SUCH AS ACTIVITY INFORMATION SHUSH SAY, SETS OF PEOPLE WITHIN THE POPULATION. WE USE THAT TO BUILD MODEL DISEASE STATUS INFECTIOUS DISEASE, ALSO BUILD OTHER MODELS LIKE LIKELIHOOD OF A PARTICULAR PERSON IN THE POPULATION TO BE VACCINATEED BY WILL FEED INTO THE DISEASE STATUS MODEL. ONCE YOU HAVE A SYNTHETIC POPULATION, THE FIRST TEMPTATION YOU HAVE IS TO USE IT FOR MANY DIFFERENT PURPOSES. A LOT OF EFFORT GOES INTO THESE THINGS SO SEE WHAT ELSE WE CAN USE IT FOR, SOME THINGS MAKE PLENTY OF SENSE, USE IT TO BUILD MODELS OF CELL PHONE COMMUNICATION, MOBILE DEMAND ON RESOURCES FOR BANDWIDTH. YOU CAN ADD -- YOU HAVE THIS THING I CALL A CO-ORDINANT SYSTEM BEFORE, ADD NEW FEATURES TO THE CO-ORDINANT SYSTEM, NEW DIMENTIONS TO IT, MAYBE POLITICAL AFAILIATION WOULD BE PART OF -- AFFILIATION WOULD BE A MODEL WHETHER YOU GET VACCINATED OR NOT. I DON'T KNOW. YOU CAN LOOK AT OTHER KINDS OF PHENOMENON, WE LOOK AT THINGS RELATED TO OBESITY, SMOKING TOBACCO USE, SEE WHETHER OTHER SORTS OF DATA YOU HADN'T ANTICIPATED MIGHT BE PART OF THESE MODELS, IF YOU HAVE A -- I WOULD RECOMMEND YOU HAVE SOME SORT OF FILTER TO KEEP YOU FROM PUTTING THE KITCHEN SINK IN ESPECIALLY BUILDING COMPLEX MODELS. BUT THERE IS A LIMIT TO WHAT YOU CAN DO WITH THESE THINGS. THERE'S CERTAIN CONSTRAINTS ON INFERENCE, EXAMPLE AGAIN, IS SUPPOSE YOU HAVE THESE DIFFERENT FEATURES OF YOUR DATA, A, B, C AND YOU COLLECTED DATA ON A BY ITSELF AND ON CONDITIONAL DISTRIBUTION THAT B GIVEN A, THIS COULD BE ACALLABLE MODEL, SOMETHING ABOUT A AND THIS CAUSES SOMETHING B. AND YOU CAN CERTAINLY DETERMINE JOINT DIRECTION HERE. SIMILARLY IF YOU HAVE INFORMATION ABOUT DNA YOU CAN DETERMINE THAT BUT YOU CAN'T GET THE JOINT DISTRIBUTION FOR ALL THREE VARIABLES WITHOUT MAKING MORE ASSUMPTIONS. COLLECTING MORE DATA, CONSUMING CONSTRAINTS ON FULL JOINT DISTRIBUTION MIGHT BE. IT IS WORSE THAN THAT BECAUSE WHAT YOU GET IS SOME OTHER SURVEY CLAIM TO BE MODELING IN VARIABLE A USING A SLIGHTLY DIFFERENCE DEFINITION SO YOU GET THIS THING AND YOU WILL HAVE TO DO SOME WORK TO MAKE IT LOOK LIKE MEASURING ARC, B, C. I APOLOGIZE FOR THE FORMATTING, IT DIDN'T LOOK THAT WAY WHEN I MADE IT. AND THE BOTTOM POINT HERE IS IN THE COMPLEX SYSTEMS YOU CAN GET -- IF YOU WANT TO GET THE DE DETAILS YOU NEAT THE DETAILS RIGHT, YOU NEED TO UNDERSTAND PARTS OF THE JOINT DISTRIBUTION, JERRY MENTIONED IF THERE'S ONE PART OF THE DISTRIBUTION WHERE THERE'S SOME HUGE SPIKE AND YOU HAVE TO AVERAGE OVER THAT, SMOOTH OUT TO PROVIDE PRIVACY OF SOMETHING, THAT COULD BE A PROBLEM. AND WE NEED TO THINK ABOUT HOW TO ADDRESS THAT. VERY LAST POINT HERE IS CURSIVE DIMENSIONALITY WHICH IS THE MORE THESE FEATURES I ADD THE MORE DATA I NEED. IF THEY'RE ABOUT THE SAME NUMBER OF POSSIBLE VALUES FOR EACH FEATURE YOU WILL FINE THE SAMPLE SIZES EWE NEED EXPANDS EXPONENTIALLY WITH NUMBER OF FEATURES YOU ADD. THIS IS NICE WORD EXPONENTIALLY BECAUSE THE WAY PEOPLE DESCRIBE BIG DATA, IT'S GROWING EXPONENTIALLY FAST. SO IN AN ABSOLUTELY WONDERFUL MARKETING INSIGHT, THEY LEFT THE TERM BIG DATA FAIRLY OPEN. OPEN TO INTERPRETATION. EXPONENTIAL AMOUNT OF DATA THAT GIVES ME LONGITUDINAL CAREFULLY DESIGNED COMPREHENSIVE SURVEYS THAT INCLUDE ALL THE DATA THAT I NEED. JUST MUCH, MUCH MORE OF IT SO I CAN BUILD SYNTHETIC POPULATION HOWEVER I LIKE. WHAT YOU GET IS MORE LIKE BUNCH OF INCONSISTENT EXPERIMENTS, EXPERIMENTAL RESULTS, WITH PARTIAL INFORMATION IN EACH. SO TO GIVE AN CAM PL MARBLES ARE NOT PEOPLE I UNDERSTAND THAT BUT THIS WILL STRIP AWAY ISSUES THAT ARE INVOLVED WITH ASSOCIATING WITH PEOPLE. LET'S THINK ABOUT MARBLES. SO ONE FIRST THING IN MODELING IS MODEL JOINT DISTRIBUTION OF FEATURES BY PICKING A SAMPLE. HERE IS A SAMPLE OF MARBLES AND SUPPOSE I WANT TO MARBLE JOINT DISTRIBUTION HERE, COLOR DENSITY SIZE TRANSPARENCY, ASIDE FROM THE OBVIOUS FACT THAT I BETTER GO BACK TO MY SAMPLING PROTOCOL IF I WANT TO MODEL DISTRIBUTION SIZE, THAT'S KIND OF A TYPICAL THING. BIG DATA, THAT WILL MEAN LOTS OF MARBLES. IN MY SAMPLE. BE ABLE TO GET DISTRIBUTIONS JUST RIGHT THAT'S NOT WHAT BIG DATA S. OR ARE. WE NEED TO FIGURE -- IT'S MORE LIKE WHAT WE HAVE DOWN HERE SO SOMEBODY PROVIDED ME A VERY, VERY BIG SAMPLE WITH A EXTREME BIAS IN IT. LOOKS GREEN MARBLE BUT I DON'T KNOW WHAT COVER THEY'RE USING BECAUSE SOME I WOULD NOT CALL GREEN, SOME BLUE, SOME YELLOW. I DON'T KNOW WHAT'S IN THAT DISTRIBUTION. HERE IS SOMEBODY THAT COLLECT AND ARRANGED IN SOME FASHION. THERE IS A REASON BEHIND IT BUT I DON'T KNOW WHAT IT WAS, I CAN'T EXPLAIN. OVER HERE ARE MARBLES FROM A SET OF CHINESE CHECKERS DESIGNED THE FALL IN TO EXACTLY SIX CATEGORIES. THAT'S WHAT BIG DATA IS AND THAT'S WHAT WE HAVE TO FIGURE OUT HOW TO USE. THESE ARE HARD PROBLEMS TO SOLVE BUT NOT NEW. NOT TERRIBLY NEW IN GENERAL, THESE ARE THE KINDS OF ISSUES WE HAD TO DEAL WITH WHEN DEALING WITH NATURAL EXPERIMENTS VERSUS RANDOMIZED CONTROL CLINICAL TRIALS. WHAT WE DEAL WITH WHEN TRY TO GET FROM FACTORIAL DESIGN, NOT HAVING TO HAVE ALL INSTANCES YOU NEED IN EVERY FELLOW OF THE EXPERIMENT. WE HAVE A LOT OF EXPERIENCE IN DETECTING AND ACCOUNTING FOR CONFOUNDERS IN DATA. AND INTERESTINGLY SOMEONE ONE OF MY STUDENTS STAFF POINTED OUT TO ME, THIS IS LIKE DOING META ANALYSES OF EXPERIMENTS. THEY'RE ALL SLIGHTLY DIFFERENT, AND YOU HAVE TO FIGURE A WAY TO PUT THEM TOGETHER. NOT SAYING WE HAVE THE PERFECT SOLUTION TO ANY OF THESE THINGS BUT PEOPLE WORK ON IT. PEOPLE HAVE BEEN WORKING ON IT FOR A LONG TIME, THEY WILL CONTINUE WORKING ON IT AND IT'S NOT JUST IN THE HIV FIELD THAT WE NEED TO LOOK FOR SOLUTIONS BUT EVERYONE CONFRONTING THESE PROBLEMS. SO WE BUILT THIS SYNTHETIC POPULATION IN THE U.S. ABOUT TEN YEARS AGO AND CURRENTLY WE'RE WORKING ON THE WORLD WE HAVE BEEN DOING THESE RESPIRATORY ILLNESS SIMULATIONS FOR A LONG TIME, WHY HAVE WE DONE ALL THIS THAT YOU NEED SO FAR? THERE IS WORK, EFFORT AND DIRECTION THAT GOES INTO THIS, PEOPLE HAVE MENTIONED THE NEED TO BUILD THESE TRANSDISCIPLINARY TEAMS WITH EXPERTISE FROM EVERYBODY IN THE ROOM ON THEM. WE FOUND IT TAKES A WHILE TO SOCIALIZE THE METHODS TO CONVINCE PEOPLE THAT YES, SYNTHETIC INFORMATION IS USEFUL AND YES DO EMPERIMENTS THAT YOU WANT TO DO WITH SYNTHETIC INFORMATION. AND UNFORTUNATELY IT'S NOT A -- ONCE YOU DEMONSTRATED IN ONE DISCIPLINE YOU CAN CONVINCE EVERYBODY YOU NEED TO DO THIS DOMAIN BY DOMAIN. THE COMPUTATIONAL ISSUES THAT I HAVEN'T ADDRESSED AT ALL, AS HARD AS STATISTICAL ISSUE. SOME OF THE TALKS THIS MORNING I WOULD SAY HAVE DONE A VERY GOOD JOB ILLUSTRATING WHAT SOME OF THE ISSUES ARE, MAINTAINING PRIVACY, SECURITY, BUT THERE'S ALSO A WHOLE HOST OF OTHER ISSUES. SCALABILITY IS PRIMARY AMONG THEM IF YOU USE A N SCORED ALGORITHM AND N IS 100 MILLION INSTEAD OF TEN YOU FIND YOU DON'T HAVE TIME TO WAIT FOR ANSWERS. YOU NEED TO FIND NEW ALGORITHMS, YOU NEED TO INVEST SERIOUS EFFORT IN THIS. TRACKING DOWN THE DATA, UNDERSTANDING WHERE IT IS. FIGURING HOW TO GET HOLD OF IT. WE LIKE -- I WILL SHOW A PICTURE ON THE NEXT SLIDE OF A SYSTEM THAT ALLOWS PEOPLE TO REALLY USE THE DATA AND REALIZE THE BENEFITS THAT ARE POSSIBLE FROM IT. PART OF THAT IS MAINTAINING PROVIDENCE AT EACH STEP OF THE SYSTEM. SO IF WE WANT -- IF WE PROVIDE YOU FOR EXAMPLE WITH THE FINAL SOCIAL CONTACT NETWORK WE HAVE ESTIMATED, WE NEED TO KNOW YOU SHOULD BE ASKING WHAT ARE ALL INDIVIDUAL PIECES OF DATA THAT WENT INTO IT. IN ADDITION TO THAT, WHAT ARE THE MODELS THAT YOU USE TO COMBINE THIS DATA? BECAUSE EVERY STEP OF COMBINING DATA INVOLVE SOME MODEL OR MODELING ASSUMPTION, WHAT ARE PARAMETERS THAT YOU USE IN THE MODEL TO COMBINE DATA. WHICH OF THESE HAVE BEEN PUBLISHED, HAS BEEN PEER REVIEWED? WHAT ARE THE ALTERNATIVES TO EACH OF THESE? I DON'T THINK IT'S -- I DON'T THINK IT WILL EVER BE THE CASE THAT ANYONE IS GOING TO USE DATA IF THEY DON'T UNDERSTAND ALL THE STEPS IN ITS CREATION. I'M TRYING TO REMEMBER BULLET POINTS DOWN BELOW THE BOTTOM. HERE IS MY IMAGE OF AN IDEAL SYNTHETIC INFORMATION RESOURCE. WHAT DOES IT GET YOU, IT ENCOURAGES AND ALLOWS A TRANSDISCIPLINARY TEAM SCIENCE APPROACH TO YOUR PROBLEM. ONE YOU CAST THINGS IN THE LANGUAGE OF SYNTHETIC INFORMATION, PEOPLE KNOW WHAT TO DO WITH IT, AS SCIENTISTS THAT'S WHAT WE'RE USED TO, USED TO HAVING SOME DATA TO WORK WITH. IF I CAN GO TO A DATA SET AND SAY I DON'T KNOW ANYTHING ABOUT THE DISTRIBUTION OF HIV OR HOW ASSOCIATED WITH THESE OTHER FACTORS BUT SOMEONE IS SITTING AROUND THE TABLE WHO SAYS EVERYTHING YOU NEED TO KNOW IS IMBEDDED IN THIS INFORMATION RESOURCE ALREADY. ANALYZE THE INFORMATION RESOURCE AS IF WE WERE GIVING ANSWERS TO WHATEVER QUESTIONS YOU WOULD LIKE TO ASK OF THE POPULATION. IT'S EASIER WITH PEOPLE BACKGROUND DOING EXPERTISE TO COOPERATE WITH EACH OTHER. AS I WAS SAYING BEFORE, MAINTAINING PROMINENCE NOT ONLY AN ISSUE HOW DATA CONSTRUCTED BUT WHAT LIMITS PLACES ON THE APPROPRIATE USE. MAYBE THE MODEL I BUILT FOR INFECTIOUS -- RESPIRATORY DISEASE FOR EXAMPLE ARE NOT SUITABLE FOR HIV, SOMEWHERE RESOURCES ITSELF CAN I USE THIS DATA SET TO DO WHAT I WOULD LIKE TO DO WITH IT, THERE'S SOME IS IT NOT APPROPRIATE FOR THAT USE SO SOMETHING LIKE THE VERIFICATION SERVER DESCRIBED EARLIER THIS MORNING. BUT EARLIER IN THE PROCESS SO BEFORE YOU HAVE A HIGH TALK ABOUT HOW THINGS ARE RELATED YOU WANT TO SAY CAN I DETECT SUCH A RELATIONSHIP IN THIS DATA. IT SHOULD BE POSSIBLE TO PLUG AND PLAY. IF YOU DON'T LIKE HOW WE HAVE BUILT A DISTRIBUTION OF AGE INCOME AND OTHER VARIABLES THEN IT SHOULD BE POSSIBLE FOR YOU TO SAY NO, I WOULD RATHER MAKE THESE ASSUMPTION ABOUT THAT PROCESS. AND ADD TO DATA RESOURCE. AND OTHERS TO BE ABLE TO USE THAT ASSUMPTION THIS POINT HERE THE RESULT IN THE MODELS ARE REPRODUCIBLE, WE ALL KNOW THE ISSUES RIGHT NOW. AND LITERATURE OF REPRODUCIBLE SCIENCE. AND HAVING THE INFORMATION RESOURCE ITSELF BE ATTRIBUTEABLE EASILY EXCITABLE, SO IF I GO TO THE RESOURCE WRITE PAPER BASED ON SOMETHING I DECKED IN THIS DATA SET, I SHOULD BE ABLE TO PUNCH A BUTTON AND HAVE IT SAY, SITE ACKNOWLEDGMENT TO THE GRANTS THAT MADE IT POSSIBLE AND PEOPLE WHO ATTRIBUTE TO THE DATA SET, THAT'S ONE OF THE KEY ELEMENTS IN GET PEOPLE TO USE THIS. SO YOU DON'T FEEL YOUR WORK IS WASTED. ONCE IN THE PUBLIC CO-MAIN INSTEAD OF BEING AVAILABLE TO EVERYONE TO USE IN CITATION, PUBLICLY AVAILABLE FOR EVERYBODY TO USE WITH CORRECT CITATION. SO I'M REALLY HAPPY THAT I HAVEN'T SEEN THIS CONCLUSION IN ANY OF THE TALKS SO FAR. AS YOU GET MORE DATA USUALLY YOUR TRADITIONAL STATISTICAL TEST FINDS ANY DIFFERENCE AT ALL SO YOU THINK MAYBE EVERYTHING IS PUBLISHABLE. THAT'S NOT WHAT I CAME TO TALK ABOUT. WHAT I WANT TO SAY IS I THINK BIG DATA PROVIDES WONDERFUL OPPORTUNITIES TO STUDY AMONG OTHERS MENTIONED, THE SMALL AREA ESTIMATES, THE SMALL SUBPOPULATION ESTIMATES. FROM BOTH ARE NATURALLY DONE WITH SYNTHETIC POPULATIONS. THE GROWING AMOUNT OF DATA SHOULD BE SOMETHING THAT WE CAN MAP INTO STUDYING MORE COMPLEX INTERACTIONS AMONG FEATURES OF THE DATA SET. AND I THINK THAT SYNTHETIC INFORMATION SETS ARE WELL SUITED TO ACHIEVE THAT. THANK YOU. >> DR. EUBANK IS A STATISTICIAN, I LOVE TO TALK TO DISCUSS THE ISSUE OF SAMPLING AND HOW TO TAKE IT INTO ACCOUNT. OUR NEXT SPEAKER DR. MARTINA MORRIS, IS AN EXPERT IN NETWORK RESEARCH METHODS SHE HOLD AS A JOINT A POINTMENT AS PROFESSOR IN THE DEPARTMENT OF SOCIOLOGY AND STATISTIC AT UNIVERSITY OF WASHINGTON. AND SHE'S FOUNDING DIRECTOR OF THE SOCIO BEHAVIORAL AND PREVENTION RESEARCH CORE AT THE UWCFAR. DR. MORRIS CO-LEADS AN INTERDISCIPLINARY GROUP OF STATISTICIANS EPIDEMIOLOGY AND DEMOGRAPHERS WHO DEVELOP AND IMPLEMENT INNOVATIVE METHODOLOGY AND SOFTWARE FOR NETWORK MODEL ING. THEY HAVE RELEASED SOFTWARE AND OUR PACKAGE STAT NET. ALSO THE PACKAGE EPI MODEL, FOR ANY OF US WHO USE NETWORK MODELING IN THE AREA OF EPIDEMIOLOGY I BELIEVE WE HAVE ALL FOUND THESE PACKAGES INCREDIBLY HELPFUL. I KNOW I HAVE. SO I WILL ALSO MENTION HER EPI MODEL USES COMPARTMENTAL MODELS, AND STOCHASTIC KNELT WORK MODELS -- NETWORK MODELS IN ORDER TO HELP US MODEL INFECTIOUS DISEASE PROPAGATION. THESE MODELS CAN BE USEFUL FOR STUDYING THE PROPERTIES OF DIFFERENT DESIGNS. INTENDED TO INVESTIGATE THE EFFECT OF INTERVENTION. >> THANK YOU, VICTOR, THANK YOU TO THE ORGANIZERS FOR INVITING ME TO THIS MEETING. I'M GOING TO TALK TODAY ABOUT WORK THAT'S REALLY A LARGE GROUP OF PEOPLE, TO ACKNOWLEDGE HERE, PRIMARILY THE STAT NET DEVELOPMENT TEAM BUT ACTUALLY LARGER GROUP FROM THAT. SO I WOULD LIKE TO START BY TALKING ABOUT WHAT MAKES NETWORKS AND IN PARTICULAR INFECTIOUS DISEASE TRANSMISSION BECAUSE THIS IS REALLY SUPPOSED TO BE ABOUT BIG DATA FOR HIV AND I THINK INFECTIOUS DISEASES OR NETWORKS HAVE UNIQUE PROPERTIES THAT NEED TO BE THOUGHT ABOUT IN THAT CONTEXT. I WILL BRIEFLY TALK ABOUT THE DIFFERENCE BETWEEN BEHAVIORAL NETWORK DATA AND PHYLOGENETIC DATA CHRISTOPHE FRASER WILL SPEAK ABOUT SO I'LL LEAVE TO HIM AND I WILL SPEND MOST OF THE PRESENTATION ON REVIEWING KEY DEVELOPMENT AND STATISTICAL NETWORK METHODOLOGY TO DO THE MODELING THAT THE PREVIOUS SPEAKER WAS DISCUSSING FOR HIV PREVENTION. I WILL HAVE AN EXAMPLE AT THE END. SO A SIMPLE WHAT IF. IMAGINE TWO BLUE PEOPLE EACH ONE HAS ONE PARTNER, BOTH THESE BLUE PEOPLE ARE IN CONCORDANT HIV NEGATIVE PARTNERSHIPS, BOTH HAVE THE SAME COMMON USE AND SEXUAL BEHAVIOR REPERTOIRE, YOU THINK BOTH ARE E EQUALLY PROTECTED AGAINST HIV EXPOSURE. WHAT YOU REALLY NEED TO KNOW IS -- WHO THEIR PARTNERS ARE CONNECTED TO EXPOSURE BY BLUE NOTES IS FUNCTION OF PARTNERS BEHAVIOR NOT THEIR OWN BEHAVIOR AND PARTNER BEHAVIOR AND EXPOSURE IS DETERMINED BY PARTNERS PARTNERS BEHAVIOR BASICALLY NETWORK DETERMINES EXPOSURE, THAT'S ONE THING THAT MAKES INFECTIOUS DISEASE THAN CHRONIC DISEASE, THAT HAS IMPLICATIONS FOR THE TYPE OF DATA WE NEED TO THINK ABOUT AND TYPE OF MODELS APPROPRIATE. BASICALLY WHAT THIS DOES IS INFECTIOUS DISEASE, IT SEPARATES THE RELATIONSHIP OR BREAKS THE RELATIONSHIP BETWEEN INDIVIDUAL BEHAVIOR AND RISK. THAT'S A DIFFERENT PARADIGM FOR UNDERSTANDING EXPOSURE. THE SECOND THING THAT'S IMPORTANT, THIS IS WELL KNOWN TO ANYBODY WHO HAS DONE THE MODELING IN THIS FIELD, THERE ARE THRESHOLDS IN EPIDEMICS, THE MOST FAMOUS IS REPRODUCTIVE THRESHOLD BELOW REPRODUCTIVE THRESHOLD AND DON'T GET EPIDEMIC IF ABOVE YOU DO. VERY SMALL CHANGE IN THRESHOLD MAKE AS BIG DIFFERENCE, TO SEE WHAT THAT LOOKS LIKE, WE NET GET SAME IN NETWORK CONNECTIVITY, DISTRIBUTION OF NUMBER OF SEXUAL PARTNER ON A PARTICULAR DAY THAT LOOKS LIKE THIS, IGNORING THE THE PEOPLE WHO UNFORTUNATELY HAVE NO SEXUAL PARTNERS BECAUSE THEY'RE IRRELEVANT FOR PURPOSES. FOR THE REST WE HAVE ONE OR TWO PARTNERS AN MAYBE 10 PERCENT THREE MART PARTNERS IS 1.68 FOR THOSE WHO ARE ACTIVE. GENERATE NETWORKS FOR DEGREE DISTRIBUTION LIKE THIS, THOSE ARE DIFFERENT EXAMPLES FROM DIFFERENT SIMULATIONS IN GENERAL COMPRISE 2% OF THE ENTIRE POPULATION. FORESEES THESE WERE TAKEN FROM 10,000 NODE NETWORKS. WE HAVE TWO PARTNERS HARDLY ANY MORE HAVE THREE. YOU GENERATE LARGE COMPONENTS THAT COMPRISE 10% OF THE POPULATION. AND THAT RED RING IN THERE IS WHAT WE CALL A BICOMPONENT TWO NODES OR TWO LINKS TO ACTUALLY DECONSTRUCT THAT COMPONENT. DOESN'T SEEM LIKE BIGGER EFFORT BUT DOUBLING THE INTERVENTION EFFORT. THAT'S THE IMPACT. ROBUSTNESS OF CONNECTIVITY AND WE'RE STARTING TO SEE 1% OF THE POPULATION IN THE LARGEST BICOMPONENT. ANOTHER .06 OF A PARTNER HERE, 41% ON AVERAGE AND 5% BICOMPONENT AND .06 OF PARTNER AND 64% CONNECTED. 15% BICOMPONENT SO THIS DIFFERENCE FROM LEFT TO RIGHT IS .2 PARTNERS ON AVERAGE. PER PERSON. SO WE CAN THINK OF MANY SURVEYS WOULDN'T BE ABLE TO DETECT THAT AS A STATISTICALLY SIGNIFICANT DIFFERENCE, THEY'RE NOT POWERED FOR THAT. SO IT DOESN'T SEEM LIKE LARGE DIFFERENCE AND ONLY 12% HAVE CONCURRENT PARTNER MORE THAN ONE PARTNER ON THE DATE OF THE INTERVIEW SO SMALL DIFFERENCE LEADS TO HUGE CHANGE IN CONNECTIVITY. THAT HAPPENS AT THE THRESHOLD. IT DOESN'T HAPPEN IN OTHER PARTS OF THE DISTRIBUTION. SO YOU CAN GET THE CHANGE IN OTHER PARTS OF THE DISTRIBUTION, YOU DON'T GET THE IMPACT SO THIS THRESHOLD TURNS OUT TO BE VERY IMPORTANT. SO THERE ARE A NUMBER OF IMPLICATIONS FROM THESE TWO THINGS. THE FIRST IS INDIVIDUAL OUTCOMES ARE DEPENDENT IN THESE PROCESSES, WHICH MEAN BOTH DATA MINING AN SIMPLE STATISTICAL METHODS ARE NOT DESIGNS FOR THAT PROCESS. BIG DATA,? NCI SOME WAY A BIG DATA ISSUE, FUNDAMENTAL DRIVER IS NETWORK CONNECTIVITY AND THAT'S NOT MEASURABLE AT THE INDIVIDUAL LEVEL. CONNECTIVITY IS A NO, MA'AM NEAR FUNCTION OF BEHAVIOR SO THE LINEAR INTUITION, OFTEN SALES THAT'S A PROBLEM BUT THE THRESHOLDS TURN OUT TO BE NICE TARGETS FOR INTERVENTION SO THAT'S A GOOD THING. WHAT DOES THIS MEAN FOR BIG DATA? IN NETWORKS THIS THE OBLIGATORY SLIDE, THERE IS AN INFORMATION CASCADE WITHIN YOU GET DATA ON NETWORKS FOR N NODES YOU HAVE THEN CHOOSE TWO DYADS, THAT'S THE NUMBER OF POSSIBLE LINKS FOR UNDIRECTED NETWORK, THAT WILL BE N TIMES N MINUS ONE OVER TWO. THAT DOESN'T GET SO LARGE SO QUICKLY THOUGH IT'S ORDER N SQUARED SO IT CAN GET LARGE BUT THE REAL ISSUE IS YOU HAVE TWO TO THE END CHOOSE TWO POSSIBLE NETWORKS ON ANY SET OF NODES SO FOR TEN NODES, TINY NETWORK, 45 DYADS, BUT 3.5 TIMES TEN TO THE 13 POSSIBLE NETWORKS ON TEN NODES. THAT'S NUTS. SO CAN WE OBSERVE THE NETWORK? THIS IS WHERE THINGS GET INTERESTING SO ONE WAY TO OBSERVE A CONTACT NETWORK IS BEHAVIORAL SURVEYS, THIS IS SELF-REPORTED BEHAVIOR WITH NO RESPECT IN THE LITERATURE. EXAMPLES OF OBSERVED COMPLETE SEXUAL NETWORKS I ONLY KNOW OF TWO ATTEMPTS MADE TO GET A COMPLETE SEXUAL NETWORK ONE WAS COLORADO SPRINGS IN 1980, THE TACOMA ISLAND IN EARLY 2000s I THINK. I'M SURE PEOPLE HERE CAN REMIND ME THE YEAR BUT THERE ARE CAVEATS ON THE THE NEXT PAGE THERE. AN ALTERATIVE IS LOOKING AT HIV SEQUENCING DATA WHICH CAPTURE IT IS TRANSMISSION NETWORK, THIS IS AN INTERESTING FIELD, CHRISTOPHE FRASER WILL TALK ABOUT IT, THIS IS A CLASSIC EXAMPLE OF BIG DATA. SEQUENCE THE VIRUS FROM INDIVIDUALS, INFECTED WITH HIV, THOSE SEQUENCES THEMSELVESES ARE BIG DATA. WHEN YOU PULL THEM TOGETHER IT'S BIG DATA AND YOU CAN CONSTRUCT PHYLOGENIES OR TREES THAT CONFER THE PROCESS OF THE EVOLUTION, THAT'S BIG COMPUTING SO THAT'S BIG DATA DOWN THE LINE. THERE IS A SENSE THIS IS OBJECTIVE AND CAPTURE IT IS PROCESS BUT KEEP MANY MIND INFER TREE WE GET FROM THAT, EVEN CLUSTERS PROVIDES NO INFORMATION ON WHO INFECTS WHO. YOU HAVE TO HAVE SUPPLEMENT TEAR INFORMATION FROM EPIDEMIOLOGICAL OR CLINICAL DATA IN ORDER TO MAKE THOSE KINDS OF INFERENCES. SO SIMILARITIES AN DIFFERENCES WORTH KEEPING IN MIND, IN PRINCIPLE A TRANSMISSION NETWORK IS A SUBSET OF THE CONTACT NETWORK. NOT ALL CONTACTS TRANSMIT AN THOSE THAT DO ARE INVITES (INAUDIBLE) A TRANSMISSION NETWORK. PRACTICE ANALYTIC GOALS ARE TYPICALLY DIFFERENT FOR THESE TWO APPROACHES, WITH CONTACT NETWORK DATA WE USE THIS TO DEVELOP MODELS FOR PROJECTING AN EPIEPIDEMIC FORWARD IN TIME SO MUCH THE WAY OR PREVIOUS SPEAKER DESCRIBED, OFTEN TO EVALUATE INTERVENTION OPTIONS WHAT HAPPENS IF IF I TRY THIS INTERVENTION OR SOME OTHER INTERVENTION. THE APPROACH EMPHASIZES A GENERAL UNDERSTANDING OF EPIDEMIC DYNAMICS, UNDER VARIOUS CONDITION, WE TREAT AT THE VIRTUAL LABORATORY. FOR TRANSMISSION NETWORK DATA WHAT WE USE THOSE DATA FOR IS UNDERSTAND THE HISTORY OF A SPECIFIC EPIDEMIC UP TO CURRENT TIME, IT ISN'T DESIGNED FOR FORWARD PROJECTION. THE APPROACH EMPHASIZES UNDERSTANDING THE EPIDEMIC PROCESS IN A SPECIFIC POPULATION, HISTORICAL RECONSTRUCTION ENORMOUSLY IMPORTANT YOU GET INFORMATION AND I THINK AGAIN, WE CAN DEBATE THE -- THESE ARE COMPLIMENTARY RATHER THAN COMPETITIVE, THESE TWO APPROACHES. WORTH KEEPING IN MIND IS THE COMPLETE NETWORK IN EITHER OF THESE CASES EVER OBSERVED THIS I WOULD SAY NO. IN BOTH CASES WE ONLY OBSERVE A FRACTION OF THE COMPLETE NETWORK. BEHAVIORAL NETWORKS 50% OF THE NAMED PARTNERS WERE MISSING. IN COLORADO SPRINGS, AN UNKNOWN FRACTION MISSING, WE ESTIMATED 80% COVERAGE OF HIGHEST RISK POPULATION WAS THERE BUT LOTS OF PEOPLE NAMED ANONYMOUS PARTNERS. IT WAS LOTS OF NETWORK WE DIDN'T OBSERVE. THOSE ARE OUR ONLY TWO REAL ATTEMPTS TO GET COMPLETE NETWORK. TO THE AD HEALTH FRIENDSHIP NETWORK, FRIENDSHIPS WHICH ARE SIGNIFICANTLY LESS SENSITIVE TO REPORT, THEY ARE 40% OF THE FRIENDSHIP DYADS HAD SOME MISSING DATA SO YOU'RE KIDDING YOURSELF IF YOU CAN GET COMPLETE NETWORK FOR THE CONTACTS. BUT IN PHYLOGENETIC NETS WHICH ARE OFTEN THOUGHT COMPLETE DATA THERE, WHAT YOU SEE IS TYPICAL ANALYSIS ABLE TO CONNECT 30 TO 50% OF THE CASES IN TO CLUSTERS, WITHOUT THE DIRECTION OF TRANSMISSION. THE OTHER 50 TO 70%? EITHER INFECTED BY OUTSIDE CONTACTS WHICH SEEMS UNLIKELY IN MOST POPULATIONS OR YOU'RE MISSING THE CONNECTING CASES FROM THIS POPULATION. SO THIS IS COMPLICATED BY SAMPLING ISSUES OVER TIME BUT IN EITHER CASE, YOU'RE NOT GOING TO GET A COMPLETE NETWORK. YOU NEED TO THINK WHAT ARE THE IMPLICATIONS OF MISSING DATA MANY NETWORKS? WHICH I THIS I ARE DIFFERENT THAN DATA IN DISTRIBUTION NETWORKS. SO INDEPENDENT DATA IF YOU DON'T OBSERVE RESPONDENT A IT DOESN'T AFFECT YOUR OBSERVATION SO IF A DOESN'T RESPOND TO SURVEY OR DOESN'T RESPOND TO ITEM ON THE SURVEY, IT DOESN'T EFFECT THESE OBSERVED INCOMES. DEPENDENT DATA IF YOU DON'T OBSERVE A OBSERVED PROPERTIES AND THE STRUCTURE OF THE NETWORK THAT YOU OBSERVE SO IF OUR WE GET DEGREE DISTRIBUTION WRONG AS WELL. THE GENERAL POINT HERE IS THAT BIG DATA OFTEN MEANS A LOT OF DATA REITERATING WHAT I THINK WAS A VERY WELL PUT COMMENT IN THE PREVIOUS PRESENTATION. A LOT OF DATA IS NOT A CENSUS. SO THINK ABOUT TWITTER NOT EVERYBODY TWEETS. WHAT IS MISSING FROM THAT? THINK ABOUT HIV SEQUENCE DATABASES, THOSE ARE ALSO NOT COMPLETE. THE PROPERTIES OF YOUR SAMPLE DETERMINE THE METHODS YOU NEED FOR VALID INFERENCE. BIG DATA THAT'S NO DIFFERENCE THAN IN LITTLE DATA. YOU CAN'T IGNORE SAMPLING. SECONDLY A LOT OF DATA DOESN'T MEAN THE RELEVANT DATA SO INTERESTED IN HI TRANSMISSION SYSTEMS HIV DOESN'T SPREAD BY RETWEETS FACEBOOK FRIENDS OR EMAIL THOUGH WE HAVE LOTS OF DATA THOSE ARE NOT THE RELEVANT DATA. ADOPTION DIFFUSION PREVENTION INTERVENTIONS MIGHT OR MIGHT NOT SPREAD EFFECT TESTIFILY ON THESE NETWORKS SO IT MIGHT BE USEFUL FOR THOSE KINDS OF QUESTIONS BUT NOT FOR UNDERSTANDING SPREAD OF INFECTION. YOU CAN START WITH NETWORK CENSUS ON EVERY NODE AND LINK, EARLY NET WORK ANALYSIS METHODS REQUIRE THIS DATA WHICH IS WHY NETWORK ANALYSIS WENT NOWHERE FOR YEARS. THEN WE DEVELOPED ADAPTIVE APPROACHES VARIOUS ON LINK TRACING DESEENS, MOST IN THIS AUDIENCE HEARD ABOUT RESPONDENT DRIVEN SAMPLING BUT SNOWBALL SAMPLELING CONTACT TRACING ARE ACTIVE DESIGNS, THIS IS AN ACTIVE AREA OF RESEARCH AND WORK BEING DONE HOW TO MAKE INFERENCES FROM THESE SAMPLES. WE HAVE EGO CENTRIC SAMPLES, WE ASK THEM TO REPORT ON PARTNERS. WE DON'T ENROLL PARTNERS ON THE STUDY, WE GET INFORMATION ON ATTRIBUTES OF PARTNERS AND RELATIONSHIPS. THIS IS KIND OF THE POOR COUNTRY COUSIN OF NETWORK SAMPLING BUT TURNS TOUT INHERIT NICE INFERENCIAL PROPERTIES FROM SURVEYS. HOW DO YOU GET FROM A SAMPLE THAT LOOKS LIKE THIS, WHICH IS CROSS SECTIONAL AND EGO CENTRIC TO UNDERLIE PROCESS PARTNERSHIP FORMATION IN SOLUTION PHOTONODE ENTRY AND EXIT AND TRANSMISSION OF INFECTION. IT SEEMED LIKE YOU CAN'T GET HERE FROM THERE. IN FACT, YOU CAN. THIS IS POPULATION PARAMETERS FROM DATA AND FULLY SPECIFIED MODEL CAN BE IN TURN USED FOR SIMULATING NETWORKS. NETWORK ANALYSIS THE PRIMARY STATISTICAL FRAMEWORK WE HAVE IS EXPONENTIAL RANDOM FAMILY MODELS, WHERE THE WORK HAS BEEN DONE THE LAST 20 YEARS, I WILL GIVE YOU A LITTLE MATT, MATH, NOT TOMB TOO MUCH. BASIC IDEA FOR EXPONENTIAL MODEL WHAT IS THE PROBABILITY OF PARTICULAR NETWORK? IT'S A FUNCTION OF NETWORKS STATISTICS THAT ARE MORE OR LESS LIKELY THAN RANDOM AND A KIND OF WEIGHT CO-EFFICIENT THERE THAT DETERMINES HOW MUCH MORE LIKELY THIS PARTICULAR TYPE OF STATISTIC IS. WE HAVE A VECTOR OF MODEL PARAMETERS, NETWORK STATISTICS AND THEN THIS NUMERATOR SUMMED OVER ALL POSSIBLE NETWORKS ON NODE SO REMEMBER ALL POSSIBLE NETWORKS THING? THIS IS WHERE IT COME BACKS TO BITE YOU IT HAS WELL UNDERSTOOD STATISTICAL PROPERTIES, VERY GENERAL, VERY FLEXIBLE. THOSE ARE NOT FAMILIAR WITH THIS NOTATION, ALL THIS MEANS IS YOU HAVE DATA TIMES X PLUS X 2 ET CETERA SO THIS NUMERATOR IS ESSENTIALLY LIKE A LINEAR MODEL, THINK OF IT AS A LOGISTIC REGRESSION EXCEPT'S A LITTLE DIFFERENT BECAUSE OBSERVATIONS ARE DEPENDENT. FRAME UNDER CONDITIONAL PROBABILITY OF A TIE, A TIE SEETHER MORE OR LESS LIKELY IF IT CONSTRUCTS ONE OF THESE CONFIGURATIONS THAT IS MORE OR LESS LIKELY THAN WE EXPECT BY CHANCE. WHAT IS THE SPECIFICATION OF MODEL LIKE THIS INVOLVE? YOU HAVE TO REPRESENT THE NETWORK STATISTICS AND THE NETWORK STATISTICS CAN BE -- THERE ARE COUNTS OF NETWORK CONFIGURATION SO SIMPLEST ONE IS NUMBER OF EDGES AND THAT CONTROLS THE NUMBER DENSITY OF THE NETWORK P WITHIN GROUP TIES THIS IS WHAT A NETWORK ANALYST CALLS TOMOPHOLE, YOU CAN NETWORK TWO STARS, THIS INFLUENCES DEGREE DISTRIBUTIONS, WITH WE MODEL THE DEGREE DISTRIBUTIONS SPECIFICALLY. THREE CYCLE, MEASURE TRIAD CLUSTERING AND ALMOST ANYTHING YOU CAN THINK OF AS NETWORK CONFIGURATION YOU CAN PUT INTO A MODEL LIKE. THIS THERE'S A KEY DISTINCTION IN TYPES OF TERMS, DYAD INDEPENDENT TERMS ONE AND TWO AND DYAD DEPENDENT TERMS THREE AND FOUR HERE, THE PROPERTY OF ONE OR THE STATUS OF ONE LINK DETERMINES THE STATUS OF ANOTHER SO THAT'S WHY THEY'RE CALLED DYAD DEPENDENT, THAT'S WHEN YOU GENERATE CASCADING PROCESSES THAT TURN INTO COMPLEX SYSTEMS. WHERE WERE WE? TEN YEARS AGO WE WERE DOING CROSS SECTIONAL ANALYSIS WITH A SINGLE MEASURE OF A COMPLETE NETWORK AND OUR TEST DATABASE FOR THIS STUFF. SOME DATA ON A FRIENDSHIP NETWORK YOU POSE A MODEL, ESTIMATE THE CO-EFFICIENT AND THEN YOU CAN SIMULATE DATA FROM THIS MODEL. AND THE SIMULATIONS WE THEN DURING THAT ERA WERE BASICALLY USING TO DO GOODNESS OF FIT BECAUSE WE CAN OBSERVE HIGHER ORDER STATISTICS SO SAY WE ONLY PUT EDGES IN THE MODEL SEWELL WE'RE MODELING IS RADIO GRAPH OR NEWLY RANDOM GRAPH, ALL WE'RE MODELING IS HO KNOWLEDGE GUS PROBABILITY OF TIE FORMATION, HOW WELL WE CAPTURE THE ENTIRE DEGREE GRIKES DISTRIBUTION? HERE IS THIS BLACK LINE IS WHAT WE OBSERVE IN THE DATA FOR THE DEGREE DISTRIBUTION. AND HERE IS WHAT OUR MODEL PRODUCES OVER 100 SIMULATIONS. WE CAN COMPARE ALSO TO ALSO SOMETHING LIKE EDGE WISE SHARED PARTNERS, MEASURE OF LOCAL TRIADIC CLOSURE OR COMPARE TO IT THE DISTANCE, THE OVERALL REACHABILITY BASICALLY IN A NETWORK AND DISTANCE, SO THAT'S WHAT WE WERE USING SIMULATION FOR TEN YEARS AGO. AFTER THAT (INDISCERNIBLE) CAME UP WITH A TEMPORAL VERSION OF ERGOS FOR DYNAMIC NETWORK ANALYSIS, EFFECTIVELY THIS CREATED A MODEL FOR FORMATION OF PARTNERSHIPS AND A MODEL FOR DISILLUSION OF PARTNERSHIPS THAT TAKES YOU FROM TIME T TO TIME T PLUS ONE SO WE HAVE AN ACTUAL GENERATIVE MODEL FOR THESE UNDERLYING NETWORK. THE FITTED MODEL USED TO SIMULATE DYNAMIC NETWORK SERIES. THEN RAPID SUCCESSION WE ADDRESS SEVERAL ISSUES IMPORTANT TO GET FROM EGO CENTRIC CROSS SECTIONAL SAMPLE TO TEMPORAL DYNAMIC EVOLVING NETWORK. ONE ADDRESSING SIZE AND VARIANT PARAMETERIZATIONS WHICH PERMITS CHANGING SIZE AND NODAL COMPOSITION ESTIMATING THE CROSS SECTIONAL DATA BY USING RETROSPECTIVE INFORMATION ON PARTNERSHIP DURATION OR AGE AND FINALLY ESTIMATING EGO CENTRICALLY SAMPLED DATA, A FRAMEWORK FOR MODEL SELECTION AND INFERENCE. NOW WE HAVE A BRIDGE STATISTICAL NETWORK FOR INFERENCE AND SIMULATION. WE CAN START WITH SMALL DATA, SINGLE CROSS SECTIONAL SAMPLED NETWORK A TEMPORAL RANDOM GRAPH MODEL TO DATA WITH MODEL SELECTION AND GOODNESS OF FIT. SIMULATE LONGITUDINAL DYNAMIC PROCESS AND DISILLUSION WITH NO DEMOGRAPHICS THAT REPLICATES ALL OBSERVED CONTACT MODEL STATISTICS OVER TIME. THEN LAYER ANY DIFFUSION PROCESS YOU LIKE ON THE RESULTING DYNAMIC NETWORK. THAT'S HUGE. THIS IS WHERE YOU THEN BRING IN EVERYTHING ELSE THAT MATTERS FOR HIV. MY PERSPECTIVE THIS IS A CLASSIC STATISTICAL RABBIT OUT OF THE HAT HOW THE HELL DID YOU DO THAT? IT SHIFT IT IS BURDEN TO BIG COMPUTATION FROM BIG DATA. SO AN EXAMPLE, I WILL TALK BRIEFLY ABOUT UNDERSTANDING HIV DISPARITIES BY RACE BECAUSE I HAVE LITTLE TIME LEFT THESE ARE EVIDENT IN RISK GROUPS BUT FOCUS ON HETEROSEXUALS HERE. HIV DISPARITIES IN 20 10, THIS IS THE MOST RECENT DATA AVAILABLE ON NEW HIV INQUIRED INFECTION, WE SEE BLACK WOMAN RICK OF ACQUIRE INFECTION IS # 4 TIMES HIGHER THAN A WHITE HETEROSEXUAL MAN RISK. THAT'S INSANE. YOU MIGHT ASK YOURSELF WHY IS THAT? DISPARITIES ARE A HISTORY IN TUS, FOR SYPHILIS STATISTICS BACK IN 1966, DOESN'T LOOK THAT BIG A DIFFERENCE BUT NOTE THIS IS THE LOG SCALE. SO THESE ARE HUGE DIFFERENCES. IF THERE EVIDENCE BLACK WOMEN'S BEHAVIOR IS RISKIER THAN WHITE MEN OR WOMEN? EVERY REPRESENTATIVE STUDY SAYS NO FOR THE LAST 20 YEARS, THESE ARE RISK RATIOS LESS RISKY BEHAVIOR TO THE LEFT MORE FOR BLACK WOMEN TO THE RIGHT RELATIVE TO WHITE WOMEN, WE DON'T SEE MORE RISKY BEHAVIOR FOR BLACK WOMEN. THE NETWORK EXPLANATION FOR BLACK WOMEN RISK IS IT TAKES TWO STRUCTURAL FEATURES SEPARATION WHICH IS MIXING BY RACE AND DIFFERENT RATES OF CONCURRENT CITY FOR MALE PARTNERS RELATIVE TO WHITE PARTNERS. MONOGAMY SPREADS IN THIS GROUP SO PREVALENCE RISES AND THE ASSORTED OF MIXING SEGREGATES THE NETWORK SO THAT A PREVALENCE DIFFERENTIAL CAN BE SUSTAINED OVER TIME. IS THERE EVIDENCE THAT NETWORK STRUCTURE IS RISKIER IN THESE MEASURE? YES, EVERY REPRESENTATIVE STUDY FOR 20 YEARS SAYS YES IN CONCURRENT SAY AND ASSORTED MIX SOCK WE CAN TEST THE HYPOTHESIS WITH EMPIRICAL NETWORK DATA AND WILL USE THE NHSLS WITH ED LAWMAN'S STUDY, REPRESENTATIVE STUDY OF 3,000 ADULTS 18 TO 59 YEARS OLD EGO CENTRIC SEXUAL NETWORK DATA ON PARTNERSHIPS THE LAST YEAR. WITH WE GET NO RESPECT FOR BEHAVIORAL DATA BUT I WANT TO POINT OUT HERE THERE IS NO SIGNIFICANT DIFFERENCE MANY THE NUMBER OF PARTNERS REPORTED BY SEX IN THIS. VARY FOR THE NUMBER OF REPORTED ON THIS PARTICULAR -- THE DAY OF THE INTERVIEW. SO FEMALES REPORT 1300 AND MEN REPORT ABOUT 1300. SO LET'S MODEL THIS AND EVALUATE IT. FIRST WE WILL START WITH TRADITIONAL STATISTICAL APPROACH. WE WILL TEST THESE DIFFERENCE MODELS AND SEED HOW WELL THEY FIT THE DATA SO THE FIRST MODEL IS DEMOGRAPHIC MEAN EFFECT THAT INCLUDES A TERM FOR SEX HOMOFULLY WHICH LE W WE EXPECT TO BE NEGATIVE AND ALLOWS FOR VARIATION BY SEX AND RACE THEN ADDED SORTED MIXING AND FINALLY ADD CONCURRENCE SAY, MODEL DEPENDENCY FOR MONOGAMY. THIS IS STANDARD STATISTICAL ANALYSIS THIS IS NOT A BIG DEAL TO YOU GUYS BUT WE NEVER HAD IT BEFORE ESPECIALLY FOR EGO CENTRIC. WE SEE SIGNIFICANT SEX EFFECT BUS NO SIGNIFICANT DIFFERENCE IN AVERAGE DEGREE BY RACE, HOMOFOLY SIGNIFICANT BY RACE PARTICULARLY STRONG AMONG BLACKS WITH RELATIVE WITHIN GROUP PARTNERSHIP 170 TO 1. WE SHOW STRONG BIAS, POSITIVE CO-EFFICIENT BUS THERE'S SIGNIFICANT DIFFERENCES BY RACE WITH HIGHEST WHITE MEN AND WOMEN, HIGHEST PREVALENCE WHERE IS THE LOWEST FOR BLACK MEN ODDS OF 2.7. HOW DO THESE THREE MODELS FIT THE DEGREE DISTRIBUTION THAT WE OBSERVE IN THE DATA? THIS IS ONE THING WE OBSERVE, WE CAN LOOK AT IT. MODEL ONE HERE ARE THE SIMULATIONS, THESE BLACK DOTS THERE IS VARIATION BUT NOT A LOT THAT'S OUR SIMULATION. WE DON'T FIT DEGREE DISTRIBUTION AT ALL. WE ALSO DON'T DO A GOOD JOB IN MODEL TWO BUT SOON AS WE PUT IN CONCURRENT CITY MONOGAMY BIAS WE FIT DEGREE DISTRIBUTION, REST OF DEGREE DISTRIBUTION PERFECTLY. MODEL 3 IS BEST PATTERNS ARE CONSIST WHEN TWO KEY ELEMENTS OF NET WORK HYPOTHESIS. NOW THAT SIMULATE FROM EACH MODEL SO WE SIMULATE 100 DRAWS FROM A 10,000 NODE NETWORK AND SEE WHAT EACH MODEL PREDICTS FOR. NOW WE GET COMPLETE NETWORK FOR EGO CENTRIC DATA SET. NETWORK EXPOSURE. HOW DO WE DEFINE IT IS MEASURED BY PROBABILITY OF BEING IN A COMPONENT LARGER THAN SIZE 2. BECAUSE THAT'S WHEN YOUR BEHAVIOR NO LONGER DETERMINES YOUR RISK EXPOSURE. THAT'S NETWORK EXPOSURE. WE SEE WITH MODEL ONE AND TWO, THERE ARE ALMOST NO DIFFERENCE ABOUT 40% TO 45, 35 TO 40 PROBATION OFFICER MISDEMEANOR COMPONENTS OF SIZE 3 OR LARGER. AND WE DON'T SEE THE KIND OF PATTERN WE WOULD EXPECT GIVEN THE PREVALENCE DISPARITIES BUT FOR MODEL 3 WHEN WE ADD THE DIFFERENCES IN CONCURRENT WE SEE THE SAME PATTERN THAT WE SEE IN THE HIV DISPARITIES SO THE HIGHEST NETWORK EXPOSURE IS FOR BLACK WOMEN WHITE MEN AND WOMEN HAVE ALMOST NO NETWORK EXPOSURE. SO THIS I'M NOT GOING TO HAVE TIME TO SHOW YOU YOU THE WHOLE MODEL, SO I WILL SHOW YOU A LITTLE BIT. IT'S THE -- WE'RE DOING THIS FROM A SINGLE CROSS SECTIONAL SAMPLE, EGO CENTRICALLY SAMPLED NETWORK AND GAINING INTUITION ABOUT HOW THE STRUCTURE OF THE NETWORK INFLUENCES THE SPREAD OF INFECTION. SO BECAUSE I DON'T HAVE TIME I WON'T GO THROUGH THIS BUT HAPPY TO TALK WITH PEOPLE AFTERWARDS. THE KEY TAKE HOME POINT, ONLY 5% TIES ARE CONCURRENT BUT THESE ACCOUNT FOR 50% OF THE FORWARD REACHABLE SET OF INFECTION. ALL THIS IS OPEN SOURCE, WE HAVE HAD A REAL COMMITMENT TO DEVELOPING OPEN SOURCE TOOLS AS VICTOR SAID, IT'S THE STAT NET LIBRARY OF SOFTWARE PACKAGES AND THERE ARE TRAINING WORKSHOPS AND TUTORIALS AND OTHER MATERIAL ONLINE. I WOULD LIKE TO THANK NICHD AND NIDA FOR CONSISTENT YEARS OF SUPPORT ON THIS. IN VICTIM, I WOULD LIKE TO SAY IN THE ERA OF BIG DATA -- IN SUMMARY, I WOULD LIKE STY INNER RA OF BIG DATA WE HAVE TO UNDERSTAND SAMPLING. EVEN IF YOU HAVE BIG DATA WE NEVER OBSERVE THE CENSUS. WITH SAMPLES AND APPROPRIATE STATISTICAL MODELS YOU CAN GET VALID ROBUST INFERENCES, AS MIGUEL HERNAN MENTIONED YESTERDAY WE DON'T NEED TO REINVENT METHODS WE JUST CAN'T FORGET THEM. FOR CONTACT NETWORK ANALYSIS IT REMAINS THE ONLY TOOL FOR PROJECTION AND INTERVENTION IN VIRTUAL LAB WE RECOLLECT MADE SUBSTANTIAL PROGRESS IN DEVELOPING METHODS FOR SAMPLE DATA AND THE BIG DATA CHALLENGES ARE IN COMPUTATION. SO I WOULD ALSO LIKE TO POINT OUT THE LITTLE DETAILS ACTUALLY MATTER HERE FOR THE BIG DYNAMICS. SO THIS DIFFERENCE, THE SMALL DIFFERENCE IN CONCURRENT HAS IMPLICATIONS FOR THESE REALLY LARGE DIFFERENCES AND DISPARITIES IN HIV. THAT'S WHY WE NEED ACCURATE CONTACT NETWORK MODELS FOR PROJECTION. NOT ASSUMPTIONS. THANK YOU VERY MUCH. [APPLAUSE] >> >> THANK YOU MARTINA, YOU RAISED A LOT OF GREAT POINTS WE CAN FURTHER ADDRESS DURING PANEL DISCUSSION. OUR NEXT SPEAKER IS DR. STEVEN COLE, PROFESSOR OF EPIDEMIOLOGY AND DIRECTOR OF GRADUATE STUDIES AT THE UNC CHAPEL HILL SCHOOL OF GLOBAL PUBLIC HEALTH AND HIS RESEARCH FOCUS INCLUDES INFECTIOUS DISEASE, PRIMARILY HIV AND CANCER. >> THANKS, VICTOR AND THANKS TO THE ORGANIZERS FOR PUTTING ON THIS INTERESTING WORKSHOP. MY POWERPOINTS BECOME SEN TO YOU AND JUMPED FORWARD. FOR PUTTING ON THIS GREAT SET OF HETEROGENEOUS TALKS. IT'S BEEN REALLY INTERESTING SO THIS IS JOINT WORK WITH JESSIE EDWARDS KATIE AND MICHAEL HUH CHIN AND ALLEN BURKHART AND THANKS TO JAMES ROBINS AN SANDRA GREEN FOR EXPERT DEVICE. I'M SUPPORTED BY NIAID FOR THE MECHANISMS LISTED HERE. SO I WAS GIVEN ORIGINALLY A TITLE WHICH IF I'M RECALLING CORRECTLY WAS INFERENCE IN A WORLD OF MISSING DATA WHICH I LET GO MISSING AND REPLACED WITH EVEN TOUGHER TITLE FROM BIG DATA TO KNOWLEDGE. SO I THOUGHT WE SHOULD PROBABLY START BY DEFINING MY TERMS WHAT I MEAN BY BIG DATA AND WHAT I MEAN BY KNOWLEDGE. TAKE MUCH MORE THAN 20 MINUTES WE WILL HAVE A WORKING DEFINITION IN A SECOND. BIG DATA MY WORKING DEFINITION FROM O NUMBER OF DISPARATELY SOURCED UNITS OFFER FEATURES. IF POSSIBLE MEASURED IN REAL TIME. WHAT I WANT TO INCLUDE IS THESE WE HAVE SEEN A FOURTH I WASN'T AWARE OF, AND I WANT TO TALK HOW BIG DATA WILL ENCROACHING ON MY WORK AS EPIDEMIOLOGIST. HOW I SEE IT WAYS TO IMPROVE INFERENCES THAT WE CAN MAKE. WE'LL HAVE TO LAY GROUND WORK A LITTLE BIT OF GROUND WORK. ESSENTIALLY BIG DATA IS MASSIVE SAMPLES AND/OR HEIDI MENTION. SAMPLES ALLOW US TO ANSWER QUESTIONS MORE COMPLEX THAN WE DO WITH TYPICALLY SIZE DATA WHICH IS A BOON FOR PRECISION MEDICINE AND OTHER AREAS AS WELL. HEIDI MENTION IS IN CONTEXT, POLICIES WE STUDY, OUTCOMES WE STUDY, MODIFIERS OF EFFECTS ON POLICY AN OUTCOMES. WE'LL DEFINE THE OUTCOMES IN A SECOND. A WONDERFUL QUOTE FROM JUDAEA 2011 AWARD LECTURE, COUNTER FACTUALS ARE SCIENTIFIC THOUGHT FREE WILL AND MORAL BEHAVIOR. I WILL SUBSCRIBE TO THE FIRST, DROP FREE WILL AND MORAL BEHAVIOR BUT DUB SCRIBE TO FACT COUNTER FACTUALS ARE BUILDING BLOCKS FOR BUILDING SCIENTIFIC THOUGHT. WE WILL USE THESE BLOCKS. IN HAVING A COUNTER FACTUAL DISTRIBUTION FUNCTION. SO WE THINK ABOUT PROBABILITY PROBABILITYIES, SLIP BY DEFINING KNOWLEDGE BUT TYPICALLY DEFINE KNOWLEDGE AS DEGREE OF OF BELIEF MEASURED BY PROBABILITY. YOU MIGHT MAKE A STATEMENT. THAT STATEMENT YOU BELIEVE TO A CERTAIN EXTENT. AND WHEN WE HAVE SHARED KNOWLEDGE, WE HAVE A SHARED BELIEF IN STATEMENT THAT MEETS A CERTAIN LEVEL VERACITY AND SOCIETY IS KNOWN AS FACT. OFTEN GETS SUPER SEEDED. WE HAVE THESE PROBABILITY THEORY AND THE IDEA OF PROBABILITY IS FAIRLY WELL ENTRENCHED BUT I'M TALKING COUNTER FACTUAL PROBABILITY, PROBABILITY THEORY ON COUNTER FACTUALS. WHAT WE HAVE HERE IS THE PROBABILITY THAT THE COUNTER FACTUAL VARIABLES (INAUDIBLE) AND THAT'S WHAT I WANT TO KNOW ABOUT, THAT'S THE STATEMENT, I WANT TO KNOW WHAT THE PROBABILITY OR THE RISK OF COUNTER FACTUAL DISTRIBUTION FUNCTION IS. SO A IS A FINITE SET, SAY K TIME FIXED POLLSIS OR TREATMENT Z. TYPICALLY IN MY WORLD IT HAS BEEN ON THE ORDER OF THREE OR FIVE OR 30 OR 50. BUT IT CAN BE 5,000, 10,000, 20,050,000, 100,000. 50,000, 100,000. IN THEORY. IN PRACTICE I HAVEN'T SEEN IT YET, COULD BE 20,000. SO LARGE NUMBER OF POLICIES. AND WE COULD GENERALIZE TIME FIXED FUEL POLLSIS TO TIME VARYING POLICIES. AS MIGUEL WAS TALKING ABOUT YESTERDAY WE OFTEN DO TO ASK AND ANSWER IMPORTANT QUESTIONS NEED TO CONSIDER THEM TO BE TIME VARIANT BUT FOR TODAY I WANT TO LIMIT TO TIME FIX POLICIES BECAUSE WE CAN LEARN ALL THE PIECES THAT I'M GOING TO MENTION TODAY IN TIME FIXED SETTING, KEEPS EVERYTHING MORE TRACTABLE. SO Y IS A FINITE SET OF Q ARBITRARILY DISTRIBUTED OUTCOME VARIABLES. SO WE CAN THINK OF OUTCOME HERE, OBSERVED OUTCOME IS COMPLEX OF CAUSE SPECIFIC MORTALITY, AS WELL AS INCIDENCE OF PARTICULAR DISEASE AS WELL AS SETS OF BIOMARKERS THAT REPRESENT THE IMMUNOLOGY, AND VIROLOGY. IT'S ALL THOSE TOGETHER IN A MULTI-VARIANT SET. YOU CAN KEEP THINKING THAT WAY ALL THIS WAY THROUGH THE TALK OR YOU CAN REMOVE ALL THAT COMPLEXITY AND THINK OF Y AS THE FIVE YEARS ALL CAUSE MORTALITY. TO NAIL IT DOWN. WHY -- IS THE COUNTER FACTUAL, IT'S A SET OF KQ POSSIBLY RANDOM POTENTIAL OUTCOMES. COUNTER FACTUAL CONSISTENCY STATES THE OBSERVED OUTCOME FOR U IS YOUR POTENTIAL OUTCOME UNDER POLICY A: WHEN WE SEE YOU TO HAVE POLICY A, NICE DISCUSSION OF THIS BUDGET EPIDEMIOLOGY IN 2010. SO WE HAVE COUNTER FACTUALS, WE WANT TO ESTIMATE, THE RISK UNDER WHAT MAYBE SCENARIO. (LOST AUDIO) CONDITIONS THAT YOU HAVE TO YOU EITHER SUBSCRIBE TO THESE CONDITIONS OR REPLACE THEM WITH ONE OF THESE ALTERNATIVE SETS, WHENEVER YOU BELIEVE ANYTHING. WHENEVER YOU SUBSCRIBE TO PIECE OF KNOWLEDGE, YOU PROCESSED IN THIS WAY OR EXCHANGEABLE WAY SO NO MEASUREMENT ERROR, NO INTERFERENCE, THERE WAS TALK IN THE LAST AND NEXT TALK ABOUT INTERFERENCE. BILL CITY VERSION, IRRELEVANCE AN EXCHANGEABILITY. AT THE END OF THE TALK I HAVE A SET OF REFERENCES THAT GIVE YOU MORE DETAIL ON THESE TOPICS. THESE ARE CONDITIONS, THE ONE THAT'S EASIEST TO UNDERSTAND IS MEASUREMENT, THESE ARE CONDITIONS THAT NEED TO BE IN PLACE IN ORDER TO IDENTIFY AND ESTIMATE AVERAGE EFFECT. IF WE HAVE EXCHANGEABILITY CONDITIONAL ON CONTEXT WE NEED POSITIVITY AND IF WE HAVE TO CONDITION ON THAT CONTEXT WITH SOME KIND OF PARAMETRIC MODEL WE NEED TO HAVE THE CORRECT MODEL FORM. SO THIS IS OUR WORKING MODEL THAT COMBINES WITH COUNTER FACTUALS. SO THATTOUSER -- SO THAT'S BACKGROUND BUT COUNTER FACTUALS ALONE, THAT SET OF CONDITIONS DOESN'T PROVIDE WHAT WE NEED TO MAKE WITH INFERENCE AND LEARN THE WORLD. WE NEED TO BELIEVE THOSE CONDITIONS HOLD BUT HOW TO WE CONVINCE OURSELVES CONDITIONS HOLE. CLOSEST THING TO MAGIC IN THIS TALK IS RANDOMIZED EXPERIMENT. THIS IS A GREAT QUOTE, EXPERIMENT THE SOLE JUDGE OF SCIENTIFIC TRUTH. SO WHAT I WANT YOU TO DO IS IMAGINE RANDOMIZED EXPERIMENTS. RANDOM SAMPLING OF END UNITS FROM ARBITRARILY LARGE POPULATION OF CAPITAL AND YOU HAVE RANDOM ALLOCATION OF THESE LITTLE END UNITS TO ONE TO HAVE POLICIES RANDOM ON VARIATION OF LITTLE END UNITS FROM THE LITTLE M UNITS. WE ARE IMAGINING THIS, THIS IS A MENTAL TOOL LIKE TRIANGLE OR REAL NUMBER THAT WE USE TO THINK ABOUT INFERENCE. WE WILL IGNORE THIS THIRD RANDOMIZATION. WE NEED IT ADS PART OF THE FULL TOOL BUT FOR TODAY WE CAN IGNORE IT. THIS IS THE TOOL WE NEED TO UNDERSTAND HOW TO HANDLE MISSING DATA IN OBSERVATIONAL STUDY. IN THIS HYPOTHETICAL MULTIPLE RANDOMIZED TRIAL WITH M BEING ARBITRARILY LARGE, WE CAN IDENTIFY CAUSAL EFFECT SO BY IDENTIFICATION I DON'T MEAN WHAT WE TALK ABOUT YESTERDAY WHEN WE SAID IDENTIFICATION, WE USE WORDS IN DIFFERENT WAYS, I MEAN THE IF -- IF YOU CAN GET RID OF RANDOM ERROR IN YOUR SYSTEM YOU CAN LEARN VALUE SINGLE NUMBER OF VALUE, AS YOU MAKE M ARBITRARILY LARGE, YOU CAN ACTUALLY LEARN THE VALUE YOU WANT FOR THAT COUNTER FACTUAL PROBABILITY DISTRIBUTION. WHAT IS INTERESTING HERE IS WHILE THIS G FORMULA WE WILL COME BACK TO IN A FEW MORE SLIDES, THIS IS COUNTER FACTUAL PROBABILITY THAT WE WANT, EQUALS AND SIMPLIFIES REDUCES CONDITIONAL PROBABILITY. CORRELATION EQUALS CAUSATION AND IN THIS RANDOMIZED THOUGHT EXPERIMENT WE CAN LEARN ABOUT THE COUNTER FACTUAL WITH THE FACTUAL SO FACTUAL OBSERVATIONS ABOUT THE WORLD TO LEARN ABOUT WHAT MAYBE. AND THAT'S AN AMAZING GIFT FISHER GAVE US, ALMOST 100 YEARS AGO. FISHER WAS THINKING ABOUT TOO ABSTRACT NOT ABSTRACTLY ENOUGH, FISHER WAS THINKING CONCRETELY ABOUT THE EXPERIMENT. ABOUT THE TRIALS THAT HE WAS DOING MANY THE FIELDS. THINKING MORE ABSTRACTLY ABOUT RANDOMIZED TRIAL, WE CAN MAKE IT MENTAL TOOL TO KEEP BACKGROUND REGARDLESS WHETHER RUNNING A RANDOMIZED TRIAL OR NOT. IF WE CAN IDENTIFY THE CREATURE TO ESTIMATE, THEN THE NEXT THING WE HAVE TO DO LOOK FOR AN ESTIMATOR. THAT'S EASY IN THIS EXPERIMENT, THAT'S OFTEN A HARD PART OF OBSERVATIONAL SCIENCE. GIVEN IDENTIFICATION WE CAN GET CONSISTENTEST MAYTOR OF COUNTER FACTUAL RISK USING NON-PARAMETRIC DISTRIBUTION FUNCTION. ONE OVER EACH OBSERVATION AND CALCULATE THAT STRAIGHT FORWARD EVEN IF WE HAVE 10,000 DIFFERENT POLLSIS AND WE HAVE A COMPLEX OUTCOME. WE'LL COME BACK TO ONE OTHER ISSUE WITH THAT IN A SECOND BUT WITH CENSORING WE CAN USE NON-PARAMETRIC JOE HAHNSONEST MAYTORND BE MODEL FREE. TO THE EXTENT WE CAN MODEL FREE IN OUR LEARNING WE WANT TO BE MODEL FREE BECAUSE EVERY TIME WE HAVE A MODEL WE HAVE A CONSTRAINT WE MAY OR NAY NOT BE RIGHT ABOUT THAT COB STRAINT. -- CONSTRAINT. IF WE'RE RIGHT GREAT. IF NOT, WE GO OFF ON A TANGENT. TO THE EXTENT MODEL FREE WE'RE SAFER POSITION NOT ONLY ESTIMATE COUNTER FACTUAL RISK, BUT VARIABILITY NON-PARAMETRICALLY USEK EFRON BOOT STRAP ANOTHER GIFT THE SCIENCE. SO WE HAVE COMBINATION THOUGHT EXPERIMENT ANT EXPERIMENTS AND WE HAVE THE COUNTER FACTUALS. THERE IS ONE ASIDE HERE, IT WILL BE A CLUE TO WHAT'S LATER IN THE TALK, THAT IS THAT IF WE HAVE THIS IDEA -- IDEAL EXPERIMENT BUT WE HAVE A LARGE NUMBER OF POLICIES, OR WE HAVE A COMPLEX MULTI-VARIANT OUTCOME WE WILL BE SUBJECT TO FINITE SAMPLE BIAS, IF IF WE WERE TO ACTUALLY FIELD THIS EXPERIMENT. AND WE HAVE FINITE NUMBERS IN THE WORLD, SUBJECT SO FINITE SAMPLE BIAS. AND EVEN IF WE CAN DO ONE OF THESE TRIPLY RANDOMIZED EXPERIMENTS WE WANT TO DO SOMETHING LIKE BASE, NON-BASE COMPROMISE TO STABILIZE THE FUNCTIONALS OF THE COUNTER FACTUAL RISK LIKE THE RISK RATIO OR THE RISK DIFFERENCE. WITH WELL CHOSEN PRIORS YOU CAN SHOW WE HAVE REDUCE MEAN SQUARE WHEN PENALIZE OR STABILIZE THESE RISK DIFFERENCES OR RISK RATIOS. THE TRICK THERE IS WE DON'T KNOW EXACTLY WHAT WELL CHOSEN MOANS YET. I FORGOT WHAT THE NEXT SLIDE IS. I HAD THIS NICE THOUGHT, THOUGHT SO BEAUTIFUL IF WE CAN GET THEM CLEAN, OUR THOUGHTS ARE SO BEAUTIFUL WHEN WE CAN ARTICULATE TO EACH OTHER AND PING BACK AND FORTH AND MAKE SURE WE'RE -- THAT WE HAVE GOT ONE OF THOSE SICKLY -- THOSE THICK NETWORKS TO THE RIGHT HAND SIDE OF SLIDE 4 THE PRIOR SPEAKER, I DON'T KNOW THAT I'M THERE YET, THAT'S THE DEEP COMMUNICATION WE HAVE REDUNDANCIES WHEN TALKING. IF OUR FIND WE CAN HAVE MULTIPLY RANDOMIZED EXPERIMENT. ANY EXPERIMENT IS A BROKEN EXPERIMENT. EVERY EXPERIMENT IS A BROKE ENEXPERIMENT. SO WE NEED APPROACH RANDOMIZED CLINICAL TRIALS FROM THIS PERSPECTIVE, GOOD TO DO THAT BUT AS WE SPOKE YESTERDAY WE NEED TO APPROACH OBSERVATIONAL STUDIES FROM THE PERSPECTIVE, THEY ARE JUST RANDOMIZED EXPERIMENTS BROKEN BY DESIGN. THESE ARE NOT TWO CREATURES. RANDOMIZED TRIAL OBSERVATIONAL STUDIES ARE NOT TWO CREATURE, THEY LIVE ON A CONTINUUM. EVEN THOUGHT EXPERIMENTS. CAN BE THOUGHT ABOUT THIS WAY. THESE ARE CALLED OBSERVATIONAL STUDIES NON-EXPERIMENTS. CALL THEM PSEUDO EXPERIMENTS. I THINK MIGUEL MAY HAVE CALLED THEM MOCK TRIALS. IF WE DON'T HAVE AND RANDOMIZATION IS CONDITIONAL ON SET OF FACTORS BUT IF WE DON'T HAVE PHYSICAL RANDOMIZATION OR CENSUS, ON EACH UNIT SAMPLING, POLICY ALLOCATION AND OBSERVATION, EACH OF THOSE PLACES THAT I MENTION WE WANT TO HAVE RAN DOCUMENT SELECTION OR ASSIGNMENT, THEN WE DON'T HAVE POINT IDENTIFICATION. I CAN'T GIVE A NUMBER EVEN IF I HAD AN INFINITE SAMPLE SIZE, I CAN JUST GIVE AN INTERVAL, A SET, NOT A CONFIDENCE INTERVAL BUT POINT ESTIMATE BECOMES A SET. WE CAN SHARPEN THOSE BOUNDS, SHARPEN THAT TO A POINT BUT WE HAVE TO INVOKE ADDITIONAL UNTESTABLE ASSUMPTIONS TO DO SO. ONE SET OF ASSUMPTIONS IS STANDARD MODEL. WHAT WE HAVE DONE IN EPIDEMIOLOGY AND TO LARGE EXTENT IN BIOSTATISTICS IS WE HAVE RUN RIGHT PAST THE BOUNDS TO POINT ESTIMATION. TO WHAT IF ANY EXTENT WE SHARPEN THE BOUNDS IS OPEN FOR DISCUSSION. I HAVE BEEN JUST AS GUILTY AS EVERYONE ELSE AND ONE RATIONALES OTHERS HAVE USED, I HAVE USED IS THAT THE BOUNDS ARE SO UNINFORMATIVE, SO WIDE THAT WE NEED TO SHARPEN THE INFERENCE BUT THE FACT THAT THE BOUNDS ARE VERY WIDE IS ITSELF VERY INFORMATIVE. THE FACT THEY'RE NON-INFORMATIVE IS VERY INFORMATIVE. WHAT IS LOGICALLY POSSIBLE UNDER WHAT WE SEE IS GIVEN BY THOSE BOUNDS. AND THAT'S PERHAPS THE PLACE TO START RATHER THAN MOVING OVER THEM TO POINT ESTIMATES. SO I DON'T HAVE A TIME INDICATOR. WE HAVE BROKEN EXPERIMENT RANDOMIZED TRIAL IS BROKEN SO YOU DIDN'T GET BALANCE IN RANDOM ALLOCATION. YOU HAVE NON-COMPLIANCE, LOSS TO FOLLOW-UP, OTHER MISSING DATA, NOT GENERALIZABLE TO THE DATA YOU'RE TRYING TO MAKE INFERENCE, SO IF YOU HAVE A BROKEN EXPERIMENT OR PSEUDO EXPERIMENT BROKEN BY DESIGN, AND YOU ADD TOGETHER THE CONDITIONAL EXCHANGEABILITY ASSUMPTIONS FROM THE STANDARD MODEL, WE HAVE TO RELY ON ADDITIONAL DATA. THE AIONALS WON'T GIVE POINT ESTIMATE BUT ASSUMPTIONS AN ADDITIONAL DATA CAN. SAM PREGNANTING ARE OUTCOME AND OBSERVATION. COMMON CAUSES. I WILL DENOTE THAT BY CONTEXT. THE CONTEXT WILL BE DIFFERENT FOR EACH ONE OF THOSE THINGS BUT DENOTE BY SO THE G FORMULA GETS EXTENDED AND YOU DON'T NEED TO FOLLOW THE G FORMULA THAT MUCH BUT IT'S EXTENDED, SIMPLE BEFORE, JUST HAD THIS PIECE IN IT. NOW WE EXTEND FOR THE CONTEXT, WE HAVE THE CONTEXT AND WE INTEGRATE OVER THE CONTEXT TO LEARN THE SITUATION. WHEN WE NEED TO CONTEXT THERE. TO ACCOUNT FOR NON-RANDOM POLICY ALLOCATION. WASN'T A RANDOMIZED CLINICAL TRIAL, ON INVESTIGATIONAL STUDY, -- OBSERVATIONAL STUDY, WHETHER I TOOK TREATMENT A DEPENDS ON CONTEXT. CONTEXT IS BIG DATA CAN REALLY COME IN. AND CHANGE THE WAY WE DO EPITEAMOLOGY IN A WAY THAT WILL BE PERHAPS REVOLUTIONARY. IT MIGHT NOT BE THAT SECOND RANDOMIZATION THAT WE DON'T HAVE I ONLY FOUND TWO PAPERS IN PUBMED WITH DOUBLE RANDOMIZATION, RANDOMIZED FROM A POPULATION AND ALLOCATED TREATMENTS. BOTH FAILED FAIRLY MISERABLY AND THOSE WERE ONLY TWO IN THE LAST 30 YEARS SO WE DON'T HAVE THE RANDOM SAMPLE SO WE WANT THE TO ACCOUNT FOR NON-RANDOM SAMPLING AND WE CAN DO THAT WITH TRANSPORT CONTEXT NOT IN THE STUDY BUT CONTEXT WHERE YOU WANT TO MAKE INFERENCE. OKAY. ONE MORE EXTENSION THEN TALK WAYS TO MOVE FORWARD. IN PRECISION OR PERSONALIZED MEDICINE OR PRECISION PUBLIC HEALTH WE'LL CONSIDER HIGH DEFINITION VERSION OF TRANSPORT G FORM L SO EXTEND WHAT WE KNOW TO BE CONDITIONAL ON SET OF FACTORS, AGAIN THIS IS JUST A COMPONENT, SET OF COMPONENTS FROM W. THESE ARE A SET OF FEATURES THAT TAYLOR THE TREATMENT OR HELP MODIFY THE POLICY AND MODIFIERS OF POLICY EFFECT USED TO TAYLOR THE TREATMENT. AND USING TRANSPORT DEFORMULA IN THIS WAY COULD LEAD TO HIGH DEFINITION KNOWLEDGE OR PERSONALIZED MEDICINE. SO WE HAVE HEARD COUPLE OF TIMES ABOUT DIMENSIONALITY, THE CONTEXT SET OF COMMON EXCUSE CAUSES THE NUMBER OF POLICIES THE NUMBER OF -- THE TRUCK OR SET OF MODIFIERS BECOMES HEIDI MENTION IN A SITUATION WHERE YOU HAVE BIG DATA RELATIVE TO AMOUNT OF INFORMATION YOU HAVE, THEN WE SUFFER FROM BELLMAN'S CURSE OF DIMENSIONALITY OR COMBINATORIAL EXPLOSION. THIS IS A SERIOUS POINT THAT A SPEAKER MADE A COUPLE OF SPEAKERS AGO MADE. THIS IS A SERIOUS POINT WE HAVEN'T TALKED ABOUT TOO MUCH. HERE IS CONJECTURE, A COMBINATION OF FLEXIBLE SEMIPARAMETRIC MODELS PERSONALIZATION AND UTILITY FUNCTIONS I WON'T GIN TO DETAIL ON, PROVIDE A WAY FORWARD. ALBEIT IMPERFECT. TO THE BEST OF MY ABILITY I CANNOT FIND A SOLUTION TO THE CURSE OF DIMENSIONALITY NOT THAT I WOULD FIND THE SOLUTION BUT I CAN'T FIND ANY OTHER HUMAN BRAIN THAT FOUND THE SOLUTION. BUFF WE HAVE A WORK MODEL TO GET US -- WE CAN LEARN AT A RATE THAT IS BETTER THAN WE ARE DOING USING COMBINATION OF FEATURES. SO BROKEN OR PSEUDOEXPERIMENTS TYPICALLY REQUIRE TO ESTIMATE AT LEAST PART OF THE G FORMULA PARAMETRICALLY. MAYBE THAT WE HAVE TO ESTIMATE OUTCOME MODEL OR POLICY ALLOCATION MODEL. WHEN WE DO THAT WE CAN ESTIMATE PIECES OF IT AND WE CAN REMAIN SEMIPARAMETRIC LETTING THE OTHER PIECES OF THE G FORMULA AND THE SYSTEM WE'RE INTERESTED IN REMAIN NONE PARAMETRIC. THAT REDUCES OUR TRANSES OF HAVING MODEL MISSPECIFICATION. SECOND GENERATION SEMIPARAMETRIC METHODS ARE HARDER TO IMPLEMENT BUT MORE PRECISE. BOTH FIRST ORDER OR GENERATION METHODS AND SECOND GENERATION, PROVIDE CONSISTENTEST MAYTORS. CONSISTENTEST MAYTORS, THE -- IT WILL BEAT YOU TO THE FINISH LINE, THEY WILL BOTH GET THERE, SOME EXAMPLES OF SECOND GENERATION METHODS ARE LISTED HERE. SO JUST LIKE WITH THE IDEALIZED EXPERIMENT BROKEN EXPERIMENTS, WE WILL BE SUBJECT TO SAMPLE BIAS, WE WILL REQUIRE MORE MODELS IN BROKEN EXPERIMENTS, WE MIGHT HAVE TO MODEL SAMPLING PROCESS, THE POLICY ALLOCATION PROCESS, THE OBSERVATION OR MISSING DATA PROCESS, AS WELL AS THE OUTCOME MODELS. SO WE HAVE MORE CHANCES FOR FINITE SAMPLE BIAS TO DESTROY INFERENCE. BUT IN THE IDEAL EMPERIMENT WE CAN EMPLOY A BASE ON BASE COMPROMISE AND USE STABILIZING PRIORS FOR ANY OR ALL MODELS, THESE PRIORS DISAPPEAR WHEN THE DATA BECOME RICH ENOUGH TO INFORM. WHEN THE DATA ARE NOT RICH, THEY SETTLE BACK TO AN IGNORE ANSWER STATE. -- IGNORANCE STATE RATHER THAN WHAT HAPPENS IN USE OF OUR MODELS CURRENTLY AND THAT THE MODELS HE CAN PLEAD. -- EXPLODE, NON-USEFUL INFORMATION WHEN THE DATA AS YOU POOL THE DATA A WAY. IN CONCLUSION, HEIDI MENTION CONTEXT POLICIES OUTCOMES AND MODIFIERS CAN BE ACCOMMODATED BY A SEMIBAYESIAN PARAMETRIC APPROACH TO CAUSAL INFERENCE. WE NEED THIS IS CONJECTURE AND WE NEED EXPERIMENTS EXAMPLES AND CODE TO PUSH THAT FORWARD. AND INTERESTING YESTERDAY SOMEBODY SAID KNOWLEDGE IS POWER. SOMEBODY MUST HAVE SAID SCIENCE IS KNOWLEDGE. AND BY TRANSITIVITY, SCIENCE IS POWER. MAYBE BIG SCIENCE IS BIG POWER. BUT LEAST WE LEAVE ON UBRIS, I'M GOING TO UNESPIED STEAD LEAVE WITH A QUOTE THAT SCARES -- INSTEAD THAT SCARE MESS TO DEATH. THE RATIONALITY OF INTELLECTUAL VIOLENCE AGAINST WHICH THE PASSIVISM RATIONALITY MAY OR MAY NOT BE AN ADEQUATE WEAPON. I WILL LEAVE IT THERE. [APPLAUSE] >> THANKS, STEVEN YOU HELPED ME FEEL MORE EMPOWERED AND DISEMPOWERED AT THE SAME TIME. I THINK THAT'S PROBABLY A GOOD THING. BUT YOU RAISED MANY IMPORTANT ISSUES THAT AGAIN AM HOPING WE CAN AT LEAST TOUCH ON DURING OUR DISCUSSION. OUR FINAL SPEAKER IS DR. CHRISTOPHE FRASER, PROFESSOR OF THEORETICAL EPIDEMIOLOGY IN THE DEPARTMENT OF MEDICINE. SCHOOL PUBLIC HEALTH IMPERIAL COLLEGE, HE'S ALSO DEPUTY DIRECTOR OF THE MRC CENTER FOR OUTBREAK ANALYSIS AND MODELING LIKE STEVE EUBANK, DR. FRASER STARTED IN THEORETICAL PHYSICS. I STARTED IN PHYSICS BUT LEFT AFTER A BACHELOR'S DEGREE. SOMETIME AGO, IT WAS LIKE LEAVING A COLLEGE, I FELT DISEMPOWERED FOR YEARS BUT SEEING STEVEN AND CHRISTOPHE FRASER GIVE MESS RETROSPECTIVE PERMISSION FOR HAVING LEFT PHYSICS SO I'M GRATEFUL FOR THAT. DR. FRAZIER CHAIRED THE EVOLUTIONARY EPIDEMIOLOGY GROUP SINCE 2009, THEY WORK IN THEORY, ON INTEGRATING DATA AND DEVELOPING APPLICATIONS FOR PUBLIC HEALTH. >> WHEN DO YOU WANT ME TO FINISH MIC IN THIS CASE >> >> THANKS A LOT. SO HERE I'M GOING TO CHANGE GEAR A LITTLE BIT. THANKS FOR INVITING ME. I WANT TO TRY TO BE VERY CONCRETE AND ACTUALLY SHOW YOU SOME OUTPUTS OF WHAT I THINK AS BIG DATA ANALYSIS. BASED ON DEAF ANYTHING SO FAR IT SEEMS DEFINITION I WILL USE I WILL CALL IT THAT BECAUSE I WANT TO. AND THE EMPHASIS IS LINKING CLOSELY RELATED DATA SOURCES FROM THE SAME PATIENTS. SO I'LL HIGHLIGHT BRIEFLY SOME GENERAL PRINCIPLES AND GO STRAIGHT INTO THE CASE STUDY WHICH IS HIV IN THE NETHERLANDS WHICH IS THE BIG DATA SOURCE BEING WORKING WITH CLOSELY, THEN SHOW YOU HOW WE GENERALIZE THAT TO THE HVTN STUDY IN ZAMBIA AND BRINGING INSIGHTS ACROSS MULTIPLE COHORTSETHES OF WORKING DURING IS TRYING TO LOOK AT THE PROPERTIES, THE RURAL TRANSMISSION, AND THE IMPORTANT POINT I THINK TO UNDERSTAND IS WE ONLY DO THIS IN A PROBABILITYISTIC SENSE PROBABLISTIC SENSE. SO THE CAVEATS ABOUT THIS ANALYSIS THAT MARTINA MORRIS EMPHASIZED EARLIER BUT FROM ANALYSIS POINT OF VIEW. SO WE DON'T NEED TO PROVE TRANSMISSION. SO WE'RE NOT IN THE BUSINESS OF LOCAL, WE'RE IN THE BUSINESS OF SAYING EVEN BEING ABLE TO PUT SOME KIND OF PROBABILITY ON TRANSMISSION BETWEEN TWO INDIVIDUALS IS ENOUGH TO MAKE A LOT OF INFERENCE ABOUT WHAT THE PATTERNS OF TRANSMISSION ARE AND WHAT DRIVES TRANSMISSION. WHAT ARE GENERAL PROPERTIES OR STAGE OF INDIVIDUAL MORE LIKELY TO TRANSMIT THAN ANOTHER. SO THE PHYLOGENY EACH PERSON IS TIP TYPICALLY REPRESENTED BY ONE TIP AND THE DISTANCE ALONG THE TREE IS HOW SIMILAR THE VIRUS SO TWO VIRUSES ARE. THESE ARE VERY DISSIMILAR LATE LATED TO FIND A COMMON AN ZESTTOR YOU NEED -- ANCESTOR YOU HAVE TO GET BACK TO THE ROOTS OF PHYLOGENY TO THE ROOT OF EPIDEMIC. SO BECAUSE THE COMMON ANCESTOR OF THESE TWO VIRUSES IS SO FAR BACK IN THE PAST, YOU CAN RULE OUT TRANSMISSION. THESE TWO VIRUSES HERE FROM THESE TWO DIFFERENT INDIVIDUALS THE COMMON ANCESTOR, THE SINGLE VIRUS WHICH GAVE RISE TO THESE TWO VIRUSES WHICH WERE SAMPLED FROM TWO INDIVIDUALS IS ONLY MAYBE A FEW YEARS OLD. IT'S POSSIBLE THAT TRANSMISSION OCCURS LINKING THESE TWO INDIVIDUALS, AND ALMOST CERTAINLY LINKED BY SHORT CHAIN OF TRANSMISSION AND WE CAN REFINE THOSE PROBABILITIES BY LOOKING AT CLINICAL DATA. AND THEN THE QUESTION S WHAT ARE THE PROPERTIES OF INDIVIDUALS WHO ARE LINKED TOGETHER BY SHORTCHANGE OF TRANSMISSION. BECAUSE BIG DATA I HAVE A FEW SLIDES ABOUT THE CHALLENGES TO GIVE YOU THE IDEA OF THE KIND OF CHALLENGES WE FACE. THEY'RE ALL KINDS OF CHALLENGES, ETHICAL, LOGISTIC, MULTI-DISCIPLINARY, THESE ARE VERY MULTI-DISCIPLINARY ANALYSIS, THESE HAVE BEEN DISCUSSED BUT THEN TALKING ABOUT WHAT TAKES A LOT OF TIME IN ANALYSIS. WHAT TAKES TIME IS ACTUALLY THE ANALYSIS ITSELF. BRINGING TOGETHER THE TEAMS AND THE SAMPLES, TAKES TIME TYPICALLY THERE'S GOOD WILL. PEOPLE ARE INTERESTING TO SEE WHAT COMES OUT AND IT TAKES TIME BUT IT GETS DONE. NUTS AND BOLTS GENERATING THE THE DATA QUICK. THE TECHNOLOGY IS CRITICAL BUT ON COMPUTING AN SEQUENCING SIZE. BUT ACTUALLY DESIGNING ANALYSIS TAKES A LOT OF TIME. THE AMOUNT OF DATA IS A LITTLE BIT OVERWHELMING WHEN WE CHARACTERIZE THE PHYLOGENY OF VIRUSES, WITH -- WE TYPICALLY SEE SMALL BITS OF VIRUSES, CHARACTERIZED FROM ONE INDIVIDUAL, ONE INDIVIDUAL CONTAIN AS SWARM OF CLOSELY RELATED VIRUSES. SO WE SEE LOTS OF TYPICALLY TENS OF MILLIONS OF SMALL CHUNKS SHOTGUN PIECES, THOSE VIRUSES, FOR 176 SAMPLES, SMALL AMOUNT OF INFORMATION WHICH IS VOICES IN THAT SAMPLE. THE INSIGHTS COME WHEN THIS IS COMBINED WITH HIGH RESOLUTION LONG TERM COHORTS WITH CLINICAL DEMOGRAPHIC BEHAVIORAL AND IN LATER ITERATIONS GEOGRAPHICAL DATA. AND THE ENDS ARE QUITE STRAIGHT FORWARD TO UNDERSTAND DRIVERS OF TRANSMISSION TO IMPROVE INTERVENTIONS. MATHEMATICS AN COMPUTATIONAL ANALYSIS OF DATA, THAT LAST POINT IS THE MOST TIME CONSUMING PART OF THE PROCESS. WE DON'T USE EXPENSIVE MACHINES IS RELATIVELY INEXPENSIVE BUT IT'S IMPORTANT IN TERMS OF THINKING OF THESE THINGS, THAT SHOULDN'T MEAN CHEAP, IT'S IMPORTANT TO INVEST ANYTIME, INVEST IN IT THAT MATTERS TAKE TIME TO DO SO. SO LET US LOOK AT THE NETHERLANDS SO THE REASON I WORKED CLOSELY IN NETHERLANDS FOR YEARS IS NATIONAL PRESENCE OF NATIONAL COHORT WITH OPT OUT CONSENT SO EVERYBODY IN THE NETHERLANDS IS PART OF THIS NATIONAL COHORT. THE EPIDEMIC SO FAR BEING WELL CHARACTERIZED THE EPIDEMIC IS NOT THE TYPICAL MANY EPIDEMICS IN CERTAINLY IN EUROPE. AMONG MEN WHO HAVE SEX WITH MEN AND AMONG PEOPLE WHO MIGRATEED FROM AREAS OF HIGH HIV PREVALENCE. ALL THE DATA COMES FROM A LIMBED NUMBER OF REGISTERED TREATMENT CENTERS WHO ALSO SUPPLY DATA CENTRALLY TO THE HIV MONITORING FOUNDATION, FROM THESE 26 HIV TREATMENT CENTERS AND ALL LINKED AND AN NONMIZED DATA SETS FOR ANALYSIS. THE TREATMENT CASCADE IS TYPICAL OF MANY OTHER EPIDEMICS. WITH 58% INFECTED INDIVIDUALS HAVING VIRAL SUPPRESSION, 82% OF THOSE INCURRED HAVING VIRAL SUPPRESSION, WHAT'S ATYPICAL IS THE FREQUENCY OF FOLLOW-UP ON THE COVERAGE OF DATA. THE EPIDEMIC HAS BEEN INCREASING WE DOCUMENTED THAT IN MODELING WORK. SO PHYLOGENETIC WE ESTIMATE THE INCIDENCE OF INFECTION FROM THE INCIDENCE OF NEW DIAGNOSIS, ALL GOING OUT CONTRIBUTED PREDOCUMENT INNOCENTLY TO INCREASES IN RISK BEHAVIOR AMONG UNDIAGNOSED INDIVIDUALS PEOPLE WHO DON'T KNOW, PEOPLE WHO ARE NOT YET SUPPRESSED. STARTING REALLY FROM THE YEAR 2000 ONWARD. AND CROSS VALIDATED THIS WITH LONG TERM BEHAVIORAL GOING ON OVER THE SAME PERIOD. BLACK DOTS SURVEY FROM THE AMSTERDAM PROPORTION OF MEN WHO REPORT INTERCOURSE, BLACK ON THE RIGHT AND THE LINES, THE INFERENCE FROM MATHEMATICAL MODELS, REPLICATES. SO THAT IS A PICTURE OF THE WHOLE EPIDEMIC AND WE WANT TO ZOOM IN MORE CLOSELY INTO SOME THAT MAKES WE BREAK DOWN THE CLUSTERS AND CLUSTERS DEFINE ON PHYLOGENIES OF VIRUSES, THIS IS THE OVERALL PHYLOGENY OF THE WHOLE EPIDEMIC, THE TREE WITH EACH PERSON RELEASE TO UNDERSTAND TREE AND WE LOOK AT INDIVIDUAL SUBEPIDEMICS WHICH ARE EPIDEMICS MORE CLOSELY RELATED VIRUSES. AND WE TRY TO CHARACTERIZE WHAT HAPPENS THINKING THE WHOLE EPIDEMIC AND BREAKING THIS THERE ARE MANY INDIVIDUALS WHO DON'T CLUSTER BUT WE FIND 106 LARGE CLUSTERS CONTAINING TEN OR MORE INDIVIDUALS, AND OF THOSE 91 SHOWN HERE, PREDOCUMENT INNOCENTLY AMONG MEN WHO HAVE -- PREDOCUMENT INNOCENTLY IN MEN WHO HAVE PREDOMINANTLY WHO DECLARE THEMSELVES HETEROSEXUAL WHO APPEAR IN THESE CLUSTERS AS SHOWN IN GREEN. THE DOTS ARE THE TIME OF DIAGNOSIS OF THESE INDIVIDUALS. AND THERE ARE A COUPLE OF FEATURES STRIKING ABOUT THIS PICTURE. THE FIRST IS A LOT OF CLUSTERS ARE PRETTY OLD. THEY DATE BACK TO EARLY DAYS OF THE EPIDEMIC. STUDYING RATES OF APPEARANCE OF NEW CLUSTERS. THEY NEVER SEEM TO STOP, OF ALL CLUSTERS WHICH STARTED AND PRE-DATED BACK TO THE BEGINNING OF THE EPIDEMIC, 1980s, NONE STOPPED. SO WE SAY WE STOPPED EPIDEMIC BUT IF WE DISAGGREGATE INTO SMALLER EPIDEMICS WE HAVEN'T STOPPED OR SLOWED DOWN ANY SMALL SUBEPIDEMICS. THE SECOND THING IS A LOT OF INDIVIDUALS GET INFECTED AS PART OF THESE OLD CLUSTERS. HERE ARE HETEROSEXUAL CLUSTERS, LARGELY RELATE AND ONE LARGE CLUSTER WHICH CONTAINS 66% IV DRUG USERS IN THE STUDY AND MOST CO-INFECTED INDIVIDUALS. THESE SHOWS THE INDIVIDUALS IN SMALL CLUSTER ANSWERED THE INDIVIDUALS WITH NO CLUSTER AND WAS RECORDED. IF WE ASK PEOPLE, ONE DATA WE HAVE IS ALL INDIVIDUALS FOR US WHERE THEY THINK THEY WERE INFECTED AND THOSE WHO SELF-REPORT BEING -- HAVING LIKELY BEEN INFECT AID BROAD, MUCH MORE LIKELY TO NOT CLUSTER OR TO FALL INTO SMALLER CLUSTERS WHICH VALIDATE THE IDEA OF CLUSTERS BEING LOCAL AND TRANSMISSION. IF THEY CLUSTER SEQUENCE FROM ANOTHER COUNTRY THEY'RE MORE LIKELY TO HAVE BEEN INFECTED. POW IF THERE IS GOOD NEWS IN HERE T CLUSTERS ARE AGING, -- HERE, THE CLUSTERS ARE AGING QUITE FAST. THE MEAN AGE OF NEW DIAGNOSIS WITHIN A CLUSTER INCREASES AT NULL .5 YEARS OF AGE. THE PROBLEM IS THE NEWEST CLUSTERS ARE AMONG YOUNGER INDIVIDUALS SO THE NEW CLUSTERS WHICH ARE APPARENT. WE CANNIST MAT WITH MATT MATTCAL MODELING TECHNIQUES REPRODUCTION NUMBER FOR CLUSTER SO EACH LINE REPRESENTS ONE CLUSTER AND WE ESTIMATE REPRODUCTION NUMBER. THERE IS IS A STRIKING PATTERN, RELATIVELY SIMILAR TO EACH OTHER. SELF-SUSTAINING. WHEN EACH PERSON INFECTS ONE OTHER PERSON THE EPIDEMIC THE SUBEPIDEMIC THE CLUSTER IS SELF-SUSTAINING. NEW CLUSTERS AMONG YOUNGER PEOPLE ARE GROWING MOST QUICKLY IN THIS EPIDEMIC AND MORE QUICKLY THAN ANY OTHER CLUSTERS WHICH APPEAR AT ANY POINT IN THE EPIDEMIC. THEN WE DECIDED TO ZOOM IN MORE AND LOOK AT INDIVIDUAL POTENTIAL TRANSMISSION EVENTS. WE CAN NEVER PROVE TRANSMISSION BUT WHEN WE HAVE AN INDIVIDUAL AND WE KNOW THEY WERE HIV NEGATIVE UP TO A CERTAIN POINT AND BECAME HIV POSITIVE WE MIGHT FORM -- FIND PROBABLE TRANSMISSERS, THESE INDIVIDUALS CLOSELY RELATED VIRUS AND WE CAN ASSIGN PROBABILITY ONE BIG THING WE DID IN THE ANALYSIS WAS A TRANSFORMATION DATA. COMPLICATED BUT WE LOOK AT ALL THESE TIME WINDOWS, FOR PROBABLE TRANSMITTERS IT'S NOT PARTICULARLY LIKELY THE TRANSMISSION OCCURRED IN A WINDOW BUT IT HAS OCCURRED AT SOME POINT SOMEBODY AND WE CAN LOOK AT PROBABILITIES AND WE CAN CORRELATE THESE PROBABILITIES WITH CLINICAL DATA SO THE PROBABILITIES THAT DEFINE PHYLOGENETICALLY AND WE LOOK AT THE CLINICAL DATA AND WE LOOK WHAT ARE THE CLINICAL CHARACTERISTICS OF BEING VERY COMPATIBLE WITH TRANSMISSION THAT MUST HAVE OCCURRED TO THIS INDIVIDUAL, THIS POINT IN TIME. THIS IS ANALYSIS -- SO WE GO THROUGH THE DATA AND FIND PROBABLE DIRECT TRANSMITTERS AND FOUND 883 PROBABLE DIRECT TRANSMITTERS TO 601 RECIPIENT MSM WHICH IS A BIT BIGGER THAN THE PREVIOUS STUDY DONE ON THIS WHICH FOUND 41 MSM. SO WE LOOK AT THESE PROBABLE TRANSMITTERS AND TRANSMISSION TIMES AND WE CLASSIFIED THEM INTO 13 POSSIBLE STAGES, PRE-DIAGNOSIS, DIAGNOSED, ARC STARTED AND ARC STARTED EXPERIENCE SOME DEGREE OVERLUSH PRESENTATION. WE ADDED UP ALL THESE PROBABILITIES TRANSMISSIBILITY OF ALL THESE DIFFERENT STAGES, WE AD THEM UP TO ONE AND ASKED WHAT PROPORTION AND THE ANSWER IS AND MOST OF THOSE OCCUR OCCUR EARLY HIV HIV INFECTION. VERY, VERY FEW TRANSMISSIONS OCCUR FROM PEOPLE ON ART EVEN THOSE WITHOUT CLEAR EVIDENCE OF SUPPRESSION. EARLIER DIAGNOSIS DOES OFFER REDUCTIONS, POTENTIAL TO REDUCE TRANSMISSION BUT NOT THAT MUCH. YOU NEED TO GET THESE UNDIAGNOSED INDIVIDUALS SO RELATIVE TRANSMISSIBILITY FOR THESE 13 STAGES, THIS IS A TOTAL CONTRIBUTION, UP TO 100% DIFFERENT STAGES. THEN COUNTER FACTUAL MODELING ON THE PHYLOGENY, CLINICAL CHARACTERISTICS OF THESE INDIVIDUALS SO THIS IS DIFFERENT FROM POPULATION MODELING, IF WE HAD CHANGE CHARACTERISTICS OF THE INDIVIDUAL WOULD WE REDUCE TRANSMISSION WOULD HAVE OCCURRED AND TRANSMISSIONS WE FOUND IN PHYLOGENY COULD BE REDUCED. WITH DIFFERENT CHANGES. WHAT WOULD HAVE HAPPENED IF EVERYBODY STARTED TREATMENT SOON AS THEY WERE DIAGNOSED, YOU CAN REDUCE TRANSMISSION BY ABOUT 20 TO 30%. WHAT WE FOUND IS TREATMENT AND PREVENTION IN THIS POPULATION WOULD HAVE A LIMITED IMPACT BECAUSE IT HAS ALREADY BEEN ACHIEVED. IN FACT WHAT WE FOUND BECAUSE WE HAD EVIDENCE AND DATA NOT JUST ON PEOPLE HIV POSITIVE BUT ALSO ON THEIR PREVIOUS NEGATIVE TEST BEFORE THEY BECAME POSITIVE WE FOUND PEOPLE WHO POTENTIAL TRANSMITTERS ACTUALLY TEST LESS THAN THE REST OF THE POPULATION. WHEN THEY DO COME IN FOR TESTING WE ASK WHAT WOULD HAPPEN IF THERE HAD BEEN -- POLICY WE CAN TEST, OFFERING PEOPLE TO PREP ON THE BASIS THEY TEST FORD HIV. DEPENDING ON THE TAKEUP AND THE ADHERENCE, THIS IS THE POLICY WHICH LARGE PRODUCTIONS IN TRANSMISSION IN THIS POPULATION ACCORDING -- BASED ON A BUNCH OF ASSUMPTIONS. NOW QUITE QUICKLY TO DESCRIBE A FEW OTHER STUDIES, THIS IS QUITE WELL WORKED OUT BUT A FEW OTHER STUDIES WHICH GIVE AN IDEA OF WHERE WE'RE GOING. WE'RE NOW EXPANDING ACROSS EUROPE AND USING DEEP SEQUENCING A STUDY THAT I LEAD THE SAME PERSON TOGETHER INVOLVING NUMBER OF EUROPEAN SERO CONVERTERS REALLY AIMED AT -- EPICHEMOLOGICAL AND BIOLOGICAL OUTCOMES SO WE'RE LOOKING AT SERO CONVERTERS TO BE ABLE TO LOOK WHAT EXTENT VIRAL GENOTYPE PREDICTS VIRAL LOAD, CD40 DECLINES IN CLINICAL OUTCOMES. SO THAT'S WHY THE FOCUS IS ON THE VIRAL GENOTYPE. AND HERE WE'RE USING WHOLE VIRAL GENOMES AND DEEP SEQUENCING HUGE AMOUNTS OF DATA, AND REALLY TRYING TO SOLVE THE PROBLEM WHICH HASN'T BEEN SOLVED, SHORT TRACKS OF DNA MOST WHICH ARE HIV AND SOME WHICH ARE ALL OTHER THINGS WHICH ARE IN THE SAMPLE, THE PROBLEM IS THEY'RE NOT IDENTICAL BECAUSE SEQUENCES FROM DIFFERENT VIRAL PARTICLES WHICH ARE NOT ALL IDENTICAL. HOW DO YOU -- IT'S LIKE HAVING A MILLION PIECES FROM A THOUSAND JIG SAW PUZZLES ARE CLOSELY RELATED FOR WHICH YOU DON'T HAVE A TEMPLATE. THAT'S ONE REQUIREMENT WE HAVE BEEN TRYING TO SOLVE AND MAKING GOOD PROGRESS AND WE HAVE SOME DRAFTS AND HERE IS PHYLOGENY BASED ON DRAFT. DRAFT. YOU GET HIGHER RESOLUTION PHYLOGENYIES, THIS IS THE FIRST DRAFT COMPARING SEQUENCES WE OBTAINED FROM THE UK AND NETHERLANDS WITH THIS MUCH HIGHER RESOLUTION PHYLOGENY. YOU CAN FIND THE EPIDEMIC IS ENTERMIXED. WE DIDN'T SEE THAT WITH POLL SEQUENCING BECAUSE THE TIPS OF PHYLOGENY, WHEN WE LOOK DEEP TORE PHYLOGENY THERE WERE LOTS OF CHANGES OF THE EPIDEMIC FROM THE UK TO THE NETHERLANDS. REALLY, IT IS THE SUBONLY DETERMINICS ARE LOCAL BUT THERE'S A CHANCE THEY MOVE FROM ONE COUNTRY TO THE NEXT. THIS IS SHOWING YOU THE INCREASE IN RESOLUTION WHICH IS BIG BUT NOT MAGICAL BECAUSE THERE ARE REASONS FOR THAT WITH WHICH ARE BIOLOGICAL, ANOTHER THING WE DISCOVERED WITH SEQUENCING IS EXTENT OF WHICH IS PERHAPS SURPRISING, THE PHENOMENA WAS NOT SURPRISING IS EXTENT OF DUAL INFECTIONS, INDIVIDUALS ARE CARRY TWO DIFFERENT VIRUSES THEY MUST PRESUMABLY ACQUIRE FROM TWO DIFFERENT INDIVIDUALS WHICH ARE IN TWO PARTS OF PHYLOGENY SO THE SAME INDIVIDUAL HAS TWO DISTINCT VIRUS ONCE MORE COMMON THAN THE OTHER. AND IT'S OF THE ORDER OF 10% OF ALL 8 TO 10% OF ALL INDIVIDUALS IN THE POPULATION, REMEMBER THESE ARE SERO CONVERTERS RECENTLY INFECTED INDIVIDUALS. WE CAN LOOK AT LOST OF OTHER THINGS LIKE EVOLUTION OF VIRAL LOAD, ANOTHER STORY. AN INTERESTING STORY. WE SEE EVIDENCE OF VIRAL LOADS CHANGENING THIS POPULATION AND EVOLVEING ALONG LINEAGES AND WE WANT TO UNDERSTAND WHAT IT IS IN THE VIRAL GENOTYPE THAT DETERMINES THAT. THAT WAS THE KEY AIM OF THE BEEHIVE STUDY. IN ZAMBIA INVOLVED WITH THE VERY LARGE HPTN 701 POP-OUT TRIAL LED BY RICHARD HAYES AND SARA FIDDLER AND OTHER PLAYERS, (INDISCERNIBLE) IN ZAMBIA. AT THIS TRIAL INVOLVES 1.2 MILLION INDIVIDUALS IN 21 COMMUNITIES THAT ARE RANDOMIZED TO THREE ARMS. STANDARD OF CARE ARM, ENHANCED STANDARD OF CARE REALLY AND HOUSEHOLD TESTING WITH IMMEDIATE TREATMENT EITHER ACCORDING TO NATIONAL GUIDELINES OR UNIVERSALLY SOME POINT IN NATIONAL GUIDELINES LOOKING MORE LIKE UNIVERSAL TREATMENT AND CIRCUMCISION AND STANDARDS HIV PREVENTION PACKAGE. THE AIM WHICH TO FIND OUT HOW MUCH THESE INTERVENTIONS RELATIVE TO STANDARD OF CARE REDUCE INCIDENCE OF NEW INFECTIONS IN A COHORT. PHYLOGENETICS FUNDED NIH HERE, HAS AS PART OF THE TRIAL WILL ADD INSIGHT INTO WHAT HAPPENS DURING THE TRIAL AND WILL HELP US ESTIMATE THE SPECIFICS IN NETHERLANDS ROLE OF ACUTE INFECTION ESTIMATE DEMOGRAPHIC CORRELATES OF TRANSMISSION ESTIMATE AMOUNT OF TRANSMISSION FROM OUTSIDE THE COMMUNITY. SO WE'LL SEE THE PATTERNS OF TRANSMISSION. WE KNOW THIS WENT TO ELIMINATE 100% TRANSMISSION, WE WERE INTERESTED IN LEARNING AS MUCH AS POSSIBLE ABOUT WHERE RESIDUAL TRANSMISSION COMES FROM. ONE THING WE DID WHICH HAD BEEN DONE FOR PHYLOGENETICS BEFORE, PHYLOGENETICS IS DONE GO AHEAD AND DO IT, PLANNING A STUDY WAS TRYING TO DO POWER COOL QUEUELATIONS AND WE FOUND FOR OUR METHODS AND OTHER METHODS AVAILABLE THAT YOU NEED TO SAMPLE A LARGE PROPORTION OF THE POPULATION. AND CLUSTER RANDOMIZED TRIAL LIKE THIS, THE POPULATION SIZES ARE QUITE LARGE, SAMPLING A LARGE NUMBER OF VERY LARGE PROPORTION OF INDIVIDUALS PHYLOGENETICS. MUCH LARGER THAN IN OUR COHORTS. SO WE DEVELOPED A PROTOCOL WHICH INVOLVES RECRUITING PEOPLE IN CLINICS AND COMMUNITIES FOR THIS PHYLOGENETIC STUDY. FINALLY, 071 STUDY IS PART OF A CONSORTIUM BRINGING TOGETHER DIFFERENT COHORTS AND DIFFERENT STUDIES AND DIFFERENT TRIALS IN IN CONSORTIUM, THE AGE OF THE CONSORTIUM IS REALLY ONGOING AND OVER THE NEXT 12 TO 18 MONTHS TO STUDY AND MAKE AVAILABLE 20,000 HIV WHOLE GENOMES FROM ACROSS DIFFERENT SITES AT SIX DIFFERENT SITES, UGANDA, SOUTH AFRICA ZAMBIA, ONE THING WE HAVE BEEN DOING THE PHYLOGENETIC METHODS ARE AS I INTIMATED COMPLICATED AND MATHEMATICALLY ANALYTICALLY COMPUTATIONALLY DIFFICULT AND IN DEVELOPING THESE METHODS AND WE HAVE FOR OTHER REASONS DEVELOP POPULATIONS. STEVEN EUBANK MENTION AND WE CAN SIMULATE PHYLOGENY AND ASK FOR SIMULATED DATA HOW TO GET A JOB WE KNOW PHYLOGENIES FILTER THROUGH LOTS OF UNCERTAINTY HOW A JOB DO THEY DO RECONSTRUCTING WHAT HAPPENED IN THE SIMULATED EPIDEMIC. NEXT DEADLINE FOR SUBMISSION FIRST OF AUGUST. SO SUMMARY PHYLOGENIES PROVIDE INSIGHT INTO INDIVIDUAL POPULATION DRIVERS OF EPIDEMIC SPREAD, USING TECHNOLOGIES PROVIDE IMPROVE RESOLUTION AND REDUCE COST AT LEAST UP FRONT, THEY MAKE THE ANALYSIS CHALLENGING. AND EXCITING. IT'S AN AREA OF INTENSE METHODOLOGICAL DEVELOPMENT, EXPECT NEW INSIGHTS INTO GENERALIZED HIV EPIDEMICS FROM THESE DIFFERENT STUDIES AND COANCHOR. AND PHYLOGENETICS I THINK SHOULD QUESTION IN DIFFERENT SETTINGS ASSESS TARGETING OF PREVENTION. THESE ARE THE MULTI-DISCIPLINARY CONSORTIA INVOLVING MANY DIFFERENT PEOPLE, THE BEEHIVE CONSORTIA, HVTN 071 ITSELF FUNDED BY A LARGE CONSORTIUM AND A CONSORTIUM. [APPLAUSE] >> THANK YOU, CHRISTOPHE. YOU HEARD A NUMBER OF TALKS FROM DIFFERENT PERSPECTIVES SOMETIMES OVERLAPPING. COMMENTARY. AND I THINK THERE ARE MANY ISSUES TO RESOLVE SO I WOULD LIKE TO INVITE ALL SPEAKERS TO COME UP AND PARTICIPATE IN THE PANEL DISCUSSION. THE ORGANIZERS TOLD ME WE HAVE UNTIL 12:30. WE HAVE FORTUNATELY A GENEROUS LUNCH THAT WE CAN CUT INTO. THE FORMAT SUGGESTED TO ME ALL THE PANELISTS MAKE A BRIEF COMMENTARY ABOUT THEIR TAKE AWAY MESSAGE FROM THIS MORNING'S SESSION PUTTING THEIR TALKS IN THE CONTEXT OF THE GENERAL SESSION. OPEN THINGS UP FOR QUESTIONS FROM THE FLOOR. IF THERE'S NO OBJECTION WE MIGHT START IN REVERSE ORDER WITH CHRISTOPHE. UNLESS YOU PREFER WE START FROM THE OTHER ORDER CHRISTOPHE. >> ALL RIGHT. WE WILL GIVE YOU TIME TO SYNTHESIZE. WITH A LITTLE CHANGE OF TACTIC. I WOULD SAY THAT THE TAKE AWAY MESSAGE FOR ME IS THERE ARE TREMENDOUS OPPORTUNITIES, ALSO TREMENDOUS COMPLEXITIES MANY TERMS OF MAKING STATISTICALLY VALID ROBUST INFERENCES FROM THESE DATA. I THINK WE HAVE DISCUSSED A LOT AND SEEN A LOT OF TEMPLATES FOR BRINGING TOGETHER CONSORTIUM AND BRINGING TOGETHER THE DATA. BUT THIS IDEA OF THINKING ABOUT CONCEPTUAL TRIALS WITHIN DATA, REALLY FORMULATING QUESTIONS AND FORMULATING ANALYSIS THAT ARE NOT ENTIRELY CONFOUNDED BY AD HOC NATURE IN WHICH DIFFERENT DATA SETS HAVE BEEN PUT TOGETHER IS SOMETHING VERY EXCITING AND NECESSARY. >> I WOULD ASK YOU ONE FURTHER QUESTION CHRISTOPHE, MAYBE MENTION THIS TO THE PANEL. WHAT DO YOU THINK ARE THE BIGGEST OBSTACLES RIGHT NOW? ARE THEY DEVELOPMENT OF METHODS, ACCESS TO DATA, ALL THE ABOVE? >> I DON'T THINK IT'S PRACTICAL BARRIER BUT THE DEVELOPMENT OF METHODS AND SYNTHESIZING AVAILABLE METHODS SO VARIOUS SPEAKERS MADE THE POINT, I HOPE YOU ESSENTIALLY READ OUR PAPERS AND USE OUR METHODS OF THE ANALYSIS. WHEN WE PUT TOGETHER ALL THE ANALYSIS TO EXTRACT AS MUCH AS POSSIBLE FROM THE DATA. THAT'S A SUBSTANTIAL CHALLENGE. WE HAVE TO LEARN TO DO THAT MOST EFFICIENTLY. >> I'M SORRY. STEVEN. >> ONCE SOMEBODY MISTOOK P ME FOR MIGUEL. I DECIDED TO GIVE THEM HIS EMAIL ADDRESS ON THE SPOT. AND SAID GO AHEAD AND SEND ME ANYTHING YOU WANT. SO I WOULD SAY MAYBE TWO POINTS, ONE ECHO WHAT CHRISTOPHE WAS SAYING, ONE IS THAT IN THIS CLEARLY BIG OPPORTUNITY THAT MANY OF YOU SEE A PRINCIPLED WAY FORWARD, I THINK, WOULD BE IMPORTANT AND THE ABILITY TO COMMUNICATE ACROSS DISCIPLINES TO TAKE ADVANTAGE OF THE OPPORTUNITY. THE OPPORTUNITY EITHER LIVES OR DIES WITH OUR ABILITY TO TIE TOGETHER ANALYTIC COMPUTATIONAL AND INTELLECTUAL CONTRIBUTIONS THAT ARE FAIRLY DISPARATE, THAT OPPORTUNITY IS I THINK OURS FOR THE SEIZING BUT HOW TO DO THAT, IS A LITTLE DIFFICULT. >> SO WHAT THEY SEND, -- WHAT THEY SAID. I HAVE BEEN VERY IMPRESSED AT THE ESPECIALLY IN THIS MORNING VERSION LACK OF SITE. A LOT OF BIG DATA IS STUFF WE HEAR ABOUT, DATA WILL SPEAK FOR ITSELF AND ALL PANEL ITSELFS HERE HAVE BEEN REALLY THOUGHTFUL ABOUT IT, ADDRESSING SOME OF THE ISSUES. THE METHODS THAT WE NEED TO PULL TOGETHER, I AGREE, ARE COMPLEX, THEY ARE EVOLVEING, AND THAT'S -- SO PART OF THE PROBLEM IS THE USUAL SILOING OF DISCIPLINES BUT THE OTHER PART OF THE PROBLEM, IT'S A MASSIVE UNDERTAKING TO UNDERSTAND EACH OF THESE NEW TYPES OF METHODOLOGY. SO PUTTING THEM TOGETHER IS CHALLENGING. BUT IT IS AN INTERESTING OPPORTUNITY AND EVEN A CONFERENCE LIKE THIS HELPS A GREED DEAL. >> FIRST TWO OR THREE FACTORS ACCOUNT FOR PROBABLY -- I WOULD ADD TO THE PREVIOUS REMARKS, NOT ONLY DO WE NEED TO BE AWARE OF THE METHODS BEING DEVELOPED IN OTHER AREAS AND APPLY THEM HERE, BUT WE ALSO NEED TO REALIZE THAT THE PROBLEMS THAT WE'RE ENCOUNTERING ARE BEING ENCOUNTERED ALL OVER ACROSS THE BOARD AT NIH NSF, ANYWHERE ALL DOMAINS AS THEY TRY TO UNDERSTAND WHAT TO DO WITH BIG DATA. SO WE SHOULD MAKE SURE NOT TO UNINTENTIONALLY BUILD A SILO OF METHODOLOGY FOR HIV. >> TO SOME EXTENT, SIMILAR POINT, PART OF THE TEN CHALLENGE SEEMS TO BE IN COMMUNICATION ACROSS DISCIPLINES WITH DIFFERENT METHODS, BOTH METHOD AND THEORY ON COMPUTER SCIENCE SIDE, ASPECTS AS WELL, I ASK THEM TO GO DO ITS WORK, THEY COME WITH SOMETHING THAT WOULD NOT MAKE SENSE TO THE OTHER END BUT THEN ALSO SOME OF THE COMPUTER SCIENCE METHODS OR METHODS COMING FROM PHYSICS AND SO ON BECOME VERY, VERY DIFFICULT FOR PEOPLE WHO ARE NOT TRAINED IN THAT FIELD. SO WE NEED METHODS BUT WE ALSO NEED TURNKEY FORMAT AT LEAST AT SOME POINT THAT'S SOMETHING WE ARE QUITE BENEFICIAL T. OTHER ASPECT IS FROM YESTERDAY I WORK OUT DOING IMPRESSION THAT THE ETHICAL CHALLENGES ARE VERY ACUTE AND LARGELY UNRESOLVED AND I THINK TODAY WE SAW A LOT OF IT. POTENTIAL RESOLUTIONS, NICE TO HEAR MORE OF THAT DURING THE UPCOMING SESSION. NOT WALK OUT WITH THE IDEA THIS SHOULDN'T BE DONE BUT THE IDEA OF WHAT CAN BE DONE AND FIND WAYS TO DO IT. >> I RESONATE WITH SPECTRUM OF OPPORTUNITIES, SPECIFICALLY THE OPPORTUNITY OF INTEGRATING VERY MULTI-DISCIPLINARY DATA, WHETHER IT'S PRIVATE, SYNTHETIC, WHETHER VERY PERM, REAL, THE LATTER COMMENT ABOUT BARRIERS WE'RE AT A POINT THOSE ARE TECHNOLOGICALLY ABLE TO OVERCOME THOSE. SO THAT EXPOSES A VERY VAST HORIZON OF ANALYTIC OPPORTUNITIES THAT WITH EVEN A FEW YEARS BACK WERE NOT POSSIBLE. VERY EXCITING. >> FIRST AND LAST. SO I THINK ALL THE OTHER PANELISTS HAVE MADE REALLY IMPORTANT POINTS AND I WILL TAKE A SLIGHTLY DIFFERENT SPIN ON THE QUESTION WHICH IN SOME WAYS NOT THAT DIFFERENT. BUT THERE WERE -- I HEARD INTERESTING OPPORTUNITIES FOR OTHER TYPES OF RESEARCH. FROM A SELFISH SYNTHETIC DATA PERSPECTIVE, THE TYPES OF DATA THAT ARE BEING LOOKED AT, UTILIZED, ARE NOT ONES THAT I HAVE HAD MUCH EXPERIENCE GENERATING PUBLIC USE FILES FOR. AND THERE ARE TECHNIQUES USED TO GENERATE SYNTHETIC PHYLOGENIES AND NETWORKS, SYNTHETIC POPULATIONS, I DON'T KNOW IF THERE'S SYNTHETIC TWITTER BY TEXT IN GENERAL HOW DO YOU GENERATE SYNTHETIC TEXT TO THINK ABOUT THE FRAMEWORK. SO THERE'S INTERESTING OPPORTUNITIES THERE. ANOTHER THEME I HEARD A LOT, COMBINING INFORMATION FROM MULTIPLE DATA SOURCES AND THAT BRINGS UP ISSUES OF LINKING RECORDS. AND LINKING RECORDS IS A VERY CHALLENGING THING TO DO PARTICULARLY WHEN YOU DON'T HAVE EXACT IDENTIFIERS. NOISY AND COURSE IDENTIFIERS. AND THINKING HOW CERTAINTY THAT COMES FROM INEXACT MATCHING PLAYS INTO ALL THESE KINDS OF INFERENCES THAT ARE BEING DONE AND PLAYS TO THE REALLY COOL STUFF, THAT'S A REALLY INTERESTING OPPORTUNITY. NOW I WOULD LIKE TO OPEN THE SESSION UP FOR QUESTIONS FROM THE FLOOR, IF YOU WANT TO COME TO THE MIC PLEASE. >> HELEN NISSENBAUM, THIS QUESTION ORTHOGONAL REALLY WANTING TO ASK AT THE WHOLE PERIOD OF TIME OF THE CONFERENCE. THAT IS THE DATA ITSELF ACCESS TO THE DATA ITSELF, I'M WONDERING HOW DIFFICULT IT IS TO GET YOUR HANDS ON THE DATA YOU NEED IN ORDER TO APPLY THE METHODS AND THE METHODOLOGIES AND DOES THE DATA MAINLY COME FROM GOVERNMENT SUPPORTED OR LIKE THE NIH GOVERNMENTAL INSTITUTIONS IS THERE A DESIRE OR DO YOU THINK WE COULD SO MUCH FANTASTIC DATA WE CAN'T GET HANDS ON, PHARMACEUTICAL COMPANIES OR MEDICAL INSURANCE COMPANIES FOR VARIOUS REASONS. I'M INTERESTED IN THIS ARRAY OF QUESTIONS ABOUT ACCESS TO DATA. >> WE RELY ON GOVERNMENT SOURCES BECAUSE THEY'RE MOST LIKELY FREELY AVAILABLE AND WELL MAINTAINED, AND THERE WHEN YOU NEED THEM. WE HAVE USED PROPRIETARY SETS IN PLACES LIKE DUNN AND BRAD STREET. MY PERCEPTION IS IN THE PAST LITTLE REPOSITORIES OF DATA AROUND THE COUNTRY, FOR EXAMPLE IN METROPOLITAN PLANNING OFFICES, WE'RE AVAILABLE IF YOU CAN PROVIDE IN TURN ANALYSIS OF DATA THAT HELPED OUT THOSE ORGANIZATIONS SO IT'S KIND OF A SCRATCH YOUR BACK IF YOU'LL SCRATCH OURS. BUT RECENTLY ESPECIALLY WITH THE EMERGENCE OF SOCIAL MEDIA, COMPANIES UNDERSTAND THE VALUE OF THE DATA THEY'RE COLLECTING AND IN MANY CASES THE WHOLE PURPOSE OF THE COMPANY THE WHOLE EXISTENCE IS TO COLLECT THIS DATA. I SEE MANY DATA SETS BECOMING CLOSED OFF TO US BECAUSE THEY'RE TOO EXPENSIVE. AND WE CAN'T PUBLISH ON THEM. THE FIRST THING TO REPRODUCE RESULTS IS SPEND A MILLION DOLLARS FOR A DATA SET, THAT RESTRICTS PUBLICATION POSSIBILITIES. >> A LOT OF DATA WITH WE USE IN THESE STUDIES AND IN OTHER RELATED STUDIES INFECTION ACROSS THE DEPARTMENT, IS TYPICALLY COMBINING RESEARCH DATA WHICH IS TYPICALLY GOVERNMENT FUNDED RE SEARCH OR LIKE THE GATES FOUNDATION AND WELCOME TRUST TYPICALLY HAVE MANDATES THAT DATA HAS TO BE MADE AVAILABLE TYPICALLY 12 MONTHS AFTER THE GRANT FINISHES OR WHEN THE GRANT FINISHES OR AS PRODUCED, SEQUENCE DATA FOR EXAMPLE, FUNDED BY THE WELCOME TRUST HAS TO BE MADE AVAILABLE ON SERVERS THAT THEY SEQUENCED AND THE CHALLENGE IS BRINGING IT TOGETHER MEANING THRILL ENSURING COMPLIANCE ON THE PART OF RESEARCHERS AND ALSO DEALING WITH ISSUES OF PATIENT CONFIDENTIALITY AND PRIVACY. THAT TYPICALLY DEALT WITH BY COMMITTEE WHERE YOU SIGN UP AND YOU HAVE A TEMPLATE SAY WHAT YOU'RE GOING TO DO AND THAT YOU WILL RESPECT SO TYPICALLY BEHIND THE FIREWALL, THAT'S THE ONLY PROTECTION, YOU HAVE TO SAY BASICALLY ANONYMIZE THE DATA AND SOME CASES IT MIGHT BE POSSIBLE TO DO SO. THEN GOVERNMENT DATA, INTERGOVERNMENTAL AGENCIES AS WELL, THAT'S TYPICALLY NOT SHARE FREEDLY, IT'S GIVEN TO RESEARCH GROUPS WHETHER SHOULD BE IS BIG DEBATE BUT BUILD UP A RELATIONSHIP WITH GOVERNMENT AND INTERGOVERNMENTAL AGENCIES, DISCUSSIONS THAT TAKE SEVERAL YEARS IN FORMAL DISCUSSION BUILDING UP TRUST, WITH FORMAL LEGAL MATERIAL TRANSFER AGREEMENTS DISCOVER THE DATA AS MUCH AS THEY DO ANY BIOLOGICAL EXAMPLE. THOSE ARE LENGTHY PROCESSES EVEN WHEN THERE'S GOOD WILL. PART OF OUR EFFORT CHALLENGED THE ASSUMPTION AS A RESEARCHER YOU NEED THE DATA. SOME DATA SHOULDN'T BE GIVEN OUT BUT THAT DOESN'T NECESSARILY MEAN YOU CAN'T DO ANALYSIS. CAN'T DO TESTING. THAT'S WHAT MOTIVATED OUR PROCESS. THERE IS DATA THAT'S PRIVATE AND YOU NEED TO ANSWER THESE QUESTIONS. IF YOU DON'T VET, IT'S EXTREMELY DIFFICULT TO ANSWER QUESTIONS OPERATIONALLY FOLKS NEED TO ADDRESS VERY IMMEDIATE PUBLIC HEALTH CONCERN. THAT IS A PAIR KIEM SHIFT WHERE A LOT NOT TO BE FACETIOUS HERE BUT ANALYTIC WORK THAT SOME STARTED WITH ANALYTIC DUMPSTER DIVING WE DIVE AND SEE WHAT WE GET, SWIM THROUGH THE DATA. THAT'S POWERFUL BECAUSE WE SEE DOTS AND CONNECT THEM, TO GENERATE THEORIES AS WE GO. PRIVACY WORLD THINK WITH STARTING THE THE THEORY AND TESTING IT AND SOME SENSE BECAME ALMOST SCIENTIFIC METHOD AND ESSENTIALLY START WITH OUR THEORY AND HAVE TO GIVE TO CONTAINER. WE DON'T GET TO LOOK AT THE DATA, IT TESTS OUR THEORY WOULD BE AN EXPERIMENTATION. AND THAT PARADIGM SHIFT BASED ON TYPES OF SENSITIVITIES THAT WE SEE AND ALL ISSUES WITH PRIVACY I TEND TO THINK THAT WILL BE COMPOUNDING AND PROBABLY A SIGN OF THINGS TO COME. O >> MY QUESTION IS BASICALLY, WHAT DO I TAKE BACK TO THE PERSON (INAUDIBLE) AND SOMETHING SIMPLE THAT I CAN SIMPLY SAY THIS IS WHAT HAPPENED WITH THIS, THIS IMPACT IT WILL HAVE ON YOU. >> ONE THING YOU MIGHT BE ABLE TO TAKE AWAY IS THERE'S POTENTIAL FOR ALL THESE SOURCES OF RICH DATA INCREASINGLY BEING COLLECTED TO HELP US FUNS WHAT OPPORTUNITIES THERE ARE. WE MIGHT BE ABLE TO GIVE BETTER INFORMATION ABOUT RISK TRANSMISSION, RISK FAILURES OF THERAPIES AND RISK OF MORBIDITY AND MORTALITY, WE MAY SHARPEN THAT FOR MORE DETAILED INFORMATION. >> I HAVE ONE SPECIFIC QUESTION FOR CHRISTOPHE, REALLY LIKED YOUR TALK. I WAS INTERESTED THOUGH WITH THE GENETIC DATA PUTTING TOGETHER AND EVALUATING TRANSMISSION. ARE YOU THEN LOOKING AT HETEROSEXUAL TRANSMISSION IN AFRICA SEEING IF YOU'RE ABLE TO GET THE PHYLOGENETIC GOING FROM WOMAN MAN WOMAN? >> ABSOLUTELY, THAT WILL HELP US WORK OUT TRANSMISSION TRENDS, OF COURSE IT'S POSSIBLE THAT -- AND PROBABLY LIKELY IF THERE IS TRANSMISSION AMONG MEN WHO HAVE SEX WITH MEN, THERE'S PROBABILITY LESS OPENLY DECLARED. WE SAW SELF-DECLARED HETEROSEXUALS WHO CLEARLY ARE IN TRANSMISSION MSM IN NETHERLANDS AND IF THERE ARE TRANSMISSION CLUSTERS MSM IN SETTINGS LESS WORTHY ACCEPTABLE. >> BECAUSE IT -- IT ALSO MIGHT INDICATE THAT -- HOW MUCH TRANSMISSION IS COMING IN FROM OUTSIDE IF YOU ARE SORT OF SEEING MEN INFECTED BUT YOU CAN'T FIND ANY STRAIN WITHIN THE AREA. >> ONE QUESTION, IN ALMOST ALL SUBISISSAHARAN AFRICAN COUNTRIES YOU GET ALMOST 30% HIGHER PREVALENCE IN WOMEN. WOMEN ARE CLEARLY INVOLVED MAYBE AT THE END OF THE CHAINS. >> ONE QUESTION FOR EXAMPLE BY A YOUNG WOMAN, LARGE CAMPAIGN AMOUNT OF MONEY PUT IN THE SUGAR DADDY, THAT MIGHT BE AN IMPORTANT TRANSMISSION. CLEARLY TESTABLE HYPOTHESIS BY PHYLOGENETICS. TO THE CENTERS SUGGEST IN THAT POPULATION, YOUNGER WOMEN WHO HAVE HIGH INCIDENCE ARE GETTING INFECTED FROM YOUNGER MEN, WHICH IS A USEFUL INSIGHT BECAUSE IT WOULD RULE OUT THE SUGAR DADDY. >> I HAVE ONE GENERAL QUESTION. IN THE WORKSHOP PEOPLE ARE TALKING BIG DATA THERE ARE BIG DIFFERENCES BETWEEN DATA AND HIV SAY THE U.S. EUROPES FROM BEING DATA IN AFRICA AND ISSUES ABOUT CONFIDENTIALITY OR ACCESS THE DATA, BIG DIFFERENCES SO I BOND FER THE PANEL HAD COMMENTS ON DIFFERENCES WITH BIG DATA AND RE SOURCE RICH AND THE RESOURCE CONSTRAINED IN >> IN TERMS OF SUBSAHARAN AFRICA, THEY'RE WELL FUNDED LONG TERM COHORTS. SO FROM A DATA PERSPECTIVE THERE ARE LITTLE POCKETS OF RESOURCE RICHNESS IN AN AREA THERE ISN'T A LOT OF DATA. RECENTLY OUR TEAM IS WORKING ON EBOLA AND WE HAVE SEEN IN EXTREMELY RESOURCE POOR AND CHALLENGING ENVIRONMENT, IT IS POSSIBLE TO GENERATE BIG DATA SETS. THERE'S NO REASON IT WOULDN'T BE AN EXPENSIVE PROPOSITION FOR EVERY SINGLE NEWLY DIAGNOSED INDIVIDUAL FOR SOME BASIC DATA TO BE UP LOADED FROM EVERY SINGLE CLINIC EVERY DAY. WE KNOW HOW TO DO THAT. SO WHY WE RELY ON LIMITED SURVEILLANCE FROM A LIMITED SELECTION OF ANTI-NATAL CLINICS TO GENERATE PICTURE OF WHAT THE HIV EPIDEMIC WAS THREE YEARS AGO, ONCE A YEAR THAT'S LOGISTICAL QUESTION, NOT REALLY A QUESTION OF RESOURCES, I DON'T THINK. WE JUST NEED TO SHARE THE VALUE AND IMPORTANCE OF HAVING THAT UP TO DATE. >> THIS IS VERY LOUD, MAKES ME WANT TO SING. I'M SAM GARNER, I DO RESEARCH ETHICS IN THIS I WOULD BEING, LAST NIGHT I ASKED A QUESTION ABOUT RISK AND I WANT TO ASK RISK AGAIN BUT WITH A DIFFERENT PANEL, I'M WONDERING IF THE PAM OR ANY OTHER INVESTIGATORS IN THE ROOM WE WORKED WITH LARGE DATA SETS OR BIG DATA E SETS PARTICULARLY TYPES OF DATA COLLECTED WITHOUT CONSENT, COULD TALK ABOUT PRIVACY, RISK, OTHER ETHICAL CONCERNS THAT COME UP IN A STUDIES THAT THEY HAVE DONE OR KINDS OF THINGS THEY ENVISION HAPPENING AND THE WAYS THEY HAVE GOB ABOUT MANAGING THEM, -- GONE ABOUT MANAGING THEM FROM ETHICS PERSPECTIVE, I WANT A SENSE OF WHAT INVESTIGATORS DEALING WITH THE FIELD ARE GOING TO BE DEALING WITH, THINKING COP CRETELY ABOUT THINGS WE'RE TRYING TO MANAGE LIKE HOW THESE TYPES OF STUDIES MAY IMPACT POTENTIALLY NEGATIVELY DUE TO DATA BREECHES OR OTHER PRIVACY CONCERNS, INDIVIDUAL PEOPLE. AND WAIT UNTIL I SIT DOWN BECAUSE I HAVE TO TAKE NOTES. THANK YOU. FROM I WON'T WAIT. FOR A LONG TIME WE RASHED THE POPULATION SYNTHESIS AS EXTREMELY SAFE, LOW RISK FOR IDENTIFYING ANYBODY BECAUSE WE START -- WE MAKE UP THIS POPULATION STATISTICAL SIGNIFICANCE TO THE REAL POPULATION, THE MORE FEATURES TO ADD INTO THIS AND BETTER OUR DATA SOURCES AND MODELS BECOME, THE MORE LIKELY IT IS THAT YOU WILL BE ABLE TO RECONSTRUCT INDIVIDUAL PEOPLE WHOSE RECORDS MUST HAVE BEEN PRESENT IN THE DATA SET WE USE TO START WITH, IN ORDER TO OBTAIN RESULTS. MUCH LIKE JERRY MENTIONED THIS MORNING, IF WE HAVE FEATURES EVERYBODY IS A POPULATION UNIQUE. WE REGARD OURSELVES AS UNIQUE PEOPLE WITH REASON. SO WHEN I HEARD SOLON'S TALK I WAS THINKING THESE ARE STATISTICAL MODELS, I'M NOT TERRIBLY WORRIED THAT I'M IDENTIFIED BASED ON THE OUTCOME OF A STATISTICAL MODEL. I MAY HAVE TO RETHINK THAT A LITTLE BIT. WE STARTED WITH THE POPULATION SYNTHESIS FOR ONE REASON, SPECIFICALLY BECAUSE OF PRIVACY CONCERNS, WE WERE WORKING LOS ALAMOS NATIONAL LAB. ONE THING FOR THE NIH SAYING WE LIKE TO SEE OUR DATA FROM A GOVERNMENT AGENCY, WE'RE HERE TO HELP YOU. ANOTHER ONE TO GO AROUND AND SAY WE'RE FROM YOUR FRIENDLY NEIGHBORHOOD NUCLEAR WEAPONS LAB, WE WOULD LIKE TO KNOW WHAT YOU BOUGHT YESTERDAY. SO WE HAVE BEEN AWARE OF THESE ISSUES FOR A LONG TIME, I THOUGHT THE SOLUTION WAS SYNTHESIZING AND I'M STARTING TO REALIZE THAT THAT -- THERE MAYBE MOVE TO IT THAN THAT. >> MIKE KOHN, I HAVE A PROVOCATIVE QUESTION, THAT IS PEOPLE EXPERIMENTAL RANDOMIZED CONTROL TRIALS IN A SENSE ONE PURPOSE OF THE MEETING IS TO COMPARE THE OPPORTUNITIES IN RCT VERSUS RECREATIONAL DATA BUT PEOPLE DO RCTs ARE ASKED AS THEY DO THEM CAN THE RESULT OF THAT TRIAL CHANGE PUBLIC HEALTH POLICY OR MEDICAL TREATMENT? YOU'RE TRYING TO FIND A BIG ENOUGH QUESTION JUSTIFYING ALL THE TIME AND EXPENSE. I SUSPECT THE GOAL FOR BIG DATA AND FOR OUR MEETING IS IN FACT TRY TO CREATE AN ENVIRONMENT WHERE SOMEHOW THE DATA AGGRAVATED IS USED TO CHANGE POLICY. BUT THE PROBLEM IS FOR EXAMPLE LOOKING AT THE NEXT MEETING COMING UP, THERE MUST BE AT LEAST TENT GIANT DATA SETS SAYING WE CAN START -- STARTING ANTIVIRAL TREATMENT IMMEDIATELY IS INCREDIBLY BENEFICIAL TO INDIVIDUAL'S HEALTH. SOME PEOPLE ON STAGE HAVE WRITTEN PAPERS ABOUT THAT. YET THEY HAVE A LITTLE TRACTION FROM WHO UNTIL A MONTH AGO WHEN RANDOMIZE CONTROL TRIAL THAT IT WAS A SMALL TRIAL, STOPPED EARLY BECAUSE DEMONSTRATED A CHANGE. CHRISTOPHE STARTED GETTING AT THIS IDEA OF USING THE DATA GENERATED IN NETHERLANDS TO CREATE A CHANGE IN PUBLIC HEALTH POLICY, TRYING TO HAVE AN ALGORITHM THAT WOULD OFFER TO SPECIFIC GROUP OF PEOPLE, THE QUESTION IS HOW DO THE PEOPLE DO SUCH BIG DATA. GET THE TRACTION NECESSARY TO MAKE ALL THE EFFORT WORTHWHILE. >> SO I THINK IT'S FIRST CONVINCING YOURSELF ANSWERS ARE AS CLOSE TO BEING CORRECT AS RCT AND SITUATION WHERE YOU HAVE RCT DATA THERE ARE CORRECTION MS. BIASES OBSERVATIONAL DATA AND THERE ARE WAYS TO GET AROUND THAT BUT THE RCTs GET RID OF THOSE BIASES. THERE'S MOTIVATIONAL DATA IS A GRADE BELOW. ON THE OTHER HAND OBSERVATIONAL DATA HAS BENEFIT OF DOPE, SO IF YOU LOOK AT TALKING ABOUT PEPFAR THEY WANT BIG DATA BECAUSE THEY WANT TO KNOW WHERE TO SPEND THEIR DOLLARS, AND WANT TO SPREAD EVENLY OVER THE COUNTRY. THAT'S NOT ANSWERABLE BY RCT, VERY DIRECT IMPLICATION OF BIG DATA. MAKING SURE ESSENTIALLY REDIRECTING MONEY FROM ONE AREA TO ANOTHER. AND HAS DIRECT QUESTION HOW TO GET MORE PEOPLE HAVING TREATMENT IN THE RIGHT PLACE. >> SO I THINK IT ACTUALLY HAS TO BE A CULTURAL CHANGE IN THE WAY WE THINK ABOUT OUR EVIDENCE. WE OFTEN TAKE AN RCT RESULT AS A TRUTH, AS A FACT. AND IF WE TAKE A STEP BACK AND THINK ABOUT POTENTIAL WAYS WHICH IT'S BROKEN AND WE DON'T HAVE A POINT ESTIMATE ANY MORE, WE HAVE THESE BOUND EVEN FROM THE WELL CONDUCTED RCT, THEN IT WILL START TO MUDDY THE DISTINCTION BETWEEN THAT AND WELL CONDUCTED OBSERVATIONAL STUDY. SO TO START TO PUT THE INFORMATION WE GET FROM TRIALS OR BROKEN TRIAL ON THE SAME FOOT ING, AND WHAT WE SEE CLEARLY WHEN WE DO THAT OBSERVATIONAL STUDIES ARE WEAK, GENERALLY THEY ARE WEAKER THAN TRIALS, OUR PREFERENCE FOR TRIALS MAKE SENSE LOGICALLY. BUT WE TEND TO CLASSIFY SORT OF A HUMAN TRAIT TO CLASSIFY, WE TEND TO CLASSIFY TRIALS AS GOOD, OBSERVATIONAL STUDIES AS QUESTIONABLE, SOMETIMES GOOD, SOMETIMES NOT. IT'S MORE NUANCED THAN THAT. I THINK THAT CULTURAL CHANGE IN TO THINKING ABOUT THE EVIDENCE WILL HELP IN THAT DIRECTION. >> WE'RE RUNNING OUT OF TIME. WE DO NEED TO TAKE LUNCH. I WANT TO ASK A REALLY PRACTICAL QUESTION. I WAS SURPRISED YET WITH USAGE OF TWITTER. THAT TWITTER IS SUCH A VERY, VERY YOUNG POPULATION SO I DID MATH IN THE CAR, IF THEY WERE 15 TO 20 AND THAT OTHER 25% IS LIKE GOVERNMENT AGENCIES, MOVIE STARS AND THE KARDASHIANS AND ROBOTS. SO WHAT OTHER TEXTURAL DATA DO WE HAVE FROM THE REST OF THE COMMUNITY? IF WE WANT TO LOOK HOW TRANSMISSION FOR MSM MAY HAVE THE SAME GENERATIONAL ISSUE OF OLDER MEN AND YOUNGER MEN, HOW DO WE GET THE 20 TO 30-YEAR-OLD POPULATION, WHAT OTHER SOURCES ARE OUT THERE? >> SOCIAL MEDIA IN GENERAL ARE POPULATED BY YOUNGER PEOPLE. I THINK THEY YOUNGER SEGMENT IS (INDISCERNIBLE) FOR HIV ON THE ACCOUNT WHOSE PROFILE THIS IS, IS NOT THE ONLY FACTOR BECAUSE WE CAN ASSUME WHAT'S BEING PROJECTED OUT FROM SOCIAL MEDIA REFLECTS SOMETHING ABOUT SOCIAL REPRESENTATIONS IN THE POPULATION, THAT MIGHT CONTAIN PEOPLE BEYOND 24-YEAR-OLD TYPICAL. OTHERWISE YOU WILL NOT SEE ANY ASSOCIATIONS BETWEEN THESE WHAT'S BEING TREATED. IF WE ASSUME ONLY MATTERS FOR THOSE ON TWITTER, THAT WILL BE ONE RESPONDER, THERE ARE OTHER SOURCES LIKE DIFFERENT COMMUNICATIONS THAT MIGHT HAVE OTHER GROUPS BUT TEND TO BE (INAUDIBLE). THE ADVANTAGE OF TWITTER IS -- THE MASSIVE DATA IF YOU'RE ONLY DEALING WITH 100 POSTING ON A FORUM JUST NOT THE SAME, TRY NOT YOU'RE DOING GEOGRAPHIC ANALYSIS, IT DOESN'T HAVE THE SAME FLEXIBILITY, THAT WILL BE THE PROS AND CONS. THE NEXT STEP IS TO USE RECORDING REAL LIFE COMMUNICATION IN PUBLIC SETTINGS. IMAGINE CREATING A SOUP OF ALL THE CONVERSATIONS HAPPENING IN THIS BUILDING AT THE POINT IN TIME, THIS IS DONE IN THOSE FIELDS ANALYZING THOSE KINDS OF (INAUDIBLE). >> I THINK THE OTHER PLACE YOU FIND TEXT IN A VERY IMMEDIATELY RELEVANT CONTEXT IS SITES LIKE GRINDER WHICH ARE PRIVATE. SO THERE ARE A LOT OF PRIVATE SEXUAL MATCH UPS. THOSE DATA ARE NOT ALWAYS MADE AVAILABLE FOR THE KINDS OF REASONS EVEN MENTIONED. AND THAT'S -- BUT THERE'S BOTH TEXT AND OTHER INFORMATION THERE WHICH IS DIRECTLY RELEVANT TO HIV TRANSMISSION. AND I THINK THAT WOULD BE AN INTERESTING PLACE. I KNOW I HAVE SEEN VARIOUS NIH PROPOSALS COME IN TO ANALYZE DATA LIKE THAT. >> GRINDER YOU CAN'T EVEN GET A RESPONSE FROM. IT'S AN UNKNOWN NUMBER WE HAVE AN EMAIL ADDRESS, PROFILES THAT ARE PUBLIC THAT YOU CAN ANALYZE THE 8 IS SHOWN PREDOMINANT. [APPLAUSE] >> THANK YOU TO THE PANELISTS, THIS WILL BE THE CONCLUSION OF OUR VIDEOCAST PORTION OF THE MEETING. WE WANT TO SAY THANK YOU TO THE ONLINE PARTICIPANTS WHO HAVE NUMBERED AROUND 100. WE APPRECIATE YOUR PARTICIPATION AND THERE WILL BE A PARTICIPANT LIST AND MEETING SUMMARY DISTRIBUTED TO THE REGISTRANTS. FOR THE IN PERSON ATTENDEES, WE'RE GOING TO BREAK FOR LUNCH AND COME BACK TO THIS ROOM TO RECEIVE THE CHARGE TO THE BREAK OUT GROUPS AT 1:30. THANK YOU SO MUCH.