Giter VIP home page Giter VIP logo

use-umls-and-python-to-classify-website-visitor-queries-into-measurable-categories's Introduction

Classify website visitor queries into measurable categories

Chart and respond to trends in information seeking, by classifying your web visitor queries to a taxonomy and an ontology, using completely anonymized data

Analytics data for search for large web sites is often too verbose and too inharmonious to analyze. One "portal" site studied receives around 150,000 "clicks" per month from search-engine results screens, and around 100,000 queries per month from internal site search. Examining the visitor queries reveals many variations on the same conceptual ideas, making the content difficult to analyze and summarize. For this reason, many web managers are not looking for meaning in the search terms their site visitors are using.

We should put our completely anonymized search queries into "buckets" of broader topics, so subject matter experts (SMEs) have a way to understand how customers are currently seeking information within that SME's topic area. Having this, the SMEs can examine whether existing content should be modified to improve its findability, and whether new content should be added to fill gaps in customer needs. Health/medical analytics managers can use the Unified Medical Language System (UMLS) Semantic Network and Semantic Groups, to do this to serve customer information needs better.

Search represents a direct expression of our customers’ intent. We should use this data to improve our staff’s awareness of what our customers need from us.

Use cases

  1. A web analyst could say to a product owner, "Did you know that last month, 30 percent of your home page searches were in some way about drugs? Should we take action on this? How might we improve task completion and reduce time on task, for this type of information need?
  2. We should cluster and analyze trends we know about. For multi-faceted topics that directly relate to our mission, we should create customized analyses to collect the disparate keywords people might search for into a single bucket. How can we create a better match between user interest and the content we manage for this topic? Where might we improve our site structure and navigation?
  3. We should focus staff work on new trends, as the trends emerge. When something new starts to happen that can be matched to our mission statement, we should deploy social media posts on the new topic immediately, and start new content projects to address the emerging information need.
  4. On a longer time scale, anyone publishing to the web might want to ask, how are we preparing to support voice search? Understanding how people search for information will help understand how to adapt for this possible next-generation technology.

Pilot project results

72% of search volume (for October 2019) is tagged with broader-topic names within 3 minutes, after multiple iterations that updated the tagging files. This was 205,633 of 282,387 searches (72%), and (because the logs are already aggregated) 30,604 of 89,476 rows (34%) were tagged. What are untagged are terms searched less than a monthly average of once per day, that are often multiple-concept searches of low frequency.

During the pilot we did not create supplemental files for the MetaMapLite or CSpell tools. This would improve results.

Screenshots

screen to upload file

UMLS Semantic Types Categories

Workflow

Only partially implemented during this Codeathon.

Workflow

Dependencies

Tools

Yet to be integrated; may be useful:

  • Medical language abbreviations
  • Scikit-Learn multi-class classifier

Additional output

Search Strings input used for MetaMap and FuzzyWuzzy alt text

Future work

  • Implement tagging interface that provides suggestions for untagged queries above a frequently threshold, to facilitate manual tagging.

Influences and thanks

People

  • Dan Wendling, team lead, NLM/LO/PSD
  • Dmitry Revoe, NLM/NCBI/MGV
  • Victor Cid, NLM/LHC/CgSB
  • Laritza Rodriguez, NLM/LHC/CSB
  • Wenya Rowe, NLM/NCBCI/CBB
  • Rachit Bhatia, NLM/OCCS/STB

Past work

https://github.com/NCBI-Hackathons/Semantic-search-log-analysis-pipeline

use-umls-and-python-to-classify-website-visitor-queries-into-measurable-categories's People

Contributors

allissadillman avatar larodrig avatar revl avatar wendlingd avatar wenya-r avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

use-umls-and-python-to-classify-website-visitor-queries-into-measurable-categories's Issues

Post 'classify searches' MVP

Need an uncomplicated version of this code that people can start using, an MVP that covers GA to two Tableau workbooks: SME_drill-down and Trend_Analysis.

Leave out what's not done.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.