Giter VIP home page Giter VIP logo

sannpeterson / semantic-search-log-analysis-pipeline Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ncbi-hackathons/semantic-search-log-analysis-pipeline

1.0 1.0 0.0 27.79 MB

Classify web visitor queries so you can chart, and respond to, trends in information seeking

License: MIT License

Python 4.18% Jupyter Notebook 2.29% JavaScript 57.89% CSS 14.44% HTML 14.13% ColdFusion 6.10% PHP 0.97%
bioinformatics data-science ncbi hackathon biomedical ontology

semantic-search-log-analysis-pipeline's Introduction

Semantic Search Log Analysis Pipeline (SSLAP)

Chart and respond to trends in information seeking, by classifying your web visitor queries to an ontology

(Code is under revision. Functional through phase 03.)

The logs for internal search for large biomedical web sites can be too verbose and too inharmonious to make sense of. Logs for one NIH site contains more than 200,000 queries per month, with many variations on the same conceptual ideas. Aggregating log entries such as "ObamaCare" and "ACA" and "Affordable Care Act," for example, is far too difficult for a human to parse and take action on. This leads to several missed opportunities in communication management.

Product managers and others COULD BE using this data to understand the environment and their customers better, and improve their web sites, but without automation, the amount of human effort required has not been worth the return on the investment. If there were a way to automatically unite queries that are similar but not the same, under broader topics that could be effectively aggregated and compared over time, then we could more easily explore patterns in the vast amount of data generated, and begin to interpret their meaning.

Goals/Scope

Benefits of analyzing site search that were addressed in the current work - we will be able to:

  1. Cluster and analyze trends that we know about. For multi-faceted topics that directly relate to our mission, we could create customized analyses using Python to collect the disparate keywords people might search for, into a single "bucket." Where in the site is there interest in various facets of this subject? Analyzing a trend can show us new constellations of resources that we may not be treating as related. If we were to select a constellation topic such as "opioids" as a topic of study, our bucket might include terms around Fentanyl, heroin, drug treatment, overdose, emergency medicine, etc.), and we could then look at where this person should be in our site, and change the site to help them get there.
  2. Focus staff work onto new trends, as the trends emerge. When something new starts to happen that can be matched to our mission statement, we can start new content projects to address the emerging need.

Reporting at three levels of specificity / granularity

This project standardizes multiple customer versions of a search concept into one concept that can be accurately aggregated.

The Unified Medical Language System (UMLS) API offers a preferred term for what site visitors typed, when possible. We also include "fuzzy matching" against data from a web site spidering, because many product and service names, proper names, and other entities are not covered by UMLS. Here, a before-and-after study of how search behavior changed after a home page redesign.

Contact Dan for assistance

Given a preferred term, the UMLS API can provide one or two (perhaps more) broader grouping categories called Semantic Types,, of which there are around 130. (This hierachical report is still under revision.)

Contact Dan for assistance

At the highest level, there are 15 Semantic Groups that cover all of health-medicine and much of the life sciences, in mutually exclusive categories. Here: small sample dataset of only 7 days.

Contact Dan for assistance

How to Install Required Packages

Hunspell

The product is based on C libraries so before it can be called in Python the libraries must be available on the local machine. To begin, clone into the repo at this link. cd into the directory and then follow the instructions for your operating system.

If you have a Mac and would rather not use Homebrew to install the required packages: find the non-homebrew instructions for installing autoconf, automake, libtool here. Find something similar to install boost and gettext. Then, follow the instructions on the ReadMe page beginning at the “Then run the following commands” section of the Compiling on GNU/Linux and Unixes section (assuming you have a Mac or Linux).

After that is finished, install pyhunspell using pip or, if you prefer conda, use conda install -c conda-forge hunspell

Using the Hunspell Spellchecker

The hunspell spellchecker software is required for checking the spelling of search terms that are not accepted by the UMLS API the first time around.

The repo includes all the tools to make new dictionaries whenever the SPECIALIST LEXICON is updated. Specifically the affixcompress tool in the src/tools folder. To use the tool, get a list of all the words you want to use sort them, and then run the tool on the list. Note, should only be done within languages as the dictionary creator does seem to derive rules from the words. You can specify more than one set of dictionaries to be used at a time, if you wish.

In the 02_Run_API.py file, under the #Initialize spellchecker is where the .aff and .dic files you create should be listed. More than one set of dictionaries can be listed, I believe.

How to Analyze Your Own Data

If you have direct access to your own search logs, start with the 01 py file. If you are exporting from Google Analytics > Behavior > Site Search, open the 01 file called listOfUniqueUnassignedAfterPhase1.xlsx, make your columns match it, and start with the 02 py file. You will need to remove all procedures for dataframes starting with "log."

Recommendation for Google Analytics users who have configured site search: Create custom report with two Dimensions, Start Page and Search Term, and one metric, Total Unique Searches. Set filter for Start Page URL to restrict to a domain, if needed.

Workflow

Whole-project view. We do not have a deployable software package at this time; this repo contains scripts that can be run together or separately.

9/13/2018: Incomplete or not-started elements have a dotted border.

Contact Dan for assistance

Future Directions

  1. Collapse SPECIALIST LEXICON into a dictionary to be ingested by Hunspell to create better spellchecking to match with search terms.
  2. For matching of the web content that won't be findable in UMLS, spider whole website and cluster by text body, instead of only matching the titles/metadata headings. R&D switching to beautifulsoup and using tf-idf vectorize; create a word bank for each topic.
  3. Other NLP Toolset Integrations:
  • Google's search algorithm has a very good suggestion system backed by billions of search queries to correct for misspellings - using an available API could be worth looking into instead of reinventing the wheel.
  • Stanford has a toolset for named entity extraction, that could be used on web site pages
  1. R&D whether to start pulling down relationships to assist with queries such as yoga nutrition. https://www.nlm.nih.gov/research/umls/META3_current_relations.html
  2. Consider capturing IDs for preferred terms for later use? So people can use the wikidata, sparql, etc. connections?

semantic-search-log-analysis-pipeline's People

Contributors

wendlingd avatar njohnso6 avatar streety avatar austinkim-mbs avatar marskar avatar allissadillman avatar

Stargazers

 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.