The dataset available shows the customer reviews of hotels and resorts for the period of Aug 2018 - Aug 2019. The total number of rows is 6448 and the column consists of 3 information i.e. Review, Date, and Location. The dataset is not clean i.e. Inconsistent location- structured formatting required, inconsistent date format, Null and blank values identified.
The Review column is of qualitative nature and cannot be measured but I tried exacting common review Keywords using SQL βLIKEβ function in BigQuery which helped me to extract quantitative data.
- Amazing
- Great
- Good
- Ok
- Average
- Terrible
- Bad
- Worst
After applying these keywords, I identified 3905 reviews that are classified as:
- Good
- Average
- Bad
Total Population (Row) | 6448 |
---|---|
Time Period | Aug 18- Aug 19 |
Quantitative review data sample set acquired | 3905 out of 6448 |
Quantitative review data sample set:
Good | 2561 |
---|---|
Average | 1039 |
Bad | 305 |
I observed we had 6448 end-users of our services. By aggregating the dataset I was able to measure month-on-month customer acquisition which was at a rate of 7.6% per month.
We had the highest customer acquisition in the month of Dec 2018 i.e. 9.7%. This may be because of the long Christmas holidays.
We see a fall in customer acquisition in the year 2019 by 0.9% as compared to the previous year.
I identified NULL and Blank values in the location column. They constitute 73.5% of the total population and need to be addressed swiftly.
The location column is unstructured and difficult to draw any admissible insights. After an initial level of cleaning, the following insights were found
Top 5 locations
- New York
- California
- Florida
- Puerto Rico
- Michigan
Top 10 locations in the world.
- United States of America
- United kingdom
- Canada
- France
- Italy
- New Zealand
- Australia
- Japan
- South Africa
- Germany
- Monthly customer acquisition is at 7.6%.
- December 2018, had the highest customer acquisition of 9.7%.
- Customer reviews: Good- 65.58%, Average- 26.6%, Bad- 7.8%.
- Top 5 locations in us- New York, California, Florida, Puerto Rico, and Michigan. Top 5 Locations in the World- USA, UK, Canada, France, and Italy.
- Location includes 73.5% of null and blank cells.
- Introduce structured formats to better capture data.
- Introducing Geo location tracker to capture exact location automatically or using pin codes to eliminate NULL and Blanks.
- Introduce a Quantitative scale(1-10) to better understand the customer ratings
- Improve services at low-performing locations.
- Introduce attractive offers during the holidays to increase customer acquisition.
Hope you find this insightful.
I would also like to learn how to clean this dataset to make it much more usable such as
- Extract keywords to separate column and categorize them accordingly.
- Cleaning the location to a universally acceptable format
By doing so, I will be able to provide more accurate customer responses and plot performance categorized areas based on location.