Synopsis

Figure: Plot of performance of different algorithms using different number of features

Synopsis:
The City of Boston regularly conducts inspection of restaurant to monitor if restaurants are following food safety and public health rules. It records health violations for all the restaurants at three different levels: *(one star) "minor", ** (2 stars) - "major", and ***(3 stars) - "severe" violations. Currently the health inspections are random, which leads to the wastage of time and efforts in inspecting clean restaurants that have been following the rules closely — and missed opportunity to improve health and hygiene at places with more serious food safety issues.

Our goal is to predict the number of health violation for all three levels during an inspection of a restaurant in the city of Boston on a specific date using Yelp data. This predictive model will help the City of Boston in making the inspection more efficient by using the Yelp restaurant's review data to predict possible violations and conduct inspections in only those restaurants where it is necessary. This task is important as it can substantially improve the City’s inspection efforts and can change the way inspections are organized.

We retrieved data from DRIVENDATA, which host’s social challenges in the field of data science. We utilized yelp reviews for predicting violations. We processed the review text before the inspection date and applied TF-IDF to create a feature matrix of the words. These features were used as input to predict the count of violations. We used Root mean square log error (RMSLE) as the evaluation criteria and Ordinary Least Square regression model as our benchmark (RMSLE = 1.1386).

We applied Ridge regression, Naive Bayes, nearest neighbors, SVM, generalized boosted trees, and random forest, among others. We used 90% data for training and 10% for validation. For testing we utilized the test data provided by DRIVENDATA and submitted our results on the website.

Key Results:
The graph shows the performance of different algorithms with different number of features. The best performance was achieved using Random Forest model with 1000 features, which gave a RMSLE of 0.9992. We used 500 estimators and 5 maximum features as the parameters for random forest.

Read the full report here.

Keeping it ﻿Fresh﻿: Predict restaurant inspections

Keeping it Fresh: Predict restaurant inspections