Predicting Future Sales using Historical Data
Quick note: Thank you for checking out my post. Today I’ll go over a Sales Forecasting analysis completed using Facebook Prophet. Two files were provided; sales records and store level data. Preprocessing, exploratory data analysis and prophet modeling yielded a 30 day sales projection with an average 7.8 mean absolute percent error. For a detailed breakdown of the analysis including code, a presentation and more, please visit my Github.
The contents of the analysis can be found on Github under user miguelangelsantana.
Obtain | Data
Following the OSEMN framework, our analysis began with two data files. The first consisted of 1,017,209 sales data records and the second consisted of 1,115 store level records. Both tables had a unique store identifier and as such would be joined post scrubbing.
Scrub | Data Processing
The data consisted of two main promotions. The first was a standard promotion held by the company, the second was a special extended promotion that only applied to certain stores. When it did, additional features were recorded such as duration and period of promotion two. During the scrubbing phase it was determined that null values across all promotion two categories would be treated as not participating in the promotion. As such, the respective features were filled. The remaining 3 total null values were filled with the average value per respective feature the null value corresponded to. A drop would have been just as effective with 3 value points over such a large dataset but I felt it best to keep the store records in.
Next, individual stores are identified as opened or closed by the dataset. To fairly assess sales performance, stores that were identified as closed were dropped from the dataset. With a date column converted using pandas datetime functionality, additional features were created for analysis (day, month, year).
Lastly, the two tables were merged using an inner join on the unique store identifier column.
Exploratory Data Analysis
Using a few group by methods sales trends and promotional participation were observed.
Average Sales By Month
Average Sales By Day of Month
Average Sales By Day of Week
Average Sales Performance | Top 10 Stores
Promotion 2 Participation | Top 10 Stores
Store 817 has the highest average sales. To our surprise, only 1 of the top 10 stores participates in promotion two. In addition, the top 10 stores are substantially closer to competitors when compared to the average distance of competition. Another important note is that the average distance of a competitor from a store is 5,457.84 while the average competitor distance to a top 10 store is 727.
Preparing for Facebook Prophet
It’s important to note that Facebook Prophet takes in data in a very specific format. The date column must be identified by the column name ‘ds’ and the target variable (in this case sales) must be identified by the target ‘y’. Once the school and state holidays were extracted and converted into a unique data frame, it was time to prepare the data.
The analysis would be focused on our top performer in order to gain insight into possible trends to help non performing stores. A convenience function was created to extract store 817’s data and convert it into a usable format for our model. The data was selected, reformatted, holidays were included and we let it run.
Validating the Model
Fortunately, Facebook Prophet has a built in cross validation function that we used in our analysis. The cross validation function requires an initial, period and horizon feature which allows for the model to go through and validate its projections against the actual target data points.
Our cross validation function used an initial period of 400 days, a folding period of 400 days and a horizon period of 45 days. The average root mean squared error over the 45 points was 2025.914. Great score when compared to the average target value of 21,757. The averaged mean absolute percent error is 7.8%.
Plotting RMSE against Horizon
Plotting MAPE against Horizon
While there aren’t many direct features that can be used to infer effects on our target variable with respect to business growth, it’s important to note that the top performing stores are located near competitors. In addition, almost all of the top stores did not participate in promotion two.
The model was able to predict sales for store 817 with an average percentage error of 7.8 percent. Our predictions show a decline in sales over the next 30 days but end short of the expected uptick in traffic that occurs over the holiday season (November and December).
The analysis is limited by the anonymity of the data. Per the source, the company is based in Europe and many external factors may influence sales including customer cultural preferences. Additionally, product knowledge is not available and features of the product itself may be influencing the analysis.
While our analysis is limited, its worth noting that the top store is near competitors and is not participating in the second-extended promotion that is made available to approximately half of the stores in our dataset. With limited domain knowledge, a recommendation would include removing promotion two from bottom performing stores so they may focus on advertising traditional company wide promotions. In addition, the existence of competition with respect to the customer and product sold should be considered when opening new stores.
Future work should include additional store level data such as employees in store, online sales, product information and geographic information to consider additional external factors through feature engineering.
For more information, the analysis can be found on Github under user miguelangelsantana.
Definitely a fun project to explore but ultimately a great experience with the simple and effective Facebook Prophet modeling algorithm. As always, thank you for stopping by and happy analyzing!