Day Three in Machine Learning: web scraping and multivariate future predictions with Prophet.

This article is about Machine Learning prediction using data collected from a website and google trend. It is a toy example that assumes a correlation between the keyword “coffee” search on Google and the historical price of coffee. The example shows how to combine different sources of data in a library called Prophet to make multivariate future predictions (Figure 1 shows the final output of the learning exercise). To get the data from this website I used Selenium and Pytrend to access the information from Google trend. The full code implementation can be found here on my GitHub account.

Figure 1. Price prediction graph of the coffee price up to 2022

Selenium is a Python library that makes the process of web scraping relatively simple by performing with code the same actions a user would make while visiting the website. With a few lines of code, I got the historical price of coffee from the website, stored in a table and plotted on a graph by dates (pink line in Figure 2). Getting the data from Google Trend is also simple as the library Pytrend is well documented (yellow line in Figure 3). After getting the data I needed to combine the two datasets together. Most of the dates, however, contained values in one table but the same dates would be empty in the other table. Prophet can understand data sorted by date and can do most of the data preprocessing for you, but it does not like empty values. The solution was to fill the missing value with the average of the previous and next information available (command 17 of the code).

Figure 2: The data

The idea was to use both of the trend lines (figure 2) to make a prediction of the price of coffee in the future. In order to validate the accuracy of the model, I used the data up to the end of 2019 for training and the data in 2020 for testing. The vertical dotted line in figure 3 shows the separation between the training and testing set. The graph also shows a dark pink line that represents the prediction of the model. As we might have imagined the search of the keyword “coffee” in google trend might not be a feature that effectively contributes to the actual price of the coffee. Nevertheless, it is interesting to see how the Prophet library works with more than one input feature (multivariate).

Figure 3. Training and validation split.

The below figure 4 shows how Prophet would predict the price of coffee for 2020 and 2021 with the data provided. As we had only information up to the end of 2020, the extra regression additive contribution of the feature “coffee trend on Google search” flats for 2021. One consequence of this is the lower confidence of the model in terms of prediction in 2021 (The blue region in the upper graph in figure 4 show the confidence range). One solution to this would be to divide the solution into two pieces. First, we would predict the values for the google trend of coffee for 2021 and then we would use the predicted value to update the data up to 2021 to predict the price of coffee in the main model.

Figure 4. Components graphs of Prophet.

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store