Few problems are truly global issues and none have involved as many as our environment. Though some parts of the research community are dubious about mankind actual influence (or at least its magnitude) on greenhouse gases and the global warming issue, no one disputes the fact that prolonged and localized exposure to particles and gases emitted by motorized vehicles have a negative impact on one’s health. The consequences range from respiratory difficulties to shortened life span. It goes without saying that the economical consequence on society in the form of health-care expenditures is steadily growing. The World Bank estimated that the loss of income directly imputed to air pollution reached 225 billion USD for 2013 alone. Cities around the world are plagued by intense traffic that, given certain meteorological conditions, make the air unbreathable. As Beijing is the poster child of how bad thing can get it needs to be be noted that even smaller cities such as Stockholm are not exempt from air pollution and poor air quality. In fact, at peak hours, the air quality on some streets of the Swedish capital reach levels that are hazardous to the health. There is a growing awareness of the genral public and it’s questions and concerns need to be addressed.
Many organizations around the world have recognized the consequences of long term exposures to carbon dioxide, ozone, and particles of different sizes, categorized into two groups, namely pm2,5 and pm10 (particle that have a size less than 2, 5 micrometers and 10 micrometers, respectively). Though most cities do give hourly measurement values of these gases and particle levels and in some case even regional predictions of them, few (if any at all) give localized predictions. One of the reasons for this is that predictions of the sort are notoriously hard to make and that the methods used require both theoretical and advanced technical skills beyond the simple task of gathering of data. They also require enough data about the net amount of motorized vehicle within the area of interest.
Sopra Steria Analytics Sweden is based in Stockholm and even though the particle emissions in this Scandinavian city cannot be compared to those of Shanghai or many major European city, the air quality is a recurring discussion in media and, as we mentioned above, the public demands that this growing problem be delt with. It has been estimated that around 300 individuals in Stockholm die prematurely each year due to pollution. It can evidently be argued that there might be confounders but it is undeniable that gases and particles are partly due to these deaths. We chose to put our knowledge and our ingenuity to the test and decided to develop a model that would enable us to make predictions on three heavily trafficked streets in the center of the city. If we manage to make at least 24 hour predictions, it would give the general population the ability to decide to avoid a particular street at a particular time of the day.
Our first task was to determine what influenced the levels of particles in the air apart from motorized traffic. A little bit of research led us to the realization that weather phenomenons such are temperature, humidity, precipitation and wind are predominantly connected to the levels of particles. As in all predictive modelling, history is a very good predictor of the future and we needed to gather as much data of past events from the Swedish Meteorological and Hydrological Institute (SMHI), but also from SLB-analys (Stockholm Air and Noise Analysis). SLB-analys gathers hourly data o levels of NoX and ozon gases aswell as pm10 and pm2,5 particles and has been doing so since January 1, 2015. We shall come came to the procedures we used to gather this data a little later as we encountered some problems that deserve consideration. As we mentioned, traffic is the number one reason for pollution in this context and we obviously needed to quantify the net flow of vehicles in a confined area (the city). Luckily, even for drivers, every single vehicle entering the city has to pay a fee, a so-called Congestion Fee. There is a total of 78 stations through which vehicle have to pass to enter or exit the center of Stockholm and there are no other access ways.
Transportstyrelsen (The Swedish Transport Agency) photographs the registration numbers of each car, whether it enters or exists the city between the 6 AM and 6 PM and have gathered this data since January 1, 2015. Transportstyrelsen has even made this data available to the general public through an Open Data API. We have downloaded this using Python code and stored it in a SQL-Database. We share this code below:
import timeit import pypyodbc import json start = timeit.default_timer() ########### Python 3.2 ############# import http.client, urllib.request, urllib.parse, urllib.error, base64 headers = { 'Ocp-Apim-Subscription-Key': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX', } sqlConnection = pypyodbc.connect('Driver={SQL Server};' 'Server=YYYY;' 'Database=Weather;') cursor = sqlConnection.cursor() sqlConnection2 = pypyodbc.connect('Driver={SQL Server};' 'Server=YYYY;' 'Database=QQQQ;') cursor2 = sqlConnection2.cursor() SQLCommand = '''SELECT dt.[date] ,[startTime] ,[stopTime] FROM [Weather].[dbo].[DateTime] dt left join (select distinct date, time from Weather.dbo.TrafficData) as td on td.date = dt.date and td.time between dt.startTime and dt.stopTime where td.date is null --and dt.date in('2015-01-01','2017-02-15') and cast(startTime as time) between '06:00:00' and '19:00:00' and DATEPART(dw,dt.date) not in (1,7) order by date, startTime''' cursor.execute(SQLCommand) results = cursor.fetchone() while results: params = urllib.parse.urlencode({ 'Passagedatum': results[0] , 'PassageTidStart': results[1], 'PassageTidStopp': results[2], 'Betalstation': '', 'Lan': '', 'Kommun': '', 'Fordonstyp': '', 'Skatteobjekt': 'STH', }) #params=params.replace("%3A", ":") # params=str(params) try: apiConnection = http.client.HTTPSConnection('tsopendata.azure-api.net') apiConnection.request("GET", "/Passager/v0.1/ByBetalstation?%s" % params, "{body}", headers) response = apiConnection.getresponse() data = response.read() listObject = json.loads(data) if (len(listObject)>0) : s = 'INSERT INTO TrafficData ' for x in listObject: s = s + 'select \'' + x["Betalstation"] + '\', \'' + x["Fordonstyp"]+ '\', \'' + x["Kommun"]+ '\', \'' + x["Lan"]+ '\', \'' + x["PassageDatum"][0:10] + '\', \'' + x["PassageTid"]+ '\', \'' + x["Postnummer"]+ '\', \'' + x["Riktning"]+ '\', \'' + x["Skatteobjekt"]+ '\', ' + str(x["Korfalt"]) + ' UNION ALL\n' s=s[:-10] #print (s) cursor2.execute(s) sqlConnection2.commit() ) apiConnection.close() except Exception as e: print("[Errno {0}] {1}".format(e.errno, e.strerror)) results = cursor.fetchone() sqlConnection2.close() sqlConnection.close() stop = timeit.default_timer()
Apart from being a lengthy process, the download of this data is an easy task. We mentioned earlier that we also chose to record data from particle levels from SLB-Analys. At first, all data was available in csv-files but several weeks before we decided to present our solution we discovered that the organization had decided to remove the option of downloading these files. What are your options when all the information you can see is what is available on graphs on a website? How do you retrieve that data? Luckily, R offers packages such as rvest that enables one to scrape the content of pages. So, the plan became to schedule hourly scraping of measurements by determining the exact position of the tables they used to present their graphs (http://slb.nu/slbanalys/luften-idag/). The advantages presented by the rvest-package is that a few simple lies of code suffice to retrieve all the information needed:
library(rvest) library(RODBC) options(max.print = 10000) thepage = read_html("http://slb.nu/slbanalys/luften-idag/", encoding = "UTF-8") a = thepage %>% html_nodes("script") %>% html_text() a =gsub("\\\t", "", a) a =gsub("\\\r\\\n", "", a) startstr = "data = google.visualization.arrayToDataTable" endstr = "]]" keepers = grep(pattern = startstr, x = a) test = a[keepers] str=test[6] stringToDataframe = function(str){ start = gregexpr(pattern= startstr, text=str)[[1]][1] + nchar(startstr) + 3 end = gregexpr(pattern= endstr, text=str)[[1]][1]-2 str = substr(str,start=start,stop =end) aaa = as.list(strsplit(str, "\\],\\[")[[1]]) for(i in 1:length(aaa)){ aaa[[i]]=unlist(as.list(strsplit(aaa[[i]],",")[[1]])) } m = matrix(unlist(aaa), ncol=length(aaa[[1]]), byrow=T) theNames = m[1,] theNames = gsub(pattern = "'",replacement = "",x=theNames) colnames(m)= theNames m=m[-1,] df <- data.frame(m) df[,1] = gsub(pattern = "'",replacement = "",x=df[,1]) df = as.data.frame(apply(df,2,function(x)gsub('\\s+', '',x))) df[] <- lapply(df, as.character) startToday = which(df[,1]=="00:00") EndToday = dim(df)[1] dates = c(rep(Sys.Date()-1,startToday), rep(Sys.Date(),EndToday-startToday)) df$datum = dates return(df) } pm10 =stringToDataframe(test[1]) nono =stringToDataframe(test[3]) ozon =stringToDataframe(test[6]) outpm10= data.frame(Year=as.numeric(substr(pm10$datum,1,4)), Month = as.numeric(substr(pm10$datum,6,7)), DagNumeric= as.numeric(substr(pm10$datum,9,10)), Timme = as.numeric(substr(pm10$Tid,1,2)), PM10= as.numeric(pm10$Hornsgatan)) outnono= data.frame(Year=as.numeric(substr(nono$datum,1,4)), Month = as.numeric(substr(nono$datum,6,7)), DagNumeric= as.numeric(substr(nono$datum,9,10)), Timme = as.numeric(substr(nono$Tid,1,2)), NONO2 = as.numeric(nono$Hornsgatan)) outozon= data.frame(Year=as.numeric(substr(ozon$datum,1,4)), Month = as.numeric(substr(ozon$datum,6,7)), DagNumeric= as.numeric(substr(ozon$datum,9,10)), Timme = as.numeric(substr(ozon$Tid,1,2)), Ozon= as.numeric(ozon$Hornsgatan))
The nest step is to simply store this information into your SQL-Database using the RODBC-package in R.
Apart from demanding flexibility and ingenuity, the process of retrieving the relevant data is not the real fun part of the process. Choosing an adequat model depends mostly on the type of data you have to deal with and what the output you wish to produces. The transport data (passages through toll stations) would normally suggest the use of time series, which are delt easily with in R, but it contains a lot of missing data and even though it exhibits some seasonal variations it also contains a lot of noise which is not optimal for predicting on an hourly basis. Furthermore, we discovered that our meteorological data was also missing values at different moments. The same was true of historical measurements of particles due to sensor malfunctions. These fact quickly eleiminate many predictive algorithms.
Given all the above it seemed reasonable to use a Random Forest predictive model. A random decision forest is an ensemble learning method for classification and regression among a wide range of other tasks. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set. For a thorough mathematical description we invite the reader to download Analysis of a Random Forests Model by Gérad Biau. What we chose to do is to create a model for each hour to predict, that is to specialize 24 random forest models to predict each one hour of the coming day. The random forest package randomforest is both efficient and easy to use given that the data has been properly handled. Given that every single hour of the coming day has to be predicted, it does not suffice to feed the model the data as it is. The data needs to expose the fact that time is flowing and that all the inputs change over time. We chose to slide the data forward and backwards 24 hours using DataCombine
for(i in 1:24){ Dataset=slide(Datset, Var = "ozon", NewVar=paste0("ozon",i,"h",sep=""), TimeVar = "Timme", slideBy = -i)}
As in many statistical methods there are no clear rules on how to proceed with a model. How large should your training set be? How many trees should you construct and how many variables should you consider to train your model on? Our recommendation test your way forward and determine what gives the best result with consideration taken to perform the task. Many times one observes that the marginal effect of an additional variable cannot be justify the extra time taken.
for(i in 1:24){ d = slide(dat, Var = "pm10", NewVar = "TARGET", GroupVar = "Year", TimeVar= "Timme", slideBy = i) d = slide(d, Var = "pm10", NewVar = "TARGET", TimeVar= "Timme", slideBy = i) d = d[-seq(dim(d)[1]-i,dim(d)[1]),] smp_size = floor(0.90 * nrow(d)) train_ind= sample(seq_len(nrow(d)), size = smp_size) train= d[train_ind, ] test = d[-train_ind, ] fit= randomForest(TARGET ~. ,data = train, ntree = 900, mtry = 10) true.Target = test$TARGET testInput= test testInput$TARGET= NULL modelOutput = predict(object = fit, newdata = testInput) modelList[[i]] = fit save(modelList, file = "F:/Partikelkollen/SCRIPT/ModelList.RData")
In some cases, parallelizing your script might be an efficient way to reduce the time of creating your model. R allows you to do so using the foreach-package.
fit = foreach(ntree=rep(1000, 20), .combine=combine,.multicombine= TRUE .packages='randomForest') %dopar% {randomForest(TARGET ~., train, ntree=ntree) }
Once the models have been buit they can be used to predict future events. As we wish to make these prediction every hour of the day, the models need to be applied on new data every hour. Now, given that no special events occur, the models need not be retrained every hour. In fact, if the new incoming data on particle levels, weather and traffic follow the same patterns as the historical data it may be enough to update the models once a month. They will still give accurate predictions. A good way to determine whether the models still are a good fit is to keep track of the model performance, for instance by determining the mean error of the model.
Our main goal was to give the general public the opportunity to decide whenever it is safe to spend time at specific localtions in Stockholm. It was therefore important to design a way to present the results in an easily understandable manner. Delivering the results of our predictions in standard tables was out of the question as people generally need more graphic alternatives. Though R has over the years developped packages adding many features (e.g. ggplot2), R-graphics are notoriously unappealing to non-statisticians and laymen. Many tools are available on the market but they have a tendency to work badly with constant updates and demand that that the data be fully prepared. There is however one tool that is easy to use, allows user interaction and is esthetically appealing, namely Power BI. Its direct contact to the SQL-database, the ability to schedule updates, the user interaction feature and its many visualisation packages (among which R-scripting is one) made it the superior candidate.
As an example, we show a snapshot of the 24 hours pm10 prediction on one of the most polluted streets in Stockholm, Hornsgatan:
The graph shows the particle concentration (micrograms per cubic meter) for the coming 24 hour period, colored according to the limits set by environmental quality norms. The average daily value for two other streets, Norrlandsgatan and Sveavägen, are also given for comparison. As we mentioned previously, it is important, even for users, to be able to see how reliable the model is. We have therefore chosen to also incorporate a graph showing overlapping historical values predicted values. A second graph visualises the difference between the predicted and actual values. To make the visualization interactive we used a time slide that enables the user to choose which periods the model and the actual values to compare.
Serge