The timetable says one time, the real-time system shows another and the reality something completely different. Many of us have at some point experienced the frustration of waiting for a bus to arrive at a given time, only to still be waiting 10 minutes after the given time. Can’t trust the time shown on the real-time system? How hard can it be?

Is there a pattern?
Is it possible to determine how long it’ll take before a bus arrives at a given point (bus stop) given that we know where it is at a given point in time? GPS units on busses reporting their position in real time enables the use of machine learning to identify the pattern that the bus traffic has? We are lucky on this issue. We have access to GPS data for most buses in traffic, at least for local bus operators. But before we throw ourselves into the world of machine learning, we should reason a little on the difficulties of this task. Before we make use of real world data, we will try to simulate a bus trip to see if we can anticipate all the challenges that are present on the subject.
Simulate a bus ride
What are the difficulties in calculating the arrival time of a bus? What pattern does a bus in regular service have? To understand this, we need to know which rules govern the bus’s journey. An obvious example of such a rule is the speed. What controls the speed at which the bus travels? Speed limitation, queues and passengers are things that obviously affect the journey’s extent in time. But there are more factors that affect and it can be difficult to predict anything that depends on too many factors that one cannot control.

Create data for a bus trip
We have generated an imaginary bus line that extends between Nynäshamn and Västerhaninge, south of Stockholm. The line follows both smaller and larger roads and has stops distributed at different distances from each other. An issue that immediately arose is the speed limit that should prevail on the roads. We decided to let a unique speed prevail over the entire route. The choice of speed does not significantly affect our machine learning model. Patterns in traffic are affected by other factors such as surrounding traffic and passenger pressure. What is it that controls the pattern? The human behavior? Random events? Below are the points for the bus’s behavior that we chose to stick to:
- Same speed throughout the entire distance.
- Create a timetable for the bus to relate to.
- Slightly lower speed of creation to give space to keep time.
- If the bus is early, it waits for the departure in the timetable.
- Valid at each stop.
- GPS point is generated every second.
- Disturbance of GPS, latitude and longitude, to simulate the varying accuracy of a GPS units.
- Generates both high and low resolution for comparison.
The bus moves slower when the timetable is generated. A normally distributed speed deviation has then been added to simulate the reality during the bus ride. Normal distribution is used because it is the distribution that most closely resembles human behavior. Generally, this distribution has been used for all events resulting from human actions.
Then we come to the random external events that affect the bus.
- Delay in driving (queue between stations and stops at single locations).
- The likelihood of queuing the entire distance between two stations is set to 2%.
- The queue speed is set at 30%.
- Delay due to stop (stationary car that must be overtaken and the like).
- Probability <1% per measurement point (1 per second).
- The speed at the measuring point is set to 50%.
- Delay at bus stop (many passengers on/off etc.).
- The probability of delay at stop 5%.
- 30 seconds is added to the time the bus is at the stop.

The above picture shows how the position of the bus, shown as dots with a small deviation generated by the above rules, follows the line in a realistic way. The dots are more sparsely distributed after the bus leaves the stop (upwards in the image). This is a side effect of the given rules, the bus always leaves the stop at even seconds. The stationary bus at the stop is seen as a cluster of dots that come from the deliberately generated fault of the GPS unit. The position “jumps” around with a few meters difference every second.
Training a model
The choice of machine learning model fell on Random Forest Regression. The environment for the training was Microsoft Machine Learning studio where the model is called Decision Forest Regression. The data was loaded and the first thing we did was that, based on GPS data, calculate mpore features to help the training of the model. For example, if the bus is at a stop (geofencing) and if the date is a holiday, then we used the Filter Based Feature Selection tool which gives a value of the correlation between our various features and the value sought; the time of the bus’s arrival at the stop. Below is a list of the results from Filter Based Feature Selection.
Feature | Weight | Info |
seconds_to_stopid_search | 1,000 | Seconds to the (all remaining) stop. Searched value. |
lat | 0,498 | Latitude |
time_from_start* | 0,497 | Seconds that the bus has been driving since the first stop |
stopid* | 0,490 | Previous stop |
stopid_search* | 0,483 | Remaining stops |
speed | 0,126 | Speed |
bear | 0,096 | Bearing (compass direction) |
lon | 0,085 | Longitude |
at_stop* | 0,082 | At the stop? Boolean |
seconds_at_stop* | 0,058 | Seconds as the bus was at the stop |
arbetsdag* | 0,042 | Everyday? Boolean |
minute | 0,040 | Minute part of the time |
specialdag* | 0,026 | Weekend? Boolean |
hour | 0,023 | The hour of the time |
weekday* | 0,012 | Day of the week |
year | 0,010 | The annual part of the date |
unix_time | 0,010 | Date / time in seconds from 1970-01-01 |
time | 0,010 | Date and time |
month | 0,009 | Monthly part of the date |
second | 0,000 | Second part of the time |
Here we can note that no value reaches over 0.5 in correlation and that it quickly drops to zero, some features affect very little. Interestingly, latitude affects more than longitude. The reason for this is simply because the bus line is more extended from south to north.
Results of the simulation
The following diagram shows the result of the trained model’s calculation of the remaining time between stops on the line. The model calculates the remaining time to the next stop as soon as it stops at one.

The black line shows the actual time, the red shows the calculated time and the green difference between the actual and calculated. The lines are supposed to show how long it remains until the next stop. The X-axis shows the bus’s total driving time and the bus’s time left (the y-axis) to the next stop. When the bus arrives at a stop, the time is zero on the y-axis and then starts counting down the time to the next stop. It forms the pattern of a perpendicular triangle for each drive between two stops, the larger the triangle the longer the time between stops. In the beginning of each countdown the bus stands still and passengers are getting off and on. In the best of worlds, a GPS unit had sent the same position every time until the bus starts rolling again. But when the reception quality can vary, the calculated position suffers and different positions are reported, even though the bus is stationary. This fault causes the model to register that the bus is on its way, and then return in the opposite direction at the next positioning. In the picture, this is most clearly seen as tags at the beginning of each countdown, when the bus is stationary at the stop. Every error in the GPS position gives rise to deviations in the model. We have tested both high and low accuracy on the GPS unit. With low accuracy, the deviations becomes larger in the picture above, the model performs worse. However, the model is not affected in the same way if the refresh rate changes. The result will be equally good even if the number of positions is reduced slightly. An update rate of one position per ten seconds produces almost the same result as with one per second.
The result is satisfactory and shows that we are on the right path. The model follows our simulated driving reasonably well. Lessons learned from this test are as follows.
- Data quality has a great impact on the model’s precision.
- The accuracy of the GPS unit greatly affects the model.
- Sensor refresh rate has less impact than expected.
- Better with longer history than many points.
- Difficult to simulate reality in a realistic way.
- Difficult to model pure random events.
Real world data
As previously mentioned, we have had access to a bus operator’s traffic data (GPS) for one of its bus lines. Information about position has been reported in real time with a frequency of one measurement point per second. The quality of the position has been very good, with a margin of error of a couple of meters. We have historical data with several departures per week for five months, a total of over 800 departures. In the same way as for the simulated bus tours, more features are calculated from the GPS information; time from start, last stop, if the bus is at a stop or not. The same model was used previously for machine learning in the Microsoft Machine Learning Studio.

We split up the departures we had in training, validation and testing dataset and started the training. What we discovered then was that there is a limit to how long a project in Microsoft Machine Learning Studio may run. After two days the project was shut down abruptly. We had to either reduce the number of departures, the number of GPS updates per departure or both. As the variation of the departures was important in order to achieve a stable model, our first measure was to look at the number of GPS points that were needed for the training. Initially, the frequency was one point per second. The experience from the simulation provided that a point per ten seconds was fully sufficient to bring about a reliable model, which could be substantiated by some tests on the real data. However, we still had to reduce the number of departures that were included in the training so that the driving would be completed within two days of training. We eventually used a couple of hundred departures for training data and a little less for validation and testing.

In the figure above, on the y-axis, the average deviation from actual remaining time to arrival and the calculated is seen depending on how far in time the searched bus stop is (x-axis). If the bus is half an hour (1800 seconds) away, the average deviation is just over 80 seconds.

The figure above shows how it can look for a specific departure. We see in a section between 1275 and 1540 seconds driving time (x-axis) from the start how the model (red line) works to calculate the remaining time (y-axis). The fact that the calculated line goes up and down in comparison with the real (black) indicates how jerkiness in traffic or errors from the GPS unit affects the result.
Experiences
As we discovered already during the simulation, the quality of the GPS information affects the model’s stability. However, it is not as important to have a low frequency between the measurement points during the training. This also applies when driving, it is enough to update the position with a longer interval. This saves data traffic costs for the communication between bus and report center. It is important to have enough interval between the points in order to be able to detect if, for example, the bus is at the stop.
The result shows that it is very possible to calculate, with machine learning, the remaining time until a bus arrives. It is also possible to extend the use of real-time information from the bus to other analyzes.
- Stress on vehicles: Together with engine information (CAN data), stressful distances can be identified. It is possible to visualize on a map how the engine works on the lines, both on average and in real time.
- Customer satisfaction: Measurements on the bus together with the bus’s geographical location can give the opportunity to identify areas where efforts are required. Or specific departures where customer satisfaction has been poor.
- Real-time visualization :At the stop: Show how far away the bus is (in kilometers) along with estimated time left. In the app: Show the position of the bus on a map.
- Driver Analysis: Driver’s driving style can be linked to engine information to identify the best driving style for minimal wear. Coupled with customer satisfaction, the perfect driving style can be analyzed. What is most important? Arrive on time or safe and secure driving?
- Passenger Count: Is the traffic network overloaded? Is there any area where we need to expand? In the app: Will I get a seat when the bus arrives?
Leave a Reply