At the stop, waiting for the bus

The timetable says one time, the real-time system shows another and the reality something completely different. Many of us have at some point experienced the frustration of waiting for a bus to arrive at a given time, only to still be waiting 10 minutes after the given time. Can’t trust the time shown on the real-time system? How hard can it be?

1 old bus
Figure 1: Is it possible to predict, with the help of machine learning, when the bus arrives?

Is there a pattern?

Is it possible to determine how long it’ll take before a bus arrives at a given point (bus stop) given that we know where it is at a given point in time? GPS units on  busses  reporting their position in real time enables the use of machine learning to identify the pattern that the bus traffic has? We are lucky on this issue. We have access to GPS data for most buses in traffic, at least for local bus operators. But before we throw ourselves into the world of machine learning, we should reason a little on the difficulties of this task. Before we make use of real world data, we will try to simulate a bus trip to see if we can anticipate all the challenges that are present on the subject.

Simulate a bus ride

What are the difficulties in calculating the arrival time of a bus? What pattern does a bus in regular service have? To understand this, we need to know which rules govern the bus’s journey. An obvious example of such a rule is the speed. What controls the speed at which the bus travels? Speed limitation, queues and passengers are things that obviously affect the journey’s extent in time. But there are more factors that affect and it can be difficult to predict anything that depends on too many factors that one cannot control.

2 simulated bus line
Figure 2: An imaginary bus route between Nynäshamn and Västerhaninge.


Create data for a bus trip

We have generated an imaginary bus line that extends between Nynäshamn and Västerhaninge, south of Stockholm. The line follows both smaller and larger roads and has stops distributed at different distances from each other. An issue that immediately arose is the speed limit that should prevail on the roads. We decided to let a unique speed prevail over the entire route. The choice of speed does not significantly affect our  machine learning model. Patterns in traffic are affected by other factors such as surrounding traffic and passenger pressure. What is it that controls the pattern? The human behavior? Random events? Below are the points for the bus’s behavior that we chose to stick to:

  • Same speed throughout the entire distance.
  • Create a timetable for the bus to relate to.
    • Slightly lower speed of creation to give space to keep time.
  • If the bus is early, it waits for the departure in the timetable.
    • Valid at each stop.
  • GPS point is generated every second.
  • Disturbance of GPS, latitude and longitude, to simulate the varying accuracy of a GPS units.
    • Generates both high and low resolution for comparison.

The bus moves slower when the timetable is generated. A normally distributed speed deviation has then been added to simulate the reality during the bus ride. Normal distribution is used because it is the distribution that most closely resembles human behavior. Generally, this distribution has been used for all events resulting from human actions.

Then we come to the random external events that affect the bus.

  • Delay in driving (queue between stations and stops at single locations).
    • The likelihood of queuing the entire distance between two stations is set to 2%.
    • The queue speed is set at 30%.
  • Delay due to stop (stationary car that must be overtaken and the like).
    • Probability <1% per measurement point (1 per second).
    • The speed at the measuring point is set to 50%.
  • Delay at bus stop (many passengers on/off etc.).
    • The probability of delay at stop 5%.
    • 30 seconds is added to the time the bus is at the stop.
3 simulated bus gps
Figure 3: The points in the picture show the position of the bus’s journey along the red line. The different colors relate to different departures. All departures go in one direction, from the bottom up.


The above picture shows how the position of the bus, shown as dots with a small deviation generated by the above rules, follows the line in a realistic way. The dots are more sparsely distributed after the bus leaves the stop (upwards in the image). This is a side effect of the given rules, the bus always leaves the stop at even seconds. The stationary bus at the stop is seen as a cluster of dots that come from the deliberately generated fault of the GPS unit. The position “jumps” around with a few meters difference every second.

Training a model

The choice of machine learning model fell on Random Forest Regression. The environment for the training was Microsoft Machine Learning studio where the model is called Decision Forest Regression. The data was loaded and the first thing we did was that, based on GPS data, calculate mpore features to help the training of the model. For example, if the bus is at a stop (geofencing) and if the date is a holiday, then we used the Filter Based Feature Selection tool which gives a value of the correlation between our various features and the value sought; the time of the bus’s arrival at the stop. Below is a list of the results from Filter Based Feature Selection.

Feature Weight Info
seconds_to_stopid_search 1,000 Seconds to the (all remaining) stop. Searched value.
lat 0,498 Latitude
time_from_start* 0,497 Seconds that the bus has been driving since the first stop
stopid* 0,490 Previous stop
stopid_search* 0,483 Remaining stops
speed 0,126 Speed
bear 0,096 Bearing (compass direction)
lon 0,085 Longitude
at_stop* 0,082 At the stop? Boolean
seconds_at_stop* 0,058 Seconds as the bus was at the stop
arbetsdag* 0,042 Everyday? Boolean
minute 0,040 Minute part of the time
specialdag* 0,026 Weekend? Boolean
hour 0,023 The hour of the time
weekday* 0,012 Day of the week
year 0,010 The annual part of the date
unix_time 0,010 Date / time in seconds from 1970-01-01
time 0,010 Date and time
month 0,009 Monthly part of the date
second 0,000 Second part of the time


Here we can note that no value reaches over 0.5 in correlation and that it quickly drops to zero, some features affect very little. Interestingly, latitude affects more than longitude. The reason for this is simply because the bus line is more extended from south to north.

Results of the simulation

The following diagram shows the result of the trained model’s calculation of the remaining time between stops on the line. The model calculates the remaining time to the next stop as soon as it stops at one.

4 image simulated bus
Figure 4: Countdown of time to next stop. On the y-axis, the time is shown in seconds until the next stop. The total running time is displayed on the x-axis. Black line shows actual time, red the calculated and green difference. Note how the GPS position’s “jumping” generates a shaky calculated line.

The black line shows the actual time, the red shows the calculated time and the green difference between the actual and calculated. The lines are supposed to show how long it remains until the next stop. The X-axis shows the bus’s total driving time and the bus’s time left (the y-axis) to the next stop. When the bus arrives at a stop, the time is zero on the y-axis and then starts counting down the time to the next stop. It forms the pattern of a perpendicular triangle for each drive between two stops, the larger the triangle the longer the time between stops. In the beginning of each countdown the bus stands still and passengers are getting off and on. In the best of worlds, a GPS unit had sent the same position every time until the bus starts rolling again. But when the reception quality can vary, the calculated position suffers and different positions are reported, even though the bus is stationary. This fault causes the model to register that the bus is on its way, and then return in the opposite direction at the next positioning. In the picture, this is most clearly seen as tags at the beginning of each countdown, when the bus is stationary at the stop. Every error in the GPS position gives rise to deviations in the model. We have tested both high and low accuracy on the GPS unit. With low accuracy, the deviations becomes larger in the picture above, the model performs worse. However, the model is not affected in the same way if the refresh rate changes. The result will be equally good even if the number of positions is reduced slightly. An update rate of one position per ten seconds produces almost the same result as with one per second.

The result is satisfactory and shows that we are on the right path. The model follows our simulated driving reasonably well. Lessons learned from this test are as follows.

  • Data quality has a great impact on the model’s precision.
    • The accuracy of the GPS unit greatly affects the model.
  • Sensor refresh rate has less impact than expected.
    • Better with longer history than many points.
  • Difficult to simulate reality in a realistic way.
    • Difficult to model pure random events.

Real world data

As previously mentioned, we have had access to a bus operator’s traffic data (GPS) for one of its bus lines. Information about position has been reported in real time with a frequency of one measurement point per second. The quality of the position has been very good, with a margin of error of a couple of meters. We have historical data with several departures per week for five months, a total of over 800 departures. In the same way as for the simulated bus tours, more features are calculated from the GPS information; time from start, last stop, if the bus is at a stop or not. The same model was used previously for machine learning in the Microsoft Machine Learning Studio.

5 machine learning
Figure 5: View of the training flow in the Microsoft Machine Learning Studio.

We split up the departures we had in training, validation and testing dataset and started the training. What we discovered then was that there is a limit to how long a project in Microsoft Machine Learning Studio may run. After two days the project was shut down abruptly. We had to either reduce the number of departures, the number of GPS updates per departure or both. As the variation of the departures was important in order to achieve a stable model, our first measure was to look at the number of GPS points that were needed for the training. Initially, the frequency was one point per second. The experience from the simulation provided that a point per ten seconds was fully sufficient to bring about a reliable model, which could be substantiated by some tests on the real data. However, we still had to reduce the number of departures that were included in the training so that the driving would be completed within two days of training. We eventually used a couple of hundred departures for training data and a little less for validation and testing.

6 result average remaining time
Figure 6: The image shows the average difference (y-axis) in seconds between actual and calculated remaining time (x-axis) to the selected stop. Ten minutes before arrival (600 seconds), the average error is 40 seconds.


In the figure above, on the y-axis, the average deviation from actual remaining time to arrival and the calculated is seen depending on how far in time the searched bus stop is (x-axis). If the bus is half an hour (1800 seconds) away, the average deviation is just over 80 seconds.

7 result one departure
Figure 7: Another way to visualize the result is to look at a specific run. The figure shows a section of the bus’s driving in time. The y-axis shows the difference between real (black line) and calculated time (red line).

The figure above shows how it can look for a specific departure. We see in a section between 1275 and 1540 seconds driving time (x-axis) from the start how the model (red line) works to calculate the remaining time (y-axis). The fact that the calculated line goes up and down in comparison with the real (black) indicates how jerkiness in traffic or errors from the GPS unit affects the result.


As we discovered already during the simulation, the quality of the GPS information affects the model’s stability. However, it is not as important to have a low frequency between the measurement points during the training. This also applies when driving, it is enough to update the position with a longer interval. This saves data traffic costs for the communication between bus and report center. It is important to have enough interval between the points in order to be able to detect if, for example, the bus is at the stop.

The result shows that it is very possible to calculate, with machine learning, the remaining time until a bus arrives. It is also possible to extend the use of real-time information from the bus to other analyzes.

  • Stress on vehicles: Together with engine information (CAN data), stressful distances can be identified. It is possible to visualize on a map how the engine works on the lines, both on average and in real time.
  • Customer satisfaction: Measurements on the bus together with the bus’s geographical location can give the opportunity to identify areas where efforts are required. Or specific departures where customer satisfaction has been poor.
  • Real-time visualization :At the stop: Show how far away the bus is (in kilometers) along with estimated time left. In the app: Show the position of the bus on a map.
  • Driver Analysis: Driver’s driving style can be linked to engine information to identify the best driving style for minimal wear. Coupled with customer satisfaction, the perfect driving style can be analyzed. What is most important? Arrive on time or safe and secure driving?
  • Passenger Count: Is the traffic network overloaded? Is there any area where we need to expand? In the app: Will I get a seat when the bus arrives?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by

Up ↑

%d bloggers like this: