Top-Down Distribution of Quality Measurement Requirements using Table Functions in Python


Since August of 2015 I have been working for ETC Transport Consultants1 in Switzerland as Senior Data Analyst and Developer on a part-time basis. My job is developing a complete software solution for Travel Planning & Resource Optimization in Python (a popular high-level programming language).

ETC Transport Consultants (ETC) has the mandate from Swiss Federal Office of Transport (FOT) to measure the quality of public transport vehicles (bus, train, tram etc.) in all Cantons of Switzerland. Required sample quantities, that is, how many times the quality of a certain transport line or station must be measured (and evaluated) within a year is provided by FOT, as formal quality measurement requirements.

The test customers recruited by ETC make about 6-hour long round trips on various public transport lines, in order to measure more than 60 attributes of each transport line and vehicle. Each measurement is recorded into a special application on their mobile phones, which finally transmit the measurement data to a central web based database named QDABA that is located in Berlin. These measurement data are then used for generating various quality reports for the FOT, for the Cantons of Switzerland, and for the transport companies.

The complex round-trips and measurement details of these test customers must be planned meticulously in order to fulfill the quality measurement criteria required by FOT. Currently, this planning is done in a dedicated MS Access tool with some semi-automated consistency checks in order to simplify the manual work.

As the first step (named tour planning), tours are planned including details like start time and station, lines and stations to be taken and measured during the tours, cross-overs to nearby stations at certain stations, and so on. As the second step (named assignment planning), the planned tours are allocated to available days and test customers.

As you can imagine, this manual planning process is quite expensive in terms and labor and time, especially if there are lots of lines and stations to be measured. Currently, there are 88 different transport companies spanning more than 1100 transport lines with a total sample of 40’000 measurements per year. My job (as data analyst & developer) is about automating the whole planning process in order to save time, labor and money.

The complete planning process can be broken down to three sub-processes:

  1. Top-down distribution of measurement requirements: Distribution with table functions, minimum-loss rounding, +5Mod12 distribution with frequency
  2. Tour planning: Time-dependent dynamic network, graph algorithms based on time-table data, search criteria for finding all possible paths, find most valuable (i.e. minimum cost) tours to cover measurement requirements
  3. Assignment Planning: Overall integration, graph algorithms and constrained optimization, allocating best-value tours optimally to available days and test customers

As of today (3. July 2016), all these three sub-processes are almost completely implemented, excluding some real-life complications and final tests. I guess, the remaining detail work can be completed within about three months, until the end of September 2016.

This is good news for ETC because this software, process and operation know-how, namely Complex Travel Planning & Resource Optimization, can be deployed in many other fields like field research (survey) or activity campaign organization, where limited resources must be allocated to tasks in different locations, for which travels and transports must be planned to the finest detail.

The rest of this article is about the first sub-process mentioned above, top-down distribution of measurement requirements, which I implemented using the table functions that I had developed several years ago.

Top-Down Distribution of Measurement Requirements

The measurement requirements are delivered by FOT at a high granularity level, expressed in terms of years and line bundles:

Line-specific statistics (tour-km and person-km) for the transport lines of line bundle AS BahnS 1:

In order to begin with the monthly tour and assignment planning, these high-level requirements (i.e. sample numbers) must be broken down to detail fields like month, transport line, time window (hours of day) and weekdays, according to following specifications:

Samples(LineBundle,Year) --> Samples(Line,Month,WeekdayGroup,TimeWindow)

  1. The total number of samples after distribution (at detail level) must exactly match the total number of samples before distribution (at high level). And all numbers after distribution must of course be whole numbers; one can’t measure a transport line 4.5 times within a month.
  2. Distribution from year to months: In most cases, yearly measurement quantities (samples) must be distributed evenly to 12 months of the year. But, it must also be possible to distribute the samples to given set of months of a year (including fractional months like 1/3 of December), excluding one or several months.
  3. Distribution from a line bundle to individual lines of the bundle: A given percentage like 40% (fixed ratio) of samples are distributed equally to all lines. Another percentage like 60% (variable ratio) of samples are distributed proportionate to the average of two line-specific statistics, total route-km and person-km. A fixed default proxy value is used for calculations, if any statistic is missing in the data set provided by FOT.
  4. Distribution from total to time windows: Samples are distributed to time windows (i.e. hour blocks like 6:00-10:59) according to given ratio (key) for each time window. For example, 4% for time window 0:00-05:59, 34% for time window 06:00-10:59, and so on.
  5. Distribution from total to weekday groups: Samples are distributed to weekday groups (i.e. Monday-Friday, Saturday, Sunday) according to given fixed ratio (key) for each weekday group. For example, 10% for Saturday, 80% for Monday-Friday.
  6. All the input data from FOT related with the quality measurement requirements (number of samples per year and line bundle, line-specific statistics etc.) must be read from multiple excel files and sheets, and the results (i.e. samples after distribution) must be exported to an excel file for practical reporting with pivot tables. There are more than 200 line bundles, and accordingly more than 200 excel files to be read as input data.

The whole distribution process can be summarized as follows:

  1. Read all the input data from the excel files provided by FOT (high-level measurement requirements, line-specific public transport statistics)
  2. Calculate all the distributions as specified above, to obtain samples at detail level: Samples per month, transport line, weekday group and time window.
  3. Export the results, that is samples after distribution, back to an excel file for practical reporting using pivot tables.

You might think, this distribution problem alone is a small-sized software project which may require several months to implement by a small team of developers, with experience in similar quantitative distribution problems.

It took me only three workdays to implement the distribution, including tests with input data, using the table-valued functions that I had developed myself several years ago.

Implementation of the Distribution Problem with Table Functions

You may find below the core section of the code which solves the distribution problem using table functions in Python:

# calculate tour-km proportion of each transport line within a line bundle
TourRatio = Tours.GetProportions(['LineBundle'])
# calculate person-km proportion of each transport line within a line bundle
PkmRatio = PKm.GetProportions(['LineBundle'])
def EffektiveRatioFunc(x,y):
	Calculate effective ratios that are used as key values 
	for distributing #samples from line bundles to individual lines.
	return (x + y) / 2.0 	# average of ratios 
# calculate effective ratio
EffRatio = Relation.RelationArithmetic(TourRatio, PkmRatio, EffektiveRatioFunc)
# LineBundle-LineID mapping with a constant value (1)
LB_Line = 1 + (0 * Umlaufe)
LB_Line_Ratio = LB_Line.GetProportions(['LineBundle'])
# distribute #samples to lines and cantons (before rounding)
DistributedSamplesPerYear = \
	((FixPart * SamplesLB * LB_Line_Ratio) + (VariablePart * SamplesLB * EffRatio)) \
	* TimeWindow * Weekday
# minimum-loss rounding before distributing to individual months
DistributedSamplesPerYear_rounded = \
	DistributedSamplesPerYear.ApplyListOperationToSubRelation(['LineBundle'], \
# distribute #samples from year to individual months
DistributedSamplesPerMonth = \
	DistributedSamplesPerYear_rounded.ApplyListOperationToSubRelation( \
	['LineBundle','Year','TU','TU-Nr','TimeWindow','Line-ID','TimeWindowNr','Weekday'], \
# convert multiple key figures to attributes
DistributedSamplesPerMonth2 = DistributedSamplesPerMonth.KeyFiguresToAttribute('MonthNr')
# add attributes with additional line information
DistributedSamplesPerMonth2 = (DistributedSamplesPerMonth2 + LineInfo) + Month
# export results to excel file
DistributedSamplesPerMonth2.ExportToExcel(SaveFileAs = ExportExcelFile, \
	KeyFigureFields = None)

I can hear you asking “how can the code be so short and simple?”

Remember, analytical table functions are general-purpose and versatile high-level functions that enable you to formulate apparently complex specifications in a short time.

Yes, that is the primary advantage of table functions: Quick and clean code, provided that developer has the skills to use high-level analytical functions. Generally, developers with some experience in database applications and math software like matlab, SAS or R should have no difficulty in understanding table functions.

In addition to efficient use of table functions, there were two key issues for solving this particular distribution problem:

1) Minimum-Loss Rounding: This is not the standard minimum-loss rounding algorithm; I developed a new rounding algorithm specifically for this distribution problem. It is a vector function which rounds all the fractional numbers in the vector either up or down, with a method which minimizes the total proportionate rounding error (or deviation). I took me about two workdays to develop this rounding algorithm.

This vector function is passed to a table function as argument, in order to round numbers at any attribute-combination (or level) you want:

# minimum-loss rounding before distributing to individual months
DistributedSamplesPerYear_rounded = \
	DistributedSamplesPerYear.ApplyListOperationToSubRelation(['LineBundle'], \

MinLossRound is the vector function in the code above, which is passed to the table function named ApplyListOperationToSubRelation as argument. The Table function applies the given vector function on the value set related with the current attribute-combination.

If you want to apply another rounding algorithm, you just pass another vector function for rounding to the table function as argument. This underlines the flexibility of table functions.

2) Plus 5 Modula 12 Distribution: This is the algorithm, also a vector function like minimum-loss rounding, that I used for distributing the samples from year to months.

# distribute #samples from year to individual months
DistributedSamplesPerMonth = \
	DistributedSamplesPerYear_rounded.ApplyListOperationToSubRelation( \
	['LineBundle','Year','TU','TU-Nr','TimeWindow','Line-ID','Weekday'], \

The initial requirement was distributing samples evenly to all 12 months of a year. For this purpose, I had developed a simple vector function with the modulus operation (+5 % 12) which accomplished the task simply and perfectly.

Again, this vector function was passed to the proper table function in order to distribute samples at the desired attribute-combination (or level):

# distribute #samples from year to individual months
DistributedSamplesPerMonth = \
	DistributedSamplesPerYear_rounded.ApplyListOperationToSubRelation( \
	['LineBundle','Year','TU','TU-Nr','TimeWindow','Line-ID','Weekday'], \

After several weeks, the initial requirement was extended: The algorithm should be able to distribute samples to a given set of months, for example from April to December (to 9 months instead of 12).

Furthermore, distribution to a fraction of a month should be possible, like distributing samples from April to the first half of December.

I had to adjust only the original vector function to add an additional inverse frequency parameter to the algorithm. Now it was possible to distribute total samples to any number of months, including fractional ones.

The next step was easy: Just pass the new updated vector function with the additional frequency parameter to the table function mentioned above. This illustrates again, the generality and flexibility of table functions.
Written by: Tunc Ali Kütükcüoglu, 3. June 2016

Official Reference: Statement of ETC Transport Consultants

Software implemented with table functions is currently in use. It helps us to handle the formal quality measurement requirements easily and in a very fast time. It was important to solve this problem quickly, because the subsequent manual planning process required a lot of time. We are looking forward to use the new Travel Planning & Resource Optimization tool, as it helps us to handle the whole process in a faster time than now.
Mirjam Schwager, Deputy Branch establishment Olten, Switzerland

  1. ETC Transport Consultants GmbH with offices in Berlin and Switzerland offers IT-solutions to transport authorities and transport companies as well as mystery survey for public transport systems. In order to provide fast and reliable data and information flows within the transportation sector, we offer dynamic and company-wide central data hubs (RBL / ITCS). These combine and connect individual components, such as traveler information systems, travel-time analysis, transfer connections and data management in one system. The integration of VDV and other interfaces is enabled. []
This entry was posted in Calculation engine and tagged , , . Bookmark the permalink.

Leave a Reply