Posted by: dresstosurvive | August 20, 2007

The Push versus Pull Model of Software Development

Shell pipelines, reading files, and streaming data; structure your code for maximal efficiency.

Have you ever considered why your code bottlenecks at certain points? It pays to distinguish between data that is pushed to its next destination and data that is pulled. Pulling is a great thing when the bottleneck is the user. Pulling data saves resources when there is little or no work to be done.

However, things change when you need to operate on a large data set or perform multiple transformations in sequence. Pulling data becomes ineffective—the slowest link in the chain becomes a choke point. This can be significantly alleviated by pushing data instead.

Pushing data works best if you can structure your chain of operations to perform the most expensive ones first. By doing this, the later operations are able to execute without starving for data. As each operation completes, it sends the transformed data or a subset of it to the next component.

Care must be taken to avoid placing heavier duty operations later, otherwise you risk flooding a component with data. In the case that a more expensive operation must be performed later, it is best to pull data into that component.

Pulling data entails components completing their work as fast as possible, while also serving requests to dump new data. A component requests more data as needed from the previous component. If there is nothing available, it must sleep and poll at a later date or employ mechanism to be signaled when new data is available. Reading a file is a pull operation, because the component must wait for the drive before the data is supplied.

Shell pipelines are an example of pushing data. Suppose a user wished to search their movies for all stored in AVI containers. They might construct a shell pipeline looking something like below. First, the listing of all possible movies would be created. Then, the subset of those stored in AVI containers would be created.

Pushing Data

This method of structuring data by progressively filtering results is at the heart of good data mining practice. Note that at each stage, you can take the new data set and perform various operations on it. This allows for cheap non-destructive updates provided you maintain a copy of the data set. As each subsequent operation is cheaper, filters can be added and removed at will.

Pushing works well for things like streaming music. You don’t want to starve the final output. If you use a pull model which requests an update every few milliseconds, you might experience skipping or delays.

Pulling works well for things like AJAX interfaces. You don’t want to continuously do work, but you want data whenever the user asks for it. Ideally, on the server side, the code will be structured to do its processing in a push fashion and provide an interface for data to be pulled.

In short, consider whether your operations are driven by user input, physical motion (such as in a disk drive) or whether they are limited by processing power. Also take into account whether the final output must be available on a time-critical or realtime basis. In a multiplayer online game, data becomes irrelevant when it is out of date and must be discarded. Other operations, such as analysing stock market activity, will not allow for data to be discarded at random.

Leave a response

Your response:

Categories