Whitepapers
Download PDF: Concurrent Programming Issues

Issues in Concurrent Programming

Introduction

The term concurrent programming usually refers to the kind of programming required to support simultaneous activities in a software application.

Such a definition is somewhat broad, and indeed concurrent programming can refer to a variety of programming models. For example, multi-processor systems, parallel systems, distributed systems as well as uni-processor systems with simulated concurrency - all these platforms support concurrent programming. In this article, we take particular emphasis on the uni or multi- processor system with actual or simulated concurrency. Many of the issues we handle here are general and can be applied to the other systems as well.

Concurrent programming, as we take focus here, is the set of techniques and tools used in programming concurrent activities that dynamically share information with one another. Note that dynamic information sharing is an important criteria. Otherwise programming two different programs running on two different machines would qualify as concurrent programming under an indiscriminate definition.

Click here to go to Top

Behind The Theory

The programmer new to concurrent programming usually comes across a multitude of terms like threads, semaphores, mutexes, monitors, conditional variables, signals, delayed variables, critical sections, RPCs, rendezvous, remote evaluations, synchronous/asynchronous data transfer, message queues, shared memory, priorities, deadlocks and so on. He/she will also have been introduced to certain “sample” problems like Dining Philosophers and Producer-Consumer problems which usually fail to make much sense. What does it mean if five philosophers get together to eat spaghetti? Well, hardly anything. But the example serves as a classic case common to most concurrent programming issues, if only it is viewed together with practical scenarios. So it is usually best to introduce terms and concepts on a need-by-need basis.

But before we do that, let’s see when we really need concurrent programming.

Click here to go to Top

When Do We Need Concurrent Programming?

Most of the time in most of the PC-running applications – we don’t. There are some applications that can benefit from concurrency. Then, there are applications where concurrency is a requirement (especially in real-time systems). And finally, there are applications that are inherently distributed in nature, like e-mail for example.

We will touch briefly upon why email is a concurrent system belonging to the last class mentioned above. The e-mail system is well designed and it has interacting components running on multiple machines, dedicated to routing and delivering your mail across the world. It has avoided dead-locks and many other problems. Let’s say that an email you sent to a friend bounced, and by the time it gets back to you, your email account is deleted. What is there that prevents orphaned e-mails from traveling the world back and forth through endless number of servers? What happens if two email delivery components try to send to each other at the same time? As we can see there are many problems due to concurrent operations showing up here.

To take another example, consider a packet broadcast into a network (a general graph). What prevents it from circling round and round? There is logic active in these systems that work only with co-operative agreement and action. And such an agreement put to action is called a protocol. There is concurrent logic at work and they are concurrent systems. Even your network card on the ethernet is part of a concurrent system – because it follows a protocol for avoiding access contention and co-operates (indirectly) with other cards on your network. Only, the concurrency design was done in the ethernet white papers and the cards are simply following their assigned task.

Again, such systems have to be wary about live-locks. That is, the system is not frozen, but keeps doing something that is useless or wasting resources. Does that happen? Look at the following example.

Many people are familiar with the “finger” program on the Internet. (if you don’t, find out and get back here). A finger server (or daemon in Unix terms) supports queries from a client who issues a finger command. Some sites do a “reverse finger” on the client’s finger service to find out who is fingering them. Sounds like a reasonable requirement. Now, what would happen if the client’s finger service also is the “reverse fingering” type? Unless a special mechanism is in place, these two programs will either dead-lock or repeatedly finger each other and run out of resources and become useless, without extracting any information. That’s a case of a live-lock.

We like to classify dead-locks and live-locks and many of the “effects” found in concurrent systems - as emergent phenomena. A phenomena is emergent when you do not explicitly make provisions for it, but it arises from the interactions between the components in your system. You don’t code dead-locks. Neither do you code for live-locks or data corruption or system hangs. They come all of their own. What you have to make sure of is that they don’t arise in your system. And this means controlling the functionality of your components in certain ways so that the collective effects are minimal. And that is what concurrent programming techniques are mostly about. So basically, many of concurrent programming techniques are ways and means of avoiding destructive emergent phenomena in concurrent systems.

Can we have a constructive emergent phenomena? Yes, you can. But like most emergent phenomenon, it is difficult to design the situation where it can turn up. Remember any times when your design seemed to solve certain problems without you actually putting it in there?. By a set of bells and whistles in place which were meant for other things, the problem got solved by itself. A constructive emergent phenomena that you would like to have is self-organization. More on this later.

So, getting back – Do we need concurrent programming? The answer is yes, if your application has inherent concurrency and requires either manifested concurrency or speed-up. It is yes, if your application has some inherent concurrency and there are substantial design simplifications to be had in using concurrency. But most of the time – it is a plain NO. It can result in messy situations and applications that are exceedingly difficult to debug. Despite what certain Java tutorials might say about threads and concurrency, they are best avoided in vanilla programming tasks.

For an introduction to what threads are, take a look at Appendix A.

Click here to go to Top

What Are the Dining Philosophers Doing?

Every student who learns operating systems or concurrent programming is introduced to the Dining Philosophers problem. Five philosophers are sitting around a circular table and there is a big bowl of spaghetti at the center of the table. There are five forks placed around the table in between the philosophers. When a philosopher, who is mostly in the thinking business gets hungry, he grabs the two forks to his immediate left and right and dead-set on getting a meal, he gorges on the spaghetti with them. Once he’s full, the forks are placed back and he goes into his mental world again. The problem usually omits an important fact that a philosopher never talks to another philosopher. The typically projected scenario is that if all the philosophers grab their fork on their left simultaneously (which can happen by chance – and what can go wrong will go wrong – murphy’s law) none of them will be able to grab the fork on their right. And with their one-track mind-set they will forev er keep waiting for the fork on their right to come back on the table.

The basic idea behind the scenario is that if a concurrent activity always does what seems best for itself or what seems to be the right thing for itself in a shared resources scenario, the result can be chaos. Is there a solution to the Dining philosopher problem? Actually the scenario was posed not for a solution but to illustrate a basic problem if the traditional programming approach is applied to concurrent systems. The problem itself crops up in your own concurrent systems and your design decisions should be aware of this and that is what you have to solve. Any set of concurrent programming techniques that you use are expected at the basic level to offer you features that can be used to deal with the Dining Philosophers problem in some way.

An illustration:

Assume that you have the simple task of writing some important information into two files on the disk. But these files are shared by other programs as well. Therefore you use the following strategy to update the files:

                Lock fileA	
                Lock fileB 
                Write information to fileA and fileB 
                Release the locks 
                    
                

This seemingly straight forward and obvious coding can result in deadlocks if other tasks are also writing to these files. For example, if another task locks fileB first, then locks fileA, and if both tasks try to do their job at the same time – dead-lock occurs. My task would lock fileA, the other task would lock fileB, then my task would wait indefinitely to lock fileB while the other task waits indefinitely to lock fileA

The above is a simple scenario, and easy to find out. But you can have a bit more involved case where task A can wait for a lock held by task B which is waiting for a lock held by task C which is waiting for a lock held by task A. A circular wait - a dead-lock results. This is a Dining Philosophers scenario.

In the above code fragment, one could resort to locking the files one at a time for modification. Then the problem would disappear. But there are times when requirements dictate that you have to lock more than one resource before updating them.

Click here to go to Top

Solutions?

How do you solve such problems? Either you should design your system so that circular waits are avoided or you need some special logic for controlling the access. The former is easier to do if the system is well defined – i.e. the design is known to you, so that whenever you add or modify a component that makes such access, you make sure that it doesn't create circular waits.

A resource access graph:

Concurrent Programming - A resource access graph

You can work out an access graph for the resources/components of your system that requires mutually exclusive access. A simple access graph might look as follows:

This helps identify the programs/processes and the mutually exclusive resources that they use. Access control must be then enforced on these programs/processes

A call graph:

call graph

Sometimes the resources could be programs/processes themselves. Process A makes a call to process B which makes a call back to process A and blocks since A is waiting for a reply from B. This can be extended to a set of processes, where a circular wait can develop. A call graph can be useful in such scenarios. Essentially, you look for a cycle in the call graph. Existence of a cycle means potential for dead-lock.

It should be easy to find a cycle in the above graph. The nodes in the above graph could be processes or mutually exclusive objects.

A client-server model

It is usually a lot easier and less painful to follow a client-server approach in the design to avoid dead-locks. Each resource is “served” by a server task to which a client sends a service request (a message). The server after completing the task, replies with a message. Message queues can be used to send messages back and forth. Each task can have its own message queue where a message is received. The client need not wait for the server task to finish but may continue. It can look for the reply message on its own message queue at any time and process it.

In this case, all operations related to the resource will be managed by the server task. For example, in the file locking example mentioned above, the task is updating the files and locking the files is a means. Such an activity will be moved to the server task. The client will simply provide the data to be written. All modifications to the resource can thus become serialized in the server task. A client task can send messages to more than one server task and process reply messages

client-server model

In the above design, each task/program/process has a queue to itself. The queue is used for receiving messages for the task in this example. To send a message to another task, the message is deposited into the message queue of the intended task.

In the Dining Philosophers example, the table that has the forks could act as a server that dispenses two forks to a requesting philosopher. If the philosophers always leave the room for thinking, then the room itself could be the server and may control the entry of the philosophers. Or you might introduce a waiter who provides the forks on demand, who dispenses two forks at a time. This is the way of a centralized resource access control.

Protocols

Another way is for the tasks to coordinate with each other, avoiding a centralized server. This requires communication (directly or indirectly) between the tasks, a predefined agreement on how to proceed with access and what to do if possible contention arises. For example, passing an authorizing token around can help arbitrate access. Without holding the token, a task should not proceed with an operation. Another method could be an election from the list of tasks that want to execute a mutually exclusive operation. The reader may refer to research papers on distributed mutual exclusion for more details (e.g. see IEEE Computer journals).

One of the methods that work in many cases simply relinquishes control (backs off), waits a small, perhaps random period of time and attempts access again. For e.g., in the file locking example, if a process is holding a lock to fileA, and tries to lock fileB which blocks, the second operation can time out and the process may relinquish the fileA lock, wait for some time and retry the operation.

Click here to go to Top

A Busy-Waiting Problem And Event-Conversion

One of the important design criteria when working out a concurrent system is the avoidance of busy-waiting. Busy-waiting simply means that a task is in some kind of loop repeatedly polling for an event or some condition to come true. This takes up CPU cycles and the performance of the system comes down drastically. Usually just one busy-waiting task in a single/multi-processing system is enough t slow down the application considerably. You may see processor utilization graphs shoot up to more than 90%.

A usual problem that protocol converters or tasks that interface with different kinds of message delivery systems have to tackle is how to look for events from these subsystems and manage information flow between them. See an example problem below:

Busy-Waiting Problem And Event-Conversion

Here a task has to juggle back and forth between a set of network communication sockets and a set of message queues. A message from the network may arrive on any of the sockets, get processed by the task and will be sent to one of the message queues. And vice versa. The operating system provides a system call for detecting the presence of a message in a message queues. Another system call is provided to detect the presence of a packet in a network socket.

Now one may try to code as follows:

In a loop do
  For each message queue
  	If there is a message in the queue, process it
  For each socket
    If there is a packet available, process it
End of loop

But this is essentially busy-waiting. If there is nothing to process, the task keeps running through the loop and wastes CPU time. We need the task to “block” or relinquish control in such a case, and come back to activity when any input event occurs. How can we do this?

Most of the time you don’t have to break your head over it. The operating system usually provides you with a system call that lets you wait for messages on a set of message queues. And another which lets you wait for packets on a set of network sockets. Does that help?

Suppose we try to code as follows:

In a loop do
       Wait for a message in any of the message queues 
       Process the message
       Wait for a packet in any of the network sockets
       Process the packet
End of loop

This obviously wont help much. You can't process a network packet until a message is processed and then later vice versa. You might try introducing a time-out on the wait so that after a period of waiting the system call will come out and the code will proceed to the next activity. Again, this introduces busy-waiting, though a little less expensive because of the periodic blocking until timeouts. But the blocking delays can slow down processing also. The problem here is that the system calls do not let you wait for events on BOTH the message queues AND network sockets simultaneously. Unfortunately, this is how many operating systems behave. So what can you do?

Busy-Waiting Problem

A simple re-structuring can avoid the issue. By introducing two listener processes which use the respective system calls for the message queues and the network sockets, each of them can block indefinitely until a message or packet has arrived. The notification of the arrival can be sent to a message queue used by the main task. The function of the main task is then:

In a loop do
     Wait for a message on the message queue
     If the message says data on a socket,
read the socket and process the data
     If the message says data on a message queue,
read the message queue and process the data
End of loop

Writing data to the sockets or message queues will be done by the main task itself. Its message queue acts as a multiplexer for two system calls. Obviously, this can be extended to multiplexing of many blocking system calls. If you don’t even have a system/API call that lets you look for events on a set of sockets (which is a rare case) or message queues, you can introduce a listener process for each in the above scenario. This is not recommended unless the processes are light-weight (threads). Otherwise context-switching time can be of concern.

The same busy-waiting issue can pop-up when you deal with third-party toolkits that let you look at interested events in that sub-system, but you cannot multiplex it with operating system events or other tool-kit events. In such cases, introducing concurrency can solve the problem. What we essentially use in this technique is event conversion, for example, into messages. A heterogeneous event system is converted into a homogeneous event system by increasing concurrency and adding additional intelligence into the events.

Click here to go to Top

A Producer/Consumer Problem

The previous discussion assumes that the operating system or some tool-kit provides a call which will block if there are no interesting events, and wake up when an event occurs. Lets take a simple message queue. You are blocked trying to read a message from the queue which is empty. Now some other process puts a message there, and immediately your call returns with the message. If you are not reading messages, then when the message queue becomes full, a sender task attempting to insert a message should block until you have read at least one message. How is this done? Of course the operating system or the tool-kit does that. But how? You may have to provide this feature yourself to some application sometimes. This is known as the Producer/Consumer problem.

Lets say you are writing a piece of software that gets data from the network and provides it to your client applications. You are providing a set of APIs or calls for the clients to access this data. Lets say that whenever the client calls “GetNextDataSet()”, you will return any stored data to the client. If you don’t have any data, you don’t let the call return but block the client until you get some data from the network. Now if the client has not been reading any data for a while and you have been getting lots of data from the network what will you do? Lets say you have an internal buffer that can store 128 Kb that you may use. If the buffer becomes full you send a “quench” packet to the sender so that it stops sending you stuff on the network. Now if the client starts reading data, you dispense it from the buffer. When the buffer reaches a particular water-mark (which could be at the top), you send out a “release” packet to the sender, meaning it is okay to send more data.

It is obvious that your activities must run concurrently with that of your client(s). Thus concurrency is a requirement here. How can these activities be coordinated? This is the producer/consumer problem in functional layers. The above task is essentially a layer over the existing network layers to provide a set of APIs to your client. What tools can you use?

There are at least two discernible threads of activity. One, your layer, and then one or more clients calling your API “GetNextDataSet()”. Thus you have a minimum of two “threads” or processes.

Mutexes

If you are not going to use a client-server approach, then all the tasks will attempt simultaneous access to the buffer that holds the incoming network data. Obviously this access must be synchronized. Only one task should be allowed to modify the buffer data structures at a time to avoid data corruption. In short you need some mutual exclusion objects (mutexes) to coordinate the access. Mutexes are provided by operating systems. They are like locks. No two tasks can simultaneously “own” a mutex. So before a task accesses the buffer, it must acquire a mutex (lets call it the buffer-mutex). After its work is done it must release the buffer-mutex. That section of code thus protected from concurrent activity is known as a critical section.

Semaphores

Many times tasks have to signal each other while coordinating with others. Tasks may have to wait until a signal is received to start doing certain jobs. For example if a client calls “GetNextDataSet” and the code finds that no data exists in the buffer, it must go to sleep, waiting for a signal from your layer-task that it is okay to go ahead and look at the buffer again. This kind of signaling back and forth among tasks can be accomplished using a tool known as a semaphore. Semaphores essentially refer to signaling mechanisms using your hands, for example used in aviation between a man on the ground and an approaching or taking off airplane’s pilot. They are used to signal and co-ordinate activities. They can also be used as mutexes (In fact, the most common use for semaphores is as mutexes)

Buffer semaphore

In the above diagram, the Buffer semaphore is a mutual exclusion lock (or mutex) that can be acquired and released. It will have an integer value which specifies whether it is acquired (0) or available (1). To acquire the semaphore you issue a P(Buffer-semaphore) and to release it you issue a V(Buffer-semaphore). Similarly for the Client semaphore. A P() operation can succeed only if the value of the semaphore is 1 and it will immediately set it to 0. Otherwise the P() will wait until the value becomes 1. A V() operation sets the value to 1 and signals any pending P() operations on the semaphore to try and acquire it.

The Buffer semaphore must be acquired by any party wishing to modify the shared buffer. This is to avoid data corruption. So whenever the network layer thread wants to add data to the buffer or the client wants to take data out of the buffer, a P(Buffer-semaphore) must be issued. The call will block if it is already grabbed until a V(Buffer-semaphore) is issued. At the end of the desired operation a V(Buffer-semaphore) is used to release the buffer.

A simple logic for the “GetNextDataSet()” API or library call from the client would do the following:

GetNextDataSet: First version: (part of the client application)

  • P(Buffer-semaphore) i.e. wait until buffer is acquired
  • take out data if available from the buffer
  • V(Buffer-semaphore) i.e. release the mutex over the buffer
  • If some data was taken out, return the data, end of call.
  • P(Client-semaphore) i.e. wait until semaphore is released and grab it
  • Go to step 1

Steps 1 through 4 would be obvious, which simply lock the buffer before taking the data out. Now if there is nothing to take out, what should the client do? It should block, waiting for a signal to go back and inspect the buffer again. Who can provide this signal? The network thread of course, when it inserts new data into the buffer. The Client-semaphore will initially have a value of 0 (means acquired). So when the client app calls P() over the semaphore, it will simply block. The V() operation will be done by the network layer thread whenever it inserts new data into the buffer.

The Network Layer Thread (with busy waiting!):

In a loop do
    If buffer not full, get next data from the network 
    If data received from the network
         P(Buffer-semaphore)     i.e. acquire mutex
    Insert data into buffer   
    V(Buffer-semaphore) i.e. release mutex
	V(Client-semaphore) i.e. signal any pending P() to proceed
    End if
End loop

Problem: In the above technique, the signaling is one-way. The network layer thread wakes up the client application through a V() operation when something is entered into the buffer. If the client is waiting on a P() it will wake up and proceed to inspect the buffer. But what if the clients are not inspecting the buffer and the buffer is full?? The above logic says that the Network layer thread repeatedly keeps checking until the buffer has space inside. This is busy-waiting and must be avoided. If the buffer is full, the network layer thread in turn must go to sleep, to be awakened by the clients when they take out data. This is a reverse signaling from the client to the network thread. We can use another semaphore for this.

semaphore

So what would be additionally required of the client is simply this: Whenever it takes out data from the buffer, signal the Network layer thread using a V(Network layer semaphore). What about the network layer thread? When it finds that the buffer is full, it must send out a “quench” packet to the sender and block on a P(network layer semaphore). It will come out of P() only when new data has entered the buffer from the client. Therefore send a “release” packet on the network and continue operation.

Note that both the Client semaphore and the Network thread semaphore start off with value ‘0’.

GetNextDataSet (modified slightly):

  1. P(Buffer-semaphore)     
	i.e. wait until buffer is acquired
  2. If data available
	take out the data
	V(Network layer semaphore)
     End if
  3. V(Buffer-semaphore)    
	i.e. release the mutex over the buffer
  4. If data was taken out, return the data, end of call.

  5. P(Client-semaphore)     
	i.e. wait until semaphore is released and grab it
  6. Go to step 1

Network Layer Thread:

In a loop do
  If  buffer full
   Send out “quench” packet
   P(Network thread semaphore) 
   Send out “release packet”
  End if

  Receive data from the network

  P(Buffer-semaphore)     
  i.e. acquire mutex
  Insert data into buffer   
  V(Buffer-semaphore)    
  i.e. release mutex
  V(Client-semaphore)    
  i.e. signal any pending P() by clients to proceed
End loop

The “release” packet is sent whenever the buffer full condition disappears, and not when the water-mark is reached. This is an improvement to the algorithm that you can implement. You can also improve the algorithm so that every operation of the client does not require signaling.

Supporting many clients

Typically your network layer thread would have to support more than one client. So what extensions can we propose to the above scheme? One way would be to use a set of client semaphores, one for each client. Whenever some new data is entered into the buffer, the network layer thread can do a V() operation on all the registered client semaphores. All the blocked clients would then be activated to go forth and inspect the buffer. This will get serialized through the buffer-semaphore.

Another way is to use a counting semaphore. This is a semaphore that carries values other than 0 and 1. If there are fifteen entries in the buffer, the value of the semaphore must be set to 15. This means that 15 P() operations will succeed, resulting in the value becoming 0. Further P()s will block. Basically, if you set the value of the semaphore to the number of entries in the buffer, only so many P()s will succeed. A V() operation will increment the value of a counting semaphore instead of setting it to 1. In short, the network layer thread can do a V() operation on the counting semaphore whenever it inserts new data. The clients would execute a P() on the counting semaphore before inspecting the buffer. You don’t need separate semaphores for each client.

However, there is one problem here. Typically your clients won’t take whatever data is there in the buffer to be taken. GetNextDataSet() should only return data that is meant for that particular client. Therefore even if a client gets access to the buffer through a counting semaphore (and decrementing the count), it may not take the data out. Now it will have to increment the count back using a V(). Does that sound acceptable? Well, it isn’t. Suppose none of the data in the buffer are for the clients who have called GetNextDataSet() at this time. Since no data is taken out and the semaphore value thus remains non-zero, the P() operations will always succeed, and all these clients will be busy-waiting, repeatedly inspecting the buffers even though they are using semaphores!. This is the trouble with text books that teach counting semaphores using the Producer-Consumer problem. They assume that the consumer will always take what the producer has to offer!.

In such scenarios, it may be better to use a set of binary semaphores, one for each registered client. You can improve the logic and avoid V() operations on all the client semaphores if the network layer thread simply identifies the destination of the data it is inserting into the buffer. Just do a V() operation on the semaphore of that particular client. You may also want to have separate queues for each client instead of a common shared buffer, so that the network layer thread will insert data into the appropriate queues. Here the queues are still shared, but only between a single client and the network layer thread. What can you do if some queues become full and others don’t?. You can define a system wide limit on the size of all the queues put together to make “block-when-full” condition easier to program. By this definition, there is no specific limit on individual queue size. Any queue may grow until the system-wide limit that you define has reached. That is the point where your network layer thread blocks on its semaphore.

Some variations to think about: How can you support broadcast to all the clients? How can you support a “peek” option. i.e. inspect but do not remove the data?

Summary:

We have seen that semaphores can be used as mutual exclusion locks (mutexes), for signaling between tasks for controlled access as well as allow up to a specified number of tasks to do a certain activity (counting semaphores). The producer-consumer(s) problem is a good example where semaphores can be used to signal back and forth between tasks.

Click here to go to Top

Monitors And Synchronized Objects

Programmers usually have trouble remembering semaphore operations and how to use them. Nowadays languages that support concurrency usually provide language extensions or support libraries/classes that help the programmer concentrate on the end rather than the means. One of the common extensions provides implementation of a concept known as a “monitor”.?

A monitor is basically a set of procedures (or functions) grouped as an entity (like a class in C++). Many tasks (threads) may call the procedures “inside” the monitor. Their calls are automatically synchronized. That is, the monitor will allow only one thread/task to enter it at a time. When one thread/task is executing a procedure, others will be blocked and queued up, waiting to enter and execute some procedure. The operations are transparent to the tasks. They simply call the procedures as normally done. No need to bother with semaphores or mutexes.

But what if a thread has to block inside the procedure and let another thread in? For this the monitor provides two operations: wait and signal. A thread entering a monitor can block and allow another thread in by using a ‘wait variable-name’ operation. Another thread entering the monitor can later ‘signal variable-name’ and exit the monitor. The signal will cause the blocked thread to come out of the ‘wait’ and continue.

Monitors are best used in object-oriented languages, to control access to shared objects. A synchronized object can thus make sure that all accesses to its internal data are serialized. This synchronization can work at the class level or object level with an additional possibility of method-level synchronization for each. If working at the class-level, concurrent calls to any of the methods in the class, regardless of the object instance will be serialized. This uses a single lock for all the object instances of a class. If working at the object-level, concurrent calls to any of the methods in a particular object instance will be serialized. This will allow different tasks to invoke the same method in different objects of the same class. If working at method-level, then the programmer has to specify which all methods (procedures/functions) need to be synchronized. The serialization will apply only when tasks invoke these methods. This is more efficient but needs more involvement in the logistics. The class/object level synchronization applies to all methods and is easy to use – but all methods are protected, giving rise to overheads in method invocation whether you want it or not.

As an example, Java provides a ‘synchronized’ method option that functions as a monitor. Java maintains a lock for each object instance that contain one or more synchronized methods. Whenever a thread invokes a synchronized method, this lock needs to be acquired. When the thread leaves the method the lock is released. Java also allows one synchronized method to call another synchronized method in the same object without problems. It also provides methods wait, notify and notifyAll for the corresponding monitor functions.

Problems with monitors

One problem with monitors is applicability. If your tasks do not share code and data space you cannot use monitors. Let’s say that the tasks are not threads, but separate processes. They can set up some shared memory by whatever shared-memory creation features are available under the operating system. They can share semaphores since semaphores are usually kernel (operating system) resources. But they cannot normally share their objects (= code + data) with one another. One process cannot directly invoke the method of an object in another process without some special support from the operating system or some toolkit libraries. Therefore monitors are usually restricted to multi-threaded applications.

Again, monitors via synchronized objects can result in restricted concurrency and unnecessary overheads. Suppose that a synchronized object A contains objects B, D and E where object B contains object C.

Problems with monitors

Since object A is synchronized, any access to the objects B or C or D or E cannot be concurrent. It may be that only B needs to be under access control. Objects B, D and E may not have any common data and may have potential to be invoked concurrently. But the monitor serializes the access. Therefore the way you design your classes with monitors are of significance. Also, all method invocations may result in locking and unlocking operations whether they require it or not (if not using method-level synchronization). This means additional context-switching per method invocation. This can slow down your calls to anywhere in the order of 2 to 10 times or more per invocation.

Click here to go to Top

Message-Passing VS. Shared Memory

There are two fundamental approaches to concurrent programming. The use of messages and the use of a shared memory. Messages are usually used when processes are distributed, or when it makes sense to follow a client-server or peer-to-peer approach in the design of the application. The shared memory model is used in most multi-threaded applications where all the threads have access to the code and the global data of a process and pools of shared data has to be maintained. The shared memory model has to deal with critical sections and possibilities of data corruption, and generally provides an everybody-help-yourself-from-the-table approach to sharing information.

Messages are usually sent to message queues (there are cases where messages are like remote invocation of methods). And message queues can get filled up. When this happens dead-locks may result. For example, processes A and B may be exchanging messages as follows:

Message-Passing VS. Shared Memory

Now assume that Q1 is full. Process A attempting to insert a message into Q1 blocks. Now assume that process B is suddenly flooded with messages from the interface I that it has to send to process A. In doing so, B fills up Q2 and blocks. It cannot be unblocked until process A takes a message from Q2. But A is blocked waiting for B to process a message from Q1. This is a scenario that can occur in high-volume traffic cases where the code that does the message processing in B is not properly written. Most of the time the code will work until stressed. The above scenario can be avoided if Q1 is processed even if there is a flood of messages from I. The difference in code can be as simple as relocating an “else” statement.

Other problems are traffic congestions. The system, unless properly designed, will allow inputs to keep coming, though it can’t process them at that rate. The message queues start growing to fill up, and then the system starts accepting inputs at a much slower rate. (This is true for any queued system). Proper shut down of the system must be coordinated so that either all queues are drained or a proper termination point is designed.

Click here to go to Top

Synchronous And Asynchronous Messages

A common issue with message passing between tasks revolves around the very nature of the transmission and receipt. There are two fundamental ways of doing this, over which many other abstractions have been built.

Synchronous Messages

Synchronous essentially implies that the sender of the message waits until the message is actually accepted by the recipient. Unless and until the recipient reads in the message, the sender remains blocked.

This is an effective technique to synchronize two tasks. Once the message has been read, both tasks can proceed. In short, the first one to get to the message reading point waits for the other one to catch up. Rendezvous is a mechanism in Ada which uses a similar principle, but at the language level. Sending synchronized messages are not very common in most languages. But there are many variants of it at the language level, the most famous being RPC (Remote Procedure Call). RPC essentially allows one task to invoke a procedure in another task. The caller waits for the call to finish and gets the reply before proceeding. Here synchronization is implicit, but need not be implemented using synchronized messaging primitives. Just request/reply semantics will do.

Asynchronous Messages

Asynchronous messages are those where the sender continues immediately, or at least after making sure that the message has reached the other task’s message buffer (not necessarily read). This allows a program to use network bandwidth more effectively, and also overlap computation with message transmission. On the other hand, if any replies are involved, the sender now has to code additionally for collecting replies from all its clients.

Asynchronous messages can also be limited by buffers. That is, a sender can’t send too many messages to a client which hasn’t read anything yet. A quenching mechanism acts bac to block the sender when it attempts to send. This is how transport protocols like TCP/IP usually work.

Sending Synchronous Messages using Asynchronous Primitives

If your programming system supports only asynchronous messages, and you need to have synchronous messages, then all you have to do is create a request/reply mechanism. Every message sent out must be acknowledged by the recipient before another is sent out to the same. This simply means writing a set of library routines that will be used by both the senders and receivers.

Sending Asynchronous Messages using Synchronous Primitives

This is the most tricky. One has to resort to using threads for this. And there are optimizations one can make to reduce the thread overhead.

Essentially the sender has to create a thread that will send the message to the recipient. To avoid creating threads every time, a thread pool may be created initially and a currently free thread can be assigned the task of transmitting the data synchronously. It may not be necessary to create as many threads as there are recipients. Not all clients need data simultaneously. Even if they do once in a while and all threads are busy, the delay due to that effect is usually negligible. Any waiting data to be sent will be sent as soon as a thread becomes free. Maintaining a thread pool is the most effective solution here.

Click here to go to Top

Emergent Phenomena

This is a rarely touched upon “feature” in concurrent programming. As mentioned earlier, most of the troubles arising in concurrent programming are emergent. That is, you don’t code for it, but they happen due to complex interactions between your components. And sometimes they can be pretty tricky to solve, and may how up only in special circumstances. An example:

A multi-threaded program works fine with all the debug statements (e.g. printf statements). When they are removed, the program generates a memory fault. Obviously tough to debug!. The problem turned out to be this: The programmer was allocating memory frequently, and at one point was using a few bytes more than what was allocated. Thus he/she was writing into unallocated memory. But most of the time, this “unallocated” memory IS part of allocated memory since allocators like malloc() give you a chunk that may be more than what you request. So sometimes you can write beyond your allocation without problems. It all depends on the allocation/free sequences your program makes and how the internal free list is managed. By removing the debugging statements, the context-switching pattern of the threads changed, resulting in a drastically different allocation sequence between the threads. At some point of time, a malloc() returned exactly the size requested for the buggy line of code and caused a memory fault !!

There are also constructive emergent effects one can make use of in programming. These are tough to design, and doesn’t suit many applications. Sometimes, each component or thread following a simple set of rules can result in a complex behavior for the whole application. For example, consider a simulation of a flock of birds flying in 3D space, avoiding obstacles. Sounds like a tough job to keep the birds together in a flock, split the flock when obstacles appear, control their speed and formulate a flight route. But it turns out that very simple rules followed by each “bird” can give you all that. All that each “bird” needs to do is: a) try to keep close to the nearest bird c) Try to maintain the same velocity as the nearest bird b) avoid obstacles by turning at some angle. Such a simple set of rules can generate a complex scenario of bird flocking behavior. Similar techniques can be used in “growing” trees and branching them out to generate a forest. Simple rules applied to the leaves and trunks for growth result in very complex branching patterns. If each tree is a component, a forest’s appearance will depend on complex interactions between individual trees. All a result of emergent effects – known as self-organization.

CONCLUSION

We have taken a little tour through the landscape of concurrent programming, and looked at certain issues particular to multi-threaded/multi-processing systems. These are not specific to any particular platform but can be applied to a variety of programming systems out there. There are further connotations for other systems like distributed systems, parallel-processing systems and also real-time systems which will be explored in another forthcoming article.

Click here to go to Top

Appendix A - Threads

This section is meant as an introduction to threads and what they usually “look like”. Threads are also known under the terminology “light-weight processes”.

What are threads?

Threads are basically units of execution. Operating systems usually structure applications into processes. A process is a program in execution. A process has code and data and has a thread of execution that begins at some entry point in the code and works its way through various sections of the code.

Sometimes it is desirable to have concurrent activities going on inside a process. For example the process would like to do some background processing while waiting for more user input. Instead of writing code that looks for keyboard or mouse events all over the place during this processing, it is a lot easier to separate them as two separate executions or threads. Every thread has an entry point where it starts execution. One thread will wait in a user-input call and then process any results while the other thread will do the background processing of the application. The threads share the same code and global data inside the process. However, most thread implementations do not share local data (local variables). Why?

A thread implementation can usually be defined as :

< thread & gt =  < CPU register contents 

> + < execution-stack >+ < code >

Suppose that two threads are executing the following function:

 void  print_many_times(int count)
  {
    int i;
    for (i=0; i < count; i++)
       printf(“Hello World %d! \n”, i );    
	   // assuming that printf() is re-entrant
  }
 

Both the threads should produce output with numbers 0 to count-1, and both the threads should be able to specify separate values for count, even if they are running at the same time. The variables ‘count’ and ‘i’ must be separate instances for each thread.

In block-nested languages like Pascal, C and C++, the local variables and function parameters are allocated on the process’s stack. Each function invocation results in the stack growing, with all local variables for that invocation residing in a stack frame.

stack growth

The above figure shows two examples of stack growth. In the one on the left, function A has called function B which has in turn called function C. The stack contains the local variables and parameters of A, B and C in the respective stack frames. Note that the stack frames are of different sizes since the number of local variables and parameters can differ in each of A, B and C. The figure on the right is a similar stack frame when function X calls function Y which in turn calls function Z. When the call to Z returns, the stack frame labeled Z will be popped off the stack.

Note that processes usually have only one stack per execution. The stack also contains the return addresses so that execution can resume the instruction subsequent to a call. In short, a stack contains almost all you need to record an execution and come back to it (except for the CPU registers).

If a process can maintain multiple stacks that represent multiple execution points and then switch between them using a timer, then we essentially have pseudo-concurrent executions. These are known as threads. In the earlier days, such switching was done at the process level, by libraries that used timer events and stack and context substitution to switch back and forth. But now this is no longer required and most operating systems provide kernel support for threads. User-level context-switching with libraries is not required. Kernel level threads are more powerful in that all the threads can use system calls, library calls or APIs concurrently. User-level threads have the limitation that they must serialize library calls or APIs since the libraries or APIs are not written for re-entrance from the same process. For example, memory allocation calls like malloc() in C are not re-entrant. Operating systems do provide synchronized APIs or library versions in such cases, and the calls are then said to be thread-safe.

Finally getting back to the program above, by the use of separate stacks each thread will print out the “Hello World xxx!” messages with integers ranging from 0 to count-1. The count value for each thread may be different. The stack stores the values of the variables for each thread.

This also applies to objects. If a method inside a class uses local variables (non-static), then two threads using the same global object will still have separate instances of its local variables!. The global data however is not kept on the stack and can be seen by all the threads.

Preemption

Another interesting property of a thread implementation is how the scheduling is done. If threads are supported at user-level using libraries that latch onto timers and switch stacks and context, these can have certain problems.

User-level thread implementations: The problem is that the kernel does not know anything about the existence of the threads. So if one thread in your application makes a blocking call to the operating system, the whole process goes to sleep. That means all threads in your application go to sleep and context switching stops altogether for a while. But while the process is active, the context switching between the threads can be controlled. The switching may occur at explicit library or API calls that the thread makes (known as non-preemptive scheduling) or the switching may interrupt a thread at any time (pre-emptive scheduling). With the former, you don’t have to worry about critical sections provided that no context-switchable calls are made in the code are that should be kept safe. But such kind of switching limits the scheduling. Since control is explicitly released through certain APIs, a thread can cause all other threads to hang by refusing to give up the cpu.

Kernel-level thread implementations: If the kernel supports threads, then are indeed like processes in their own right. The operating system may refer to them as threads or processes, and they almost always use pre-emptive scheduling. Since the kernel provides for them, every thread may issue operating system calls without blocking its sibling threads. However, other application-level APIs or calls may not be designed to be reentrant even though the kernel is. Therefore unless such APIs are documented as thread-safe their concurrent use may cause problems.

Priorities

Whenever there are multiple threads, the operating system or the library usually provides for some pecking order among them. And this is by assigning priorities to each thread. The priority is usually a number. The way priorities are used in scheduling can be tricky. Certain thread packages lets the programmer decide on the priority scheme to use. In general, there are a number of different policies on thread scheduling. Most of them will say:

«The highest priority thread always runs»

Sounds good. But then you have questions. When does it block? What if it never executes a blocking call for a long time? What if there is more than one thread with the same «;highest»rdquo; priority? When the statement says «always»does it really mean always? How about starvation of lower priority threads? Is there any degradation or increasing of priority? Answers to these questions can vary for different thread implementations and will affect they way you use priorities. The most common solution is to avoid priorities and use the same priority throughout all threads. But priorities are essential in some situations, especially in real-time applications. Here events need to be handled on a priority basis, and certain events must be processed within specified time limits. If it is not done, the processing is considered wasted.

Therefore real-time operating systems usually have the policy: «The highest priority thread/process always runs until it blocks or runs to completion». This runs-to-completion part is to ensure that the thread/process can finish its task without loosing out the cpu to other tasks. If there are higher-priority events, assign them to higher priority threads/processes. Among the same priority-level threads a round robin scheduling is usually followed. But the moment a higher priority thread is ready to run, the current running thread will be switched out.

In this scheme, you might think that the highest priority thread/process need not worry about using critical sections when modifying shared data, since you can code it such that the critical section areas always run to completion. Wrong. Even though the thread does not get interrupted, it might have interrupted another lower priority thread which was executing in the critical section area. And corruption results.

Click here to go to Top