Chapter 4 Flashcards

1
Q

A thread is a basic unit of CPU utilization

A

it comprises a thread ID, a program counter (PC), a register set, and a stack. It shares with other threads belonging to the same process its code section, data section, and other operating-system resources, such as open filles and signals. A traditional process has a single thread of control. If a process has multiple threads of control, it can perform more than one task at a time. Figure 4.1 illustrates the difference between a traditional single-threaded process and a multithreaded process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Most operating system kernels are also typically multithreaded

A

multithreaded. As an example, during system boot time on Linux systems, several kernel threads are created. Each thread performs a specif i c task, such as managing devices, memory management, or interrupt handling. The commandps -efcan be used to display the kernel threads on a running Linux system. Examining the output of this command will show the kernel threadkthreadd(withpid= 2), which serves as the parent of all other kernel threads.
Many applications can also take advantage of multiple threads, including basic sorting, trees, and graph algorithms. In addition, programmers who must solve contemporaryCPU-intensive problems in data mining, graphics, and artif i cial intelligence can leverage the power of modern multicore systems by designing solutions that run in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Benefits of multithreading

A
  1. Responsiveness. Multithreading an interactive application may allow a program to continue running even if part of it is blocked or is performing a lengthy operation, thereby increasing responsiveness to the user.
    This quality is especially useful in designing user interfaces. For instance, consider what happens when a user clicks a button that results in the performance of a time-consuming operation. A single-threaded application would be unresponsive to the user until the operation had been completed. In contrast, if the time-consuming operation is performed in a separate, asynchronous thread, the application remains responsive to the user.
  2. Resource sharing. Processes can share resources only through techniques such as shared memory and message passing. Such techniques must be explicitly arranged by the programmer. However, threads share the memory and the resources of the process to which they belong by default.
    The benefit of sharing code and data is that it allows an application to have several different threads of activity within the same address space.
  3. Economy. Allocating memory and resources for process creation is costly. Because threads share the resources of the process to which they belong, it is more economical to create and context-switch threads.
    Empirically gauging the difference in overhead can be diff i cult, but in general thread creation consumes less time and memory than process creation. Additionally, context switching is typically faster between threads than between processes.
  4. Scalability. The benefits of multithreading can be even greater in a multiprocessor architecture, where threads may be running in parallel on different processing cores. A single-threaded process can run on only one processor, regardless how many are available. We explore this issue further in the following section.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Multicore Programming

A

A later, yet similar, trend in system design is to place multiple computing cores on a single processing chip where each core appears as a separateCPU to the operating system (Section 1.3.2). We refer to such systems as multicore, and multithreaded programming provides a mechanism for more efficient use of these multiple computing cores and improved concurrency. Consider an application with four threads. On a system with a single computing core, concurrency merely means that the execution of the threads will be interleaved over time (Figure 4.3), because the processing core is capable of executing only one thread at a time. On a system with multiple cores, however, concurrency means that some threads can run in parallel, because the system can assign a separate thread to each core (Figure 4.4).
Notice the distinction between concurrency and parallelism in this discus-sion. Aconcurrent system supports more than one task by allowing all the tasks to make progress. In contrast, a parallel system can perform more than one task simultaneously. Thus, it is possible to have concurrency without parallelism.
Before the advent of multiprocessor and multicore architectures, most com-puter systems had only a single processor, andCPUschedulers were designed to provide the illusion of parallelism by rapidly switching between processes, thereby allowing each process to make progress. Such processes were running concurrently, but not in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Concurrency vs Parallelism

A

Concurrency is about multiple tasks which start, run, and complete in overlapping time periods, in no specific order. Parallelism is about multiple tasks or subtasks of the same task that literally run at the same time on a hardware with multiple computing resources like multicore processor. As you can see, concurrency and parallelism are similar but not identical.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In general, five areas present challenges in programming for multicore systems:

A
  1. Identifying tasks. This involves examining applications to find areas that can be divided into separate, concurrent tasks. Ideally, tasks are independent of one another and thus can run in parallel on individual cores.
  2. Balance. While identifying tasks that can run in parallel, programmers must also ensure that the tasks perform equal work of equal value. In some instances, a certain task may not contribute as much value to the overall process as other tasks. Using a separate execution core to run that task may not be worth the cost.
  3. Data splitting. Just as applications are divided into separate tasks, the data accessed and manipulated by the tasks must be divided to run on separate cores.
  4. Data dependency. The data accessed by the tasks must be examined for dependencies between two or more tasks. When one task depends on data from another, programmers must ensure that the execution of the tasks is synchronized to accommodate the data dependency. We examine such strategies in Chapter 6.
  5. Testing and debugging. When a program is running in parallel on multiple cores, many different execution paths are possible. Testing and debugging such concurrent programs is inherently more difficult than testing and debugging single-threaded applications.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

AMDAHL’S LAW

A

Amdahl’s Law is a formula that identif i es potential performance gains from adding additional computing cores to an application that has both serial (nonparallel) and parallel components. If S is the portion of the application that must be performed serially on a system with N processing cores. As an example, assume we have an application that is 75 percent parallel and 25 percent serial. If we run this application on a system with two processing cores, we can get a speedup of 1.6 times. If we add two additional cores (for a total of four), the speedup is 2.28 times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Types of Parallelism

A

In general, there are two types of parallelism: data parallelism and task par-allelism. Data parallelism focuses on distributing subsets of the same data across multiple computing cores and performing the same operation on each core. Consider, for example, summing the contents of an array of size N. On a single-core system, one thread would simply sum the elements [0] …[N − 1].
On a dual-core system, however, thread A, running on core 0, could sum the elements [0] …[N∕2 − 1] while thread B, running on core 1, could sum the elements [N∕2] …[N − 1]. The two threads would be running in parallel on separate computing cores.
Task parallelism involves distributing not data but tasks (threads) across multiple computing cores. Each thread is performing a unique operation. Different threads may be operating on the same data, or they may be operating on different data. Consider again our example above. In contrast to that situation, an example of task parallelism might involve two threads, each performing a unique statistical operation on the array of elements. The threads again are operating in parallel on separate computing cores, but each is performing a unique operation.
Fundamentally,, data parallelism involves the distribution of data across multiple cores, and task parallelism involves the distribution of tasks across multiple cores, as shown in Figure 4.5. However, data and task parallelism are not mutually exclusive, and an application may in fact use a hybrid of these two strategies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Kernel vs user threads

A

However, support for threads may be provided either at the user level, for user threads, or by the kernel, for kernel threads. User threads are supported above the kernel and are managed without kernel support, whereas kernel threads are supported and managed directly by the operating system. Virtually all contemporary operating systems—including Windows, Linux, and macOS— support kernel threads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Many-to-One Model

A

The many-to-one model (Figure 4.7) maps many user-level threads to one kernel thread. Thread management is done by the thread library in user space, so it is eff i cient (we discuss thread libraries in Section 4.4). However, the entire process will block if a thread makes a blocking system call. Also, because only one thread can access the kernel at a time, multiple threads are unable to run in parallel on multicore systems. Green threads—a thread library available for Solaris systems and adopted in early versions of Java—used the many-to-one model. However, very few systems continue to use the model because of its inability to take advantage of multiple processing cores, which have now become standard on most computer systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

One-to-One Model

A

The one-to-one model (Figure 4.8) maps each user thread to a kernel thread. It provides more concurrency than the many-to-one model by allowing another thread to run when a thread makes a blocking system call. It also allows multiple threads to run in parallel on multiprocessors. The only drawback to this model is that creating a user thread requires creating the corresponding kernel thread, and a large number of kernel threads may burden the performance of a system. Linux, along with the family of Windows operating systems, implement the one-to-one model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Many-to-Many Model

A

The many-to-many model (Figure 4.9) multiplexes many user-level threads to a smaller or equal number of kernel threads. The number of kernel threads may be specific to either a particular application or a particular machine (an application may be allocated more kernel threads on a system with eight processing cores than a system with four cores). Let’s consider the effect of this design on concurrency. Whereas the many-to-one model allows the developer to create as many user threads as she wishes, it does not result in parallelism, because the kernel can schedule only one kernel thread at a time. The one-to-one model allows greater concurrency, but the developer has to be careful not to create too many threads within an application. (In fact, on some systems, she may be limited in the number of threads she can create.) The many-to-many model suffers from neither of these shortcomings: developers can create as many user threads as necessary, and the corresponding kernel threads can run in parallel on a multiprocessor. Also, when a thread performs a blocking system call, the kernel can schedule another thread for execution.
One variation on the many-to-many model still multiplexes many user-level threads to a smaller or equal number of kernel threads but also allows a user-level thread to be bound to a kernel thread. This variation is sometimes referred to as the two-level model (Figure 4.10).
Although the many-to-many model appears to be the most f l exible of the models discussed, in practice it is diff i cult to implement. In addition, with an increasing number of processing cores appearing on most systems, limiting the number of kernel threads has become less important. As a result, most operating systems now use the one-to-one model. However, as we shall see in Section 4.5, some contemporary concurrency libraries have developers identify tasks that are then mapped to threads using the many-to-many model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Thread Libraries

A

A thread library provides the programmer with anAPIfor creating and man-aging threads. There are two primary ways of implementing a thread library.
The f i rst approach is to provide a library entirely in user space with no kernel support. All code and data structures for the library exist in user space. This means that invoking a function in the library results in a local function call in user space and not a system call.
The second approach is to implement a kernel-level library supported directly by the operating system. In this case, code and data structures for the library exist in kernel space. Invoking a function in theAPIfor the library typically results in a system call to the kernel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Three main thread libraries are in use today

A

POSIXPthreads, Windows, and Java. Pthreads, the threads extension of thePOSIXstandard, may be provided as either a user-level or a kernel-level library. The Windows thread library is a kernel-level library available on Windows systems. The Java threadAPI allows threads to be created and managed directly in Java programs. However, because in most instances theJVMis running on top of a host operating system, the Java threadAPIis generally implemented using a thread library available on the host system. This means that on Windows systems, Java threads are typ-ically implemented using the WindowsAPI;UNIX, Linux, and macOSsystems typically use Pthreads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

two general strategies for creating multiple threads: asynchronous threading and synchronous threading.

A

asynchronous threading and synchronous threading. With asynchronous threading, once the parent creates a child thread, the parent resumes its execution, so that the parent and child execute concurrently and independently of one another. Because the threads are independent, there is typically little data sharing between them. Asynchronous threading is the strategy used in the multithreaded server illustrated in Figure 4.2 and is also commonly used for designing responsive user interfaces.
Synchronous threading occurs when the parent thread creates one or more children and then must wait for all of its children to terminate before it resumes.
Here, the threads created by the parent perform work concurrently, but the parent cannot continue until this work has been completed. Once each thread has f i nished its work, it terminates and joins with its parent. Only after all of the children have joined can the parent resume execution. Typically, synchronous threading involves signif i cant data sharing among threads. For example, the parent thread may combine the results calculated by its various children. All of the following examples use synchronous threading.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Pthreads

A

Pthreads refers to thePOSIXstandard (IEEE1003.1c) def i ning anAPIfor thread creation and synchronization. This is a specification for thread behavior, not an implementation. Operating-system designers may implement the specification in any way they wish. Numerous systems implement the Pthreads specif i ca-tion; most areUNIX-type systems, including Linux and macOS. Although Win-dows doesn’t support Pthreads natively, some third-party implementations for Windows are available.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

define NUM THREADS 10 /* an array of threads to be joined upon */ pthread t workers[NUM THREADS];

for (int i = 0; i < NUM THREADS; i++) pthread join(workers[i], NULL);

A

The C program shown in Figure 4.11 demonstrates the basic PthreadsAPI for constructing a multithreaded program that calculates the summation of a non-negative integer in a separate thread. In a Pthreads program, separate threads begin execution in a specif i ed function. In Figure 4.11, this is therun-ner()function. When this program begins, a single thread of control begins in main(). After some initialization,main()creates a second thread that begins control in therunner()function. Both threads share the global datasum.
Let’s look more closely at this program. All Pthreads programs must include thepthread.hheader f i le. The statementpthread t tiddeclares the identif i er for the thread we will create. Each thread has a set of attributes, including stack size and scheduling information. Thepthread attr t attr declaration represents the attributes for the thread. We set the attributes in the function callpthread attr init(&attr). Because we did not explicitly set any attributes, we use the default attributes provided. (In Chapter 5, we discuss some of the scheduling attributes provided by the PthreadsAPI.) A separate thread is created with thepthread create()function call. In addi-tion to passing the thread identif i er and the attributes for the thread, we also pass the name of the function where the new thread will begin execution—in this case, therunner()function. Last, we pass the integer parameter that was provided on the command line,argv[1].
At this point, the program has two threads: the initial (or parent) thread in main()and the summation (or child) thread performing the summation oper-ation in therunner()function. This program follows the thread create/join strategy, whereby after creating the summation thread, the parent thread will wait for it to terminate by calling thepthread join()function. The summa-tion thread will terminate when it calls the functionpthread exit(). Once the summation thread has returned, the parent thread will output the value of the shared datasum.
This example program creates only a single thread. With the growing dominance of multicore systems, writing programs containing several threads has become increasingly common. A simple method for waiting on several threads using thepthread join()function is to enclose the operation within a simpleforloop. For example, you can join on ten threads using the Pthread code shown in Figure 4.12.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Java Threads

A

Threads are the fundamental model of program execution in a Java program, and the Java language and itsAPIprovide a rich set of features for the creation and management of threads. All Java programs comprise at least a single thread of control—even a simple Java program consisting of only a bmain() method runs as a single thread in the JVM. Java threads are available on any system that provides aJVMincluding Windows, Linux, and macOS. The Java thread API is available for Android applications as well.
There are two techniques for explicitly creating threads in a Java program.
One approach is to create a new class that is derived from theThreadclass and to override its run()method. An alternative—and more commonly used —technique is to define a class that implements the Runnableinterface. This interface def ines a single abstract method with the signaturepublic void run(). The code in the run()method of a class that implementsRunnable is what executes in a separate thread. An example is shown below:
class Task implements Runnable { public voidbrun(){ System.out.println(“I am a thread.”);
} }

19
Q

LAMBDA EXPRESSIONS IN JAVA

A

Beginning with Version 1.8 of the language, Java introduced Lambda expressions, which allow a much cleaner syntax for creating threads. Rather than def i ning a separate class that implementsRunnable, a Lambda expression can be used instead:
Runnable task = () ->{ System.out.println(“I am a thread.”);
};
Thread worker = new Thread(task);
worker.start();
Lambda expressions—as well as similar functions known as closures—are a prominent feature of functional programming languages and have been available in several nonfunctional languages as well including Python, C++, and C#. As we shall see in later examples in this chapter, Lamdba expressions often provide a simple syntax for developing parallel applications.

20
Q

Thread creation in Java involves

A

creating aThreadobject and passing it an instance of a class that implementsRunnable, followed by invoking the start()method on the Thread object. This appears in the following example:
Thread worker = new Thread(new Task());
worker.start();
Invoking the start()method for the newThreadobject does two things:
1. It allocates memory and initializes a new thread in the JVM.
2. It calls the run()method, making the thread eligible to be run by thenJVM.
(Note again that we never call the run()method directly. Rather, we call thestart()method, and it calls the run()method on our behalf.) Recall that the parent threads in the Pthreads and Windows libraries use pthread join()and WaitForSingleObject()(respectively) to wait for the summation threads to f i nish before proceeding. The join()method in Java provides similar functionality. (Notice that join()can throw an Interrupt-edException, which we choose to ignore.) try{ worker.join();
} catch (InterruptedException ie){ } If the parent must wait for several threads to f i nish, then join()method can be enclosed in aforloop similar to that shown for Pthreads

21
Q

Java Executor Framework

A

Java has supported thread creation using the approach we have described thus far since its origins. However, beginning with Version 1.5 and its API, Java introduced several new concurrency features that provide developers with much greater control over thread creation and communication. These tools are available in thejava.util.concurrentpackage.
Rather than explicitly creating Thread objects, thread creation is instead organized around the Executorinterface:
public interface Executor { void execute(Runnable command);
} Classes implementing this interface must define the execute()method, which is passed a Runnableobject. For Java developers, this means using theExecu-torrather than creating a separateThreadobject and invoking itsstart() method. TheExecutoris used as follows:
Executor service = newExecutor;
service.execute(new Task());
The Executorframework is based on the producer-consumer model; tasks implementing theRunnableinterface are produced, and the threads that exe-cute these tasks consume them. The advantage of this approach is that it not only divides thread creation from execution but also provides a mechanism for communication between concurrent tasks.
Data sharing between threads belonging to the same process occurs easily in Windows and Pthreads, since shared data are simply declared globally. As a pure object-oriented language, Java has no such notion of global data. We can pass parameters to a class that implementsRunnable, but Java threads cannot return results. To address this need, thejava.util.concurrentpack-age additionally def i nes theCallableinterface, which behaves similarly to Runnableexcept that a result can be returned. Results returned fromCallable tasks are known asFutureobjects. A result can be retrieved from theget() method def i ned in theFutureinterface. The program shown in Figure 4.14 illustrates the summation program using these Java features.
TheSummationclass implements theCallableinterface, which specif i es the methodV call()—it is the code in thiscall()method that is executed in a separate thread. To execute this code, we create anewSingleThreadEx-ecutorobject (provided as a static method in theExecutorsclass), which is of typeExecutorService, and pass it aCallabletask using itssubmit() method. (The primary difference between theexecute()andsubmit()meth-ods is that the former returns no result, whereas the latter returns a result as aFuture.) Once we submit thecallabletask to the thread, we wait for its result by calling theget()method of theFutureobject it returns.
It is quite easy to notice at f i rst that this model of thread creation appears more complicated than simply creating a thread and joining on its termination.
However, incurring this modest degree of complication confers benef i ts. As we have seen, usingCallableandFutureallows for threads to return results.

22
Q

THE JVM AND THE HOST OPERATING SYSTEM

A

The JVM is typically implemented on top of a host operating system (see Figure 18.10). This setup allows the JVM to hide the implementation details of the underlying operating system and to provide a consistent, abstract environment that allows Java programs to operate on any platform that supports a JVM. The specification for the JVM does not indicate how Java threads are to be mapped to the underlying operating system, instead leaving that decision to the particular implementation of the JVM. For example, the Windows operating system uses the one-to-one model; therefore, each Java thread for aJVMrunning on Windows maps to a kernel thread. In addition, there may be a relationship between the Java thread library and the thread library on the host operating system. For example, implementations of a JVM for the Windows family of operating systems might use the WindowsAPI when creating Java threads; Linux and macOS systems might use the Pthreads API.

23
Q

Implicit Threading

A

With the continued growth of multicore processing, applications containing hundreds—or even thousands—of threads are looming on the horizon.
Designing such applications is not a trivial undertaking: programmers must address not only the challenges outlined in Section 4.2 but additional diff i cul-ties as well. These diff i culties, which relate to program correctness, are covered in Chapter 6 and Chapter 8.
One way to address these diff i culties and better support the design of con-current and parallel applications is to transfer the creation and management of threading from application developers to compilers and run-time libraries.
This strategy, termed implicit threading, is an increasingly popular trend. In this section, we explore four alternative approaches to designing applications that can take advantage of multicore processors through implicit threading.
As we shall see, these strategies generally require application developers to identify tasks—not threads—that can run in parallel. A task is usually writ-ten as a function, which the run-time library then maps to a separate thread, typically using the many-to-many model (Section 4.3.3). The advantage of this approach is that developers only need to identify parallel tasks, and the libraries determine the specif i c details of thread creation and management.

24
Q

ANDROID THREAD POOLS

A

In Section 3.8.2.1, we coveredRPCs in the Android operating system. You may recall from that section that Android uses the Android Interface Definition Language (AIDL), a tool that specifes the remote interface that clients interact with on the server.AIDLalso provides a thread pool. A remote service using the thread pool can handle multiple concurrent requests, servicing each request using a separate thread from the pool.

25
Q

Multithread problems

A

In Section 4.1, we described a multithreaded web server. In this situation, whenever the server receives a request, it creates a separate thread to service the request. Whereas creating a separate thread is certainly superior to creating a separate process, a multithreaded server nonetheless has potential problems.
The first issue concerns the amount of time required to create the thread, together with the fact that the thread will be discarded once it has completed its work. The second issue is more troublesome. If we allow each concurrent request to be serviced in a new thread, we have not placed a bound on the number of threads concurrently active in the system.

26
Q

Thread Pools

A

The general idea behind a thread pool is to create a number of threads at start-up and place them into a pool, where they sit and wait for work. When a server receives a request, rather than creating a thread, it instead submits the request to the thread pool and resumes waiting for additional requests. If there is an available thread in the pool, it is awakened, and the request is serviced immediately. If the pool contains no available thread, the task is queued until one becomes free. Once a thread completes its service, it returns to the pool and awaits more work. Thread pools work well when the tasks submitted to the pool can be executed asynchronously.
Thread pools offer these benefits:
1. Servicing a request with an existing thread is often faster than waiting to create a thread.
2. A thread pool limits the number of threads that exist at any one point. This is particularly important on systems that cannot support a large number of concurrent threads.
3. Separating the task to be performed from the mechanics of creating the task allows us to use different strategies for running the task. For example, the task could be scheduled to execute after a time delay or to execute periodically.

The number of threads in the pool can be set heuristically based on factors such as the number of CPUs in the system, the amount of physical memory, and the expected number of concurrent client requests. More sophisticated thread-pool architectures can dynamically adjust the number of threads in the pool according to usage patterns. Such architectures provide the further benefit of having a smaller pool—thereby consuming less memory—when the load on the system is low. We discuss one such architecture, Apple’s Grand Central Dispatch, later in this section.

27
Q

Java Thread Pools

A

The java.util.concurrentpackage includes anAPIfor several varieties of thread-pool architectures. Here, we focus on the following three models:
1. Single thread executor—newSingleThreadExecutor()—creates a pool of size 1.
2. Fixed thread executor—newFixedThreadPool(int size)—creates a thread pool with a specif i ed number of threads.
3. Cached thread executor—newCachedThreadPool()—creates an unbounded thread pool, reusing threads in many instances.
We have, in fact, already seen the use of a Java thread pool in Section 4.4.3, where we created anewSingleThreadExecutorin the program example shown in Figure 4.14. In that section, we noted that the Java executor frame-work can be used to construct more robust threading tools. We now describe how it can be used to create thread pools.
Athread pool is created using one of the factory methods in theExecutors class:
•static ExecutorService newSingleThreadExecutor() •static ExecutorService newFixedThreadPool(int size) •static ExecutorService newCachedThreadPool()

28
Q

Fork Join

A

The strategy for thread creation covered in Section 4.4 is often known as the fork-join model. Recall that with this method, the main parent thread creates (forks) one or more child threads and then waits for the children to terminate and join with it, at which point it can retrieve and combine their results. This synchronous model is often characterized as explicit thread creation, but it is also an excellent candidate for implicit threading. In the latter situation, threads are not constructed directly during the fork stage; rather, parallel tasks are designated. This model is illustrated in Figure 4.16. A library manages the number of threads that are created and is also responsible for assigning tasks to threads. In some ways, this fork-join model is a synchronous version of thread pools in which a library determines the actual number of threads to create— for example, by using the heuristics described in Section 4.5.1.

29
Q

Fork Join in Java

A

Java introduced a fork-join library in Version 1.7 of theAPIthat is designed to be used with recursive divide-and-conquer algorithms such as Quicksort and Mergesort. When implementing divide-and-conquer algorithms using this
library, separate tasks are forked during the divide step and assigned smaller subsets of the original problem. Algorithms must be designed so that these separate tasks can execute concurrently. At some point, the size of the problem assigned to a task is small enough that it can be solved directly and requires creating no additional tasks. The general recursive algorithm behind Java’s fork-join model is shown below:
Task(problem) if problem is small enough solve the problem directly else subtask1 = fork(new Task(subset of problem) subtask2 = fork(new Task(subset of problem) result1 = join(subtask1) result2 = join(subtask2) return combined results Figure 4.17 depicts the model graphically.
We now illustrate Java’s fork-join strategy by designing a divide-and-conquer algorithm that sums all elements in an array of integers. In Version 1.7 of theAPIJava introduced a new thread pool—theForkJoinPool—that can be assigned tasks that inherit the abstract base classForkJoinTask(which for now we will assume is theSumTaskclass). The following creates aForkJoin-Poolobject and submits the initial task via itsinvoke()method:
ForkJoinPool pool = new ForkJoinPool();
// array contains the integers to be summed int[] array = new int[SIZE];
SumTask task = new SumTask(0, SIZE - 1, array);
int sum = pool.invoke(task);
Upon completion, the initial call toinvoke()returns the summation ofarray.
The classSumTask—shown in Figure 4.18—implements a divide-and-conquer algorithm that sums the contents of the array using fork-join. New tasks are created using thefork()method, and thecompute()method speci-f i es the computation that is performed by each task. The methodcompute()is invoked until it can directly calculate the sum of the subset it is assigned. The call tojoin()blocks until the task completes, upon whichjoin()returns the results calculated incompute().
Notice thatSumTaskin Figure 4.18 extendsRecursiveTask. The Java fork-join strategy is organized around the abstract base classForkJoinTask, and theRecursiveTaskandRecursiveActionclasses extend this class. The fun-damental difference between these two classes is thatRecursiveTaskreturns a result (via the return value specif i ed incompute()), andRecursiveAction does not return a result. The relationship between the three classes is illustrated in theUMLclass diagram in Figure 4.19.
An important issue to consider is determining when the problem is “small enough” to be solved directly and no longer requires creating additional tasks.
InSumTask, this occurs when the number of elements being summed is less than the valueTHRESHOLD, which in Figure 4.18 we have arbitrarily set to 1,000.
In practice, determining when a problem can be solved directly requires careful timing trials, as the value can vary according to implementation.
What is interesting in Java’s fork-join model is the management of tasks wherein the library constructs a pool of worker threads and balances the load of tasks among the available workers. In some situations, there are thousands of tasks, yet only a handful of threads performing the work (for example, a separate thread for eachCPU). Additionally, each thread in aForkJoinPool maintains a queue of tasks that it has forked, and if a thread’s queue is empty, it can steal a task from another thread’s queue using a work stealing algorithm, thus balancing the workload of tasks among all threads.

30
Q

OpenMP

A

OpenMPis a set of compiler directives as well as anAPIfor programs written in C, C++, orFORTRANthat provides support for parallel programming in shared-memory environments. OpenMPidentif i es parallel regions as blocks of code that may run in parallel. Application developers insert compiler directives into their code at parallel regions, and these directives instruct the OpenMPrun time library to execute the region in parallel. The following C program illus-trates a compiler directive above the parallel region containing theprintf() statement:
#include<omp.h> #include<stdio.h> int main(int argc, char *argv[]) { /* sequential code */ #pragma omp parallel { printf("I am a parallel region.");
} /* sequential code */ return 0;
} When OpenMP encounters the directive #pragma omp parallel it creates as many threads as there are processing cores in the system. Thus, for a dual-core system, two threads are created; for a quad-core system, four are created; and so forth. All the threads then simultaneously execute the parallel region. As each thread exits the parallel region, it is terminated.
OpenMP provides several additional directives for running code regions in parallel, including parallelizing loops. For example, assume we have two arrays,aandb, of sizeN. We wish to sum their contents and place the results in arrayc. We can have this task run in parallel by using the following code segment, which contains the compiler directive for parallelizingforloops:
#pragma omp parallel for for (i = 0; i < N; i++){ c[i] = a[i] + b[i];
} OpenMPdivides the work contained in theforloop among the threads it has created in response to the directive #pragma omp parallel for In addition to providing directives for parallelization, OpenMPallows developers to choose among several levels of parallelism. For example, they can set the number of threads manually. It also allows developers to identify whether data are shared between threads or are private to a thread. OpenMP is available on several open-source and commercial compilers for Linux, Win-dows, and macOSsystems. We encourage readers interested in learning more about OpenMPto consult the bibliography at the end of the chapter.</stdio.h></omp.h>

31
Q

Grand Central Dispatch

A

Grand Central Dispatch (GCD) is a technology developed by Apple for its macOSand iOSoperating systems. It is a combination of a run-time library, anAPI, and language extensions that allow developers to identify sections of code (tasks) to run in parallel. Like OpenMP,GCDmanages most of the details of threading.
GCDschedules tasks for run-time execution by placing them on a dispatch queue. When it removes a task from a queue, it assigns the task to an available thread from a pool of threads that it manages.GCDidentif i es two types of dispatch queues: serial and concurrent.
Tasks placed on a serial queue are removed inFIFOorder. Once a task has been removed from the queue, it must complete execution before another task is removed. Each process has its own serial queue (known as its main queue), and developers can create additional serial queues that are local to a particular process. (This is why serial queues are also known as private dispatch queues.) Serial queues are useful for ensuring the sequential execution of several tasks.
Tasks placed on a concurrent queue are also removed inFIFOorder, but several tasks may be removed at a time, thus allowing multiple tasks to execute in parallel. There are several system-wide concurrent queues (also known as global dispatch queues), which are divided into four primary quality-of-service classes:
•QOS CLASS USER INTERACTIVE—The user-interactive class represents tasks that interact with the user, such as the user interface and event handling, to ensure a responsive user interface. Completing a task belonging to this class should require only a small amount of work.
•QOS CLASS USER INITIATED—The user-initiated class is similar to the user-interactive class in that tasks are associated with a responsive user interface; however, user-initiated tasks may require longer processing times. Opening a f i le or aURLis a user-initiated task, for example. Tasks belonging to this class must be completed for the user to continue inter-acting with the system, but they do not need to be serviced as quickly as tasks in the user-interactive queue.
•QOS CLASS UTILITY—The utility class represents tasks that require a longer time to complete but do not demand immediate results. This class includes work such as importing data.
•QOS CLASS BACKGROUND—Tasks belonging to the background class are not visible to the user and are not time sensitive. Examples include index-ing a mailbox system and performing backups.
Tasks submitted to dispatch queues may be expressed in one of two different ways:
1. For the C, C++, and Objective-C languages,GCDidentif i es a language extension known as a block, which is simply a self-contained unit of work. A block is specif i ed by a caret ˆ inserted in front of a pair of braces { }. Code within the braces identif i es the unit of work to be performed. A simple example of a block is shown below:
^{printf(“I am a block”);} 2. For the Swift programming language, a task is def i ned using a closure, which is similar to a block in that it expresses a self-contained unit of functionality. Syntactically, a Swift closure is written in the same way as a block, minus the leading caret.
The following Swift code segment illustrates obtaining a concurrent queue for the user-initiated class and submitting a task to the queue using thedispatch async()function:
let queue = dispatch get global queue (QOS CLASS USER INITIATED, 0) dispatch async(queue,{print(“I am a closure.”)}) Internally,GCD’s thread pool is composed ofPOSIXthreads.GCDactively manages the pool, allowing the number of threads to grow and shrink accord-ing to application demand and system capacity.GCDis implemented by the libdispatchlibrary, which Apple has released under the Apache Commons license. It has since been ported to theFreeBSDoperating system.

32
Q

Intel Thread Building Blocks

A

Intel threading building blocks (TBB) is a template library that supports design-ing parallel applications in C++. As this is a library, it requires no special compiler or language support. Developers specify tasks that can run in parallel, and theTBBtask scheduler maps these tasks onto underlying threads.
Furthermore, the task scheduler provides load balancing and is cache aware, meaning that it will give precedence to tasks that likely have their data stored in cache memory and thus will execute more quickly.TBBprovides a rich set of features, including templates for parallel loop structures, atomic operations, and mutual exclusion locking. In addition, it provides concurrent data structures, including a hash map, queue, and vector, which can serve as equivalent thread-safe versions of the C++ standard template library data structures.
Let’s use parallel for loops as an example. Initially, assume there is a function named apply(float value)that performs an operation on the parameter value. If we had an arrayvof size n containing float values, we could use the following serial for loop to pass each value invto the apply()function:
for (int i = 0; i < n; i++){ apply(v[i]);
} A developer could manually apply data parallelism (Section 4.2.2) on a multicore system by assigning different regions of the arrayvto each pro-cessing core; however, this ties the technique for achieving parallelism closely to the physical hardware, and the algorithm would have to be modif i ed and recompiled for the number of processing cores on each specif i c architecture.
Alternatively, a developer could use TBB, which provides aparallel for template that expects two values:
parallel for (range body) where range refers to the range of elements that will be iterated (known as the iteration space) and body specif i es an operation that will be performed on a subrange of elements.
We can now rewrite the above serial for loop using theTBBparallel for template as follows:
parallel for (size t(0), n, ={apply(v[i]);});
The f i rst two parameters specify that the iteration space is from 0 to n−1 (which corresponds to the number of elements in the array v). The second parameter is a C++ lambda function that requires a bit of explanation. The expression =is the parameteri, which assumes each of the values over the iteration space (in this case from 0 to 𝚗 − 1). Each value ofiis used to identify which array element invis to be passed as a parameter to the apply(v[i]) function.
The TBB library will divide the loop iterations into separate “chunks” and create a number of tasks that operate on those chunks. (The parallel for function allows developers to manually specify the size of the chunks if they wish to.)TBB will also create a number of threads and assign tasks to available threads. This is quite similar to the fork-join library in Java. The advantage of this approach is that it requires only that developers identify what operations can run in parallel (by specifying aparallel for loop), and the library manages the details involved in dividing the work into separate tasks that run in parallel. Intel TBB has both commercial and open-source versions that run on Windows, Linux, and macOS. Refer to the bibliography for further details on how to develop parallel applications using TBB.

33
Q

The fork() and exec() System Calls

A

If one thread in a program calls fork(), does the new process duplicate all threads, or is the new process single-threaded? SomeUNIXsystems have chosen to have two versions of fork(), one that duplicates all threads and another that duplicates only the thread that invoked thefork()system call.
The exec()system call typically works in the same way as described in Chapter 3. That is, if a thread invokes the exec()system call, the program specified in the parameter to exec() will replace the entire process—including all threads.
Which of the two versions of fork()to use depends on the application.
If exec()is called immediately after forking, then duplicating all threads is unnecessary, as the program specified in the parameters to exec()will replace the process. In this instance, duplicating only the calling thread is appropriate. If, however, the separate process does not call exec() after forking, the separate process should duplicate all threads.

34
Q

A signal

A

is used inUNIXsystems to notify a process that a particular event has occurred. A signal may be received either synchronously or asynchronously, depending on the source of and the reason for the event being signaled. All signals, whether synchronous or asynchronous, follow the same pattern:
1. A signal is generated by the occurrence of a particular event.
2. The signal is delivered to a process.
3. Once delivered, the signal must be handled.
Examples of synchronous signals include illegal memory access and divi-sion by 0. If a running program performs either of these actions, a signal is gen-erated. Synchronous signals are delivered to the same process that performed the operation that caused the signal (that is the reason they are considered synchronous).
When a signal is generated by an event external to a running process, that process receives the signal asynchronously. Examples of such signals include terminating a process with specif i c keystrokes (such as <control><C>) and having a timer expire. Typically, an asynchronous signal is sent to another process.</C></control>

35
Q

A signal may be handled by one of two possible handlers:
1. A default signal handler 2. A userdefined signal handler

A

Every signal has a default signal handler that the kernel runs when han-dling that signal. This default action can be overridden by a user-define signal handler that is called to handle the signal. Signals are handled in differ-ent ways. Some signals may be ignored, while others (for example, an illegal memory access) are handled by terminating the program.
Handling signals in single-threaded programs is straightforward: signals are always delivered to a process. However, delivering signals is more compli-cated in multithreaded programs, where a process may have several threads.
Where, then, should a signal be delivered?
In general, the following options exist:
1. Deliver the signal to the thread to which the signal applies.
2. Deliver the signal to every thread in the process.
3. Deliver the signal to certain threads in the process.
4. Assign a specif i c thread to receive all signals for the process.
The method for delivering a signal depends on the type of signal generated.
For example, synchronous signals need to be delivered to the thread causing the signal and not to other threads in the process. However, the situation with asynchronous signals is not as clear. Some asynchronous signals—such as a signal that terminates a process (<control><C>, for example)—should be sent to all threads.
The standardUNIXfunction for delivering a signal is kill(pid t pid, int signal) This function specif i es the process (pid) to which a particular signal (signal) is to be delivered. Most multithreaded versions ofUNIXallow a thread to specify which signals it will accept and which it will block. Therefore, in some cases, an asynchronous signal may be delivered only to those threads that are not blocking it. However, because signals need to be handled only once, a signal is typically delivered only to the f i rst thread found that is not blocking it.POSIX Pthreads provides the following function, which allows a signal to be delivered to a specif i ed thread (tid):
pthread kill(pthread t tid, int signal) Although Windows does not explicitly provide support for signals, it allows us to emulate them using asynchronous procedure calls (APCs). The APCfacility enables a user thread to specify a function that is to be called when the user thread receives notif i cation of a particular event. As indicated by its name, anAPCis roughly equivalent to an asynchronous signal inUNIX.
However, whereasUNIXmust contend with how to deal with signals in a mul-tithreaded environment, theAPCfacility is more straightforward, since an APC is delivered to a particular thread rather than a process.</C></control>

36
Q

Thread Cancellation

A

Thread cancellation involves terminating a thread before it has completed. For example, if multiple threads are concurrently searching through a database and one thread returns the result, the remaining threads might be canceled.
Another situation might occur when a user presses a button on a web browser that stops a web page from loading any further. Often, a web page loads using several threads—each image is loaded in a separate thread. When a user presses thestopbutton on the browser, all threads loading the page are canceled.

37
Q

Cancellation of a target thread may occur in two different scenarios:

A
  1. Asynchronous cancellation. One thread immediately terminates the tar-get thread.
  2. Deferred cancellation. The target thread periodically checks whether it should terminate, allowing it an opportunity to terminate itself in an orderly fashion.
38
Q

The difficulty with cancellation occurs in

A

situations where resources have been allocated to a canceled thread or where a thread is canceled while in the midst of updating data it is sharing with other threads. This becomes especially troublesome with asynchronous cancellation. Often, the operating system will reclaim system resources from a canceled thread but will not reclaim all resources. Therefore, canceling a thread asynchronously may not free a necessary system-wide resource.
With deferred cancellation, in contrast, one thread indicates that a target thread is to be canceled, but cancellation occurs only after the target thread has checked a flag to determine whether or not it should be canceled. The thread can perform this check at a point at which it can be canceled safely.

39
Q

The default cancellation type is deferred cancellation.

A

However, cancella-tion occurs only when a thread reaches a cancellation point. Most of the block-ing system calls in thePOSIXand standard C library are def i ned as cancellation points, and these are listed when invoking the commandman pthreadson a Linux system. For example, theread()system call is a cancellation point that allows cancelling a thread that is blocked while awaiting input fromread().
One technique for establishing a cancellation point is to invoke the pthread testcancel()function. If a cancellation request is found to be pending, the call topthread testcancel()will not return, and the thread will terminate; otherwise, the call to the function will return, and the thread will continue to run. Additionally, Pthreads allows a function known as a cleanup handler to be invoked if a thread is canceled. This function allows any resources a thread may have acquired to be released before the thread is terminated.
The following code illustrates how a thread may respond to a cancellation request using deferred cancellation:
while (1){ /* do some work for awhile / . . .
/
check if there is a cancellation request / pthread testcancel();
} Because of the issues described earlier, asynchronous cancellation is not recommended in Pthreads documentation. Thus, we do not cover it here.
An interesting note is that on Linux systems, thread cancellation using the PthreadsAPIis handled through signals (Section 4.6.2).
Thread cancellation in Java uses a policy similar to deferred cancellation in Pthreads. To cancel a Java thread, you invoke theinterrupt()method, which sets the interruption status of the target thread to true:
Thread worker;
. . .
/
set the interruption status of the thread */ worker.interrupt() A thread can check its interruption status by invoking theisInter-rupted()method, which returns a boolean value of a thread’s interruption status:
while (!Thread.currentThread().isInterrupted()){ . . .
} 4

40
Q

Thread-Local Storage

A

Threads belonging to a process share the data of the process. Indeed, this data sharing provides one of the benefits of multithreaded programming.
However, in some circumstances, each thread might need its own copy of certain data. We will call such data thread-local storage (or TLS). For example, in a transaction-processing system, we might service each transaction in a separate thread. Furthermore, each transaction might be assigned a unique identif i er. To associate each thread with its unique transaction identif i er, we could use thread-local storage.
It is easy to confuse TLS with local variables. However, local variables are visible only during a single function invocation, whereas TLS data are visible across function invocations. Additionally, when the developer has no control over the thread creation process—for example, when using an implicit technique such as a thread pool—then an alternative approach is necessary.
In some ways,TLS is similar to static data; the difference is that TLS data are unique to each thread. (In fact,TLS is usually declared as static.) Most thread libraries and compilers provide support for TLS. For example, Java provides a ThreadLocal<T>class withset() and get() methods for ThreadLocal<T>objects. Pthreads includes the typepthread key t, which provides a key that is specific to each thread. This key can then be used to access TLS data. Microsoft’s C# language simply requires adding the storage attribute [ThreadStatic]to declare thread-local data. Thegcccompiler provides the storage class keyword thread for declaring TLS data. For example, if we wished to assign a unique identifier for each thread, we would declare it as follows:
static thread int threadID;</T></T>

41
Q

Scheduler Activations

A

One scheme for communication between the user-thread library and the kernel is known as scheduler activation. It works as follows: The kernel provides an application with a set of virtual processors (LWPs), and the application can schedule user threads onto an available virtual processor. Furthermore, the kernel must inform an application about certain events. This procedure is known as an upcall. Upcalls are handled by the thread library with an upcall handler, and upcall handlers must run on a virtual processor.
One event that triggers an upcall occurs when an application thread is about to block. In this scenario, the kernel makes an upcall to the application informing it that a thread is about to block and identifying the specific thread.
The kernel then allocates a new virtual processor to the application. The application runs an upcall handler on this new virtual processor, which saves the
state of the blocking thread and relinquishes the virtual processor on which the blocking thread is running. The upcall handler then schedules another thread that is eligible to run on the new virtual processor. When the event that the blocking thread was waiting for occurs, the kernel makes another upcall to the thread library informing it that the previously blocked thread is now eligible to run. The upcall handler for this event also requires a virtual processor, and the kernel may allocate a new virtual processor or preempt one of the user threads and run the upcall handler on its virtual processor. After marking the unblocked thread as eligible to run, the application schedules an eligible thread to run on an available virtual processor.

42
Q

lightweight process, or LWP

A

Many systems implementing either the many-to-many or the two-level model place an intermediate data structure between the user and kernel threads. This data structure—typically known as a lightweight process, or LWP—is shown in Figure 4.20. To the user-thread library, theLWPappears to be a virtual processor on which the application can schedule a user thread to run. EachLWPis attached to a kernel thread, and it is kernel threads that the operating system schedules to run on physical processors. If a kernel thread blocks (such as while waiting for anI/Ooperation to complete), theLWPblocks as well. Up the chain, the user-level thread attached to theLWPalso blocks.
An application may require any number ofLWPs to run eff i ciently. Consider aCPU-bound application running on a single processor. In this scenario, only one thread can run at a time, so oneLWPis suff i cient. An application that isI/O-intensive may require multipleLWPs to execute, however. Typically, anLWPis required for each concurrent blocking system call. Suppose, for example, that f i ve different f i le-read requests occur simultaneously. Five LWPs are needed, because all could be waiting forI/Ocompletion in the kernel. If a process has only fourLWPs, then the f i fth request must wait for one of theLWPs to return from the kernel.

43
Q

Windows Threads

A

A Windows application runs as a separate process, and each process may contain one or more threads. The WindowsAPIfor creating threads is covered in Section 4.4.2. Additionally, Windows uses the one-to-one mapping described in Section 4.3.2, where each user-level thread maps to an associated kernel thread.
The general components of a thread include:
•A threadIDuniquely identifying the thread •A register set representing the status of the processor •A program counter •A user stack, employed when the thread is running in user mode, and a kernel stack, employed when the thread is running in kernel mode •Aprivate storage area used by various run-time libraries and dynamic link libraries (DLLs) The register set, stacks, and private storage area are known as the context of the thread.
The primary data structures of a thread include:
•ETHREAD—executive thread block •KTHREAD—kernel thread block •TEB—thread environment block The key components of theETHREADinclude a pointer to the process to which the thread belongs and the address of the routine in which the thread starts control. TheETHREADalso contains a pointer to the corresponding KTHREAD.
TheKTHREADincludes scheduling and synchronization information for the thread. In addition, theKTHREADincludes the kernel stack (used when the thread is running in kernel mode) and a pointer to theTEB.
TheETHREADand theKTHREADexist entirely in kernel space; this means that only the kernel can access them. TheTEBis a user-space data structure that is accessed when the thread is running in user mode. Among other f i elds, theTEBcontains the thread identif i er, a user-mode stack, and an array for thread-local storage. The structure of a Windows thread is illustrated in Figure 4.21.

44
Q

Linux Threads

A

Linux provides thefork()system call with the traditional functionality of duplicating a process, as described in Chapter 3. Linux also provides the ability to create threads using theclone()system call. However, Linux does not distinguish between processes and threads. In fact, Linux uses the term task —rather than process or thread— when referring to a f l ow of control within a program.
Whenclone()is invoked, it is passed a set of f l ags that determine how much sharing is to take place between the parent and child tasks. Some of these f l ags are listed in Figure 4.22. For example, suppose thatclone()is passed the f l agsCLONE FS,CLONE VM,CLONE SIGHAND, andCLONE FILES. The parent and child tasks will then share the same f i le-system information (such as the current working directory), the same memory space, the same signal handlers,

and the same set of open f i les. Usingclone()in this fashion is equivalent to creating a thread as described in this chapter, since the parent task shares most of its resources with its child task. However, if none of these f l ags is set when clone()is invoked, no sharing takes place, resulting in functionality similar to that provided by thefork()system call.
The varying level of sharing is possible because of the way a task is repre-sented in the Linux kernel. A unique kernel data structure (specif i cally,struct task struct) exists for each task in the system. This data structure, instead of storing data for the task, contains pointers to other data structures where these data are stored—for example, data structures that represent the list of open f i les, signal-handling information, and virtual memory. Whenfork()is invoked, a new task is created, along with a copy of all the associated data structures of the parent process. A new task is also created when theclone() system call is made. However, rather than copying all data structures, the new task points to the data structures of the parent task, depending on the set of f l ags passed toclone().
Finally, the f l exibility of theclone()system call can be extended to the concept of containers, a virtualization topic which was introduced in Chapter 1. Recall from that chapter that a container is a virtualization technique pro-vided by the operating system that allows creating multiple Linux systems (containers) under a single Linux kernel that run in isolation to one another.
Just as certain f l ags passed toclone()can distinguish between creating a task that behaves more like a process or a thread based upon the amount of sharing between the parent and child tasks, there are other f l ags that can be passed to clone()that allow a Linux container to be created. Containers will be covered more fully in Chapter 18.