Multicore Computing

created : Wed, 02 Jun 2021 17:11:08 +0900
modified : Thu, 03 Jun 2021 09:21:24 +0900

Introduction to Multicore Computing

Multicore Processor

Manycore processor (GPU)

What is Prallel Computing?

Parallelism vs Concurrency

Parallel Programming Techniques

Parallel Processing Systems

Parallel Computing vs. Distirbuted Computing

Cluster Computing vs. Grid Computing

Cloud Computing

Good Parallel Program

Moore’s Law

Computer Hardware Trend

Examples of Paralle lComputer

Generic SMP

Summary

Principles of Parallel Computing

Overhead of Parallelism

Locality and Parallelism

Load Imbalance

Performance of Parallel Programs

Flynn’s Taxonomy on Parallel Computer

SISD (Single Instruction, Single Data)

SIMD (Single Instruction, Multiple Data)

MISD (Multiple Instruction, Single Data)

MIMD (Multiple Instruction, Multiple Data)

Creating a Parallel Program

  1. Decomposition
  2. Assignment
  3. Orchestration/Mapping

Decomposition

Domain Decomposition

Functional Decomposition

Assignment

Orchestration

Mapping

Performance of Parallel Programs

Coverage (Amdahl’s Law)

Performance Scalability

Granularity

Fine vs Coarse Granularity

Load Balancing

General Load Balancing Problem

Load Balancing Problem

Static load balancing

Dynamic Load Balancing

Granularity and Performance Tradeoffs

Communication

Factors to consider for communication

MPI : Message Passing Library

Point-To-Point

Synchronous vs Asynchronous Messages

Blocking vs .Non-Blocking Messages

Broadcase

A[n] = {...}
B[n] = {...}
Broadcast(B[1..n])
for (i 1 to n)
  // round robin distribute B to m processors
  Send(A[i % m])

Reduction

Synchronization

Locality

Memory Access Latency in Shared Memory Architectures

Cache Coherence

Shared Memory Architecture

Distributed Memory Architecture

Hybrid Architecture:

JAVA Thread Programming

Process

Unix process

main()
{
  int pid;
  cout << "just one process so far" << endl;
  pid = fork();
  if (pid == 0)
    cout << "im the child" << endl;
  else if (pid > 0)
    cout << "im the parent" << endl;
  else
    cout << "fork failed" << endl;
}

Threads

Multi-process vs Multi-thread

Programming JAVA Threads

Java Threading Models

Creating THreads: method1

class Mythreads extends Thread {
  public void run() {
    // work to do
  }
}

/*
 MyThread t = new THread();
 t.start
 */

Thread Names

Creating Threads: method 2

public interface Runnable {
  public abstract void run();
}
class MyRun implements Runnable {
  public void run() {
    // do something
  }
}
/*
 Thread t = new Thread(new MyRun());
 t.start();
 */

Thread Life-Cycle

Alive States

Thread Priority

yield

Thread identity

Thread sleep, suspend, resume

Thread Waiting & Status check

THread syncrhonization

Synchronized JAVA methods

Synchronized Lock Object

synchronized(anObject) {
  // execute code while holding an Object's lock
}

Condition Variables

wait() and notify()

Producer-Consumer Problem

Potential Concurrency Problejms

Important Concepts in Concurrent Programming

Devide-and-Conquer way for parallelization

Pthread Programming

Thread Properties

pthread

Pthreads API

Thread Management

Thread Creation

Thread Termination

Thread Cancellation

Joinning

Mutexes

Mutex Routins

Locking/Unlocking Mutexes

User’s Responsibility for Using Mutex

Condition Variables

Condition Variables Routines

OpenMP

Shared Memory Model

Example -Matrix times vector

#pragma omp parallel for default(none) \
            private(i, j, sum) shared(m, n, a, b, c)
for (i = 0; i < m; i ++)
{
  sum = 0.0;
  for (j = 0; j < n; j ++)
    sum += b[i][j] * c[j];
  a[i] = sum;
}

When to consider using OpenMP

About OpenMP

Terminology

Components of OpenMP

About OpenMP clauses

The if/private/shared clauses

About storage association

The first/last private cluases

The default clause

The reduction clause

The nowait clause

The parallel region

Work-sharing constructs

The omp for/do directive

#pragma omp for [cluase[[,] clause] ...]
  <origianl for-loop>

Load balancing

The schedule clause

The SECTIONS directive

#pragma omp sections [cluases(s)]
{
#pragma omp section
  <code block1>
#pragma omp section
  <code block2>
#pragma omp section
...
}

Orphaning

Synchornization Controls

Barrier

Critical region

#pragma omp critical [(name)]
{ <code-block> }
#pragma omp atomic
<statement>

Single processor region

SING and MASTER construct

#pragma omp single [clause[[,] clause] ...]
{
  <code-block>
}
#pragma omp master
{ <code-block> }

More synchronization directives

OpenMP Environment Variableso

OpenMP and Global data

The threadprivate construct

The copyin caluse

OpenMP Runtime Functions

OpenMP runtime library

OpenMP locking routines

Nested locking

Manycore GPU Programming with CUDA

The Need of Multicore Architecture

Many-core GPUs

Processor:Multicore vs Many-core

GPU

Applications

CPU vs GPU

GPU Architecture

GPU chip design

Popularity of GPUs

Why more parallelism?

CUDA(Computer Unified Device ARchitecture)

Compute Capability

CUDA - Main Features

CUDA device and threads

CUDA Hello World

#include <stdio.h>
__glogal__void hello_world(void) {
  pritnf("Hello World\n");
}

int main (void) {
  hello_world<<<1, 5>>>();
  cudaDeviceSynchronize();
  return 0;
}

C Language Extension

Simple Processing Flow

  1. Copy input data from CPU memory to GPU memory
  2. Load GPU program and execute, caching data on chip for performance
  3. Copy results from GPU memory to CPU memory

Hello World! with Device Code

Memory Mangement

__global__ void add(int *a, int *b, int *c) {
  *c = *a + *b;
}
int main (void) {
  int a, b, c;
  int *d_a, *d_b, *d_c;
  int size = sizeof(int);
  
  cudaMalloc((void **) &d_a, size);
  cudaMalloc((void **) &d_b, size);
  cudaMalloc((void **) &d_c, size);
  
  a = 2;
  b = 7;
  cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
  
  add<<<1,1>>>(d_a, d_b, d_c);
  
  cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
  
  cudaFree(d_a);
  cudaFree(d_b);
  cudaFree(d_c);
  return 0;
}

Running in Parallel

Moving to Parallel

Vector Addition on the Device

__glogal__ void add(int *a, int *b, int *c) {
  c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
// add<N, 1>> (...);

CUDA Threads

Combining Blocks and Threads

__global__ void add(int *a, int *b, int *c, int n) {
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  if (index < n)
    c[index] = a[index] + b[index];
}
// add<<<(N + M - 1) / M, M >>>(d_a, d_b, d_c, N);

1D Stencil

Implementing Within a block

Sharing Data Between Threads

__global__ void stencil_ld(int *int, int *out) {
  __shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
  int gindex = threadIdx.x + blockIdx.x * blockDim.x;
  int lindex = threadIdx.x + radius;

  temp[lindex] = in[gindex];
  if (threadIdx.x < RADIUS) {
    temp[lindex - RADIUS] = in[gindex - RADIUS];
    temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
  }

  __syncthreads();
  int result = 0;
  for (int offset = -RADIUS; offset <= RADIUS ; offset ++)
    result += temp[lindex + offset];
  out[gindex] = result;
}

Coordinating Host & Device

Reporting Erros

Device Managment