• No results found

Power management in XX systems XX = real-time | embedded |

N/A
N/A
Protected

Academic year: 2022

Share "Power management in XX systems XX = real-time | embedded |"

Copied!
197
0
0

Loading.... (view fulltext now)

Full text

(1)

Power management in XX systems XX = real-time | embedded |

high performance | wireless | …

By: Alexandre Ferreira Daniel Mossé

http://www.cs.pitt.edu/PARTS

Collaborators (faculty):

Rami Melhem Bruce Childers PhD students:

Hakan Aydin Dakai Zhu Cosmin Rusu

Nevine AbouGhazaleh Ruibin Xu

(2)

Green Computing

Gartner's top strategic technology for 2008:

Green IT

“This one is taking on a bigger role for many reasons, including an

increased awareness of environmental danger; concern about power bills; regulatory requirements; government procurement rules; and a sense that corporations should embrace social responsibility.”

From Green500.org about supercomputers

For decades now, performance has been synonymous with "speed" (as measured in FLOPS) emergence of supercomputers that consume egregious amounts of electrical power and produce so much heat that require extravagant cooling facilities

Other performance metrics to be largely ignored, e.g., reliability, availability, and usability.

As a consequence, all of the above has led to an extraordinary

increase in the total cost of ownership (TCO) of a supercomputer.

(3)

Green Computing

From [Gunaratne et al, IJNM 2005]

In the USA, the IT equipment comprising the Internet uses about US$6 Billion of electricity per year.

(4)

Outline and where are we?

Motivation

Power consumed by servers, devices, etc

Thermal limitations

Cost of energy

Examples of hw platforms and their E consumptions

Basic techniques

Models

Hardware

Software

HPC + power

Products

Research projects

Research topics (open problems)

(5)

Power Management

Why?

Cost! More performance, more energy, more

$$

Heat: HPC Servers (multiprocessors), High performance CPUs

Dual metrics: maintain QoS, reduce energy

Battery operated: satellites

System design: Highest component speed

does not imply best system performance

(6)

Basics of Energy and Power

Power (P) consumption is proportional to the voltage fed into the component

Some systems have a power cap (power limit, cannot operate above that threshold)

The energy consumed doing a task (E) is the integral in time (T) of the power function.

A simple approximation: E = P x T, if P is the average power

Some systems have an energy budget (e.g., systems on batteries or pay-per-use systems with a $ budget)

Thermal dissipation is proportional to the Power (the more power consumed, the more heat is generated)

Some systems have a thermal cap

More heat generated more cooling needed

(7)

Introduction

CPU 28%

memory 42%

others

• CPU (including on-

30%

chip caches) and memory energy are major energy consumers

• Energy is a prime resource in computing systems

others 51%

memory 23%

CPU 26%

servers portable devices [IBM] [Celebican’04]

• Power management is subject to

performance degradation

(8)

Computing Performance

Small Systems – (10s Gigaflops)

X86, Power, Itanium, etc.

Specialized CPUs – (1 Teraflops)

Cell, GPUs (Nvidia Tesla, ATI)

SMPs – (10s Teraflops)

x86, Power, Itanium

Shared memory model with up to 1000 cores, 32/64 more common

Cluster – (1 Petaflops)

x86, Power, Cell

Fastest machines available (Blue Gene and Roadrunner) up to 128K cores

(9)

Power vs Performance Efficiency

Generic CPUs

Fast but too power inefficient

10-100W per core (Athlon 64, Intel Core 2, Power 5/6)

New high-performance chips are all multicore

New versions of x86, Power, SPARC.

Power efficient designs

Cell (100s Gigaflops)

1 embedded PowerPC and 8 SPEs (fast but limited processors)

GPUs (1 Teraflops)

Up to 240 dedicated CPUs in a 1.4 B transistors chip

Created for graphics processing, used for parallel applications

Examples: Nvidia Tesla and AMD Firestream

Floating point accelerators (1 Teraflops)

(10)

Power vs Performance Efficiency

Intel ATOM

5-6x slower than latest Core 2 processors

But uses from 0.5 to 2W instead of 30-90W

Whole system design is important

Design solved CPU problem: CPU consumes 2W

Chipset became the new problem, consuming 20W

Bottleneck changed. Will it change again? For sure!

(11)

High Performance Computing

 Blue Gene/L (~0.3 Gflops/W)

Uses a embedded processor design

PowerPC 440 derived design with 2 cores per chip

Design allows 65K CPUs (128K Cores)

Fastest machine up to Jun/2008

 Roadrunner (~0.4 Gflops/W)

Uses a hybrid design

7K Opterons (2 cores) (used for I/O processing)

12K Cell processors (FP computation)

Jun/2008: achieved 1 Petaflops

(12)

Power in High Performance Computing

Most systems get less than 0.03 Gflops/W

Roadrunner equivalent machine

would consume 39MW

Datacenters

have an efficiency factor ranging from ¼ (very efficient) to 2 (inefficient) of cooling Watt

per computing Watt [HP]

RoadRunner

(13)

System Power Management

Objective: Achieve overall energy savings with minimum performance degradation

Previous work in literature target different single component: CPU, caches, MM, or disks

Achieve Local energy savings

How about overall system energy?

Effects of applying PM in one component on other system components?

Consider the effect of saving power in one component on the other system components energy.

(14)

Memory Power

DDRx DRAM

DDR (2.5V), DDR2 (1.8V), DDR3 (1.5V)

Bigger memory chips uses less power/bit

Same technology

Higher bus speed increases consumption

Increase bandwidth (faster data transfer)

No significant reduction in access time for a single word (bottleneck in DRAM array)

~1W to 4W per DIMM.

More DIMMs or fewer DIMMs for same size (Total memory)

More DIMMs implies parallelism masking some latency

Fewer DIMMs reduce memory power

(15)

Memory Type # Chips/DIMM Power (W)

400MHz/512MB 8 1.30

400MHz/1GB 16 2.30

400MHz/2GB 16 2.10

533MHz/512MB 8 1.44

533MHz/1GB 16 2.49

533MHz/2GB 16 3.23

800MHz/512MB 8 1.30

800MHz/1GB 16 1.87

800MHz/2GB 16 1.98

0 2 4 6 8 10 12 14 16 18

400MHz/512MB

400MHz/1GB

400MHz/2GB

533MHz/512MB

533MHz/1GB

533MHz/2GB

800MHz/512MB

800MHz/1GB

800MHz/2GB

# Chips/DIMM Power (W)

Kingston Memory

Memory Power

Newer technology favors low power designs (lower Vdd)

DDR (2.5V), DDR2 (1.8V) and DDR3 (1.5V)

Denser memory is less power hungry (more memory in chip)

Higher bus speed implies higher consumption for same technology

(16)

CPU and Memory Energy

We can reduce power in processors by reducing speed (using Dynamic voltage scaling DVS)

Memory is in standby for longer periods- energy increases.

energy

CPU frequency fmax fmin

Ecpu

Emem Etot

(17)

Other metrics to consider

Maintain QoS while reducing energy

Quality of service can be a deadline, a certain level of throughput, a specified average response time

Achieve the best QoS possible within a certain Energy budget

Battery operated devices/systems such as

satellites. Batteries are exhausted, need to be

Replaced (physical replacement is expensive!)

Recharged (given constraints, such as solar power)

Replace entire device (if battery replacement is not possible)

(18)

Application Characterizations

Different applications will consume different amounts of power

Weather forecasting

Medical applications (radiology)

Military games (we have no information about it)

One application will have different phases throughout execution

Use some units more than others

More memory intensive (e.g., initialization)

More CPU intensive

More cache intensive

(19)

High Performance Computing

Weather Prediction

 1.5Km scale globally requires 10 Petaflops

 No existing machine can run this model

 Conventional CPUs (1.8M cores)

US$ 1B to build

200MW to operate (US$ 100M per year)

 Using embedded processors (20M Cores)

US$ 75M to build

 4MW to operate (US$ 2M per year)

“Towards Ultra-High Resolution Models of Climate and Weather”, by Michael Wehner, Leonid Oliker and John Shalf, International Journal of High Performance Computing Applications 2008; 22;

149

(20)

Tomography (Image reconstruction)

 4 Nvidia 9800 GX2 (2 GPUs per card) in a PC

 Faster than 256 cores AMD opteron 2.4GHz cluster (CalcUA)

 €4000 cost with 1500W consumption

 University of Antwerp

(21)

Application Characterizations

 Can we characterize usage of units within processor?

 How much cache vs DRAM vs CPU vs…

Applications have different behaviors

Next slides how these behaviors for

 the 3 metrics mentioned: L2 cache, CPU, and DRAM, and for

 4 applications from SPEC benchmark: art, gcc, equake, and parser

(22)

CPI: Cycles Per Inst L2PI: L2 access Per Inst

MPI: memory access Per Inst.

(23)

Workload Variations

CPI: Cycles Per Inst L2PI: L2 access Per Inst

MPI: memory access Per Inst.

(24)

CPI: Cycles Per Inst L2PI: L2 access Per Inst

MPI: memory access Per Inst.

(25)

Workload Variations

CPI: Cycles Per Inst L2PI: L2 access Per Inst

MPI: memory access Per Inst.

(26)

Power Management

Why?

Cost! More performance, more energy, more $$

Heating: in HPC Servers (multiprocessors)

Dual metrics: maintain QoS, reduce energy

Battery operated: satellites

System design

How and why?

Power off unused parts: disks, (parts of) memory, (parts of) CPUs, etc

Gracefully reduce the performance

Voltage and frequency scaling

Rate scaling

Transmit power scaling

(27)

Outline and where are we?

Motivation

Power consumed by servers, devices, etc

Thermal prohibitions

Cost of energy

Examples of hw platforms and their E consumptions

Basic techniques

Models

Hardware

Software

HPC + power

Products

Research projects

Research topics (open problems)

(28)

Hardware Models

Medium Low Low

Highest Medium Medium

Highest High High

Medium Medium High

Sleep, Active CPU,

CPUs or Cores

Memory/C

aches Communication Links

Sleep Power

Static Power Dynamic

Power Transition Overhead

Power States available

Sleep, standby,

DVS

Sleep, Standby,

Active

(29)

Need to organize the next few

slides on the Power Model

(30)

Power Models

Dynamic voltage scaling (DVS) is an effective power/energy management technique

Gracefully reduce the

performance to save energy

CPU: dynamic power Pd = Cef Vdd2 f

Cef : switch capacitance

Vdd : supply voltage

f : processor frequency

 linearly related to Vdd

Frequency( f ) Power( p )

p ∞ f 3

(31)

CPU energy model

Power components:

Dynamic:

Leakage:

P

d

=kα CV

2

f

P

l

=lV −V

t

V

speed

time

Dynamic Voltage Scaling (DVS):

reduce voltage/freq. linearly, reduces energy quadratically

(32)

DVS Power and Energy Models

A more accurate power model is:

P = Ps + h (Pind + Pd)

where:

Ps represents sleep power (consumed when system is in low power mode)

h is 0 when system is sleeping and 1 when active

Pind represents activity independent power

Memory static power, chipset, communications links, etc.

Pd represents activity dependent power

Pd = Cef Vdd2 f

CPU power

(33)

DRAM memory energy model

Dynamic and static Power components

Dynamic power management

Transition the DRAM chips to lower power state

Ex: Rambus {active, standby, nap, powerdn}

Energy vs. trans power and delay

P0 P1 P2

Memory requests

(34)

Memory Power Model

Memory State Active:

Standby:

Pa = Total power in active mode Ps = Static power in active mode Tac = Measurement time

M = Number of memory access over period Tac Eac = Energy per memory access

P

a

=P

s

+M E

ac

T

ac

P

st

(35)

Cache Energy Model

Cacti 3.0 power and time models

Access latency and energy are fn. of:

Cache size, associativity and block size

Longer bit/word lines ==> higher latency

(36)

Outline and where are we?

Motivation

Power consumed by servers, devices, etc

Thermal prohibitions

Cost of energy

Examples of hw platforms and their E consumptions

Basic techniques

Models

Hardware

Software

HPC + power

Products

Research projects

Research topics (open problems)

(37)

Hardware capabilities

Sending power Link rates

(encoding) Multi-clock

domain procs Real-time

Performance DVS

Timeout Timeout

Redundancy (use or add)

Timeout

Redundancy (use or add) On-off

communication links

memory and/or caches

CPU, CPUs or Cores

Biggest problem is still who manages it, at what

granularity (physical and power), and how often

(38)

Problems and solutions

Solutions:

Use DVS, on/off and clock gating, at hardware level

Complex software solutions migrating to hardware

Biggest problems is still

who manages it (hardware, software, user, application, hybrid)

at what granularity (physical and power)

how often (triggered by events or periodically)

whether it is synchronized among devices or not

(manage power for many devices in coordinated way)

if not synchronous, then at what hierarchy/order?

(39)

Run-time information

Compiler

(knows future better)

OS

(knows past better)

Who should manage power?

Static analysis

Application

Source Code

 scheduling at the HW level

 scheduling at the OS level

 compiler analysis (extracted)

 application (user/programmer furnished)

 multi-layer and cross-layer

(40)

(1) Collect Timing Information

(4) Compute worst case remaining cycles

(2) Set interrupt

service routine (3) Insert PMH in

application code

OS

Compiler or user Preprocessing

Run-time

(41)

Metrics to optimize

• Minimize energy consumption

• Maximize system utility

• Most importantly, etc!

Time constraints (deadlines or rates)

Energy constraints System utility

(reward) Increased reward with

increased execution

Determine the most

rewarding subset of tasks to execute

Determine appropriate versions to execute

?

(42)

Start with hardware solutions

“Doing hardware”, use:

Lower power components

Lower frequency of operation

P ~ f2 so bigger power gains with smaller performance loss

Clock gating

Turning off circuits whenever they are not needed:

Functional units, cache lines, etc.

Used in any large CPU today

Throttling

Reduce utilization by holding requests

DVS

Change supply voltage to reduce power

(43)

Start with memory (no real reason, should we

come up with one that sounds good??  )

(44)

Memory energy savings

Low power DRAM system (AMD systems)

Uses DDR2-DRAM memory

(45)

Memory energy savings

Uses FB-DRAM memory (Intel Systems)

(46)

Memory energy savings

Different systems will consume different amounts of power in memory

Need to characterize bottleneck

Deal with memory savings whenever needed!

Timeout mechanisms:

Turn memory off when idle

Turn some parts of memory off when idle

Power states, depending on how much of the chip is turned off

Self-refresh is the most “famous” or well-known

Who manages timeouts?

(47)

Memory energy savings

Exploiting existing redundancy

Distribute workload evenly

On-off (or in general Distribute workload unevenly)

Adding hardware

Caching near memory

Flash or other non-volatile memory

More caching at L1, L2, L3

Exploit power states (in increasing order of power)

Powerdown (all components are off)

Nap (self-refresh)

Standby

Active

(48)

Near-memory Caching for

Improved Energy Consumption

(49)

Near-CPU vs. Near-memory caches

Thesis: Need to balance the allocation of the two for

better delay and energy.

CPU

Main Memory cache

cache

 Caching to mask memory delays

 Where?

In Main Memory?

In CPU?

 Which is more power and performance efficient ?

cache

(50)

Cached-DRAM (CDRAM)

On-memory SRAM cache

[Hsu’93, Koganti’97]

accessing fast SRAM cache  Improves

performance.

High internal bandwidth

 use large block sizes

How about energy?

On-memory block size

(51)

Near-memory caching:

Cached-DRAM (CDRAM)

On-memory SRAM cache

Accessing fast SRAM cache  Improves

performance.

High internal bandwidth

 use large block sizes

CDRAM improves performance but

consume more energy

On-memory block size

(52)

Power-Aware CDRAM (PA-CDRAM)

Power management in DRAM-core

Use moderate sized SRAM cache

Turn the DRAM core to low power state

Use immediate shutdown

Power management in near-memory caches

Use distributed near-memory caches

Choose adequate block size

(53)

Power-aware Near-memory Cache vs. DRAM energy

PA-CDRAM energy: E

tot

= E

cache

+ E

DRAM

0 15 30 45 60

64 128 256 512 1024 2048 block size (byte)

Energy (mJ)

0 0.3 0.6 0.9 1.2

64 128 256 512 1024 2048

block size (byte)

Energy (mJ)

CPU intensive (bzip) memory intensive (mcf)

E_DRAM

E_cache E_tot

(54)

PA-CDRAM design

(55)

PA-CDRAM new Protocol

Need new commands:

To check whether data is in cache

Return data from cache, if so

Read data from DRAM core, if not

(56)

Power-aware CDRAM

Power management in near- memory caches

Distributed near-memory caches

Choose adequate cache

configuration to reduce miss rate & energy per access.

Power management in DRAM-core

Use moderate sized SRAM cache

Turn the DRAM core to “low power state”

Use immediate shutdown

Near-memory versus DRAM energy tradeoff – cache

block size

64 128 256 512 1024 2048

0 15 30 45 60

block size (byte)

Energy (mJ)

E_DRAM

E_cache E_tot

(57)

Exploiting memory power management

Carla Ellis [ISLPED 2001]: memory states...

Kang Shin [Usenix 2003] and Ellis [ASPLOS 2000]: smart page allocation, turn off pages that are unused.

(58)

Integrated CPU+XX DVS

(59)

Independent DVS : Basic Idea

DVS sets the frequency of a domain based

workload: speed α workload

If workload measured by performance counters

# instruction for CPU-core

L2 access for L2-cache

# of L2 misses for main memory

Control loop

periodically measure workload to set speed

Domain

Workload

Voltage setting

Voltage controlle

r

Workl oadVoltag e

tim e

tim e

(60)

Problem: Positive Feedback Intervals

Cache freq

intervals

CPU workload

intervals

L2 workload

intervals

CPU freq

intervals

More stalls Fewer L2

accesses

Slower inst.

fetch

higher L2 lat.

More stalls

CPU Domain L2 Domain

Freq. of positive feedbac

ks

(61)

Problem

Set the voltage and frequency for each domain to reduce the Combined energy consumption with little impact on delay

Previous solutions control voltage of each domain in isolation

Solution: Online Integrated Dynamic

Voltage Scaling in CPU-core and L2 cache.

Heuristic based online IDVS (online-IDVS)

Machine learning based IDVS (ML-IDVS)

(62)

Integrated DVS (IDVS)

Consider two domain chips CPU and L2 cache

Voltage Controller (VC) sets configuration using observed activity 

configure using DVS

This also works for

different types of domains (e.g., CPU and memory)

VC

L2 Domain CPU

Domain

L1 cache

FUs

L2 cache

Processor Chip

Main memory

(63)

IDVS Policy

Rules for setting voltages

Break positive feedback

in rules 1 & 5

(64)

Positive Feedback Scenarios

• Workloads increase in both domains (rule 1)

– Indicate a start of memory bound program phase

Preemptively reduce core speed

Increasing core speed will exacerbate load in both domains.

Decrease core speed rather to save core energy

• Workloads decrease in both domains (rule 5)

– Longer core stalls are a result of local core activity

Increasing or decreasing the core speed may not eliminate the source of these stalls.

Maintain the core speed unchanged

-

5

1

V$ Vc

L2 access IPC

Rule #

(65)

Machine Learning (ML-IDVS): general idea

Training phase

Runtime

Learning engine

determine freq.

& voltages Integrated

DVS policy

Auto. policy generator

(66)

Learning approach I: Obtain training data

<1, 0.5, 4, 0.05, 0.0002>

<1, 0.5, 2, 0.15, 0.006>

Application execution

Applications divided into interval called state

State: <operating frequencies, features>

ex: <fcpu, fL2, CPI, L2PI, MPI>

(67)

Learning approach

fcpu=1, f$=1 fcpu=1, f$=0.5 fcpu=0.5, f$=1 fcpu=0.5, f$=0.5

<1,1, 4, 0.05,0.001, 5, 0.4>

<1,0.5, 3.5, 0.05, 0.004, 4.2, 0.45>

<0.5,1, 2.5, 0.05, 0.0003, 4.4, 0.55>

<0.5,0.5, 2.4, 0.05, 0.001, 4.5, 0.6>

min E.D

State Table indexed by

<fCPU, fL2, CPI, L2PI, MPI>

Stores best frequency combinations

< fcpu, fL2, CPI, L2PI, MPI, delay, energy>

If (L2PI >0.32) and (CPI<2) and (MPI > 0.0003) then fL2= 1GHz else fL2 = 0.5GHz

If (MPI >0.002) and (CPI<3) and (MPI< 0.001) then fcpu = 0.5GHz else fcpu = 1GHz

Machine learning to generate policy rules

(1) Obtain training data

(2) State-table construction

(3) Generation of rules

(68)

Power scheduling

(69)

Power-Aware scheduling

Main idea is to slowdown the CPU

thus extend the computation

Compute the maximum delay that should be allowed

Consider:

Preemptions, high priority tasks

Overhead of switching speeds within a task, or when switching speeds due to task resume

Model dictates as static speed as possible

(quadratic relationship of voltage and frequency)

(70)

Voltage and frequency relationship

Hidden assumptions:

Known execution times

Quadratic relationship of V and f

Continuous speeds

Linux has several “governors” that regulate voltage and frequency

OnDemand is the most famous, implemented in the kernel, based on the CPU utilization.

Userspace and Powersave are the other two,

and can be used by applications

(71)

Linux OnDemand

For each CPU

If util > threshold --> increase freq to MAX

If util < threshold --> decrease freq by 20%

Parameters:

ignore_nice_load, sampling_rate, sampling_rate_min, sampling_rate_max, up_threshold

Down threshold was 20%, constant

Newer algo keeps CPU at 80% util

(72)

(cont)

Static Power Management (SPM)

Speed adjustment based on remaining execution time of the task

A task very rarely consumes its estimated worst case execution time (WCET)

Being able to detect whether tasks finish ahead of (predicted) time is beneficial

Knowledge of the application is also good

Knowing characteristics of application and system allows for optimal power and energy savings

(73)

Power Aware Scheduling for Real-time

T1 fmax D

time

Static Power

Management (SPM)

Static slack: uniformly slow down all tasks

Gets more interesting for multiprocessors

T2 Static Slack

E

Energy

T1 T2 f

idle

time T1

time T2

T2

T1

0.6E 0.6E

time

T1 T2

fmax/2

E/4

(74)

Basic DVS Schemes

slack

slack

slack

Proportional Scheme

Greedy Scheme

Statistical Scheme

(75)

Dynamic Speed adjustment

time

time WCET

ACET

(76)

Dynamic Speed adjustment

time

time Remaining WCET

Remaining time

Speed adjustment based on remaining WCET

(77)

Dynamic Speed adjustment

time

time Remaining WCET

Remaining time

Speed adjustment based on remaining WCET

(78)

time

time

Speed adjustment based on remaining WCET

(79)

More assumptions

Even for real-time systems, another (huge?) assumption is the knowledge of the execution

times of the tasks, which can be obtained through

Compiler-assisted

User informed

Profiling

Without exact information about the execution times of the tasks, cannot determine a constant optimal speed

Use distribution information about the times

What distribution, what mean, what average, etc makes a difference

(80)

Schemes with Stochastic Information

The intuition: compute the probability of a task

finishing early, and reduce the speed based on that.

Thus, the expected energy consumption is minimized by gradually increasing speed as the task

progresses, first named Processor Acceleration to Conserve Energy (PACE) in 2001

1

0

cdf(x )

X

(81)

Schemes with Stochastic Information

Existing schemes, such as PACE and GRACE, differ on the way they compute the speed (based on how they define the problem.

Many schemes have the following shortcomings:

use well-defined functions to approximate the actual power function

solve the continuous version of the problem before rounding the speeds to the available discrete speeds

do not consider the idle power

assume speed change overhead is zero

(82)

Practical PACE

To make PACE practical, algorithms must

Use directly discrete speeds

Make no restriction on the form of power functions

Take the idle power into consideration

Consider the speed change overhead

Quantize spaces: continuous  discrete

Cannot look at all combinations of speeds: too computationally expensive

Carry out optimizations to reduce the complexity

(83)

Problem Formulation

Minimize

0≤i≤r

s

i

F

i

ef

i

Subject to

0≤i≤r

si

f iD

where

s

i

=b

i+1

b

i

Fi=

bij<bi+1

1−cdf j−1 

1

0

b0=1 b1 b2 b3 br br+1=WC+1

cdf(x )

f0 f1 f2

fr X

… …

(84)

Graphical Representation of the Problem

(85)

Graphical Representation of the Problem

The problem is reduced to finding a path from v0 to vr+1 such that the energy of the path is minimized while the time of the path is no greater than the desired completion time D

(86)

A Naive Approach

v0 v1 v2 v3

|LABEL(0)|=1 |LABEL(1)|=M |LABEL(2)|=M2

|LABEL(3)|=M3

Exponential growth

!

(87)

Elimination of Labels

The key idea is to reduce and limit the # of labels in the nodes (after each iteration)

There are two types of eliminations

optimality preserving eliminations

eliminations that may affect optimality but

still allows for performance guarantee

(88)

Optimality Preserving Eliminations

• For any label, if the deadline will be missed even if the maximum frequency is used after this point, we eliminate it.

For any two labels, if one is dominated by the other, we eliminate the one which is being dominated

If (e,t) ≤ (e’, t’), we eliminate (e’, t’)

Eliminate all labels whose energy is greater than an upper bound (e.g., solutions where frequency is

rounded up to the closest discrete frequency)

(89)

Optimality Preserving Eliminations

v0 v1 v2 v3

|LABEL(0)|=1 |LABEL(1)|=M |LABEL(2)|=M2

|LABEL(3)|=M3

|LABEL(1)|<<M, Hopefully!

|LABEL(2)|<<M2

Hopefully! |LABEL(3)|<<M3 Hopefully!

After the first type of eliminations, the size of LABEL(i) would decrease substantially. However, the running time of the algorithm still has no polynomial time bound guarantee

(90)

Eliminations Affecting Optimality

For a node, we sort the labels on energy and eliminate part of the labels such that the

energies of any adjacent remaining labels are off by at most δ

We choose δ carefully: solution is guaranteed to be within a factor of 1+ ε of the optimal

PPACE is a so-called Fully Polynomial Time

Approximation Scheme (FPTAS) that can obtain ε -optimal solutions, which are within a factor of 1+ ε of the optimal solution and run in time

polynomial in 1/ ε

(91)

Stochastic power management

slack

β1D (1-β

1)D

D β1

(92)

β4=100%

β3=xx%

vs.

T1 T2 T3 T4

β2=xx%

vs.

vs.

β1=xx%

(93)

An alternate point of view

WCE WCE WCE time

time WCET

ACET

time

AV

WCE WCE

WCE

Reclaimed slack

stolen slack

(94)

Run-time information

Compiler

(knows future, can

steal)

OS

(knows past, can

reclaim)

Who should manage?

Static analysis

Application Source Code

PMHs: Power management hints

PMPs: Power management

points

Interrupts for executing PMPs Interrupts for executing PMPs

PMHs PMHs

time

(95)

Following slides are on compiler’s

ability to insert code, etc

(96)

Inter-task vs. Intra-task DVS

deadline

speed

time

Intra-task speed changes

# speed changes

Energy consumption

Return to question: where to manage

How many and where?

(97)

Collaborative CPU power management

OS-directed vs. compiler- directed intra-task DVS.

Ex: Gruian’01, Shin et al.’01

Cross-layer scheme uses

path-dependant information AND runtime information

Compiler provides hints

(PMH) about the application progress to the OS to

schedule the speed.

tim e

(98)

Compiler and OS collaboration

Compiler

Inserts PMHs in code to compute timing info.

periodically compute

OS

and sets new speed

PMHs inserted PMHs inserted by the compiler by the compiler

OS interrupt OS interrupt invoked for invoked for executing PMPs executing PMPs

Interrupt interval

time

(99)

OS role

Periodic Execution of ISR

Interrupt interval  controls no. of speed changes

ct = read current time

f_curr = read current freq f_new = calc_Speed(ct,d,

WCR)

If (f_new ≠ f_curr)

Lookup power level(f_new) Set_speed (f_new)

Interval_time =

interval_cycles/f_new rti

# of PMPs

Energy consumption

(100)

Compiler role

Profiling phase to

determine interrupt interval

PMH placement and WCR computation

branches, loops, procedure bodies,…, etc.

PMH 1

PMH 2 PMH 3

PMH 4 PMH 5

Call P()

(101)

CAVEAT: non-linear code

• Remaining WCET is based on the longest path

• Remaining average case execution time is based on the branching probabilities (from trace information).

At a

p

3

p

2

p

1

min average max

(102)

possible optimizations

(103)

Maximizing system’s utility

(as opposed to minimizing energy consumption)

Time constrains

(deadlines or rates) Energy constrains System utility

(reward) Increased reward with

increased execution

Determine the most

rewarding subset of tasks to execute

Determine appropriate

versions to execute

(104)

Continuous frequencies, continuous reward functions

Discrete operating frequencies, no reward for partial execution

Version programming – an alternative to the IRIS (IC) QoS model

Optimal solutions Heuristics

EXAMPLE

For homogeneous power functions, maximum reward

is when power is allocated equally to all tasks.

Add a task

if constraints is violated

Repair schedule

yes no

(105)

Satellite systems

(additional constraints on energy and power)

time

Available power

battery

Use to store (recharge) split

merge consume

Schedulable system

Solar panel (needs light)

Tasks are continuously executed

Keep the battery level above a threshold at all times

Frame based system

Can apply three (any) dynamic policies (e.g., greedy, speculative and proportional)

Example:

(106)

MultiCPU systems

(107)

Multi-CPU Systems

Multicore (currently more common, several cores per chip, private L1, shared L2)

Multiprocessor (private L1 and L2, but shared memory)

Clusters (private L1, L2, memory)

Return to question: Who should manage?

Centralized vs distributed

Front-End vs any node?

(108)

Partition Tasks to Processors

P P P P

 Global Queue

• Each processor applies PM individually

• Distributed

• Global management

• Shared memory

 Which is better??? Let´s vote

(109)

Energy-efficient Clusters

 Applied to web servers (application)

Use ON-Off and DVS to gain energy and power

Design for peak use inefficient

 Predictive schemes

Predict what will happen (load)

design algorithms based on that prediction (either static or dynamic)

Change frequency (and voltage) according to load

 Adaptive (adjustable) schemes

:

monitor load regularly

adjust the frequency (and voltage) according to load

(110)

Cluster Model

(111)

Typical Application

E-commerce or other web clusters

Multiple on-line browser sessions

Dynamic page generation and database access

Authentication through SSL (more work for security)

“Real-time” properties (firm or desired)

(112)

Global PM example – on/off

Server clusters typically underutilized

Turn on/off servers, as required by load

requests / min (3-day trace)

# of servers

(113)

Heterogeneous Apache cluster

Experimental setup

request distribution

All machines

connected through Gbps switch

Traces collected from www.cs.pitt.edu

4 hour traces, up to 3186 reqs/sec

(1.6Gbps)

(114)

Offline power/QoS measurements

load

power

max load

idle server

over-utilized QoSQoS

QoSQoS

power QoSQoS

server

(no requests)

requirement requirement

(115)

Measured power functions

Load is defined as CPU load

QoS - meet 95% of deadlines, no dropped requests

50ms deadline for static pages

200ms deadline for dynamic pages

Measured power functions

Offline measurement overhead is 2 hours per machine type

5% load corresponds to 54 requests/sec on average

(116)

Servers

Process requests, return results

Local power management Monitor, feedback load request

server pool

front end

request distribution result

feedback

Front end

Request distribution Load balancing

Global power management

Clients

Global PM – server clusters

(117)

Homogeneous clusters - LAOVS

Load-aware on/off

DVS locally

Load estimated online

Offline computed table servers versus load

PPC750 cluster (2

nodes) implementation

4x savings over noPM

noPM

With PM

References

Related documents

CLASS: CERTIFICATE IN CUTTING AND SEWING MOD:4.. NAME /

et al.[8] • 2013 • Presented a online, preemptive scheduling with task migration algo- rithm for cloud computing environment is proposed in order to improve the efficiency and

The real time services over the internet,all the tasks will meet their deadline guarantee like hard real time systems.. 1.2 Real Time Tasks Scheduling :

A mini microcontroller processes the data, and transfers it as per requirement to a wireless GSM/GPRS module which directly uploads data to our vehicle tracking server

The Rate Monotonic Scheduler (RMS) is a static priority based preemptive scheduling algorithm, and popularly famous for real time embedded applications [2].. RMA

In this context, we introduce a novel approach using a combination of prosody features (i.e. Formant Frequencies, Spectral features etc.), derived features

Such systems are often based on priority schemes Unlike in hard real time systems where it becomes important to ensure all tasks are completed by their deadline in soft

[r]