Power management in XX systems XX = real-time | embedded |
high performance | wireless | …
By: Alexandre Ferreira Daniel Mossé
http://www.cs.pitt.edu/PARTS
Collaborators (faculty):
Rami Melhem Bruce Childers PhD students:
Hakan Aydin Dakai Zhu Cosmin Rusu
Nevine AbouGhazaleh Ruibin Xu
Green Computing
Gartner's top strategic technology for 2008:
Green IT
“This one is taking on a bigger role for many reasons, including an
increased awareness of environmental danger; concern about power bills; regulatory requirements; government procurement rules; and a sense that corporations should embrace social responsibility.”
From Green500.org about supercomputers
For decades now, performance has been synonymous with "speed" (as measured in FLOPS) emergence of supercomputers that consume egregious amounts of electrical power and produce so much heat that require extravagant cooling facilities
Other performance metrics to be largely ignored, e.g., reliability, availability, and usability.
As a consequence, all of the above has led to an extraordinary
increase in the total cost of ownership (TCO) of a supercomputer.
Green Computing
From [Gunaratne et al, IJNM 2005]
In the USA, the IT equipment comprising the Internet uses about US$6 Billion of electricity per year.
Outline and where are we?
Motivation
Power consumed by servers, devices, etc
Thermal limitations
Cost of energy
Examples of hw platforms and their E consumptions
Basic techniques
Models
Hardware
Software
HPC + power
Products
Research projects
Research topics (open problems)
Power Management
Why?
Cost! More performance, more energy, more
$$
Heat: HPC Servers (multiprocessors), High performance CPUs
Dual metrics: maintain QoS, reduce energy
Battery operated: satellites
System design: Highest component speed
does not imply best system performance
Basics of Energy and Power
Power (P) consumption is proportional to the voltage fed into the component
Some systems have a power cap (power limit, cannot operate above that threshold)
The energy consumed doing a task (E) is the integral in time (T) of the power function.
A simple approximation: E = P x T, if P is the average power
Some systems have an energy budget (e.g., systems on batteries or pay-per-use systems with a $ budget)
Thermal dissipation is proportional to the Power (the more power consumed, the more heat is generated)
Some systems have a thermal cap
More heat generated more cooling needed
Introduction
CPU 28%
memory 42%
others
• CPU (including on-
30%chip caches) and memory energy are major energy consumers
• Energy is a prime resource in computing systems
others 51%
memory 23%
CPU 26%
servers portable devices [IBM] [Celebican’04]
• Power management is subject to
performance degradation
Computing Performance
Small Systems – (10s Gigaflops)
X86, Power, Itanium, etc.
Specialized CPUs – (1 Teraflops)
Cell, GPUs (Nvidia Tesla, ATI)
SMPs – (10s Teraflops)
x86, Power, Itanium
Shared memory model with up to 1000 cores, 32/64 more common
Cluster – (1 Petaflops)
x86, Power, Cell
Fastest machines available (Blue Gene and Roadrunner) up to 128K cores
Power vs Performance Efficiency
Generic CPUs
Fast but too power inefficient
10-100W per core (Athlon 64, Intel Core 2, Power 5/6)
New high-performance chips are all multicore
New versions of x86, Power, SPARC.
Power efficient designs
Cell (100s Gigaflops)
1 embedded PowerPC and 8 SPEs (fast but limited processors)
GPUs (1 Teraflops)
Up to 240 dedicated CPUs in a 1.4 B transistors chip
Created for graphics processing, used for parallel applications
Examples: Nvidia Tesla and AMD Firestream
Floating point accelerators (1 Teraflops)
Power vs Performance Efficiency
Intel ATOM
5-6x slower than latest Core 2 processors
But uses from 0.5 to 2W instead of 30-90W
Whole system design is important
Design solved CPU problem: CPU consumes 2W
Chipset became the new problem, consuming 20W
Bottleneck changed. Will it change again? For sure!
High Performance Computing
Blue Gene/L (~0.3 Gflops/W)
Uses a embedded processor design
PowerPC 440 derived design with 2 cores per chip
Design allows 65K CPUs (128K Cores)
Fastest machine up to Jun/2008
Roadrunner (~0.4 Gflops/W)
Uses a hybrid design
7K Opterons (2 cores) (used for I/O processing)
12K Cell processors (FP computation)
Jun/2008: achieved 1 Petaflops
Power in High Performance Computing
Most systems get less than 0.03 Gflops/W
Roadrunner equivalent machine
would consume 39MW
Datacenters
have an efficiency factor ranging from ¼ (very efficient) to 2 (inefficient) of cooling Watt
per computing Watt [HP]
RoadRunner
System Power Management
Objective: Achieve overall energy savings with minimum performance degradation
Previous work in literature target different single component: CPU, caches, MM, or disks
Achieve Local energy savings
How about overall system energy?
Effects of applying PM in one component on other system components?
Consider the effect of saving power in one component on the other system components energy.
Memory Power
DDRx DRAM
DDR (2.5V), DDR2 (1.8V), DDR3 (1.5V)
Bigger memory chips uses less power/bit
Same technology
Higher bus speed increases consumption
Increase bandwidth (faster data transfer)
No significant reduction in access time for a single word (bottleneck in DRAM array)
~1W to 4W per DIMM.
More DIMMs or fewer DIMMs for same size (Total memory)
More DIMMs implies parallelism masking some latency
Fewer DIMMs reduce memory power
Memory Type # Chips/DIMM Power (W)
400MHz/512MB 8 1.30
400MHz/1GB 16 2.30
400MHz/2GB 16 2.10
533MHz/512MB 8 1.44
533MHz/1GB 16 2.49
533MHz/2GB 16 3.23
800MHz/512MB 8 1.30
800MHz/1GB 16 1.87
800MHz/2GB 16 1.98
0 2 4 6 8 10 12 14 16 18
400MHz/512MB
400MHz/1GB
400MHz/2GB
533MHz/512MB
533MHz/1GB
533MHz/2GB
800MHz/512MB
800MHz/1GB
800MHz/2GB
# Chips/DIMM Power (W)
Kingston Memory
Memory Power
Newer technology favors low power designs (lower Vdd)
DDR (2.5V), DDR2 (1.8V) and DDR3 (1.5V)
Denser memory is less power hungry (more memory in chip)
Higher bus speed implies higher consumption for same technology
CPU and Memory Energy
We can reduce power in processors by reducing speed (using Dynamic voltage scaling DVS)
Memory is in standby for longer periods- energy increases.
energy
CPU frequency fmax fmin
Ecpu
Emem Etot
Other metrics to consider
Maintain QoS while reducing energy
Quality of service can be a deadline, a certain level of throughput, a specified average response time
Achieve the best QoS possible within a certain Energy budget
Battery operated devices/systems such as
satellites. Batteries are exhausted, need to be
Replaced (physical replacement is expensive!)
Recharged (given constraints, such as solar power)
Replace entire device (if battery replacement is not possible)
Application Characterizations
Different applications will consume different amounts of power
Weather forecasting
Medical applications (radiology)
Military games (we have no information about it)
One application will have different phases throughout execution
Use some units more than others
More memory intensive (e.g., initialization)
More CPU intensive
More cache intensive
High Performance Computing
Weather Prediction
1.5Km scale globally requires 10 Petaflops
No existing machine can run this model
Conventional CPUs (1.8M cores)
US$ 1B to build
200MW to operate (US$ 100M per year)
Using embedded processors (20M Cores)
US$ 75M to build
4MW to operate (US$ 2M per year)
“Towards Ultra-High Resolution Models of Climate and Weather”, by Michael Wehner, Leonid Oliker and John Shalf, International Journal of High Performance Computing Applications 2008; 22;
149
Tomography (Image reconstruction)
4 Nvidia 9800 GX2 (2 GPUs per card) in a PC
Faster than 256 cores AMD opteron 2.4GHz cluster (CalcUA)
€4000 cost with 1500W consumption
University of Antwerp
Application Characterizations
Can we characterize usage of units within processor?
How much cache vs DRAM vs CPU vs…
Applications have different behaviors
Next slides how these behaviors for
the 3 metrics mentioned: L2 cache, CPU, and DRAM, and for
4 applications from SPEC benchmark: art, gcc, equake, and parser
CPI: Cycles Per Inst L2PI: L2 access Per Inst
MPI: memory access Per Inst.
Workload Variations
CPI: Cycles Per Inst L2PI: L2 access Per Inst
MPI: memory access Per Inst.
CPI: Cycles Per Inst L2PI: L2 access Per Inst
MPI: memory access Per Inst.
Workload Variations
CPI: Cycles Per Inst L2PI: L2 access Per Inst
MPI: memory access Per Inst.
Power Management
Why?
Cost! More performance, more energy, more $$
Heating: in HPC Servers (multiprocessors)
Dual metrics: maintain QoS, reduce energy
Battery operated: satellites
System design
How and why?
Power off unused parts: disks, (parts of) memory, (parts of) CPUs, etc
Gracefully reduce the performance
Voltage and frequency scaling
Rate scaling
Transmit power scaling
Outline and where are we?
Motivation
Power consumed by servers, devices, etc
Thermal prohibitions
Cost of energy
Examples of hw platforms and their E consumptions
Basic techniques
Models
Hardware
Software
HPC + power
Products
Research projects
Research topics (open problems)
Hardware Models
Medium Low Low
Highest Medium Medium
Highest High High
Medium Medium High
Sleep, Active CPU,
CPUs or Cores
Memory/C
aches Communication Links
Sleep Power
Static Power Dynamic
Power Transition Overhead
Power States available
Sleep, standby,
DVS
Sleep, Standby,
Active
Need to organize the next few
slides on the Power Model
Power Models
Dynamic voltage scaling (DVS) is an effective power/energy management technique
Gracefully reduce the
performance to save energy
CPU: dynamic power Pd = Cef Vdd2 f
Cef : switch capacitance
Vdd : supply voltage
f : processor frequency
linearly related to Vdd
Frequency( f ) Power( p )
p ∞ f 3
CPU energy model
Power components:
Dynamic:
Leakage:
P
d=kα CV
2f
P
l=l V −V
t V
speed
time
Dynamic Voltage Scaling (DVS):
reduce voltage/freq. linearly, reduces energy quadratically
DVS Power and Energy Models
A more accurate power model is:
P = Ps + h (Pind + Pd)
where:
Ps represents sleep power (consumed when system is in low power mode)
h is 0 when system is sleeping and 1 when active
Pind represents activity independent power
Memory static power, chipset, communications links, etc.
Pd represents activity dependent power
Pd = Cef Vdd2 f
CPU power
DRAM memory energy model
Dynamic and static Power components
Dynamic power management
Transition the DRAM chips to lower power state
Ex: Rambus {active, standby, nap, powerdn}
Energy vs. trans power and delay
P0 P1 P2
Memory requests
Memory Power Model
Memory State Active:
Standby:
Pa = Total power in active mode Ps = Static power in active mode Tac = Measurement time
M = Number of memory access over period Tac Eac = Energy per memory access
P
a=P
s+M E
acT
acP
stCache Energy Model
Cacti 3.0 power and time models
Access latency and energy are fn. of:
Cache size, associativity and block size
Longer bit/word lines ==> higher latency
Outline and where are we?
Motivation
Power consumed by servers, devices, etc
Thermal prohibitions
Cost of energy
Examples of hw platforms and their E consumptions
Basic techniques
Models
Hardware
Software
HPC + power
Products
Research projects
Research topics (open problems)
Hardware capabilities
Sending power Link rates
(encoding) Multi-clock
domain procs Real-time
Performance DVS
Timeout Timeout
Redundancy (use or add)
Timeout
Redundancy (use or add) On-off
communication links
memory and/or caches
CPU, CPUs or Cores
Biggest problem is still who manages it, at what
granularity (physical and power), and how often
Problems and solutions
Solutions:
Use DVS, on/off and clock gating, at hardware level
Complex software solutions migrating to hardware
Biggest problems is still
who manages it (hardware, software, user, application, hybrid)
at what granularity (physical and power)
how often (triggered by events or periodically)
whether it is synchronized among devices or not
(manage power for many devices in coordinated way)
if not synchronous, then at what hierarchy/order?
Run-time information
Compiler
(knows future better)
OS
(knows past better)
Who should manage power?
Static analysis
Application
Source Code
scheduling at the HW level
scheduling at the OS level
compiler analysis (extracted)
application (user/programmer furnished)
multi-layer and cross-layer
(1) Collect Timing Information
(4) Compute worst case remaining cycles
(2) Set interrupt
service routine (3) Insert PMH in
application code
OS
Compiler or user Preprocessing
Run-time
Metrics to optimize
• Minimize energy consumption
• Maximize system utility
• Most importantly, etc!
Time constraints (deadlines or rates)
Energy constraints System utility
(reward) Increased reward with
increased execution
Determine the most
rewarding subset of tasks to execute
Determine appropriate versions to execute
?
Start with hardware solutions
“Doing hardware”, use:
Lower power components
Lower frequency of operation
P ~ f2 so bigger power gains with smaller performance loss
Clock gating
Turning off circuits whenever they are not needed:
Functional units, cache lines, etc.
Used in any large CPU today
Throttling
Reduce utilization by holding requests
DVS
Change supply voltage to reduce power
Start with memory (no real reason, should we
come up with one that sounds good?? )
Memory energy savings
Low power DRAM system (AMD systems)
Uses DDR2-DRAM memory
Memory energy savings
Uses FB-DRAM memory (Intel Systems)
Memory energy savings
Different systems will consume different amounts of power in memory
Need to characterize bottleneck
Deal with memory savings whenever needed!
Timeout mechanisms:
Turn memory off when idle
Turn some parts of memory off when idle
Power states, depending on how much of the chip is turned off
Self-refresh is the most “famous” or well-known
Who manages timeouts?
Memory energy savings
Exploiting existing redundancy
Distribute workload evenly
On-off (or in general Distribute workload unevenly)
Adding hardware
Caching near memory
Flash or other non-volatile memory
More caching at L1, L2, L3
Exploit power states (in increasing order of power)
Powerdown (all components are off)
Nap (self-refresh)
Standby
Active
Near-memory Caching for
Improved Energy Consumption
Near-CPU vs. Near-memory caches
Thesis: Need to balance the allocation of the two for
better delay and energy.
CPU
Main Memory cache
cache
Caching to mask memory delays
Where?
In Main Memory?
In CPU?
Which is more power and performance efficient ?
cache
Cached-DRAM (CDRAM)
On-memory SRAM cache
[Hsu’93, Koganti’97]
accessing fast SRAM cache Improves
performance.
High internal bandwidth
use large block sizes
How about energy?
On-memory block size
Near-memory caching:
Cached-DRAM (CDRAM)
On-memory SRAM cache
Accessing fast SRAM cache Improves
performance.
High internal bandwidth
use large block sizes
CDRAM improves performance but
consume more energy
On-memory block size
Power-Aware CDRAM (PA-CDRAM)
Power management in DRAM-core
Use moderate sized SRAM cache
Turn the DRAM core to low power state
Use immediate shutdown
Power management in near-memory caches
Use distributed near-memory caches
Choose adequate block size
Power-aware Near-memory Cache vs. DRAM energy
PA-CDRAM energy: E
tot= E
cache+ E
DRAM0 15 30 45 60
64 128 256 512 1024 2048 block size (byte)
Energy (mJ)
0 0.3 0.6 0.9 1.2
64 128 256 512 1024 2048
block size (byte)
Energy (mJ)
CPU intensive (bzip) memory intensive (mcf)
E_DRAM
E_cache E_tot
PA-CDRAM design
PA-CDRAM new Protocol
Need new commands:
To check whether data is in cache
Return data from cache, if so
Read data from DRAM core, if not
Power-aware CDRAM
Power management in near- memory caches
Distributed near-memory caches
Choose adequate cache
configuration to reduce miss rate & energy per access.
Power management in DRAM-core
Use moderate sized SRAM cache
Turn the DRAM core to “low power state”
Use immediate shutdown
Near-memory versus DRAM energy tradeoff – cache
block size
64 128 256 512 1024 2048
0 15 30 45 60
block size (byte)
Energy (mJ)
E_DRAM
E_cache E_tot
Exploiting memory power management
Carla Ellis [ISLPED 2001]: memory states...
Kang Shin [Usenix 2003] and Ellis [ASPLOS 2000]: smart page allocation, turn off pages that are unused.
Integrated CPU+XX DVS
Independent DVS : Basic Idea
DVS sets the frequency of a domain based
workload: speed α workload
If workload measured by performance counters
# instruction for CPU-core
L2 access for L2-cache
# of L2 misses for main memory
Control loop
periodically measure workload to set speed
Domain
Workload
Voltage setting
Voltage controlle
r
Workl oadVoltag e
tim e
tim e
Problem: Positive Feedback Intervals
Cache freq
intervals
CPU workload
intervals
L2 workload
intervals
CPU freq
intervals
More stalls Fewer L2
accesses
Slower inst.
fetch
higher L2 lat.
More stalls
CPU Domain L2 Domain
Freq. of positive feedbac
ks
Problem
Set the voltage and frequency for each domain to reduce the Combined energy consumption with little impact on delay
Previous solutions control voltage of each domain in isolation
Solution: Online Integrated Dynamic
Voltage Scaling in CPU-core and L2 cache.
Heuristic based online IDVS (online-IDVS)
Machine learning based IDVS (ML-IDVS)
Integrated DVS (IDVS)
Consider two domain chips CPU and L2 cache
Voltage Controller (VC) sets configuration using observed activity
configure using DVS
This also works for
different types of domains (e.g., CPU and memory)
VC
L2 Domain CPU
Domain
L1 cache
FUs
L2 cache
Processor Chip
Main memory
IDVS Policy
Rules for setting voltages
Break positive feedback
in rules 1 & 5
Positive Feedback Scenarios
• Workloads increase in both domains (rule 1)
– Indicate a start of memory bound program phase
• Preemptively reduce core speed
• Increasing core speed will exacerbate load in both domains.
• Decrease core speed rather to save core energy
• Workloads decrease in both domains (rule 5)
– Longer core stalls are a result of local core activity
• Increasing or decreasing the core speed may not eliminate the source of these stalls.
• Maintain the core speed unchanged
-
5
1
V$ Vc
L2 access IPC
Rule #
Machine Learning (ML-IDVS): general idea
Training phase
Runtime
Learning engine
determine freq.
& voltages Integrated
DVS policy
Auto. policy generator
Learning approach I: Obtain training data
<1, 0.5, 4, 0.05, 0.0002>
<1, 0.5, 2, 0.15, 0.006>
Application execution
Applications divided into interval called state
State: <operating frequencies, features>
ex: <fcpu, fL2, CPI, L2PI, MPI>
Learning approach
fcpu=1, f$=1 fcpu=1, f$=0.5 fcpu=0.5, f$=1 fcpu=0.5, f$=0.5
<1,1, 4, 0.05,0.001, 5, 0.4>
<1,0.5, 3.5, 0.05, 0.004, 4.2, 0.45>
<0.5,1, 2.5, 0.05, 0.0003, 4.4, 0.55>
<0.5,0.5, 2.4, 0.05, 0.001, 4.5, 0.6>
min E.D
State Table indexed by
<fCPU, fL2, CPI, L2PI, MPI>
Stores best frequency combinations
< fcpu, fL2, CPI, L2PI, MPI, delay, energy>
If (L2PI >0.32) and (CPI<2) and (MPI > 0.0003) then fL2= 1GHz else fL2 = 0.5GHz
If (MPI >0.002) and (CPI<3) and (MPI< 0.001) then fcpu = 0.5GHz else fcpu = 1GHz
Machine learning to generate policy rules
(1) Obtain training data
(2) State-table construction
(3) Generation of rules
Power scheduling
Power-Aware scheduling
Main idea is to slowdown the CPU
thus extend the computation
Compute the maximum delay that should be allowed
Consider:
Preemptions, high priority tasks
Overhead of switching speeds within a task, or when switching speeds due to task resume
Model dictates as static speed as possible
(quadratic relationship of voltage and frequency)
Voltage and frequency relationship
Hidden assumptions:
Known execution times
Quadratic relationship of V and f
Continuous speeds
Linux has several “governors” that regulate voltage and frequency
OnDemand is the most famous, implemented in the kernel, based on the CPU utilization.
Userspace and Powersave are the other two,
and can be used by applications
Linux OnDemand
For each CPU
If util > threshold --> increase freq to MAX
If util < threshold --> decrease freq by 20%
Parameters:
ignore_nice_load, sampling_rate, sampling_rate_min, sampling_rate_max, up_threshold
Down threshold was 20%, constant
Newer algo keeps CPU at 80% util
(cont)
Static Power Management (SPM)
Speed adjustment based on remaining execution time of the task
A task very rarely consumes its estimated worst case execution time (WCET)
Being able to detect whether tasks finish ahead of (predicted) time is beneficial
Knowledge of the application is also good
Knowing characteristics of application and system allows for optimal power and energy savings
Power Aware Scheduling for Real-time
T1 fmax D
time
Static Power
Management (SPM)
Static slack: uniformly slow down all tasks
Gets more interesting for multiprocessors
T2 Static Slack
E
Energy
T1 T2 f
idle
time T1
time T2
T2
T1
0.6E 0.6E
time
T1 T2
fmax/2
E/4
Basic DVS Schemes
slack
slack
slack
Proportional Scheme
Greedy Scheme
Statistical Scheme
Dynamic Speed adjustment
time
time WCET
ACET
Dynamic Speed adjustment
time
time Remaining WCET
Remaining time
Speed adjustment based on remaining WCET
Dynamic Speed adjustment
time
time Remaining WCET
Remaining time
Speed adjustment based on remaining WCET
time
time
Speed adjustment based on remaining WCET
More assumptions
Even for real-time systems, another (huge?) assumption is the knowledge of the execution
times of the tasks, which can be obtained through
Compiler-assisted
User informed
Profiling
Without exact information about the execution times of the tasks, cannot determine a constant optimal speed
Use distribution information about the times
What distribution, what mean, what average, etc makes a difference
Schemes with Stochastic Information
The intuition: compute the probability of a task
finishing early, and reduce the speed based on that.
Thus, the expected energy consumption is minimized by gradually increasing speed as the task
progresses, first named Processor Acceleration to Conserve Energy (PACE) in 2001
1
0
cdf(x )
X
Schemes with Stochastic Information
Existing schemes, such as PACE and GRACE, differ on the way they compute the speed (based on how they define the problem.
Many schemes have the following shortcomings:
use well-defined functions to approximate the actual power function
solve the continuous version of the problem before rounding the speeds to the available discrete speeds
do not consider the idle power
assume speed change overhead is zero
Practical PACE
To make PACE practical, algorithms must
Use directly discrete speeds
Make no restriction on the form of power functions
Take the idle power into consideration
Consider the speed change overhead
Quantize spaces: continuous discrete
Cannot look at all combinations of speeds: too computationally expensive
Carry out optimizations to reduce the complexity
Problem Formulation
Minimize
∑
0≤i≤r
s
i⋅ F
i⋅ e f
i
Subject to
∑
0≤i≤r
si
f i≤D
where
s
i=b
i+1− b
iFi=
∑
bi≤j<bi+1
1−cdf j−1
1
0
b0=1 b1 b2 b3 br br+1=WC+1
cdf(x )
f0 f1 f2
…
fr X… …
…
Graphical Representation of the Problem
Graphical Representation of the Problem
The problem is reduced to finding a path from v0 to vr+1 such that the energy of the path is minimized while the time of the path is no greater than the desired completion time D
A Naive Approach
v0 v1 v2 v3
|LABEL(0)|=1 |LABEL(1)|=M |LABEL(2)|=M2
|LABEL(3)|=M3
Exponential growth
!Elimination of Labels
The key idea is to reduce and limit the # of labels in the nodes (after each iteration)
There are two types of eliminations
optimality preserving eliminations
eliminations that may affect optimality but
still allows for performance guarantee
Optimality Preserving Eliminations
• For any label, if the deadline will be missed even if the maximum frequency is used after this point, we eliminate it.
For any two labels, if one is dominated by the other, we eliminate the one which is being dominated
If (e,t) ≤ (e’, t’), we eliminate (e’, t’)
Eliminate all labels whose energy is greater than an upper bound (e.g., solutions where frequency is
rounded up to the closest discrete frequency)
Optimality Preserving Eliminations
v0 v1 v2 v3
|LABEL(0)|=1 |LABEL(1)|=M |LABEL(2)|=M2
|LABEL(3)|=M3
|LABEL(1)|<<M, Hopefully!
|LABEL(2)|<<M2
Hopefully! |LABEL(3)|<<M3 Hopefully!
After the first type of eliminations, the size of LABEL(i) would decrease substantially. However, the running time of the algorithm still has no polynomial time bound guarantee
Eliminations Affecting Optimality
For a node, we sort the labels on energy and eliminate part of the labels such that the
energies of any adjacent remaining labels are off by at most δ
We choose δ carefully: solution is guaranteed to be within a factor of 1+ ε of the optimal
PPACE is a so-called Fully Polynomial Time
Approximation Scheme (FPTAS) that can obtain ε -optimal solutions, which are within a factor of 1+ ε of the optimal solution and run in time
polynomial in 1/ ε
Stochastic power management
slack
β1D (1-β
1)D
D β1
β4=100%
β3=xx%
vs.
T1 T2 T3 T4
β2=xx%
vs.
vs.
β1=xx%
An alternate point of view
WCE WCE WCE time
time WCET
ACET
time
AV
WCE WCE
WCE
Reclaimed slack
stolen slack
Run-time information
Compiler
(knows future, can
steal)
OS
(knows past, can
reclaim)
Who should manage?
Static analysis
Application Source Code
PMHs: Power management hints
PMPs: Power management
points
Interrupts for executing PMPs Interrupts for executing PMPs
PMHs PMHs
time
Following slides are on compiler’s
ability to insert code, etc
Inter-task vs. Intra-task DVS
deadline
speed
time
Intra-task speed changes
# speed changes
Energy consumption
Return to question: where to manage
How many and where?
Collaborative CPU power management
OS-directed vs. compiler- directed intra-task DVS.
Ex: Gruian’01, Shin et al.’01
Cross-layer scheme uses
path-dependant information AND runtime information
Compiler provides hints
(PMH) about the application progress to the OS to
schedule the speed.
tim e
Compiler and OS collaboration
Compiler
Inserts PMHs in code to compute timing info.
periodically compute
OS
and sets new speed
PMHs inserted PMHs inserted by the compiler by the compiler
OS interrupt OS interrupt invoked for invoked for executing PMPs executing PMPs
Interrupt interval
time
OS role
Periodic Execution of ISR
Interrupt interval controls no. of speed changes
ct = read current time
f_curr = read current freq f_new = calc_Speed(ct,d,
WCR)
If (f_new ≠ f_curr)
Lookup power level(f_new) Set_speed (f_new)
Interval_time =
interval_cycles/f_new rti
# of PMPs
Energy consumption
Compiler role
Profiling phase to
determine interrupt interval
PMH placement and WCR computation
branches, loops, procedure bodies,…, etc.
PMH 1
PMH 2 PMH 3
PMH 4 PMH 5
Call P()
CAVEAT: non-linear code
• Remaining WCET is based on the longest path
• Remaining average case execution time is based on the branching probabilities (from trace information).
At a
p
3p
2p
1min average max
possible optimizations
Maximizing system’s utility
(as opposed to minimizing energy consumption)
Time constrains
(deadlines or rates) Energy constrains System utility
(reward) Increased reward with
increased execution
Determine the most
rewarding subset of tasks to execute
Determine appropriate
versions to execute
• Continuous frequencies, continuous reward functions
• Discrete operating frequencies, no reward for partial execution
• Version programming – an alternative to the IRIS (IC) QoS model
Optimal solutions Heuristics
EXAMPLE
For homogeneous power functions, maximum reward
is when power is allocated equally to all tasks.
Add a task
if constraints is violated
Repair schedule
yes no
Satellite systems
(additional constraints on energy and power)
time
Available power
battery
Use to store (recharge) split
merge consume
Schedulable system
Solar panel (needs light)
Tasks are continuously executed
Keep the battery level above a threshold at all times
Frame based system
Can apply three (any) dynamic policies (e.g., greedy, speculative and proportional)
Example:
MultiCPU systems
Multi-CPU Systems
Multicore (currently more common, several cores per chip, private L1, shared L2)
Multiprocessor (private L1 and L2, but shared memory)
Clusters (private L1, L2, memory)
Return to question: Who should manage?
Centralized vs distributed
Front-End vs any node?
Partition Tasks to Processors
P P P P
Global Queue
• Each processor applies PM individually
• Distributed
• Global management
• Shared memory
Which is better??? Let´s vote
Energy-efficient Clusters
Applied to web servers (application)
Use ON-Off and DVS to gain energy and power
Design for peak use inefficient
Predictive schemes
Predict what will happen (load)
design algorithms based on that prediction (either static or dynamic)
Change frequency (and voltage) according to load
Adaptive (adjustable) schemes
:monitor load regularly
adjust the frequency (and voltage) according to load
Cluster Model
Typical Application
E-commerce or other web clusters
Multiple on-line browser sessions
Dynamic page generation and database access
Authentication through SSL (more work for security)
“Real-time” properties (firm or desired)
Global PM example – on/off
Server clusters typically underutilized
Turn on/off servers, as required by load
requests / min (3-day trace)
# of servers
Heterogeneous Apache cluster
Experimental setup
request distribution
All machines
connected through Gbps switch
Traces collected from www.cs.pitt.edu
4 hour traces, up to 3186 reqs/sec
(1.6Gbps)
Offline power/QoS measurements
load
power
max load
idle server
over-utilized QoSQoS
QoSQoS
power QoSQoS
server
(no requests)
requirement requirement
Measured power functions
Load is defined as CPU load
QoS - meet 95% of deadlines, no dropped requests
50ms deadline for static pages
200ms deadline for dynamic pages
Measured power functions
Offline measurement overhead is 2 hours per machine type
5% load corresponds to 54 requests/sec on average
Servers
– Process requests, return results
– Local power management – Monitor, feedback load request
server pool
front end
request distribution result
feedback
Front end
– Request distribution – Load balancing
– Global power management
Clients
Global PM – server clusters
Homogeneous clusters - LAOVS
Load-aware on/off
DVS locally
Load estimated online
Offline computed table servers versus load
PPC750 cluster (2
nodes) implementation
4x savings over noPM
noPM
With PM