### Optimal Portfolio Liquidation in Dark Pool

### Ranjan Mishra

### Optimal Portfolio Liquidation in Dark Pool

DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

Master of Technology in

Computer Science by

### Ranjan Mishra

[ Roll No: CS-1519 ]

under the guidance of

### Dr. Diganta Mukherjee

Associate Professor

Sampling and Official Statistics Unit

Indian Statistical Institute Kolkata-700108, India

July 2017

### To everyone in the world

### CERTIFICATE

This is to certify that the dissertation entitled“Optimal Portfolio Liquidation in Dark Pool” submitted by Ranjan Mishra to Indian Statistical Institute, Kolkata, in partial fulfillment for the award of the degree ofMaster of Technology in Computer Science is a bona fide record of work carried out by him under my supervision and guidance. The dissertation has fulfilled all the requirements as per the regulations of this institute and, in my opinion, has reached the standard needed for submission.

Diganta Mukherjee Associate Professor,

Sampling and Official Statistics Unit, Indian Statistical Institute,

Kolkata-700108, INDIA.

### Acknowledgments

I would like to show my highest gratitude to my advisor,Diganta Mukhejee, Sampling and Official Statistics Unit, Indian Statistical Institute, Kolkata, for his guidance and continuous support and encouragement.

My deepest thanks to all the teachers of Indian Statistical Institute, for their valuable suggestions and discussions which added an important dimension to my research work.

Finally, I am very much thankful to my parents and family for their everlasting supports.

Last but not the least, I would like to thank all of my friends for their help and support. I thank all those, whom I have missed out from the above list.

Ranjan Mishra Indian Statistical Institute Kolkata - 700108 , India.

### Abstract

When a large investor decides to liquidate his portfolio in finite time period, he can put market orders to do so. However large market orders may adversely impact prices. Which in turn may produce lower liquidation return . Dark pools are new kind of market, where not all the information is made public after the trade execution.

We have developed a dynamic strategy to place market orders as well as dark pool orders such that total expected liquidation return is maximized. The problem is formu- lated as Morkov Decision Process in finite horizon and solved using approximate dynamic programming technique.

.

Keywords: Portfolio Liquidation, Dark Pool, Dynamic Programming, Markov Decision Process

1

## Contents

1 Introduction 4

1.1 Introduction . . . 4

1.2 Our Contribution . . . 5

1.3 Thesis Outline . . . 5

2 Markov Decision Process 6 2.1 Introduction . . . 6

2.2 Dynamic Programming . . . 7

2.3 Markov Decision Process . . . 7

2.3.1 Value Function and Q Function . . . 9

2.3.2 Value Iteration . . . 9

3 Problem Formulation and Proposed Solution 11 3.1 Introduction . . . 11

3.2 Dark Pool . . . 12

3.3 Problem Formulation . . . 13

3.4 Backward Dynamic Programming Approach . . . 13

3.5 Approximate Dynamic Programming . . . 15

3.5.1 Actor Critic Method . . . 16

3.6 Actor Critic method for Dark Pool Problem . . . 18

4 Simulation and Analysis 21

5 Future Work and Conclusion 25

## List of Figures

4.1 Sample Dynamics of Market Price . . . 21

4.2 Convergence of Temporal Difference Error . . . 22

4.3 Value Function For Different Initial Prices . . . 23

4.4 Comparison for Market Order and Dark Pool Order . . . 23

3

### Chapter 1

## Introduction

### 1.1 Introduction

When a large investor decides to liquidate his portfolio, he can place his whole portfolio as market order. But placing a large order might move the prices adversely. Impact of the order size on the prices has been studied. Several models have been proposed for same, such as [1, 12] . The impact may cause reduced return on portfolio. One simple solution is to put the orders of equal block market orders through time, but that might not produce the best return with the dynamics of the prices through time.

New kind of markets have emerged called asdark pools[5, 9] , as contrary to conventional market after the trade execution, dark pools do not make trade details public. One has to submit orders in dark pool to know the depth of the market. Lack of the trade information makes these markets more suitable for large investors since these have less impact on the prices. One does not know the depth of the market. The execution prices in these dark pools are mostly derived from the exchange such as NBBO mid price [5]. However the orders in dark pools are not guaranteed to be executed either whole or in parts. Most of the time it happen that the orders get cancelled. Since the large investor wanted to liquidate his portfolio in limited period of time, it may not be best to put the orders in dark pools only.

We propose the solution to put the orders in both the conventional (market order) and dark pools such that the return of the portfolio liquidation is maximized with the dynamics of the price governed by the size of the orders placed in the conventional market.

We formulate the problem as sequential decision problems, which naturally falls in the realm of Markov decision process, since the liquidation has to be in finite amount of time, this leads to the finite horizon Markov decision process. Optimal solution of a Markov decision problem with small action and state space can be achieved using backward dynamic programming which involves storing the value function of each state in a tabular fashion and then recursing backward in time to solve the Bellman Equation. However when state space is large using table to store values is not computationally feasible. Further to that if the action space is also large it does make the problem more demanding. We use techniques from approximate dynamic programming also called as reinforcement learning in computer science community to solve our problem.

1.2. Our Contribution 5

### 1.2 Our Contribution

Our contributions are summarized as follows.

• As the problem is to make decision through time and based on the dynamics of the market, we have formulated this as Markov Decision Process.

• We provide a solution as non stationary policy based on time and dynamics of the market considering the problem as finite horizon, using actor critic method.

• We show the empirical result that the TD error converges as the number of iteration increases, based on our proposed solution.

• We show the empirical result that policy obtained with our approach produces better return than naive policy of equal block market orders.

### 1.3 Thesis Outline

The rest of the thesis is organized as follows. In Chapter 2 we briefly discuss about the Markov Decision Process.In chapter 3, we describe the formulation and solution scheme to our problem in detail.In Chapter 4, we provide the analysis of the empirical results obtained based on the proposed scheme. In the concluding Chapter 5, we summarize the work done and future directions related to our work.

### Chapter 2

## Markov Decision Process

A sequential decision making problem where the state of the system depends on the action taken can be modelled as Markov decision process. Below we discuss the deterministic dynamic optimization problem and then gradually move towards stochastic optimization problem.

### 2.1 Introduction

Consider a discrete time dynamic system with following dynamics [4], xt+1 =ft(xt, ut)

where

• x_{t} is the current state of the system, e.g position and or velocity of particle or price
of securities in stock market.

• u_{t}is the control applied to the system for example force applied to particle or number
of stocks applied to buy/sell.

and a reward rt(xt, ut) is generated such as capital gain from selling securities. We take t∈0,1,2, ..., T is set of discrete time steps. Total reward accumulated can be written as

T−1

X

t=0

r_{t}(x_{t}, u_{t}) +r_{T}(x_{T}) (2.1)
wherex_{T} is terminal state and r_{T}(x_{T}) is terminal reward.

The objective is to maximize the overall reward generated, e.g. maximize the overall capital gain from selling the securities. We can write it as an optimization problem below

u0,umax1,...uT−1

T−1

X

t=0

rt(xt, ut) +rT(xT) (2.2) If we assume a discounting due to time such as in reward generated, we can add a discounting factor γ in the above equation and write it as

2.2. Dynamic Programming 7

u0,umax1,...uT−1

T−1

X

t=0

γ^{t}r_{t}(x_{t}, u_{t}) +γ^{T}r_{T}(x_{T}) (2.3)
Given the initial state x_{0} our goal is to find control sequence (u_{0}, u_{1}, ..., u_{T}−1). This
problem can be solved by enumerating over all actions sequences and looking into exhaustive
solution but this is computationally intractable.

### 2.2 Dynamic Programming

Dynamic programming [4] is technique of breaking the large problem to smaller problems that are easy to solve, popularized by Richard Bellman. Consider the equation (2.2) ,which we can rewrite as

V_{0}(x_{0}) = max

u0,....,uT−1

T−1

X

t=0

γ^{t}r_{t}(x_{t}, u_{t}) +γ^{T}r_{T}(x_{T})

= max

u0

[r0(x0, u0) + max

u1,....,u_{T}
T

X

t=1

γ^{t}rt(xt, ut) +γ^{T}rT(xT)]

= max

u0

[r0(x0, u0) +γV1(x1)]

(2.4)

In general we can write

Vt(xt) = max

ut

[rt(xt, ut) +γVt+1(xt+1)] (2.5) This is called Bellman equation [4]. Which we will be able to solve by backward substi- tution in the following way

• Step 1: InitializeV_{T}(x_{T}) =r_{T}(x_{T}) for all terminal states x_{T} set t=T−1

• Step 2: Calculate

V_{t}(x_{t}) = max

ut

[r_{t}(x_{t}, u_{t}) +γV_{t+1}(x_{t+1})]

∀x_{t}∈S wherext+1 =f(xt, ut)

• Step 3: ift_{>}0 return to step 1 else stop

This technique is called backward dynamic programming [4, 16, 8].

### 2.3 Markov Decision Process

Consider following similar to deterministic discrete time dynamic optimization problem with stochastic component added to it, the dynamic of the system is now governed by equation equation (2.6)

x_{t+1}=f_{t}(x_{t}, u_{t}, _{t}) (2.6)

8 2. Markov Decision Process

where_{t} is a random noise.

In this setting we can rewrite the optimal equation as finding optimal policy π =
(π_{0}, π_{1}, ....πt−1)∈Π such that

V0(x0) = max

π∈ΠE0,1,...T[

T−1

X

t=0

γ^{t}rt(xt, πt(xt)) +γ^{T}rT(xT)] (2.7)
Similar to above observation in case of deterministic case, we can establish relation and get
a backward substitution method to find optimal sequence of actions

V_{0}(x_{0}) = max

π∈ΠE_{}_{0}_{,}_{1}_{,...}_{T−1}[

T−1

X

t=0

γ^{t}r_{t}(x_{t}, π_{t}(x_{t})) +γ^{T}r_{T}(x_{T})]

= max

π∈Π

X

x1

P(x_{1}|x_{0}, π(x_{0})[r_{0}(x_{0}, u_{0}) +E_{}_{1}_{,...}_{T−1}

T

X

t=1

γ^{t}r_{t}(x_{t}, u_{t}) +γ^{T}r_{T}(x_{T})]

= max

π0

X

x1

P(x1|x_{0}, π(x0)[r0(x0, u0) +γV1(x1)]

(2.8)

hence above can be written in general as Vt(xt) = max

πt

X

xt+1

P(xt+1|x_{t}, πt(xt))[r(xt, πt(xt)) +γVt+1(xt+1)]

= max

πt

[r(x_{t}, π_{t}(x_{t})) +γ X

xt+1

P(x_{t+1}|x_{t}, π_{t}(x_{t}))V_{t+1}(x_{t+1})]

= max

πt

[r(xt, πt(xt)) +γEVt+1(xt+1)]

(2.9)

where expectation is based on one step transition i.e. distribution of _{t}. We can utilize
the backward dynamic programming method to generate the optimal control sequence or
optimal policy.

Moreover, Markov decision process can be formally defined [7, 16, 6] as 5-tuple (S, A, P.(., .), R(., .), γ) where

• S is the set of possible states of system.

• A is set of possible actions that can be taken.

• Pa(s, s^{0}) =P r(st+1 =s^{0}|s_{t}=s, at=a) that is the probability of system being in state
s‘ at timet+ 1 given it was in state sat time tand action taken was a.

• R(s, a) is reward generated when action taken ain states

• γ ∈[0,1] is the discounting factor.

The problem in MDPs is to find a policy that maps state to action i.e. a function π : S → A that decides action π(s) given the state is s. It be written as to find a policy π= (π0, π1, ...πT) such that accumulated reward is maximized i.e.

maxπ∈ΠE[

T

X

t=0

γ^{t}r_{t}(s_{t}, π_{t}(s_{t}))] (2.10)

2.3. Markov Decision Process 9

where expectation is taken over P_{π}(., .) Above discussed problem is finite horizon Markov
decision problem, if the time horizon is not limited then the problem is said to be infinite
horizon problem, it can be written as

maxπ∈ΠE[

∞

X

t=0

γ^{t}r_{t}(s_{t}, π_{t}(s_{t}))] (2.11)

2.3.1 Value Function and Q Function

For some policy π a value function [15, 16, 7] is definedV :S →_{R}

V_{t}^{π}(x_{t}) =r_{t}(x_{t}, π_{t}(x_{t})) +EV_{t+1}^{π} (x_{t+1}|x_{t}, π(x_{t})) (2.12)
hence our objective is to find a policyπ^{∗} such that

V_{t}^{π}^{∗}(xt)_{>}V_{t}^{π}(xt);∀π∈Π

where Π is set of all feasible policies. SimilarlyQfunction [15, 16, 6] is definedQ:S×A→_{R}
due to as value of taking actiona and following the optimal policy thereafter

Q_{t}(x_{t}, a_{t}) =r(x_{t}, a_{t}) +γEV_{t+1}^{π}^{∗}(x_{t+1}) (2.13)
As it can be seen it does involve expectation calculation which is difficult even when the
distribution is known. Using the Q function optimal action by chosen such that Q function
is maximized

a^{∗}_{t} = max

at∈At

Q(s_{t}, a_{t});t∈ {0,1, ...., T −1}

2.3.2 Value Iteration

Consider an infinite horizon control problem. Backward dynamic programming techniques can not be used, as we do not have a horizon and terminal reward function defined. But we can still use the optimality equation in the same way as in backward case, following is the value iteration [16, 15] algorithm for infinite horizon

As can be observed this is uses same optimally equation used in backward dynamic programming, but it starts with an some initial values of all states and then using optimality equation it calculates it approximates the value function. For small state and action space it may produce optimal value too. This is example of forward dynamic programming where we start with initial value function and update the value function based on optimality equation.

10 2. Markov Decision Process

Algorithm 1 Value Interation

1: procedure valueIteration(state s)

2: Initialization:

3: set V^{0}(x) = 0;∀x∈S.

4: set a tolerance parameter η >0.

5: set N = 0.

6: repeat:

7: for x∈S do

8: V^{n+1}=maxa∈A[r(x, a) +γP

x^{0}∈SP r(x^{0}|x, a)V^{n}(x^{0}).

9: let a^{n+1}be vector that solves above

10: if ||v^{n+1}−v^{n}||< η(1−γ)/2γ then

11: seta^{η} =a^{n+1}

12: V^{η} =V^{n+1} return a^{η}, V^{η}

13: else

14: setN =N + 1 gotorepeat

### Chapter 3

## Problem Formulation and Proposed Solution

### 3.1 Introduction

An order book is the list of orders (manual or electronic) that a trading venue (in particular stock exchanges) uses to record the interest of buyers and sellers in a particular financial instrument. A matching engine uses the book to determine which orders can be fulfilled i.e.

what trades can be made [21]. Assume a large investor who owns significantly large amount of stocks of a particular security, wants to liquidate his portfolio. He wants to sell off his whole portfolio. We consider the perfect and complete market. When he places a market order an entry in market order book is made. This information in order book is public to every investor/broker. This may lead to adverse impact on price. There have already been large number of studies on price impact of market such as [1]. Considering the price impact of market, following can be thought as the market dynamics

p_{t+1}=f_{t}(p_{t}, x_{t}, _{t}) (3.1)
where pt is the price security, xt is the amount of order to sell and t is the random noise.

More conveniently we can write the impact as

pt+1 =ft(pt, xt) +gt(t)

p_{t+1} =p^{∗}_{t} +g_{t}(_{t}) (3.2)

whereftcan be regarded as order size impact of trader andgtas impact due to noise traders,
p^{∗}_{t} is the impact price at which the trade takes place. Considering the equation (3.2) as dy-
namics of the market, the liquidation in market order can be formulated as an optimization
problem below

x0,xmax1,....xT

E

T

X

t=0

p^{∗}_{t}x˙_{t}

s.t. x_{0}+x_{1}+x_{2}+...+x_{T} =X_{0}

(3.3)

where X0 is the initial amount available to liquidate. The equation (3.3) is a dynamic optimization problem with price as state variable and amount of stocks to trade as action

11

12 3. Problem Formulation and Proposed Solution

variable, but with a constraint, i.e. equation (3.3) is a Markov decision problem with a
constraint. We can convert it into an unconstrained optimization problem by introducing
new state variables as a two dimensional setss_{t}={p_{t}, X_{t}}and with upper bound on action
x_{t} asX_{t}, as described below

x0,xmax1,....xT

E

T

X

t=0

c_{t}(s_{t}, x_{t}) (3.4)

withX_{t+1}=X_{t}−x_{t}where action space is now bounded asx_{t}∈ {0,1, , , .X_{t}};∀t∈0,1, .., T.

### 3.2 Dark Pool

As discussed in previous section the impact of the trade size may adversely affect the price of the security and hence, one has to put the orders in such a way that the impact is minimum.

Portfolio liquidating is a process which an investor wants to do in a limited period of time, he may have to compromise with the lower return. Dark pools are exchange markets working privately that allows a trader to put their orders, but not all the information about orders or trdes are made public , hence calleddark pool [3, 5]. Only way to know the market depth and other information in dark pool is to put orders in dark pool.

Even for an investor putting order in dark pool reveals only small information [3]. There
are no publicly available order books as in case of market orders. If an investor let’s say a
seller puts his order of size zt at time t, let’s sayyt amount is available to buy from buyer
side. After trade execution let’s say his dt amount got executed wheredt6yt, information
available to investor is only min(x_{t}, y_{t}) as he can only observe d_{t}. Only information he
can learn is the fraction of his order being executed. There have been studies for optimal
orders placements in dark pools so that the execution of the trades are maximized such as
in [9]. To our best knowledge the study of closest to our work has been considered in [14],
where no partial execution in dark pool has been considered, and the problem has been
formulated with whole history of orders making impact on prices. Here in this thesis we
assume only Markov property that is the price depends only on the previous order size and
previous price along with the random component.

As the price in dark pool are mostly derived from the conventional exchange price such as mid price of NBBO [3], it can be said that the order size impact on the price in the dark pool is near to negligible. It seems conclusive that investor should place all his order in dark pool, rather than as market orders. But one thing to note about the dark pool are that there are no guarantee that the orders will be executed in the dark pool. Hence it is suitable to put the orders both in dark pool and conventional exchange market order. In the next section we formulate the problem as Markov decision process and then move on to finding the way to solve the same.

3.3. Problem Formulation 13

### 3.3 Problem Formulation

Let us assume

Xt=stocks available at time t to liquidate

xt=amount of order to be placed in primary exchange at time t
y_{t}=number of order to be placed in dark pool at time t

z_{t}=number of stocks executed in dark pool at time t
dt=price of stock in dark pool at timet

pt=price of stock in primary market at time t

(3.5)

Investor by selling his stocks in primary exchange and dark pool gets liquidity as below
r(pt, dt, xt, yt) =p^{∗}_{t}.xt+d^{∗}_{t}.zt (3.6)
considering above as reward function we can formulate it as an optimization problem below

maxxt,yt

E

T

X

t=0

r(pt, dt, xt, yt) (3.7) subject to following dynamics of market

pt+1 =f(pt, xt, t)

=h(pt, xt) +e(pt, t)

=p^{∗}_{t} +e(pt, t)
d_{t+1} =g(d_{t}, p_{t}, y_{t}, η_{t})

=o(d_{t}, p_{t}, y_{t}) +m(d_{t}, η_{t})

=d^{∗}_{t}+m(dt, ηt)

(3.8)

and constraint

T

X

t=0

(xt+zt) =X0 (3.9)

Where t and ηt are random variables. And we can formulate following relationship
X_{t+1} =X_{t}−(x_{t}+z_{t})

By similar method as in section 3.1 we can turn this problem into unconstrained problem.

### 3.4 Backward Dynamic Programming Approach

Let us assume for case of market orders only, as we have discussed we will take state of the
system as vector s_{t} ={p_{t}, X_{t}} i.e. price and available liquidation amount of stock at time
t. Following is the backward dynamic programming algorithm for the same. Assuming the
following algorithm, as we can see there are nested for loops and also expectation calculation.

14 3. Problem Formulation and Proposed Solution

Lets assume that we have some fixed maximum price P and we have discrete probability transition given, and for terminal case

r_{T}(s_{T}) =p^{∗}_{T}X_{T}
s_{T} ={p_{T}, X_{T}}
a_{T} ={X_{T}}

(3.10)

Algorithm 2 Backward DP

1: procedure backDP

2: Calculate the terminal liquidity and initialize terminal action

3: V_{T}(s_{T}) =r_{T}(s_{T});∀s_{T}

4: AT(sT) =XT;∀s_{T}

5: for t=T −1 to 0 do

6: forp_{t}∈ {0,1, ...P} do

7: for Xt∈ {0,1, .., X0}do

8: st= (pt, Xt)

9: maximum= 0

10: action= 0

11: forat∈ {0,1, ...Xt} do

12: value=r_{t}(s_{t}, a_{t}) +EV_{t+1}(s_{t+1})

13: if value > maximum then:

14: value=maximum

15: action=a_{t}

16: Vt(st) =value

17: A_{t}(s_{t}) =action

then based on this we can see that the algorithm isO(tP^{2}X_{0}^{2}) in worst case. Along with
the space complexity of O(tP X_{0}). Consider for an example investor has initially 10000
stocks and maximum price is $100 and he wanted to liquidate in 10 period of time then we
have time complexity of O(10^{13}) and space complexity O(10^{7}) which is not feasible. One
way to reduce this is to consider since we are considering a large investor, he will be placing
order in blocks and also the possible price transition is limited. Using these let us assume
he is placing orders in block size of 100, and only 10 transition is possible for price, hence we
have time complexity O(10^{8}) and space complexity O(10^{5}), this seems to be feasible. But
we can observe from the discussion that as the backward dynamic programming technique
is good for only small size problems. This is termed as curse of dimensionality in dynamic
programming. This shows that if we consider this for dark pool problem, as the dimnsonality
of state and action vector is even more, it will not be feasible to apply this. Along with this
we have assumed the prices to be discrete which is not realistic.

In the next section we will look into approximation based dynamic programming algo- rithms and will apply to the dark pool problem.

3.5. Approximate Dynamic Programming 15

### 3.5 Approximate Dynamic Programming

We have seen that the backward dynamic programming approach is not feasible in case of problems with large state and action space. The obvious way to proceed is to step forward through time. But to step forward in time we need the value function available for the future time, which is not possible. We can consider the approximation of the value function. Let us assume the approximation is given as :

Vˆ_{t}^{n}(s_{t}) =approximate value function at time t for state s_{t} at iteration n

Now this value function can be used to update the value function at iteration n+ 1 by the help of information gathered in iteration n. To find the optimal action in iteration n we use the Bellman equation

a^{n}_{t} =argmax

a∈At

{r_{t}(st, a) +EVˆ_{t+1}^{n−1}(st+1)} (3.11)
where expectation is on the transition probability of s_{t}.

As calculation of the expectation can be intractable, we can generate samples and calcu- late the sample based expectation. Below is the generic forward dynamic forward dynamic programming algorithm due to [15, 4]

Algorithm 3 Forward DP

1: procedure forwardDP

2: For all t initialize value function ˆV_{t}^{0}(st)∀s∈S, N

3: n←1

4: repeat

5: for t= 0 to T −1 do

6: generate sampleΩ_{t}

7: solve:

8: a^{n}_{t} =argmaxa∈At{r_{t}(st, a) +P

ωt∈ΩtP r(ωt) ˆV_{t+1}^{n−1}(ft(st, a, ωt))}

9: choose ω_{t}∈Ω_{t}

10: computes_{t+1}=f(s_{t}, a_{t}, ω_{t})

11: update the approximation Vˆ_{t}^{n}(st)

12: t=t+ 1

13: n=n+ 1

14: if n < N then:

15: goto repeat

16: else:

17: stop

There are many methods that can be used to update the estimate of value function.

One way is to compute ˆ

v_{t}^{n}={r_{t}(s_{t}, a) + X

ωt∈Ωt

P r(ω_{t}) ˆV_{t+1}^{n−1}(f_{t}(s_{t}, a, ω_{t}))} (3.12)

16 3. Problem Formulation and Proposed Solution

which is estimate of the value function of the states_{t}, which obviously depends on the our
approximation at iterationn−1, since the estimate ˆv_{t}^{n}may have noise, we can smooth the
value function based on

Vˆ_{t}^{n}(s_{t}) = (1−α^{n})V_{t}^{n−1}(s_{t}) +α^{n}vˆ_{t}^{n} (3.13)
where α is called step size parameter, takes value between 0 and 1. We can see that the
above algorithm has assumption that the state space is finite and also the action space is
finite. But when we do not have finite space such that the price in the market can not be
regarded as discrete then we can not use the above mentioned algorithm directly. But what
we can do is use some kind of function approximation such as a linear function with some
parameters, or neural network [6, 15, 16]. This helps us to represent the value function with
limited number of parameters and based on the state we can evaluate the value function.

For example below is a linear function approximation based on state of system

Vˆ_{t}(s_{t}) =θ_{1}φ_{1}(s_{t}) +θ_{2}φ_{2}(s_{t}) +...+θ_{m}φ_{m}(s_{t}) (3.14)
where θ_{i} are parameters and φ_{i}(s_{t}) are features derived from state vector s_{t}. Now all we
need to do is find way to update the parameters θ_{1}, θ_{2}, .., θ_{m}, rather than storing the value
function for each state we need to store only parameters.

3.5.1 Actor Critic Method

We can group majority of approximate dynamic programming methods also called reinforce- ment learning in computer science community into actor only, critic only and actor-critic models [13]. In real life cases where action as well as state space is large it is not feasible to store the value function or policies itself for each state and state action pair, hence an approximation function has to be used for value such as described in above section. We describe the critic only and actor only method in following paragraphs and then move to actor critic method.

Critic only methods e.g. Q-learning [23], SARSA [18] use function approximaters for
state-action function i.e. Q functions as described in previous chapter. There are no ex-
pilicit function used for policy. Deterministic policy π = {π_{0}, ..., π_{T}−1} is calculated from
approximated Q function by solving the optimization problem over value function.

πt(st) =argmax

a∈At

Q(st, a) (3.15)

When learning is in online setting there have been examples shown for simple MDP such as [19, 10, 2] that the methods such as Q learning and SARSA do not converge with specific function approximaters. However divergence was further analysed by [20] and shown that the convergence can be assured for linear in parameters if the samples are drawn accordingly.

Actor only methods do not store the value function or parametric value function, however the policy itself is parametric and the optimize the reward based on Bellman equation in the space of parameters of policy function itself. Examples of the actor only methods are SRV algorithm [11], REINFORCE algorithm [22]. The main advantage of actor only method

3.5. Approximate Dynamic Programming 17

is their convergence property, convergence are obtained if estimated gradient are unbiased and step size or learning rate satisfy Robbins-Munro condition [17]

α_{k}>0 ∀k (3.16)

and ∞

X

k=0

α_{k}=∞

∞

X

k=0

α^{2}_{k}<∞ (3.17)

Drawback of actor only methods are that they may have large variance in estimated gradient.

The actor critic method combines the benefits of both actor only and critic only method, they are able to produce the continuous actions and hence useful for large action space, and large variance in gradient estimate is reduced by critic presence. Consider the following setting where we are approximating the value function and policies as follows

Vˆt(st) =θ_{t}^{T}φ(st) (3.18)
and

πt(st) =ω_{t}^{T}ψ(st) (3.19)

where

φ(st) ={φ_{1}_{t}, φ2t, ...., φmt}
and

ψ(st) ={ψ_{1}_{t}, ψ2t, ...., ψmt}

are vecor of features based on states_{t}also the parameters are considered to be time depen-
dent are

ωt={ω_{1}_{t}, ω2t, ...., ωmt}
and

θt={θ_{1}_{t}, θ2t, ...., θmt}
Consider the Bellman equation 2.12

V_{t}^{π}(xt) =rt(xt, πt(xt)) +EV_{t+1}^{π} (xt+1|x_{t}, π(xt))
with approximation under a policy π we can rewrite the above equation as

Vˆ_{t}^{π}(s_{t}) ={r_{t}(s_{t}, π_{t}(s_{t})) +EVˆ_{t+1}^{π} (s_{t+1})} (3.20)
The difference between right hand and left hand side is known as temporal difference(TD)
or Bellman error. This error can be used to obtain the gradient following ways

δt={r_{t}(st, πt(st)) +EVˆ_{t+1}^{π} (st+1)} −Vˆ_{t}^{π}(st)}

={r_{t}(st, ω_{t}^{T}ψt(st)) +Eθ_{t+1}^{T} φ(st+1)−θ_{t}^{T}φ(st)} (3.21)
From the above Bellman error we can easily calculate the gradient of mean square error
and update of the parameters of value function can be obtained as follows in case of linear
case

θ_{t}←θ_{t}+α_{θ}δφ_{t}(s_{t}) (3.22)

18 3. Problem Formulation and Proposed Solution

similary we can obtain the update for the policy parameters as follows

ωt←ωt+αωδψt(st) (3.23)

Eligibility traces are temporary record of occurrence of an even, i.e. system state visited through trajectory which can be defined as

et(s) =

(λet−1(s) :if s6=s_{t}
λet−1(s) + 1 :if s=st

(3.24) where λis discounting parameter for history of the state. The above eligibility traces can be generalized for continuous spaces as follows

zt=

0 :if t= 0

1 :if t=T −1

φt(st) +λzt−1 :otherwise

(3.25)

observe thatzis now vector of the same dimension as features of value function. Combining all these together we can write the actor-critic algorithm

### 3.6 Actor Critic method for Dark Pool Problem

We consider the following assumptions

• The impact function of price is known for conventional exchange

• Prices in the dark pool are derived from primary exchange prices

• We consider negligible impact of the trade size in the dark pool.

• We consider that no other fees is charged for order placement in either dark pool of conventional exchange

In addition to above assumptions we normalize the prices to be any real number in [0,1] also we take stocks available as real number in [0,1]. We trade as fraction of available amount in dark as well as conventional market. Following are the dynamics of the market

• p_{t}=f(pt−1, xt−1) +_{t} where we have assumed _{t}^{iid}∼ N(µ= 0, σ^{2}= 0.001)

• d^{∗}_{t} = pt+ηt i.e. no impact of order size in dark pool, however random component
η_{t}^{iid}∼ N(µ= 0, σ^{2} = 0.001) is added

• yt ∼ Gumbell(µ = yt−1, β = 0.02) or yt = 0 with equal probability, where yt is the
fraction of dark pool order executed and initial valuey_{0} is assumed to be 0

• state of the system is given as st = {p_{t}, Xt, yt−1} , where pt is the price in the
conventional market, X_{t} is amount left to liquidate,

3.6. Actor Critic method for Dark Pool Problem 19

Algorithm 4 Actor Critic Algorithm

1: procedure actorCritic

2: initialize

3: θt, ωt∀t∈ {0,1, ...T −1}, N

4: n←1

5: repeat

6: generate initial state randomly s0 7: initialize eligibility traces z= 0

8: for t= 0 to T do

9: generate sampleΩt 10: calculate TD error:δt

11: select action based on current parameters:a_{t}=ω_{t}^{T}ψ_{t}(s_{t})

12: generate random exploration variable:et 13: at←at+et

14: if t = T-1 then:

15: δt=rt(st, at) +P

ot∈ΩtP r(ot)rT(fT(st, at, ot))−θ^{T}_{t}φt(st)

16: z= 1

17: else

18: δ_{t}=r_{t}(s_{t}, a_{t}) +P

ot∈ΩtP r(o_{t})θ_{t+1}^{T} φ_{t+1}((f_{t}(s_{t}, a_{t}, o_{t})))−θ_{t}^{T}φ_{t}(s_{t})

19: z=λz+φt(st)

20: θ_{t}←θ_{t}+α_{θ}δ_{t}z

21: ωt←ωt+etαωδtψt(st)

22: computest+1=f(st, at, ot)

23: t=t+ 1

24: n=n+ 1

25: if n < N then:

26: goto repeat

27: else:

28: stop

We consider following set of features based on state vector for value function approximation
φ(st) ={X_{t}pt,(yt−1)pt, X_{t}^{2}, p^{2}_{t}, yt−1} (3.26)
Similarly we consider the feature vectors for dark pool policy and conventional market
policy as follows

ψ_{d}(s_{t}) ={√

p_{t}X_{t}, yt−1, X_{t}}
ψm(st) ={√

p_{t}Xt, pt, X_{t}^{2}} (3.27)
With above features we take the policy and value function parameters given as below

θ_{t}={θ_{t}_{1}, θ_{t}_{2}, θ_{t}_{3}, θ_{t}_{4}, θ_{t}_{5}}
ω^{d}_{t} ={ω^{d}_{t}

1, ω^{d}_{t}_{2}, ω^{d}_{t}_{3}}
ω_{t}^{m}={ω^{m}_{t}

1, ω^{m}_{t}_{2}, ω^{m}_{t}_{3}}

(3.28)

20 3. Problem Formulation and Proposed Solution

the reward function is given as

rt(st, xt, zt) =p^{∗}_{t}xt+d^{∗}_{t}ztyt

rT(sT) =p^{∗}_{t}XT

(3.29) Policy function are considered as sigmoid function, such that they give the fraction of current available liquidation amount as next order to be placed. Using above dynamics and assumptions we have applied the actor critic learning method described in previous section.

### Chapter 4

## Simulation and Analysis

In this chapter we analyze results obtained on applying the actor critic method to liquidation problem. First we will discuss the simulation of market price, then we will show empirical result of convergence of temporal difference error. The value function and policy obtained is discussed. Finally, we will show empirically that the policy obtained with our approach is better than naive policy of equal block market orders for each time period.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.86

0.88 0.9 0.92 0.94

xt (Market Order Size) pt(price)

Figure 4.1: Sample Dynamics of Market Price

We have considered finite horizon of 20 time periods. Based on the assumptions and market dynamics discussed in section 3.6, the market price dynamics is shown in figure 4.1 based equation below,

pt+1=pt(1−k
rx_{t}

pt

) +t (4.1)

wherek is a constant and _{t}^{iid}∼ N(µ= 0, σ^{2} = 0.001).

21

22 4. Simulation and Analysis

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

·10^{4}

−0.15

−0.1

−5·10^{−2}
0
5·10^{−2}
0.1

#iterations δt(TDError)

time t = 0 time t = 2 time t = 7 time t = 12 time t = 15 time t = 19

Figure 4.2: Convergence of Temporal Difference Error

We have used on-policy actor critic method, it learns using reward generated from each interaction with the environment. In our case for each period of time it updates the parameters of the policy and value function based on the liquidation amount received.

As discussed in section 3.5.1, temporal difference error is used to find the gradient. The convergence of the TD error is desired for learning optimal policy. Figure 4.2, shows the convergence of the temporal difference(δ) error. It can be seen that initially error starts increasing and then as the number of iteration increases TD error starts to converge.

Figure 4.3, shows the value function based on the policy obtained after TD error con- vergence as discussed above. Samples of value function corresponding to the initial market price have been drawn. Following observation can be drawn from the figure,

• The value function is sum of liquidation returns obtained, it should decrease as time proceeds, we can see that value function is monotonically decreasing.

• For higher initial price of a security, liquidation return should be higher . The graph shows value function has higher values of higher initial price.

• Value function near horizon is close to zero, which indicates as time proceeds lesser stocks are left to liquidate.

Now we discuss the policy obtained for dark pool and market orders. Figure 4.4 shows the policies for dark pool and market order. It can be seen that the size of the market order is most of the time more than the dark pool order, as in dark pool order execution is not guaranteed. We can see that as time proceeds towards horizon, the order size starts increasing. It indicates that policy trades more aggressively as time proceeds, so that whole portfolio is liquidated before the horizon.

23

0 2 4 6 8 10 12 14 16 18 20

0 0.2 0.4 0.6 0.8

t(time) Vt(ValueFunction)

Price : 0.8837 Price : 0.1225 Price : 0.8509 Price : 0.2125 Price : 0.9442

Figure 4.3: Value Function For Different Initial Prices

0 2 4 6 8 10 12 14 16 18

0 0.1 0.2 0.3 0.4 0.5

t (time)

OrderSizeinfractions

Dark Pool Orders Market Orders

Figure 4.4: Comparison for Market Order and Dark Pool Order

We show that empirically that the liquidation return obtained using actor critic method is better than the return based on equal block market orders. We have generated the samples returns based on the dynamics of the market discussed for equal block size market orders and policy obtained using actor critic method. The difference between liquidation return obtained using both approach is calculated. Null hypothesis is that mean difference between both approach is 0. Results of one sample t-test for different sample sizes, are

24 4. Simulation and Analysis

Sample Size Mean Difference t-statistic p-value

100 0.0280837810258 0.83388953147427203 0.40535119271146636 1000 0.020419467299 2.0931508570575095 0.036461619930250787 5000 0.0219556654136 4.9122884383925438 9.1447540739443554e-07 10000 0.0219220529495 6.9557773292088916 3.6146034055307383e-12

Table 4.1: Result of t-test shown in the table 4.1.

For sample size 1000 and above the p value is small (< 0.05), null hypothesis can be rejected. The mean difference calculated shows that the return obtained using actor critic method is better than the naive approach of splitting as equal block market orders. Hence we can say, our approach produces better liquidation returns.

### Chapter 5

## Future Work and Conclusion

A finite horizon control problem poses additional complexity to the control problem. One has to find a non stationary policy, as the state and action space both changes with the time. Finding stationary policies in case of infinite horizon problem is relatively easy and there have been convergence proof for many methods for stationary policy.

We have applied the finite horizon based actor critic learning method here. We have explicitly put the parameters for each time period to be separate such that we can get the non stationary policies. We have shown convergence of TD error empirically. We have also shown the comparison of splitting equal block market orders and our approach. It was shown empirically that the approach we used produces higher liquidation return.

Because of the non stationarity of the policies in finite horizon problem, non convergence of algorithms have been shown for some problems.Hence a formal proof of convergence is required. As approximation has been used, an error bound on the policies obtained has to be theoretically obtained. These are some future directions on this problem and on finite horizon control problems as whole.

25

## Bibliography

[1] Almgren, R., Chriss, N.: Optimal execution of portfolio transactions. Journal of Risk pp. 5–39 (2001)

[2] Baird, L.: Residual algorithms: Reinforcement learning with function approximation.

In: In Proceedings of the Twelfth International Conference on Machine Learning. pp.

30–37. Morgan Kaufmann (1995)

[3] Banks, E.: Introduction to Dark Pools, pp. 3–32. Palgrave Macmillan UK, London (2014),http://dx.doi.org/10.1057/9781137449573_1

[4] Bellman, R.: Dynamic programming. Princeton Univ Pr (1957)

[5] Buti, S., Rindi, B., Werner, I.M.: Diving into dark pools. Working paper series, Ohio State University, Charles A. Dice Center for Research in Financial Economics (2010), http://EconPapers.repec.org/RePEc:ecl:ohidic:2010-10

[6] Dimitri P. Bertsekas, J.N.T.: Neuro-Dynamic Programming. Optimization and neural computation series, Athena Scientific, 1 edn. (1996)

[7] D.P, B.: Dynamic programming and optimal control, vol. 2. Athena Scientific (1995) [8] (Eds.), D.P.B.: Dynamic Programming and Stochastic Control. Mathematics in Science

and Engineering 125, Academic Press (1976)

[9] Ganchev, Kuzman; Nevmyvaka, Y.K.M.V.J.W.: Censored exploration and the dark pool problem. Communications of the ACM 53 (05 2010)

[10] Gordon, G.J.: Stable function approximation in dynamic programming. Tech. rep., Carnegie Mellon University, Pittsburgh, PA, USA (1995)

[11] Gullapalli, V.: A stochastic algorithm for learning real-valued functions via reinforce- ment. Tech. rep., University of Massachusetts, Amherst, MA, USA (1988)

[12] Keim, D.B., Madhavan, A.: The Upstairs Market for Large-Block Transactions: Anal- ysis and Measurement of Price Effects (Revised: 10-94). Rodney L. White Center for Financial Research Working Papers 21-92, Wharton School Rodney L. White Center for Financial Research (undated),https://ideas.repec.org/p/fth/pennfi/21-92.

html

BIBLIOGRAPHY 27

[13] Konda, V.R., Tsitsiklis, J.N.: Convergence rate of linear two-time-scale stochastic approximation. The Annals of Applied Probability 14 (05 2004)

[14] Kratz, P., Sch¨oneborn, T.: Portfolio liquidation in dark pools in continuous time. Math- ematical Finance 25(3), 496–544 (2015),http://EconPapers.repec.org/RePEc:bla:

mathfi:v:25:y:2015:i:3:p:496-544

[15] Powell, W.B.: Approximate dynamic programming: Solving the curses of dimension- ality. Wiley Series in Probability and Statistics, Wiley-Interscience, 1 edn. (2007) [16] Richard S. Sutton, A.G.B.: Reinforcement learning. Adaptive Computation and Ma-

chine Learning, The MIT Press (1998)

[17] Robbins, H., Monro, S.: A stochastic approximation method. The Annals of Mathe- matical Statistics 22(3), 400–407 (1951),http://www.jstor.org/stable/2236626 [18] Sutton, R.S.: Generalization in reinforcement learning: Successful examples using

sparse coarse coding. In: Advances in Neural Information Processing Systems 8. pp.

1038–1044. MIT Press (1996)

[19] Tsitsiklis, J.N., Roy, B.V.: An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42(5), 674–690 (May 1997) [20] Tsitsiklis, J.N., Roy, B.V.: Average cost temporal-difference learning. In: Proceedings

of the 36th IEEE Conference on Decision and Control. vol. 1, pp. 498–502 vol.1 (Dec 1997)

[21] Wikipedia: Order book (trading) — wikipedia, the free encyclopedia (2016), https://en.wikipedia.org/w/index.php?title=Order_book_(trading)&oldid=

725209547, [Online; accessed 5-July-2017 ]

[22] Williams, R.J.: Simple statistical gradient-following algorithms for connectionist re- inforcement learning. Machine Learning 8(3), 229–256 (May 1992), http://dx.doi.

org/10.1007/BF00992696

[23] Wyatt, J.: Exploration and Inference in Learning from Reinforcement. Ph.D. thesis, Department of Artificial Intelligence, University of Edinburgh (1997)