1 Introduction
While the aftershocks of the latest global financial crisis are still being felt, many governments struggle to implement public policy because of budget deficits or lagging tax revenues (Bayer et al., 2015). The latter problem arises as a result of reduced economic activity, or when there is a strong sense among taxpayers that the expected personal benefit from tax evasion surpasses the corresponding social benefit of paying taxes (Alm and Beck, 1990; Bornstein and Rosenhead, 1990). This, in the absence of a properly designed tax system and enforcement mechanism, leads to tax evasion, a serious crime that saps the State of revenue and undermines the sense of social justice, as dishonest taxpayers seem to enjoy the same public goods as honest ones do. The resulting “shadow economy” also has a strong adverse impact on credit ratings and lending costs (Markellos et al., 2016), welfare programs, fiscal policies and unemployment (Fleming et al., 2000).
Of course, tax systems typically contain various safeguards to discourage tax evasion (defined here as the deliberate failure to declare all or some of one’s income to tax authorities). In practice, however, tax systems are rather complex policy structures which are difficult to make “airtight” in terms of tax evasion, for reasons having to do with i) occasional ambiguity in tax regulations, which hinders tax compliance and enforcement (Andreoni et al., 1998), and ii) the heterogeneous behaviors of the various taxpayer entities, based on their individual risk preferences (Hokamp and Pickhardt, 2010).
This paper is concerned with the development of a rigorous computational framework which can describe and predict the behavior of tax evaders, assuming that they are selfinterested and work to maximize the utility of their own revenues, balancing potential gains from tax evasion against the risk of getting caught. In particular, we are interested in i) estimating State revenues for any given set of tax parameters (e.g., tax rates and penalties), ii) testing whether specific tax regulations are helpful or not, and iii) predicting how taxpayers  and tax compliance  respond to parameter changes. This last item is linked to taxpayers’ risk aversion, knowledge of which would help the State determine the effects of, for example, an increase in tax penalties or audit rates.
The issues raised above are essential to the State if it is to know the extent to which its tax policies are working or to rank alternative policies and take steps towards maximizing revenue. In this work, we propose to explore them using a combination of deep neural networks and Qlearning for determining the tax evasion behavior of a risk averse taxpayer (we will use the term “firm” henceforth because we will be interested mainly in business entities). We will develop and test our approach in a context that builds on the work of
Goumagias et al. (2012) (where only the case of risk neutrality was analyzed) and involves a closetorealworld tax system, with many of the usual trappings such as tax rates, random audits, penalties and occasional tax amnesties, as well as taxpayer heterogeneity. As we shall see, the introduction of risk aversion into the model and the resulting nonlinearity of the firm’s utility function combines with the firm’s dynamics and leads to a significant increase in complexity. This puts the problem finding the firm’s optimal behavior well beyond the reach of analytical methods and requires powerful approximation techniques to be brought to bear.The main contributions of this work are i) the use of the deep reinforcement learning techniques to obtain computational solutions for the firm’s optimal behavior based on the Markov dynamical model of Goumagias et al. (2012) and ii) a computational framework for exploring the behavior expected of selfinterested riskaverse firms who may choose to engage in tax evasion in order to maximize their own utility. In addition, and on more practical grounds, we estimate the risk aversion coefficient of the “average” firm  or group of firms  given empirical data on its tax compliance and evaluate sample tax policies in terms of their benefit for the State (or, equivalently, the level of tax evasion they result in). To our knowledge, ours is the first work to apply deep learning in the context of taxation and tax evasion, and the first to obtain solutions that reveal the behavior of a riskaverse firm at a “fine” timescale, i.e., on a yeartoyear basis, based on its evolving status in the “eyes” of tax authorities. We view our approach as particularly relevant both in light of the growing interest in deep learning applications and for the opportunities that our model affords to regulators in the design of effective policies that make entities behave more honestly.
The remainder of this paper is structured as follows. In Section 2 we review the relevant literature and discuss how our approach is situated relative to previous work. Section 3 begins with a brief description of the tax system in which the firm operates, and explains the main parameters. In the same Section we describe a Markovbased model of the firm’s evolution through the tax system and pose the main optimization problem we are interested in solving and the computational challenges involved. Our solution approach, combining Qlearning and Deep Neural Networks, is detailed in Section 4. Finally, Section 5 discusses the results we obtained  using the Greek tax system as a case study for the sake of concreteness  and their relevance to the questions posed above regarding the firm’s expected behavior, incentives for reporting profits, degree of risk aversion, and policy implications.
2 Related work
Prior work related to optimal taxation and taxevasion modeling can be grouped into two main categories: i) analytic (macroeconomic, and principleagent based), and ii) computational (agentbased, simulationbased). The seminal work in the first category was Allingham and Sandmo (1972) who introduced a model of optimal taxation posed as a portfolio allocation problem. Several scholars built on that model by also introducing labor supply (Yitzhaki, 1974; Baldry, 1979) and public goods offered (Cowell, 1981). The complexity of the phenomenon was highlighted early on by Clotfelter (1983) and Crane and Nourzad (1986), who challenged the monotonic relationship between tax rates and tax evasion. One of the drawbacks of the analytical approaches was that they often implied less behavioral heterogeneity on behalf of taxpayers than what was suggested by empirical evidence (Andreoni et al., 1998), and  in order to remain tractable  they could not fully capture the dynamics of tax evasion (MartinezVazquez and Rider, 2005).
In particular, beyond the issue of accounting for heterogeneity (e.g., in taxpayers’ riskaversion), there exists much interesting structure in taxpayers’ behavior if one considers “finegrained” models of their evolution through the tax system. In that setting, one must reckon with the various random transitions the taxpayer may undergo from year to year, such as being audited, or offered the chance to participate in a tax amnesty program (we will provide details on such options shortly), or changing preferences via interaction with others. Such considerations have led to a number of recent computationalbased approaches in the form of automatonbased (Garrido and Mittone, 2012) and agentbased (Gao and Xu, 2009) models. Computational approaches may allow for more realism, by having, for example, a large number of agents interact with each other based on predetermined characteristics related to the taxation parameters and intrinsic utility functions (Pickhardt and Seibold, 2014). Their advantage is that they can offer empirically grounded and theoreticallyinformed policy implications, but they often suffer from a limited analytical tractability of the solutions they suggest.
An attempt to overcome these limitations while modeling the yeartoyear behavior of the firm was made by Goumagias et al. (2012), who introduced a parametric Markovbased model describing the evolution of a rational firm within the Greek tax system. The firm’s goal was to maximize a discounted sum of its yearly aftertax revenues, possibly by engaging in tax evasion. That work showed that the firm would attempt to evade taxation as much as possible under the system currently in place, and produced “maps” showing which combinations of tax parameters lead firms to behave honestly and which do not. A severe limitation of Goumagias et al. (2012) was the fact that it applied only to the special case of risk neutral entities. That assumption kept the firm’s state and decision spaces conveniently small (it implied, for example, that the firm’s optimal decision is to either be completely honest or to conceal as much profit as possible, eliminating “intermediate” options), making the problem of optimizing taxpayer behavior solvable via Dynamic Programming (DP) (Bertsekas, 1995). Of course, most taxpayer entities are not likely to be risk neutral; thus, it becomes necessary to incorporate riskaversion into the analysis in order to be able to predict the behavior of a broad spectrum of taxpayers and explore the effectiveness of tax policies in a more realistic setting.
As we will discuss in Sec. 3.3
, riskaversion introduces nonlinearity in the firm’s objective function, making analytical or DP methods ineffective, and we will require some way of circumventing the curse of dimensionality in that context. Among the various alternatives, iterative dynamic programming can potentially allow for tractable solutions
(Jaakkola et al., 1994), however, that method’s applicability is limited when faced with multiple sources of uncertainty, as is the case here. Computational solutions, including artificial intelligence methods and neural networks for costtogo function approximation
(Tsitsiklis and Van Roy, 1996; Wheeler and Narendra, 1986; Watkins, 1989) will prove to be more promising in our setting. Reinforcement learningbased methods, in particular, approximate the costtogo function via simulation and perform function approximation via regression or neural networks (Gosavi, 2004). This approach includes algorithms such as Rlearning (Singh, 1994; Tadepalli and Ok, 1996), and QLearning (Sutton and Barto, 1998; Tsitsiklis, 1994). One advantage of reinforcement learning which will be useful to us is that, unlike DP, the process can be set to update the value of the costtogo function for the states that are most often visited (Tsitsiklis, 1994).Recentlyproposed deep learning
algorithms have greatly broadened the scope of applicability of artificial intelligence and machine learning, beyond “classical” problems of pattern recognition
(LeCun et al., 2015) and have shown great promise in approximating complex nonlinear costtogo functions (Schmidhuber, 2015). To date, deep learning has been applied to challenging problems in areas including image recognition and processing (Krizhevsky et al., 2012), speech recognition (Mikolov et al., 2011), biology (Leung et al., 2014), analysis of financial trading (Krauss et al., 2017), social networks (Perozzi et al., 2014) and human behavior (Ronao and Cho, 2016). Here, we will make use of recent developments in deep reinforcement learning in order to obtain computational solutions for the firm’s optimal behavior, with all of our model’s complexities. This opens the door to more informed policy decisions by providing a computational platform for comparing tax policies (e.g., those with tax amnesty vs those without), estimating the firms’ degree of risk aversion from empirical data, predicting the expected tax revenue for the government, or calculating the effects of a change in any tax parameter on revenues.3 Model description
We proceed with a brief discussion of the tax system within which the firm evolves, to be followed by the corresponding mathematical model. That model will be parametric, with many of the tax “features” commonly encountered, including random audits and penalties. Of course, when it comes time to make computations, we will have to select parameter values (tax rates, etc.) for a specific locale. We will focus on Greece in particular, for the sake of concreteness and because, with tax evasion being a significant and longstanding problem there, one can draw interesting and practical conclusions. However, the basic tax provisions we consider appear in most tax systems, and our model could be adjusted to describe matters in other countries as well.
3.1 A basic taxation system with occasional optional amnesties
The basic components of our taxation system will include  as is the case in most countries  a tax rate on profits, random audits for identifying tax evaders, and monetary penalties for underreporting income. Those penalties, added to the original tax due on any unreported income discovered during an audit, will be proportional to the amount of unreported income and the time elapsed since the offense took place. We will also allow any penalty to be discounted somewhat for prompt payment. The tax authority will audit a small fraction of cases each year but will retain the right to audit a firm’s tax returns for a number of years in the past. Any taxevasion activity beyond that horizon will be considered to be beyond the statute of limitations.
Our model will also include an optional tax amnesty in which the government may occasionally allow taxpayer entities to pay a fee in exchange for which past tax declarations are closed to any audits. This “closure” fee will be paid separately for each tax year a firm would like to exempt from a possible audit. It is worth noting that the appeal of tax amnesties as revenue collecting mechanisms is typically reinforced during and after long recessions (Ross and Buckwalter, 2013; Bayer et al., 2015). Amnesties are more commonly used than one might expect. For instance, only in the US, between 1982 and 2011, there were 104 cases of some form of tax amnesty (Ross and Buckwalter, 2013). Other examples include India (DasGupta and Mookherjee, 1995), and Russia (Alm and Rath, 1998). In Greece, the closure option mentioned above was being offered roughly every 45 years during 19982006 (e.g., Hellenic Ministry of Finance (2004) and Hellenic Ministry of Finance (2008)). More recently, it was reintroduced in the Greek parliament with a new round being under consideration (Hellenic Ministry of Finance, 2015). The irregular usage of tax amnesties as tax revenue collection mechanisms increases the complexity of decision making both on behalf of the government and the taxpayer. The use of tax amnesties by firms essentially shrinks the audit pool. Thus, if in some year the government offers the closure option but a firm refuses to use it, that firm is more likely to be audited. For a more detailed explanation of the mechanics of closure, see Goumagias et al. (2012). In practical terms, one question we would like to answer is whether such a measure (although it provides some immediate tax revenue) actually hurts longterm revenues because it might act as a counterincentive to paying the proper tax (Bayer et al., 2015).
3.2 The behavior of riskaverse firms with optional closure
The work in Goumagias et al. (2012) codified the firm’s time evolution through the tax system described above, in a compact Markovbased model which includes all of the basic features described in Sec. 3.1, including tax rates, penalties, a fiveyear statute of limitations for audits of past tax statements, and occasional tax amnesty (closure). We will revisit it here briefly, in as compact form as possible, and extend it for our purposes.
For a tax system with a fiveyear statute of limitations on auditing past tax statements, the firm’s evolution can be described by the linear state equation (Goumagias et al., 2012)
(1) 
where is given, and , , are as in A.
The firm’s state at discrete time is given by the triple . Here, is a 15element set (in the discussion that follows, it will be convenient to use ), containing the firm’s possible tax statuses (see Goumagias et al. (2012) for a graphical explanation): the first five elements of correspond to the firm currently being audited, with 15 years since its last audit (any tax declarations “older” than 5 years are beyond the statute of limitations); elements 610 correspond to the firm using the closure option with 15 years having passed since its last audit or closure; and states 1115 correspond to the firm being unaudited for 15 years (not being currently audited, nor using closure). Of the remaining state elements, is a twolevel variable denoting whether the government has made the closure option available at time , and contains the time history of the firm’s past 5 decisions with respect to tax evasion, with elements in ranging from 0 (full disclosure) to 1 (the firm hides as much of its income as possible).
In Eq. 1,
is a 2element vector containing the firm’s actions in year
; the first element, denotes the fraction of profits that the firm decides to conceal, while the second, is a binary decision on whether or not to use the closure option, if it is available. In the term ,determines the first element of the “next” state vector, i.e., the firm’s status in the tax system (e.g., being audited or not, or removing itself from this year’s audit pool by making use of the closure option), according to a Markov decision process whose transition probabilities depend on the current state and the firm’s decision to use closure (see
Goumagias et al. (2012), also given in B to facilitate review). The are Bernoullilike, taking on the value 2 when the government offers the closure option (this is assumed to occur with some probability ), or 1 otherwise.The firm “weighs” its rewards (profit, plus any taxes it is able to save by declaring less of it) according to a constant relative risk aversion utility function
(2) 
with being the associated riskaversion coefficient, and being the reward the firm receives when in state and taking an action . Based on the earlier description of the rules of the tax system, is given by
(3) 
where denotes the firm’s annual revenues, is the taxrate, the closure cost (paid if the firm decides to take advantage of that option in the event it is offered), the taxpenalty and is the discount factor for prompt payment. In Eq. 3, the top term corresponds to the firm’s reward if it is not audited, so that depending on , it may pay all to none of the tax due. In the middle term, the firm is using the closure option, so that it pays for as many years as it has gone unaudited, up to a maximum of five. Finally, the bottom term in Eq. 3 corresponds to the firm being audited, so that it pays any back taxes due (based on its historical behavior) and the corresponding penalties, as per our earlier description.
The firm is assumed to act in a selfinterested way and thus chooses its policy so as to maximize the discounted expected reward:
(4) 
where denotes the discount factor.
3.3 Challenges in solving for the firm’s expected strategy
There is a significant difficulty when it comes to solving Eq. 5 for the optimal firm reward (and the associated taxevasion policy), stemming from the continuity of certain elements in the state and control vectors. As we have already mentioned, the first element, , of the control vector denotes taxevasion as fraction of the firm’s annual revenues. This implies that as well as are continuous because the firm’s last five taxevasion decisions are always incorporated into the state. This makes Eq. 5 difficult to compute.
One may attempt to circumvent this problem by discretizing the variables in question to render both the state and the control vector discrete. For example, we may instead consider , and assume that taxevasion takes place in increments of 1%, which seems like a reasonable level of coarseness. However, after thus discretizing the control and state spaces, the number of statecontrol pairs, (), remains large. Specifically, we are left with potential pairs (the number of the elements of the state vector including all possible combinations of control for the past five years, times the number of possible controls in ). Such a number of states is too large for DP to be effective in solving the stationary Bellman equation via value iteration, for example, because: i) “visiting” every state in order to update the value function associated with Eq. 5 becomes infeasible and ii) it is difficult to even store the function (the value of applying decision while at state , as a precursor to computing the maximum in the above equation) in tabular form, as one would have to do if Eq. 5 were to be solved via value iteration, for example.
The work in Goumagias et al. (2012) circumvented these difficulties by assuming riskneutrality () on behalf of the firm (and thus linearity of the reward function) and successfully applied DP after determining that should only take a “bangbang” form (conceal as much revenue as possible or none at all), leading to a significant reduction in the number of statecontrol pairs. In our case, however, the costtogo function (Eq. 3) is nonlinear, so that we must consider the full range of control values, and it is thus computationally difficult to apply DP.
One way to go forward is to combine: i) an approximation method to estimate the value function and ii) an approximate way of storing the optimal values of , based on the optimal policy. To address the former we will use reinforcement learning – specifically Qlearning, as described in Sutton and Barto (1998), where will play the role of the Qfunction , while for the latter, a deep Artificial Neural Network will be used, as we will discuss shortly.
4 Constructing an approximator: Deep QLearning
We experimented with various choices of learning algorithms and neural network architectures for the purposes of learning and storing the optimal value function given in the previous Section. In the following we describe our solution, combining Qlearning and a Deep Neural Network, and discuss some of the difficulties involved and how they can be overcome.
4.1 Qlearning
Qlearning is a modelfree reinforcement learning method (Sutton and Barto, 1998), that is used to find an optimal actionselection policy for any given finite MDP. In the “language” of Sutton and Barto (1998), an agent (in our case the firm) observes the current state at each discrete time step , chooses an action according to a possibly stochastic policy , mapping states to actions, observes the reward signal , and transitions to a new state . The objective is to maximize an expectation over the discounted return, as in Eq. 4.
Briefly, Qlearning involves sequentially updating an approximation of the actionvalue function, i.e., the function that produces the expected utility of taking a given action at a given state and following the optimal policy thereafter. The socalled function of a policy is , where
(6) 
and the state evolution proceeds under the policy . Finally, the optimal actionvalue function to which the learning process is to converge, obeys the Bellman Eq. 4.
For our purposes, in the notation of Sec. 3, the function we are seeking (5) is simply the function, after having maximized over . Common choices for modeling the Qfunction are lookup tables and linear approximators, among others. However, these models suffer from poor performance and scalability problems, and cannot possibly handle the highdimensional state space involved in our case, as we discussed in Sec. 3.3. An efficient alternative to the aforementioned models are neural networks.
4.2 Deep QNetworks (DQN)
Deep learning (DQN) was introduced by Mnih et al. (2015), and uses neural networks parametrized by to represent , where the function is augmented with a parameter vector , usually consisting of the weights and biases of the multiple layers of the network. Neural networks, viewed as general function approximators, are trained “endtoend”, and can efficiently handle highdimensionality problems. Recently, a DQN surpassed human performance in 49 different Atari games (Mnih et al., 2015). For our purposes, the DQN will receive as input the firm’s state and will have to produce the optimal decision, . Because the network will be trained to capture the optimal firm policy, we will sometimes refer to it as the “policy network”.
DQNs are trained iteratively using stochastic gradient descend, until convergence. This is done by minimizing, at each iteration
, a loss function of the network’s parameters,
, which is expressed as(7)  
(8)  
(9) 
and is an “older” copy of the network’s parameters, as we explain next. Function approximation using neural networks can be unstable, and we observed such behavior in our numerical experiments, particularly after we introduce a second source of uncertainty in the form of closure availability. Following Mnih et al. (2015), to stabilize the process we use a socalled “target network”, i.e., a copy of our original DQN which has the same architecture but a different set of parameters, . The parameters of the target network represent an older version of the policy network and are updated at a slower rate. Thus, while the policy network acts to produce inputs that will steer the firm to its next state, the slowlyupdated target network is used to compute which, in turn, is used to improve the parameters of the policy network via gradient descent:
(10) 
While training the DQN, we must choose an action to drive the state at each iteration. That action is to be chosen from using an greedy policy that selects the that maximizes with probability , or a random with probability . Additionally, our DQN uses socalled “experience replay” (Lin, 1993). During learning, we maintain a set of episodic experiences (tuples that include the state, the action taken, the resulting state and reward received). The DQN is then trained by sampling minibatches of those experiences. This has the effect of stabilizing the learning process and avoids overfitting. Experience replay was used very successfully by Mnih et al. (2015) and it is often motivated as a technique for reducing sample correlation, while also enabling reuse of past experiences for learning. Furthermore, it is a valuable tool for improving sample efficiency and can also improve performance by a significant margin, as it did in our case.
A final but important modification was the use of Double Qlearning, a technique introduced very recently by van Hasselt et al. (2016a). Double Qlearning for DQN (DDQN) reduces overestimation of the Qvalues by decomposing the max operation in the target network into action selection and action evaluation. Thus, instead of using the target network’s maximum Qvalue estimate in Eq. 9, we use the target network’s Qvalue of the current network’s best action. The DDQN update equations are the same as for DQN, after replacing the target in Eq. 9 with
(11) 
The entire Double DQN training loop is given in pseudocode in Algorithm 1 below.
4.3 DQN architecture
Our network architecture was inspired by the model of Mnih et al. (2015). The actionspace described in Sec. 3 consists of two action elements and . The firm’s taxevasion level is determined by , discretized in intervals of resulting a set of actions. This convention is commonly used to take advantage of the offpolicy stability of Qlearning compared to onpolicy , actorcritic or policy gradient approaches. The firm’s use of the closure option is , and if closure is not available then .
Our approximator (see Fig. 1
) is a 4layer multilayer perceptron (MLP) and takes as input the current state
. The first three layers consist of neurons, followed by two parallel linear layers of and neurons, for computing and, respectively. The network makes use of the rectified linear unit (ReLU) transformation function
between layers.Finally, our setting requires the DQN to produce two action elements . To improve the scalability of our approximator, and after numerical experimentation, we opted to use independent Qlearning to learn two different Qfunctions (one for each component of the firm’s decision, and , as in (Narasimhan et al., 2015; Foerster et al., 2016). In this case, the DQN loss is expressed as
(12)  
(13)  
(14) 
5 Evaluating the model: results and discussion
As we have mentioned in the Introduction, we are generally interested in being able to evaluate the firm’s decisions (assuming that it acts in a selfinterested way)  and maximum expected utility under various degrees of riskaversion, thereby producing a tool that could be used to predict firm behavior, compute tax revenue, and to gauge the reaction of the firm to tax policy scenarios under consideration by the government. We are also interested in characterizing the firm’s strategy by determining, for example, whether the firm would be expected to use a constant degree of taxevasion () in every state (as in Goumagias et al. (2012)), finding the firm’s coefficient of riskaversion given empirical estimates of the degree of tax evasion, and examining whether it is beneficial for the government to offer the closure option in any of the settings discussed in the Introduction.
5.1 Model parameters and Training setup
The various tax parameters present in our model were selected using Greece as a case study for the sake of concreteness, to facilitate comparisons with prior work (Goumagias et al., 2012), and because that country presents an interesting case as it is plagued by widespread tax evasion (we will discuss estimates in Sec. 5.4.1). Specifically, the tax and audit rates were and , respectively; the statute of limitations for auditing past tax statements was 5 years; the penalty for underreported profit was (24% annually); potential tax penalties were discounted by 40% if paid immediately (); and, finally, the cost for the firm to use the closure option  if available  was .
Training our DQNbased model to optimize the firm’s behavior for any one set of parameters (riskaversion coefficient, closure probability and cost, audit probability, penalty coefficient) required about 2 days on an Intel^{®} Xeon^{®} X5690 CPU with 72GB of RAM. Our source code is freely available under an opensource license at https://github.com/iassael/taxevasiondqn. The network was trained on episodes of the firm’s evolution, each lasting time steps. The network’s performance was evaluated every episodes as the average discounted reward of those episodes. We followed the training methodology proposed by Mnih et al. (2015), using Double QLearning (van Hasselt et al., 2016a). Because , the inputs to the network were “shifted” by subtracting from all elements of the state . Shifting the inputs to be evenly spread around resulted in faster convergence^{1}^{1}1
A simple example where this type of shifting improves learning is the case of onehot encoded inputs
, where both the weights , and biases , of the network can be being “learned” even when the original inputs are zero, i.e., , whereas without shifting, only would be learned when ..As usual, the network’s training objective was to minimize the mean squared temporal difference error. Thus, the backpropagated gradients described above were significantly affected by the scale of the rewards. Looking at the form of the riskaverse utility function
in Eq. 2, this becomes problematic for input values close to , where dives to . To stabilize the training process numerically, the values returned by were clipped below, so that they always lie in . That is, if the argument of was less than , where , the argument was replaced by . Our empirical evaluation showed that reward clipping was crucial to deal with the steep nonlinear scale of rewards. The particular value of 1 was not critical  more negative values work just as well, as long as they are “far” from the utility values the firm usually operates around, but not too negative so as to end up in extremely steep parts of near zero.Our greedy exploration policy used which linearly decreased to in the first episodes. This resulted in a highlyexplorative policy in the beginning which rapidly converged to a more exploitative one. The training process took advantage of past experiences, as we explained above (experience replay with minibatches of size ), and the target network described in Sec. 4.1 was updated every episodes. The networks’ parameters were optimized using Adam (Kingma and Ba, 2014) with a learning rate of .
We proceed by first evaluating our model in the case of a risk neutral firm  for the purposes of comparison with prior work. Following that, we will discuss the case of a riskaverse firm and will explore its behavior.
5.2 Riskneutral firms: comparison with known optimum.
Before attempting to compute a riskaverse firm’s expected behavior, we validated our approach against the known optimal solution for riskneutral firms from Goumagias et al. (2012). Tab. 1 shows the firm’s total discounted rewards in four cases which are of interest, according to how often the closure option is offered each year: a) never, b) with probability 0.2, c) always, and d) periodically, every 5 years.
Closure Option  Dynamic Programing  DQN 

Never  3254.6  3270.66 
3307.9  3316.76  
Always  3358.3  3357.01 
5periodic  3319.7  3335.75 
Our DQN approach is inherently an approximate one. We note however that the firm revenues we computed differ less than 0.5% from the “true” values computed via DP. Besides the optimal firm revenues, the optimal firm policies were identical to those found in Goumagias et al. (2012) in each of the four cases examined, i.e., it was always optimal for the firm to conceal as much profit as possible and to make use of the option whenever available.
5.3 The behavior of riskaverse firms  ranking sample tax policies
We performed a series of runs designed to explore the effect of risk aversion on the behavior of the firm, by keeping the taxparameters fixed to the values mentioned in Sec. 5.1, and varying the firm’s risk aversion coefficient, from 0 to 7 in steps of 1, for each of the four scenarios of interest with respect to the availability of closure (never, 20% of the time, always, every 5 years).
The first notable difference with the riskneutral case (Goumagias et al., 2012) is that the optimal degree of taxevasion, , for was not constant. That is, in every case, our DQNbased approach converged to a statedependent (static) policy which achieved a higher average utility than would have been possible using any constant value for (meaning that the same value of would be used regardless of which state we were in). See Tab. 2 for a comparison in the case where (we have chosen this particular value because it will be of special interest in Sec. 5.4.1  similar results hold for different values of ).
Closure Option  Max. discounted  Max. discounted utility 
utility (average )  with constant  
Never  1.91474 (0.29)  1.98007 (0.21) 
1.87780 (0.40)  1.94671 (0.31)  
Always  1.40147 (1)  1.40147 (1) 
5periodic  1.86345 (0.43)  1.89893 (0.37) 
In terms of the four tax policies under consideration, we observe from Tab. 2 that  as in the riskneutral case  the firm obtains a higher maximum discounted utility when the closure option is offered more frequently or more predictably. This implies that, from the point of view of government, the tax revenue collected is highest when the closure option is never offered at all. We will have more to say about this in Sec. 5.6.
Regarding the use of closure by the firm () we found that, for the taxparameters currently in use, if the closure option is always offered then the firm must always take advantage of it (so that it is never audited). If the option is offered stochastically or every five years, then it is optimal for the firm to use it unless the firm has just been audited (this being a departure from the optimal riskneutral policy). With respect to the level of taxevasion, , the fact that the optimal policy is not constant makes it difficult to characterize it in a “compact” way, especially when closure is offered stochastically or periodically. We will discuss ways of exploring the structure of later in this Section.
5.4 The effect of risk aversion on tax evasion
To gain insight into the firm’s behavior we plotted the average over the course of the firm’s lifetime against the firm’s riskaversion coefficient, . Fig. 2 shows the rate at which the average level of tax evasion () declines as the firm becomes more riskaverse, for each of the four scenarios regarding the availability of closure, where for each value of there were episodes executed with time steps each.
one standard deviation.
The approximate nature of our approach comes through in the fact that in the case where closure is never offered (Fig. 2  top left), there are times where the average level of taxevasion increases as (the firm’s risk aversion) increases, although we expect the opposite to occur. There is, however, a clear downward trend in the vast majority of cases showing that as the firm becomes more risk averse (higher ) the firm becomes more “honest” on average. It is also worth mentioning that it is not trivial to obtain high numerical precision with an approximation method such as ours when the utility function is highly nonlinear (i.e., in our case, very steep near zero where the firm would find itself if it had to pay a penalty at audit time, and relatively “flat” for values of income associated with nonaudit states). One possible solution for learning valuefunctions over different reward “scales” is offered in van Hasselt et al. (2016b); however, the implementation is complex, hence we opted for reward clipping as discussed in Sec. 5.1.
5.4.1 Calculating the risk aversion of Greek firms
In Fig. 2 we included data points for on the horizontal axis. That value of the riskaversion coefficient is significant because (see Fig. 2 topright) it leads to a 40% average taxevasion on behalf of the firm. It was identified by numerical experimentation, essentially using bisection on to make the average . As we have mentioned before, the 40% level is reported in the literature as the estimated taxevasion level in Greece (Artavanis et al., 2016), and so our approach allows us to estimate the risk aversion coefficient of the average Greek firm (or to reestimate it for all or a subset of firms, as newer empirical data becomes available).
5.5 Exploring the optimal policy for a representative firm ()
As we have seen, the firm’s optimal policy is not constant in three of the four closure availability scenarios (the exception is the case where closure is always available, where it is always best to conceal all profit). Because of the complexity of the problem and the large number of states (), it is difficult to represent or even visualize the optimal policy in a compact form. We have thus attempted to gain insight by examining the statistics of and
and by using decision trees, as well as various projections of the statetodecision mapping encoded in the DQN that are of practical relevance because they reveal how the tax evasion level is related to i) the tax status of the firm (i.e., how many years since its last audit or closure), and ii) the amounts that the firm has previously concealed but are still within the statute of limitations in the event of an audit.
Fig. 3 shows the frequency histograms of the firm’s optimal level of taxevasion over 25000 statedecision pair samples (obtained from our trained DQN, over 100 episodes where the firm was allowed to evolve for 250 steps, as previously mentioned). We observe that there is no variability in the case where closure is always available (the firm always uses the closure option and conceals as much profit as possible). In the cases where the option is offered stochastically or periodically there is more significant variability in the optimal (Fig. 3, top row, and bottomright), although we observe that the set of values for used by the DQN is sparse.
To gain insight into just how the values observed in the histograms depend on the firm’s state, we used decisiontree classifiers. Fitting a decision tree to the outputs of the network is a commonlyused approach for discovering patterns in the learned policy. We opted for a shallow decision tree (depth = 3) to the same 25000 outputs
, with a high threshold for splitting (). We kept the tree classifier “naive” in order to be able to gain highlevel intuition on the decision policy’s structure.Fig. 4 illustrates the trees obtained for the cases of and . In the tree nodes, the binary stand for the firm’s tax status in terms of the ith element of S (see state space description following Eq. 1), e.g., means that the firm’s tax status is not the fifth element of , so that the firm is not being audited for its last five tax years; denotes the ith element of the firm’s tax history vector , i.e., the amount of profit it concealed years ago; denotes whether closure is available to the firm or not; and denotes the number of samples (out of the 25000 total) to which each case applied.
The decision tree for (Fig. 4left) indicates that if closure is never offered, the firm opts for very high tax evasion (, topright leaf of the tree) only immediately after audits that “cover” the last 5 years. The remainder of the time, the firm almost always conceals 27% of its profit (bottomleft leaf of the decision tree), and any deviations from that value depend mainly on its history of tax evasion (e.g., whether 3 years ago it concealed more or less than 38% of its profit  see left “child” of the tree’s root node).
When (Fig. 4right), the firm again uses a high immediately after (rare) audits; for the majority of its time it uses two tax evasion levels, or depending on whether closure is () or is not () available, respectively. For the 5periodic closure scenario, the classifier (not shown) indicated that when the firm is 3 or 4 years away from the next closure, it uses a nearaverage . If the closure option is less than 3 years away, and the firm has been recently audited ( years ago), then its taxevasion goes up to .
To glean additional information on the structure of the DQN policy, we looked for patterns in the tax evasion decisions based on i) the tax status of the firm (i.e., whether it is being audited, using the closure option, or left to evolve with 15 years since its last audit or closure, as detailed in Sec. 3.2), and ii) the cumulative tax evasion “stored” in the firm’s history () within the 5year statute of limitations, this representing a kind of “amount at risk” that the firm would be liable for if it were to be audited.
Fig. 5 shows histograms of the firm’s decisions according to level of tax evasion () and the tax status of the firm (shown as an integer between 1 and 15 representing states in , as per Sec. 3.2). In the left histogram, where closure is never available, we observe that the firm spends most of its time in the tax status 15 (which corresponds to the firm having been unaudited for 5 or more years) and its level of tax evasion is near 0.28 (this matches the decision tree analysis above). Also noteworthy is the fact that the firm consistently uses when its tax status is 5 (the firm being audited for its last 5 tax filings).
In the right histogram of Fig. 5, the closure option is available with probability , and if we were to sum over the tax status axis we would obtain the topright histogram of Fig. 3. The firm generally uses a higher level of tax evasion (). The broader spread of the samples over the tax status axis compared with the previous case (closure never available) indicates that the firm uses the option when it can, thereby “erasing” any tax evasion history and thus finds itself more often in a tax status of 510 (corresponding to closure being used for the last 15 tax filings of the firm) or 1115 (the firm being unaudited for 15 years in the past).
Besides grouping the firm’s decisions by tax status, we examined how the firm behaves based on the part of its state, , which contains its past tax evasion decisions (up to five) which are still within the statute of limitations (see Sec. 3.2). Because we have quantized in steps of , and because of the structure of as the firm evolves via Eq. 1, it is difficult to visualize the firm’s policy over that entire set. It is instructive, however, to consider the sum of the elements of (which is proportional to the total amount the firm has failed to disclose) as a proxy variable for the amount at risk if it the firm were to be audited, and examine how it affects tax evasion by the firm. We expect that a “good” policy would reduce tax evasion () when that sum increases, which is precisely what happens. Fig. 6 shows the histograms of the firm’s level of tax evasion and sum of its past decisions (up to five or up to the last time it was audited or used the closure option, whichever is smaller). In the left histogram, where closure is never available, we observe that although the firm conceals approximately 30% of its profit most of the time, it sometimes decides to be completely dishonest with at 1, when the amount it is potentially “on the hook” for () is small (between 0 and 1.2) but becomes more honest (with at 0 or 0.2) when that amount is larger.
In the right histogram of Fig. 6, the closure option is available with probability 0.2, and there are occurrences of throughout the range of values for . This is explained by the fact that the usage of closure allows the firm to “wipe the slate clean” so that it is less deterred by the fact that it has accumulated a history of tax evasion. The downward trend present in the bars near is because by using the closure option whenever possible (thereby zeroing out ), it is more likely for the firm to find itself with a lower value of .
5.6 Tax policy implications
With an eye towards making policy recommendations for the “canonical” firm () we observe that, based on the results of Sec. 5.3, the more frequently the closure option is offered by the government, the higher the firm’s expected utility (see left column in Tab. 2), and  correspondingly  the lower the amount of taxrevenue collected. Thus, it appears that government should avoid using this type of tax amnesty because it encourages tax evasion, and instead reinforce auditing mechanisms.
Also, the analysis of the DQN policy in Sec. 5.5 suggests ways in which the tax authority could reallocate auditing resources towards firms which are in states associated with the highest tax evasion. In particular, under the current regime, most auditing resources are devoted to firms which have not been audited for five years and thus have past tax filings which are about to pass beyond the statute of limitations. The histograms and decision tree analysis of the firm’s policy shows that tax evasion is high immediately after an audit, suggesting that the audit probabilities should be distributed more “evenly” on , to improve the chance of catching tax evaders that were audited just one year ago.
Finally, Fig. 2 gives guidance for the expected reduction in tax evasion as the firm’s risk aversion increases. Of course, it is not easy to directly influence firm’s attitudes to make them more risk averse. However, the relationship between average and provides an opportunity for optimizing the allocation of auditing resources among various categories of firms (grouped, for example, by size or sector of economic activity), with fewer audits for the very riskaverse, more for those who are less so, once each group’s risk aversion coefficients are estimated (this can be done empirically by examining tax audits to measure tax evasion within each group, and estimating that group’s as we did in Sec. 5.4.1).
6 Conclusions
This work is part of a research program whose aim is to provide governments with quantitative tools which can be used to combat tax evasion and guide tax policy. A prerequisite for the design of effective policies is to be able to understand, in quantitative terms, the behavior of tax evaders. Towards that end, we addressed the problem of determining the behavior expected of a selfinterested riskaverse firm which aims to maximize its longterm revenues, in a tax system whose features include tax rates, random audits, penalties for tax evasion and occasional tax amnesties. The practical importance of the problem is significant: solving it allows one to estimate tax revenues, to identify measures and parameter values that make selfinterested entities behave more honestly, and to gauge the effectiveness of current or planned tax policies.
The dynamics of the firm’s (stochastic) evolution, combined with the rules of the tax system and the nonlinearity of the firm’s reward function (owing to the fact that the firm is generally riskaverse), give rise to a stochastic optimal decision problem in which the associated Bellman equation is difficult to solve using exact methods. To address that challenge, we made use of recent developments in function approximation and neural networks and constructed a Deep Qlearning Network (DQN) which “learns” the optimal firm policy. The neural network was trained to “store” the firm’s optimal longterm revenues, given a starting state and decision. DQN was used to efficiently “learn” the optimal firm decisions through simulations of the firm’s state evolution.
The DQN approach was first validated by setting our model to the special case of risk neutrality and comparing the results thus obtained (optimal policy and longterm firm revenues) to the exact solution computed via DP (Goumagias et al., 2012). We subsequently demonstrated that we can compute the firm’s optimal policy and corresponding tax revenues for the government in the “full” model which includes both riskaversion (i.e., nonlinearity in the reward function) and the tax amnesty (“closure”) option. We note that, in our particular case, Deep Learning was successful in approximating the firm’s reward function and finding its optimal decisions where other approximation methods failed to converge (we experimented extensively with Approximate Dynamic Programming, various implementations of Qlearning and SARSA algorithms, and neural network architectures which served as function approximators).
One of the contributions made possible by our approach is that it can be used to infer the risk aversion coefficient of a typical taxpayer from empirical data, and thus subsequently evaluate that taxpayer’s reactions under various scenarios of tax amnesty availability, or other parameter change (i.e., increase in the audit rates or penalties). Using Greece as a case study, we estimated the risk aversion coefficient of the average firm to be approximately , based on empirical evidence that puts the level of the Greek “hidden” economy at approximately 40% (Artavanis et al., 2016). We also compared tax revenues for a series of policies used there; our results provide evidence against the use of tax amnesties as tax revenue collection tools, even within economies with persistent and endemic tax evasion, as we there is a negative relationship between the predictability (or indeed existence) of tax amnesties and tax revenue. Although we have used Greece as a case study here, in part for the sake of concreteness, the proposed approach is adaptable to different taxation schemes and can easily be “tuned” to reflect the values of various taxparameters, such as audit rates, which are known to the government.
Opportunities for further work include the use of the very recent sampleefficient actorcritic algorithm with experience replay (Wang et al., 2017), which could enable stable learning in continuous action spaces (without having to discretize the firm’s decisions); efficient reward scaling, to handle reward values across many orders of magnitude similarly to van Hasselt et al. (2016b); and the use of Recurrent QLearning to possibly reduce some state features, e.g., the firm’s behavior in the past fiveyear window.
An interesting (and massive) computational study which has now been made possible in light of the present work, involves recording the effects of altering the various tax parameters on the behavior of the firm, so that one could compute the “degree of honesty” of the firm as a function of the parameters, in the spirit of the maps given in Goumagias et al. (2012).
Finally, we also envision extensions of this work with learning models that generalize over different values of the tax rate or the risk aversion coefficient (instead of having to be trained separately for particular values), or that also optimize selected model parameters simultaneously with the firm’s decisions. Although some parameters, such as , are generally considered exogenous in forming the firms’ risk preferences, optimizing others, especially the tax rate and penalty factor would be of particular interest for the purposes of maximizing tax revenue.
7 References
References
 Allingham and Sandmo (1972) Allingham, M. G. and Sandmo, A. (1972). Income tax evasion: a theoretical analysis. J. Public Economics, 1(34):323–338.
 Alm and Beck (1990) Alm, J. and Beck, W. (1990). Tax amnesties and tax revenues. Pub. Fin. Rev., 18(4):433–453.
 Alm and Rath (1998) Alm, J. and Rath, D. M. (1998). Tax policy analysis: the introduction of a russian tax amnesty. Technical report, Georgia State University, Andrew Young School of Policy Studies.
 Andreoni et al. (1998) Andreoni, J., Erard, B., and Feinstein, J. (1998). Tax compliance. J. Econ. Lit., 36(2):818–860.
 Artavanis et al. (2016) Artavanis, N., Morse, A., and Tsoutsoura, M. (2016). Measuring income tax evasion using bank credit: Evidence from Greece. The Quarterly Journal of Economics, 131(2):739–798.
 Baldry (1979) Baldry, J. C. (1979). Tax evasion and labour supply. Economics Letters, 3(1):53–56.
 Bayer et al. (2015) Bayer, R. C., Oberhofer, H., and Winner, H. (2015). The occurrence of tax amnesties. J. Public Economics, 125:70–82.
 Bertsekas (1995) Bertsekas, D. P. (1995). Dynamic programming and optimal control, volume 1. Athena Scientific Belmont, MA.
 Bornstein and Rosenhead (1990) Bornstein, C. T. and Rosenhead, J. (1990). The role of operational research in less developed countries: A critical approach. European Journal of Operational Research, 49(2):156–178.
 Clotfelter (1983) Clotfelter, C. T. (1983). Tax evasion and tax rates: An analysis of individual returns. The Review of Economics and Statistics, pages 363–373.
 Cowell (1981) Cowell, F. A. (1981). Taxation and labour supply with risky activities. Economica, 48(192):365–379.
 Crane and Nourzad (1986) Crane, S. E. and Nourzad, F. (1986). Inflation and tax evasion: An empirical analysis. The Review of Economics and Statistics, pages 217–223.
 DasGupta and Mookherjee (1995) DasGupta, A. and Mookherjee, D. (1995). Tax amnesties in India: an empirical evaluation. Boston University, Institute for Economic Development.
 Fleming et al. (2000) Fleming, M. H., Roman, J., and Farrell, G. (2000). The shadow economy. Journal of International Affairs, 53(2):387–409.
 Foerster et al. (2016) Foerster, J. N., Assael, Y. M., de Freitas, N., and Whiteson, S. (2016). Learning to communicate with deep multiagent reinforcement learning. In NIPS.
 Gao and Xu (2009) Gao, S. and Xu, D. (2009). Conceptual modeling and development of an intelligent agentassisted decision support system for antimoney laundering. Expert Systems with Applications, 36(2):1493–1504.
 Garrido and Mittone (2012) Garrido, N. and Mittone, L. (2012). Tax evasion behavior using finite automata: Experiments in chile and italy. Expert Systems with Applications, 39(5):5584–5592.
 Gosavi (2004) Gosavi, A. (2004). Reinforcement learning for longrun average cost. European Journal of Operational Research, 155(3):654–674.
 Goumagias et al. (2012) Goumagias, N., HristuVarsakelis, D., and Saraidaris, A. (2012). A decision support model for tax revenue collection in greece. Decision Support Systems, 53(1):76–96.
 Hellenic Ministry of Finance (2004) Hellenic Ministry of Finance (2004). Law N.3259/2004 (POL.1034/2005) (in Greek).
 Hellenic Ministry of Finance (2008) Hellenic Ministry of Finance (2008). Law N.3697/2008 (POL.1130/2008) (in Greek).
 Hellenic Ministry of Finance (2015) Hellenic Ministry of Finance (2015). Article 7, Par. 2,4, Law N.4337/2015 (POL.4337/2015) (in Greek).
 Hokamp and Pickhardt (2010) Hokamp, S. and Pickhardt, M. (2010). Income tax evasion in a society of heterogeneous agents–evidence from an agentbased model. International Economic Journal, 24(4):541–553.
 Jaakkola et al. (1994) Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural computation, 6(6):1185–1201.
 Kingma and Ba (2014) Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Krauss et al. (2017)
Krauss, C., Do, X. A., and Huck, N. (2017).
Deep neural networks, gradientboosted trees, random forests: Statistical arbitrage on the s&p 500.
European Journal of Operational Research, 259(2):689–702.  Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105.
 LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553):436–444.
 Leung et al. (2014) Leung, M. K., Xiong, H. Y., Lee, L. J., and Frey, B. J. (2014). Deep learning of the tissueregulated splicing code. Bioinformatics, 30(12):i121–i129.
 Lin (1993) Lin, L. (1993). Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie Mellon University, Pittsburgh.
 Markellos et al. (2016) Markellos, R. N., Psychoyios, D., and Schneider, F. (2016). Sovereign debt markets in light of the shadow economy. European Journal of Operational Research, 252(1):220–231.
 MartinezVazquez and Rider (2005) MartinezVazquez, J. and Rider, M. (2005). Multiple modes of tax evasion: theory and evidence. National Tax Journal, pages 51–76.
 Mikolov et al. (2011) Mikolov, T., Deoras, A., Povey, D., Burget, L., and Černockỳ, J. (2011). Strategies for training large scale neural network language models. In Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, pages 196–201. IEEE.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533.
 Narasimhan et al. (2015) Narasimhan, K., Kulkarni, T., and Barzilay, R. (2015). Language understanding for textbased games using deep reinforcement learning. arXiv preprint arXiv:1506.08941.
 Perozzi et al. (2014) Perozzi, B., AlRfou, R., and Skiena, S. (2014). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pages 701–710. ACM.
 Pickhardt and Seibold (2014) Pickhardt, M. and Seibold, G. (2014). Income tax evasion dynamics: Evidence from an agentbased econophysics model. J. Economic Psychology, 40:147–160.
 Ronao and Cho (2016) Ronao, C. A. and Cho, S.B. (2016). Human activity recognition with smartphone sensors using deep learning neural networks. Expert Systems with Applications, 59:235–244.
 Ross and Buckwalter (2013) Ross, J. M. and Buckwalter, N. D. (2013). Strategic tax planning for state tax amnesties evidence from eligibility period restrictions. Public Finance Review, 41(3):275–301.
 Schmidhuber (2015) Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61:85–117.
 Singh (1994) Singh, S. P. (1994). Reinforcement learning algorithms for averagepayoff markovian decision processes. In AAAI, volume 94, pages 700–705.
 Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
 Tadepalli and Ok (1996) Tadepalli, P. and Ok, D. (1996). Scaling up average reward reinforcement learning by approximating the domain models and the value function. In ICML, pages 471–479.
 Tsitsiklis (1994) Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Qlearning. Machine Learning, 16(3):185–202.
 Tsitsiklis and Van Roy (1996) Tsitsiklis, J. N. and Van Roy, B. (1996). Featurebased methods for large scale dynamic programming. Machine Learning, 22(13):59–94.
 van Hasselt et al. (2016a) van Hasselt, H., Guez, A., and Silver, D. (2016a). Deep reinforcement learning with double qlearning. In AAAI.
 van Hasselt et al. (2016b) van Hasselt, H. P., Guez, A., Hessel, M., Mnih, V., and Silver, D. (2016b). Learning values across many orders of magnitude. In NIPS, pages 4287–4295.
 Wang et al. (2017) Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. (2017). Sample efficient actorcritic with experience replay. In ICLR.
 Watkins (1989) Watkins, C. J. C. H. (1989). Learning from delayed rewards.

Wheeler and
Narendra (1986)
Wheeler, R. and Narendra, K. (1986).
Decentralized learning in finite markov chains.
IEEE Transactions on Automatic Control, 31(6):519–526.  Yitzhaki (1974) Yitzhaki, S. (1974). Income tax evasion: A theoretical analysis. J. Pub. Econ., 3(2):201–202.
Appendix A State dynamics
The parameters of the basic state equation 1 are as in Goumagias et al. (2012) but are also given here for the purposes of review:
(15) 
The scalar corresponds to the government determining whether to offer the closure option; this occurs with some fixed probability, , each year so that:
(16) 
The scalar
is a random variable corresponding to the transitions that the firm undergoes in
S (e.g., tax audits) depend on its current state and its decision to accept or reject the closure option (if offered):(17) 
where we use as labels for states in S. For and fixed,
forms a Markov matrix governing the firm’s transitions in
:(18) 
and the , and are as in Goumagias et al. (2012) but are also given in Appendix B for the purposes of review.
Comments
There are no comments yet.