pinbasic: An R Package for Fast and Stable Estimation of Static Models for the Probability of Informed Trading

Introduction

The framework for estimating the probability of informed trading (\(\operatorname{PIN}\)) was first established by Easley et al. (1996) (EKOP) and extended in the paper by Easley, Hvidkjaer, and O’Hara (2002) (EHO). Both models assume constant arrival rates for buys and sells as well as constant probabilities of the trading days’ condition. In the EKOP and EHO setting, trading days can reside in three different states: no-news, good-news and bad-news. The probability of informed trading is estimated using daily aggregates of buy and sell orders, whereat buys and sells are supposed to follow independent latent Poisson point processes. The static models distinguish between uninformed and informed trading intensities. In the EKOP setting uninformed buyer and seller participate in the market with identical intensities. However, the EHO setup relaxes this assumption and the expected number of uninformed buys and sells per day are unique. This model structure leads to the EKOP model being nested in the EHO model.

\(\operatorname{PIN}\) is a widely used measure in many empirical applications. Henry (2006) investigates the relationship between short selling and information-based trading, The connection between investor protection, adverse selection and \(\operatorname{PIN}\) is analysed by Brockman and Chung (2008). How the probability of informed trading influences herding is studied in the work by Zhou and Lai (2009). Aslan et al. (2011) employ \(\operatorname{PIN}\) to investigate the linkage of microstructure, accounting, and asset pricing and intend to determine firms which have high information risk. Seasonality of \(\operatorname{PIN}\) estimates are examined in the work by Kang (2010). Additionally, various papers link the probability of informed trading to illiquidity measures, e.g. Duarte and Young (2009) and Li et al. (2009), and bid-ask spreads, e.g. Lei and Wu (2005) and Chung and Li (2003).

Due to the widespread usage of the \(\operatorname{PIN}\) measure in the literature, many researchers focussed on analysing its (technical) properties in detail. Recently, several papers were published proposing improvements in the estimation of model parameters and the probability of informed trading. The original factorizations of (log) likelihood functions in static \(\operatorname{PIN}\) models are very inefficient in terms of
stability and execution time. Furthermore, the probability of informed trading can only be estimated for ancient trading data or very infrequently traded stocks. Easley, Hvidkjaer, and O’Hara (2010) present a more robust formulation of the likelihood function which reduces the occurrence of over- and underflow errors for moderately traded equities. The most recent likelihood factorization for the \(\operatorname{PIN}\) framework assuming static arrival rates by Lin and Ke (2011) can even handle daily buys and sells data of very heavily traded stocks and increases speed and accuracy of function evaluations. In addition, Lin and Ke (2011) showed by simulation that the factorization by Easley, Hvidkjaer, and O’Hara (2010) is based if used with high numbers of daily buys and sells. Hence, all publications incorporating this formulation of the model’s likelihood function may exhibit biased estimates of the probability of informed trading.

Yan and Zhang (2012), Gan, Chun, and Johnstone (2015) and Ersan and Alıcı (2016) study the generation of appropriate initial values for the optimization routine in static \(\operatorname{PIN}\) models. A brute force grid search technique which delivers several sets of starting values is established by Yan and Zhang (2012). Despite its simplicity this method is very time-consuming. The proposed methodologies by Gan, Chun, and Johnstone (2015) and Ersan and Alıcı (2016) harness hierarchical agglomerative clustering (HAC) to determine initial choices for the model parameters.

The pinbasic package ships utilities for fast and stable estimation of the probability of informed trading in the static \(\operatorname{PIN}\) framework. The function design is chosen to fit the extended EHO model setup but can also be applied to the simpler EKOP model by equating the intensities of uninformed buys and sells. State-of-the-art factorization of the model likelihood function as well as most recent algorithms for generating initial values for optimization routines are implemented. Likelihood functions are evaluated with pin_ll and sets of starting values are returned by initial_vals. The probability of informed trading can be estimated for arbitrary length of daily buys and sells data with pin_est which is a wrapper around the workhorse function pin_est_core. No information about the time span of the underlying data is required to perform optimizations with pin_est. However, the recommendation given in the literature is using at least data for 60 trading days to ensure convergence of the likelihood maximization (e.g. see Easley et al. 1996, 1416). Quarterly estimates are returned by qpin which can be visualized with ggplot. Datasets of daily aggregated numbers of buys and sells can be simulated with simulateBS. Calculation of confidence intervals for the probability of informed trading can be enabled by confint argument in optimization routines (pin_est_core, pin_est and qpin) or by calling pin_confint directly. Additionally, posterior probabilities for conditions of trading days can be computed with posterior and plotted with ggplot.

The remainder of this work is structured as follows:
The second chapter examines the general framework of models for the probability of informed trading in more detail. Properties of the extended \(\operatorname{PIN}\) model by Easley, Hvidkjaer, and O’Hara (2002) are discussed in the third section. Stable factorizations for the likelihood function and algorithms for generating reliable sets of initial values are presented in the fourth and fifth section. Some examples of the pinbasic functionalities are given in the last section.

General PIN Framework

In the sequential microstructure models for estimating the probability of informed trading the exchange of equities takes place over \(d= 1, \dots, D\) pairwise independent trading days. No market activities are permitted in which a risk-neutral and competitive market maker is not involved. The market maker determines and updates the bid and ask prices utilizing the information he gathered so far for a trading day. Trading with the market maker is possible at every timestamp \(t\) during regular market hours starting at \(t_{0,m}\) and ending at \(T_m\), i.e. \(t \in \left[t_{0,m},T_m\right]\) with finite \(T_m\). The beginning of official trading may vary depending on the chosen bourse \(m\), i.e. the New York Stock Exchange starts regular trading at 9:30 am, whereas the German electronic marketplace XETRA opens earlier at 9:00 am. Likewise, the upper bound \(T_m\) of the official trading interval may also vary according to the marketplace under consideration. Each trading day can reside in one of three possible states of the set \(Q = \{\mathcal{N}, \mathcal{G}, \mathcal{B}\}\). The elements of the set \(Q\), which represent the conditions of trading days, are no-news (\(\mathcal{N}\)), good-news (\(\mathcal{G}\)) and bad-news (\(\mathcal{B}\)). Trading days on which private information influence the market activities are called information events.

Market participants can be split in two disjoint groups, informed and uninformed traders. Traders holding private information are solely active on information events. In addition, they are assumed to be risk neutral and competitive. They buy (sell) if positive (negative) signals hit the market, which is the case on good-news (bad-news) trading days. The contrary group of traders, the uninformed market attendees, are active on every trading day for various reasons (diversification, liquidity reasons, \(\dots\)).

Distribution of transactions on trading days with respect to the different days’ conditions.

In general, the probability of informed trading \(\operatorname{PIN}\) can be defined as the relation of the expected number of transactions due to private information to the expected total number of trades, \[ \begin{align} \label{eq:pingeneral} \operatorname{PIN}= \dfrac{\text{Expected number of information-based transactions}}{\text{Expected total number of transactions}} \end{align} \]

EHO Model

This section describes the static model for estimating the probability of informed trading developed by Easley, Hvidkjaer, and O’Hara (2002). Arrival rates of buys and sells as well as probability parameters are assumed to be constant over the whole range spanned by the data. Therefore the usage of this models do not enable to estimate the probability of informed trading on a daily basis.

The sequence of trading days is assumed to be discrete and independent, whereas the time during a trading day is continuous. Conditions of trading days are not observable and determined by nature before the market opening. Information events, days on which private, price-relevant information enter the market, occur with probability \(\alpha\). These are good information with probability \(1 - \delta\) and news with negative direction with probability \(\delta\). Hence, the probabilities of no-news, good-news and bad-news days are given by: \[ \begin{align} \operatorname{Pr}(\mathcal{N}) &= 1 - \alpha\\ \operatorname{Pr}(\mathcal{G}) &= \alpha( 1 - \delta) \\ \operatorname{Pr}(\mathcal{B}) &= \alpha\delta \end{align} \] Furthermore, buys and sells are supposed to follow latent independent Poisson processes with constant intensities. A Poisson process is a point process which is often defined on the positive line. According to Daley and Vere-Jones (2003),

we shall understand by a point process some method of randomly allocating points to the real line.

However, in terms of arrivals of buys and sells, we assume the Poisson processes in the static \(\operatorname{PIN}\) models to be defined on the positive half-line.¹ Waiting times or interarrival times between two consecutive buys or sells are exponentially distributed.² On information events, the intensity of the Poisson process either for buys or sells is increased by a positive constant parameter, depending on the direction of information. The arrivals of transactions, which are indeed observable, can be interpreted as a merging of both latent point processes.³ The observable arrivals of transactions \(N_{O}\) are the outcome of a competition between the latent Poisson processes for buys and sells, \(N_{B}\) and \(N_{S}\), which is the first to arrive. The waiting times of the latent Poisson processes determine the direction of the next trade. Assuming that the current waiting time of the buys’ point process is less than the sells’ interarrival time, the observed transaction will be buyer-initiated and \(N_{B}\) increases by 1 as well as \(N_{O}\), whereas \(N_{S}\) remains unchanged.⁴ After observing a transaction the waiting times of both latent point processes are reset and the race of buys and sells process begins anew.

For each of the Poisson processes of uninformed buys and uninformed sells a unique intensity is assumed. The expected number of uninformed buys equals \(\epsilon_b\), whereas the expected amount of uninformed sells is \(\epsilon_s\). Informed buys and sells appear with rate \(\mu\).⁵ According to equation \ref{eq:pingeneral} the probability of informed trading in the EHO model can be calculated as \[ \begin{align}\label{eq:pineho} \operatorname{PIN}&= \dfrac{\alpha\delta\mu+ \alpha(1 - \delta) \mu} {(1 - \alpha)(\epsilon_b+ \epsilon_s) + \alpha\delta(\epsilon_b+ \epsilon_s) + \alpha(1 - \delta) (\epsilon_b+ \epsilon_s) + \alpha\delta\mu+ \alpha(1 - \delta) \mu} \notag \\ &= \dfrac{\alpha\mu}{\epsilon_b+ \epsilon_s+ \alpha\mu}, \end{align} \] where \(\operatorname{PIN}\) and the model parameters are constant over the whole range of the underlying data.

On information events the probability for an arrival of buys or sells increases depending on the direction of the private information. Informed buyers (sellers) enter the market if they receive a positive (negative) signal. The intensity of the point process for buys (sells) increases by the positive parameter \(\mu\) whereas the rate for sells (buys) remains on the level for no-news days.

The following scenario tree illustrates the probabilities for the potential states a trading day in the \(\operatorname{PIN}\) framework can reside in. In addition, the mapping of the sets of arrival rates for buys and sells to the different trading days’ conditions can be read directly from the graph.

EHO Scenario Tree

For deriving the (log) likelihood function in the EHO setting we can utilize standard theory for homogeneous Poisson processes. A Poisson process on the (positive) real line is completely defined by the following equation (see Daley and Vere-Jones 2003, 19), \[ \begin{align} \label{eq:poissonprocess} \operatorname{Pr}\left\{N_{(a_i, b_i]} = n_i, i = 1, \dots, k\right\} = \prod\limits_{i=1}^k \dfrac{\left[\lambda\left(b_i - a_i\right)\right]^{n_i}}{n_i!} \exp\left(-\lambda\left(b_i - a_i\right)\right), \end{align} \] where \(N_{(a_i, b_i]}\) represents the number of arrivals lying in the right-bounded, left-open interval \(\left(a_i, b_i\right]\) with \(a_i < b_i \leq a_{i+1}\). The term \(\lambda\left(b_i - a_i\right)\) can be interpreted as the expected number of arrivals happening in \(\left(a_i, b_i\right]\).

It is common practice in the \(\operatorname{PIN}\) literature to specify the intensities of Poisson processes for buys and sells, \(\epsilon_b\), \(\epsilon_s\) and \(\mu\), as expected arrivals per trading day. Hence, the half-bounded interval \(\left(a_i, b_i\right]\) in equation \ref{eq:poissonprocess} simplifies to an interval of length 1 and we can adopt equation \ref{eq:poissonprocess} for usage in the EHO model for a trading day \(d\), \[ \begin{align} \operatorname{Pr}\left\{N_{d} = n\right\} = \dfrac{\lambda^{n}}{n!} \exp\left(-\lambda\right), \label{eq:poissonprocessekop} \end{align} \] with the buys’ or sells’ Poisson process \(N_{d}\), the number of buys or sells \(n \in \mathbb{N}_0\) on trading day \(d\) and \(\lambda\) displays the arrival rate of buys or sells, respectively.

The different types of trading trades must be taken into account. If on trading day \(d\) no private information hit the market, it is free from information-based traders. According to equation \ref{eq:poissonprocessekop} the notation of the separated probabilities of observing \(B_{d}\) buys and \(S_{d}\) sells on a no-news day \(d\) is straightforward. Hence, due to the independence of the latent Poisson processes, the probability of observing a tuple of \(B_{d}\) buys and \(S_{d}\) sells can be written as product of the single probabilities, \[ \begin{align} \label{eq:probseqno} \underbrace{\exp\left(-\epsilon_b\right)\dfrac{\epsilon_b^{B_{d}}}{B_{d}!}}_{\text{Buys}} \underbrace{\exp\left(-\epsilon_s\right)\dfrac{\epsilon_s^{S_{d}}}{S_{d}!}}_{\text{Sells}}. \end{align} \] On a good-news day \(d\) informed traders are active and buy equities which yields to an increase in the arrivals of buyer-initiated transactions captured by parameter \(\mu\), \[ \begin{align} \label{eq:probseqgo} \underbrace{\exp\left(-(\epsilon_b+ \mu) \right)\dfrac{(\epsilon_b+ \mu)^{B_{d}}}{B_{d}!}}_{\text{Buys}} \underbrace{\exp\left(-\epsilon_s\right)\dfrac{\epsilon_s^{S_{d}}}{S_{d}!}}_{\text{Sells}}. \end{align} \] Likewise to good-news days, the intensity of seller-initiated trades increase on a bad-news day \(d\), \[ \begin{align} \label{eq:probseqbad} \underbrace{\exp\left(-\epsilon_b\right)\dfrac{\epsilon_b^{B_{d}}}{B_{d}!}}_{\text{Buys}} \underbrace{\exp\left(-(\epsilon_s+ \mu) \right)\dfrac{(\epsilon_s+ \mu)^{S_{d}}}{S_{d}!}}_{\text{Sells}}. \end{align} \] The likelihood function of observing a sequence of \(B_{d}\) buys and \(S_{d}\) sells on a trading day \(d\) can now be formulated, using equations \ref{eq:probseqno} - \ref{eq:probseqbad}, as weighted sum of these condition-specific probabilities. Thus, the joint density of observing \(B_{d}\) buys and \(S_{d}\) sells on a trading day \(d\) can be formulated as \[ \begin{align} \label{eq:dailylikelihoodEKOP} \mathcal{L}\left( \theta = \left(\alpha, \delta,\epsilon_b, \epsilon_s, \mu\right)\mid (B_{d},S_{d}) \right) &= \left(1-\alpha\right)\left(\exp\left(-\epsilon_b\right)\dfrac{\epsilon_b^{B_{d}}}{B_{d}!} \exp\left(-\epsilon_s\right)\dfrac{\epsilon_s^{S_{d}}}{S_{d}!}\right) \notag \\ & + \alpha\left(1-\delta\right) \left(\exp\left(-(\epsilon_b+ \mu) \right)\dfrac{(\epsilon_b+ \mu)^{B_{d}}}{B_{d}!} \exp\left(-\epsilon_s\right)\dfrac{\epsilon_s^{S_{d}}}{S_{d}!}\right) \notag \\ & + \alpha\delta\left(\exp\left(-\epsilon_b\right)\dfrac{\epsilon_b^{B_{d}}}{B_{d}!} \exp\left(-(\epsilon_s+ \mu) \right)\dfrac{(\epsilon_s+ \mu)^{S_{d}}}{S_{d}!}\right). \end{align} \] Utilizing the independence of trading days, the probability of observing \(\mathcal{M}= \left(B_d, S_d\right)_{d = 1}^{D}\) for \(d= 1, \dots, D\) trading days can be written as product of daily likelihoods, \[ \begin{align} \label{eq:likelihoodEKOP} \mathcal{L}\left(\theta\mid \mathcal{M}\right) = \prod\limits_{d=1}^{D} \mathcal{L}\left(\theta\mid (B_d,S_d)\right). \end{align} \] Hence, the log likelihood function for a total of \(D\) trading days can be formulated as \[ \begin{align} \label{eq:loglikelihoodEKOP} \log \mathcal{L}\left(\theta\mid \mathcal{M}\right) = \sum\limits_{d=1}^{D} \log \mathcal{L}\left(\theta\mid (B_{d},S_{d})\right), \end{align} \] which yields the concrete notation, \[ \begin{align} \log \mathcal{L}\left(\theta\mid \mathcal{M}\right) = &\sum\limits_{d=1}^{D} \log \Biggl( \left(1-\alpha\right) \left(\exp\left(-\epsilon_b\right)\dfrac{\epsilon_b^{B_{d}}}{B_{d}!} \exp\left(-\epsilon_s\right)\dfrac{\epsilon_s^{S_{d}}}{S_{d}!}\right) \notag \\ & + \alpha\left(1-\delta\right) \left(\exp\left(-(\epsilon_b+ \mu) \right)\dfrac{(\epsilon_b+ \mu)^{B_{d}}}{B_{d}!} \exp\left(-\epsilon_s\right)\dfrac{\epsilon_s^{S_{d}}}{S_{d}!}\right) \notag \\ & + \alpha\delta\left(\exp\left(-\epsilon_b\right)\dfrac{\epsilon_b^{B_{d}}}{B_{d}!} \exp\left(-(\epsilon_s+ \mu) \right)\dfrac{(\epsilon_s+ \mu)^{S_{d}}}{S_{d}!}\right) \Biggr). \label{eq:loglikelihoodEHOconcrete} \end{align} \]

The formulation shown in equation \ref{eq:loglikelihoodEHOconcrete} is very inefficient in terms of computation time and often raises overflow errors.⁶ In computations of equation \ref{eq:loglikelihoodEHOconcrete} factorials of daily buys and sells need to be evaluated. Since the number of daily buys and sells can easily exceed values of several hundreds or thousands, calculations can easily get infeasible. Even if the number of daily buys or sells is small enough leading to computable factorial terms, finite values of the likelihood are not ensured. Additionally, the terms \(\epsilon_b^{B_{d}}\), \(\epsilon_s^{S_{d}}\), \((\epsilon_b+ \mu)^{B_{d}}\) and \((\epsilon_s+ \mu)^{S_{d}}\) are potential sources of overflow errors. Furthermore, single terms may be finite but products of those may not. In contrast to overflow errors induced by large values for daily buys and sells, the exponential terms in the likelihood function my introduce underflow errors,⁷ i.e. \(\exp\left(-\epsilon_b\right)\), \(\exp\left(-\epsilon_s\right)\), \(\exp\left(-(\epsilon_b+ \mu) \right)\) and \(\exp\left(-(\epsilon_s+ \mu) \right)\).

Since the EKOP model is nested in the EHO setting for equal intensities of uninformed buys and sells (\(\epsilon_b= \epsilon_s\)), we forego to explain the simpler model structure.

Likelihood Factorizations

Computation of equation \ref{eq:loglikelihoodEHOconcrete} often fails even for infrequently traded stocks. For historical data which has its origin decades ago and therefore the number of daily buys and sells is very small, these formulations of likelihood functions probably return reasonable results (i.e. finite and non-NaN). However, for any recent data with number of daily buys and sells often higher than 1000 transactions, more stable implementations are essential to achieve finite function values.

The \(\operatorname{PIN}\) literature provides two widely used likelihood factorization which try to minimize the over- and underflow errors.

EHO Factorization

Easley, Hvidkjaer, and O’Hara (2010) reformulated the likelihood function in the static model with different intensities for uninformed buys and uninformed sells. The authors rearranged the likelihood function and dropped the constant term \(-\log\left(B_{d}!S_{d}!\right)\) so we can maximize the algebraically equivalent but more stable and robust factorization

\[ \begin{align} \log \mathcal{L}\left( \theta\mid \mathcal{M}\right) = &\sum\limits_{d=1}^{D} \Biggl( -\epsilon_b- \epsilon_s+ M_{d} \left(\log x_b + \log x_s\right) + B_{d} \log \left(\mu+ \epsilon_b\right) + S_{d} \log \left(\mu+ \epsilon_s\right) \Biggr) \notag \\ & + \sum\limits_{d=1}^{D} \log \Biggl( \left(1-\alpha\right) x_s^{S_{d} - M_{d}} x_b^{B_{d} - M_{d}} + \alpha\left(1-\delta\right) \exp\left(-\mu\right) x_s^{S_{d} - M_{d}} x_b^{-M_{d}} \notag \\ & + \alpha\delta\exp\left(-\mu\right) x_b^{B_{d} - M_{d}} x_s^{-M_{d}} \Biggr), \label{eq:ehofactr5par} \end{align} \] where \(M_{d} = \min \left(B_{d}, S_{d} \right) + \dfrac{\max \left(B_{d}, S_{d} \right)}{2}\), \(x_s = \dfrac{\epsilon_s}{\epsilon_s+ \mu}\) and \(x_b = \dfrac{\epsilon_b}{\epsilon_b+ \mu}\).

According to Easley, Hvidkjaer, and O’Hara (2010) the computation of the probability of informed trading benefits from the reformulation due to two facts. The computing efficiency is increased and the truncation errors (over- and underflow) are reduced. No evaluation of factorials is needed, additionally \(x_b\) and \(x_s\) are always weakly smaller than 1 which leads to more stable calculations of the terms involving power operations. However, if the number of buyer- or seller-initiated transactions is very high for a trading day, evaluations of the terms \(x_b^{-M_d}\) and \(x_s^{-M_d}\) can be problematic and may result in infinite values. Hence, diminishing the frequency of over- and underflow errors is essential in calculating \(\operatorname{PIN}\) for (very) frequently traded stocks.

Lin and Ke (2011) state that the \(\operatorname{PIN}\) computation is downward-biased if the EHO likelihood formulation is used for stocks with a large transaction number. In the same work an accurate likelihood factorization is presented which we will discuss in the next section.

Lin and Ke Factorization

An even more stable and accurate formulation of the likelihood function is presented in the work of Lin and Ke (2011). The factorization is applicable even for heavily traded stocks. The effectiveness and stability of the likelihood is caused by two principles (see Lin and Ke 2011, 629):

In computing \(\exp(x)\exp(y)\) (or \(x\exp(y)\)), the expression of \(\exp(x + y)\) (or \(\operatorname{sgn}(x)\exp(\log(|x|) + y)\)) is more stable than that of \(\exp(x)\exp(y)\) (or \(x\exp(y)\)).
In the computer arithmetic process, the absolute computing error of a function \(f(x)\) increases with the absolute value of its first-order derivative.

To fortify the usefulness of these two principles the authors give the following example. Say, one intends to compute \(\log\left(\exp(x)\exp(y) + \exp(z)\right)\) with \(x = 800\), \(y = -400\) and \(z = 900\). A threshold for the inputs for exponential function lies at 710, meaning that any larger value gets the exponential function to overflow and return infinite results.

At first, \(\exp(x)\exp(y)\) would lead to an overflow error due to the fact that one input for the exponential is bigger than the threshold (800 > 710). Taking the first principle into account, we can compute this expression with \(\exp(x + y)\) which gives 5.221469710^{173}. However, the expression \(\exp(z)\) would still produce an infinite value and therefore the expression \(\log\left(\exp(x + y) + \exp(z)\right)\) is still not computable. The second principle states that one should avoid large input values for the exponential and small positive input values for the logarithmic function. Hence, \(m + \log\left(\exp(x + y - m) + \exp(z - m)\right)\) with \(m = \max(x + y, z) = 900\) is a more stable and accurate expression. Irrelevant of the specified values for \(x,y\) and \(z\), one of the terms \(x + y - m\) and \(z - m\) is always zero, whereas the remaining term is always less than 0. This yields small input values for the exponential functions, which in turn always sum up to a value greater than 1. Thus, we have no small positive input values for the natural logarithm. Using \(x,y\) and \(z\) as specified before we can compute the expression \(\log\left(\exp(x)\exp(y) + \exp(z)\right)\), incorporating the R function log1p, as \[ \begin{align} & m + \log\left(\exp(x + y - \max(x + y, z)) + \exp(z - m)\right) = \notag \\ & 900 + \log\left(\exp(800 - 400 - \max(400, 900)) + \exp(900 - 900)\right) = \notag \\ & 900 + \log\left(\exp(-500) + 1 \right) \notag \\ & \text{with} \log\left(\exp(-500) + 1 \right) = 7.1245764\times 10^{-218} \notag \end{align} \] Following the two principles mentioned by Lin and Ke (2011), a stable and efficient formulation of the likelihood is given by \[ \begin{align} \label{eq:linkefactr5par} \log \mathcal{L}\left(\theta\mid \mathcal{M}\right) = &\sum\limits_{d=1}^{D} \Biggl( -\epsilon_b- \epsilon_s+ B_{d} \log \left(\mu+ \epsilon_b\right) + S_{d} \log \left(\mu+ \epsilon_s\right) + e_{\max, d}\Biggr) \notag \\ & + \sum\limits_{d=1}^{D} \log \Biggl( \left(1-\alpha\right) \exp\left(e_{1,d} - e_{\max, d} \right) + \alpha\left(1-\delta\right) \exp\left(e_{2,d} - e_{\max, d} \right) \notag \\ & + \alpha\delta\exp\left(e_{3,d} - e_{\max, d} \right) \Biggr), \end{align} \] where \(e_{1,d} = -B_{d}\log\left(1+\dfrac{\mu}{\epsilon_b}\right)-S_{d}\log\left(1+\dfrac{\mu}{\epsilon_s}\right)\), \(e_{2,d} = -\mu- S_{d}\log\left(1 + \dfrac{\mu}{\epsilon_s} \right)\), \(e_{3,d} = -\mu- B_{d}\log\left(1 + \dfrac{\mu}{\epsilon_b} \right)\) and \(e_{\max, d} = \max\left(e_{1,d}, e_{2,d}, e_{3,d} \right)\). Again, the constant term \(-\log(B!S!)\) is dropped.

With the Lin-Ke formulation we have a stable methodology to estimate \(\operatorname{PIN}\) even for heavily traded stocks. Besides its the stability, the Lin-Ke factorization also speeds up the likelihood computation. Since the previously presented factorization by Easley, Hvidkjaer, and O’Hara (2010) returns infeasible function values for very frequently traded stocks and EKOP is nested in EHO model, we must strongly recommend the usage of the likelihood formulation by Lin and Ke (2011) in combination with the extended EHO setup for optimization routines.

Package Interface

pin_ll computes likelihood functions in static \(\operatorname{PIN}\) models, either incorporating the factorization by Easley, Hvidkjaer, and O’Hara (2010) or Lin and Ke (2011). The function is designed for the extended EHO model, but can also be applied to the simple model structure for equal values of \(\epsilon_b\) and \(\epsilon_s\).

pin_ll(param = NULL, numbuys = NULL, numsells = NULL,
    factorization = "Lin_Ke")

Trading data is passed with the arguments numbuys and numsells, which take numeric vectors of equal length for daily buys and sells. Model parameters can be specified by param argument. The numeric vector passed to param is pre-checked. It is verified that the length of the vector equals 5, otherwise an error is thrown to the console and computation is aborted. Either param has to be a named vector (accepted names are: "alpha", "delta", "epsilon_b", "epsilon_s" and "mu") or sorting of entries needs to be done properly. If names are not set or one or more names do not match the valid choices, param is silently renamed with strings "alpha", "delta", "epsilon_b", "epsilon_s" and "mu" (in this order). For the factorization argument the user can choose between "Lin_Ke" and "EHO" strings.

Initial Values

The previous sections give options to evaluate likelihood functions in the EKOP and EHO model in a stable and effective way. However, we did not have to care about initial values for the evaluations beforehand because we knew the data generating process.

There is not one rule or algorithm delivering the best starting values which ensure the maximization to end in the global maximum for every optimization run. One possibility would be to perform an appropriate number of maximization runs with different sets of random starting values. This procedure can be cumbersome because it is unclear how many runs are really needed to reach the global maximum, instead of landing in one of possibly several local maxima. Furthermore, running several hundred or even thousands of optimizations can be very time-consuming.

The next sections discuss three methods which aim to solve the problem of suitable initial values. Earlier, we demonstrated that there is practically no difference in terms of execution time between the simple and extended static model. Hence, for the remainder of this chapter we will concentrate on the EHO setup.

Grid Search Algorithm

Yan and Zhang (2012) present a methodology to generate initial values by grid search technique. We need five starting values for the EHO model. The two probability parameters \(\alpha\) and \(\delta\) can take on any values in the range of 0 to 1. The remaining parameters \(\left(\epsilon_b, \epsilon_s, \mu\right)\) have no upper bound and can be any positive real number.

To determine initial values for the non-probability parameters Yan and Zhang (2012) make use of the marginal distributions of buys and sells.⁸ The marginal distributions for buys and sells in the EHO model can be written as \[ \begin{align} \label{eq:margdensehobuys} \operatorname{Pr}\left(\# \text{Buys} = B\right) &= \left(1-\alpha\right)\left(\exp\left(-\epsilon_b\right)\dfrac{\epsilon_b^{B}}{B!} \right) \notag \\ & + \alpha\left(1-\delta\right) \left(\exp\left(-(\epsilon_b+ \mu) \right)\dfrac{\left(\epsilon_b+ \mu \right)^{B}}{B!} \right) \notag \\ & + \alpha\delta\left(\exp\left(-\epsilon_b\right)\dfrac{\epsilon_b^{B}}{B!} \right) \end{align} \] and \[ \begin{align} \label{eq:margdensehosells} \operatorname{Pr}\left(\# \text{Sells} = S\right) &= \left(1-\alpha\right)\left(\exp\left(-\epsilon_s\right) \dfrac{\epsilon_s^{S}}{S!} \right) \notag \\ & + \alpha\left(1-\delta\right) \left(\exp\left(-\epsilon_s\right) \dfrac{\epsilon_s^{S}}{S!} \right) \notag \\ & + \alpha\delta\left(\exp\left(-(\epsilon_s+ \mu) \right) \dfrac{\left(\epsilon_s+ \mu\right)^{S}}{S!} \right), \end{align} \] with \(B, S \in \mathbb{N}_0\). It is obvious that both marginal densities are weighted sums of Poisson distributed random variables. We can utilize the linearity of the expectation operator and write the expected values for the marginal distributions of the EHO model as \[ \begin{align} \label{eq:expectedbuyseho} \mathbb{E}\left(B\right) = \alpha(1 - \delta) \mu+ \epsilon_b\\ \label{eq:expectedsellseho} \mathbb{E}\left(S\right) = \alpha\delta\mu+ \epsilon_s, \end{align} \] These moment conditions for the expected values enable us to set initial values for all five parameters.

First step is to get initial values for the probabilities \(\alpha\) and \(\delta\). To prevent the initial guesses for these two parameters to lie on the boundaries we go with Yan and Zhang (2012) and take a sub-interval of \(\left[0, 1\right]\). Starting values for \(\alpha\) and \(\delta\) are limited to belong to a series of equidistant real-valued numbers in the range of \(0.1\) to \(0.9\), always beginning and ending with the minimum and maximum , respectively.⁹ In the next step, the sample averages \(\overline{B}\) and \(\overline{S}\) of the series of daily buys and sells replace the expectations \(\mathbb{E}(B)\) and \(\mathbb{E}(S)\). Since the term \(\alpha(1 - \delta) \mu\) in equation \ref{eq:expectedbuyseho} is always positive, \(\epsilon_b\) needs to be smaller than \(\overline{B}\). The average daily number of buys \(\overline{B}\) is then multiplied by the same equally spaced series of values which are chosen as initial values for \(\alpha\) and \(\delta\) to generate starting values for the intensity of uninformed buyers \(\epsilon_b\). The last step is to receive initial values for the intensity of sells initiated by noise traders \(\epsilon_s\) and the intensity of transaction fulfilled by informed traders \(\mu\). Therefore, equations \ref{eq:expectedbuyseho} and \ref{eq:expectedsellseho} are simultaneously solved.

A set of initial values for the EHO model, \(\theta^0= \left(\alpha^0, \delta^0, \epsilon_b^0, \epsilon_s^0, \mu^0 \right)\), can then be calculated as, \[ \begin{align} \alpha^0 &= \alpha_i, \notag \\ \delta^0 &= \delta_j, \notag \\ \epsilon_b^0 &= \gamma_k \overline{B}, \notag \\ \mu^0 &= \dfrac{\overline{B} - \epsilon_b^0}{\alpha^0 \left(1- \delta^0\right)}, \notag \\ \epsilon_s^0 &= \overline{S} - \alpha^0 \delta^0 \mu^0, \notag \end{align} \] where each of the three parameters \(\alpha_i\), \(\delta_j\) and \(\gamma_k\) take on equally distanced values between 0.1 and 0.9, one at a time. Yan and Zhang (2012) choose the length of the series of initial values for \(\alpha_i\), \(\delta_j\) and \(\gamma_k\) to be five. Hence, the starting values for each of the three parameters are 0.1, 0.3, 0.5, 0.7 and 0.9. This results in a total of \(5^3 = 125\) potential sets of initial values.

However, not all combinations are feasible due to negative values for the intensity of uninformed sells \(\epsilon_s^0\). In addition, Ersan and Alıcı (2016) recommend to exclude sets of starting values with irrelevant values for \(\mu^0\) which is the case if \(\mu^0\) exceeds the maximum number of daily buys or sells (\(\mu^0 > \max(B_{d}, S_{d}), \:\: d = 1, \dots, D\)).

HAC Algorithm

Another methodology which utilizes hierarchical agglomerative clustering (HAC) to generate starting values is proposed by Gan, Chun, and Johnstone (2015). The daily order imbalance \(\operatorname{OI}_d, d= 1, \dots, D\), serves as criterion to assign the trading days to three clusters representing no-news, good-news and bad-news trading days. HAC is a bottom-up clustering technique in which at the beginning of the algorithm all order imbalances \(\operatorname{OI}_d\) illustrate a cluster of their own, e.g. if trading data for one quarter of a year is used to estimate the probability of informed trading roughly 60 clusters exist when the algorithm is initialized.

Gan, Chun, and Johnstone (2015) use the complete-linkage clustering to sequentially merge the small clusters to bigger ones. Two clusters with the shortest distance are combined in each step. The definition of shortest distance distinguishes between several available agglomerative clustering methods.¹⁰ In complete-linkage clustering or farthest-neighbour clustering, the distance between clusters is calculated as the distance between those two elements, whereat the elements are in separated clusters, that are farthest away from each other. The minimal computed distance in each step causes the merging of both clusters involved.

To be precise, in the complete-linkage clustering, the distance \(\operatorname{D}(X, Y)\) between two clusters \(X\) and \(Y\) can be written as \[ \begin{align} \operatorname{D}(X,Y) = \underset{x \in X, y \in Y}{\max} d(x, y), \end{align} \] where \(d(x, y)\) is the distance between the cluster elements \(x \in X\) and \(y \in Y\). Gan, Chun, and Johnstone (2015) use the euclidean norm as measure for \(d(x,y)\).

The following is a step-by-step instruction how to use the clustering algorithm to generate initial values for the parameters in the EHO model (see Gan, Chun, and Johnstone 2015, 1809).

Calculate a series of daily order imbalances, \(\operatorname{OI}_d= B_{d} - S_{d}\) with \(d= 1, \dots D\), and use the daily order imbalances, buys and sells as inputs for the following steps.
Perform HAC on the daily order imbalances using the complete-linkage clustering.¹¹ Stop the algorithm when there are three clusters left.
The mean of the clusters serve as a criterion to assign them to no-news, good-news and bad-news. The cluster with the highest mean is assumed to consists of trading days with positive private information. Likewise the cluster with the lowest mean adheres trading days with negative private information. The remaining cluster is then defined as the no-news cluster.
Compute the average daily buys \(\bar{B}_c\) and sells \(\bar{S}_c\) for \(c \in \{\mathcal{N}, \mathcal{G}, \mathcal{B}\}\). Then, assign each cluster a weight \(w_c\) which is calculated as the proportion this cluster occupies of the total number of trading days \(D\). Hence, the cluster weights sum up to 1.
With the help of the classification of trading days and the cluster weights from the third and fourth step, we are able to compute initial values for the intensities of uninformed buys and sells as weighted sums of average buys \(B_c\) and sells \(S_c\), respectively. \[ \begin{align} \epsilon_b^0 &= \dfrac{w_{\mathcal{B}}}{w_{\mathcal{B}} + w_{\mathcal{N}}} \bar{B}_{\mathcal{B}} + \dfrac{w_{\mathcal{N}}}{w_{\mathcal{B}} + w_{\mathcal{N}}} \bar{B}_{\mathcal{N}} \notag \\ \epsilon_s^0 &= \dfrac{w_{\mathcal{G}}}{w_{\mathcal{G}} + w_{\mathcal{N}}} \bar{S}_{\mathcal{B}} + \dfrac{w_{\mathcal{N}}}{w_{\mathcal{G}} + w_{\mathcal{N}}} \bar{S}_{\mathcal{N}} \notag \end{align} \]
The intensity of informed trading is then calculated as the weighted sum of the intensities of informed buys \(\mu_b^0\) and sells \(\mu_s^0\).¹² ¹³ \[ \begin{align} \mu_b^0 &= \bar{B}_{\mathcal{G}} - \epsilon_b^0 \notag \\ \mu_s^0 &= \bar{S}_{\mathcal{B}} - \epsilon_s^0 \notag \\ \mu^0 &= \dfrac{w_{\mathcal{G}}}{w_{\mathcal{G}} + w_{\mathcal{B}}} \mu_b^0 + \dfrac{w_{\mathcal{B}}}{w_{\mathcal{G}} + w_{\mathcal{B}}} \mu_s^0 \notag \end{align} \]
Cluster sizes are utilized to compute starting values for the probability of an information event \(\alpha\) and the probability of bad news given that private information enter the market \(\delta\). \[ \begin{align} \alpha^0 &= w_{\mathcal{G}} + w_{\mathcal{B}} \notag \notag \\ \delta^0 &= \dfrac{w_{\mathcal{B}}}{\alpha^0} \notag \end{align} \]
An initial estimate of the probability of informed trading is given by \[ \begin{align} \operatorname{PIN}= \dfrac{\alpha^0 \mu^0}{% \epsilon_b^0 + \epsilon_s^0 + \alpha^0 \mu^0} \notag \end{align} \] In contrast to the brute force grid search method discussed in the previous section, the HAC algorithm returns only a single vector of initial values. Furthermore, no computation time is spent in generating infeasible sets of values which are then immediately discarded.

Refined HAC Algorithm

A third option for generating initial values for maximization of the likelihood function in the EHO model is presented by Ersan and Alıcı (2016). The authors claim that their method in combination with Lin-Ke likelihood factorization yields unbiased estimates for the probability of informed trading and the five model parameters.¹⁴

Similar to the HAC algorithm hierarchical agglomerative clustering is utilized for generating starting values. However, instead of stopping the clustering once three clusters are left, the number of groups is not predetermined per se. The refined HAC algorithm is stopped when \(j + 1\) clusters are left, whereat \(j\) can be any positive integer which is limited only by the available number of trading days (\(j \leq D- 1\)).¹⁵

In contrast to the HAC algorithm by Gan, Chun, and Johnstone (2015), the daily absolute order imbalance is used to assign trading days to one of the \(j + 1\) groups. The clusters are then ordered by their average absolute order imbalance and distributed to a no-event and event cluster. To achieve multiple vectors of starting values instead of a single one, the two types of groups are build as follows, \[ \begin{align} CL_i^{\mathcal{N}} &= \bigcup\limits_{k = 1}^{i} CL_k, \notag \\ CL_i^{\mathcal{E}} &= \bigcup\limits_{k = i + 1}^{j + 1} CL_k, \notag \end{align} \] where \(i = 1, \dots, j\) and \(CL_i^{\mathcal{N}}\) represents the no-event cluster and \(CL_i^{\mathcal{E}}\) the event cluster. At this point, we are able to obtain \(\alpha^0\) and \(\mu^0\). The initial guess for the probability of an information event is calculated, in a very similar way to the HAC algorithm by Gan, Chun, and Johnstone (2015), as the proportion the event cluster \(CL_i^{\mathcal{E}}\) occupies from the total number of trading days \(D\). The initial intensity of informed trading equals the difference in averages of the absolute order imbalances of no-event and event group.¹⁶

In the next step, \(CL_i^{\mathcal{E}}\) is split according to the signs of the average order imbalances of its members in a good-news and bad-news group. The remaining starting values for the two intensities \(\epsilon_b\) and \(\epsilon_s\) and the probability \(\delta\) are calculated according to HAC algorithm.

Hence, we get a total of \(j\) vectors of initial values. Maximum likelihood estimation (MLE) is performed for each of the \(j\) vectors and the best result among all maximization runs is kept while discarding the remaining. According to the results in the work of Ersan and Alıcı (2016) a value of \(j = 5\) is a good compromise between accuracy and speed.

Package Interface

initial_vals generates set(s) of starting values which can be used in optimization routines estimating the probability of informed trading. It is a wrapper around the specialized functions init_grid_search, init_hac and init_hac_ref.¹⁷ Arguments numbuys and numsells take vectors of daily buys and sells. The algorithm for calculating initial values can be specified via method argument by which the user can choose one of the previous discussed methods. Brute force grid search algorithm can be chosen via "Grid", for HAC or refined HAC algorithm method needs to equal "HAC" or "HAC_Ref".

initial_vals(numbuys = NULL, numsells = NULL, method = "HAC", length = 5,
    num_clust = 5, details = FALSE)

In addition, there are method-specific arguments length, num_clust and details. The length argument is relevant only for grid search in which it determines the grid width by which the interval \(\left[0.1, 0.9\right]\) is split. This influences the amount of possible initial values for the probability parameters \(\alpha\) and \(\delta\) as well as \(\gamma\). If details is set to TRUE and method = "Grid" a list is returned with elements representing a matrix with sets of starting values, the number of sets removed due to negative values for the intensity of uninformed sells and guesses for the intensity of informed trading that are larger than the highest observed number of buys or sells in the data. Otherwise, solely a matrix of initial values is returned. Function argument num_clust determines the number of clusters trading data is grouped into if method = "HAC_Ref".

Posterior Probabilities for Trading Days’ States

Due to constant model parameters, the static EHO model do not enable to compute \(\operatorname{PIN}\) on a daily basis. Nevertheless, we can harness Bayes’ theorem and construct formulas for posterior probabilities (e.g. see Lee (2012)) of trading days’ conditions.

Incorporating the independence of of buys’ and sells’ Poisson processes, we can write the probability that a trading day \(d\) resides in no-news state given that we have observed \(B_{d}\) buys and \(S_{d}\) sells as

\[ \begin{align} \operatorname{Pr}\left(\mathcal{N}\mid (B_{d}, S_{d}) \right) &= \dfrac{\operatorname{Pr}\left( B_{d} \mid \mathcal{N}\right) \operatorname{Pr}\left( S_{d} \mid \mathcal{N}\right) \operatorname{Pr}\left(\mathcal{N}\right)} {\operatorname{Pr}\left(B_{d}, S_{d}\right)} \notag \\ &= \dfrac{1 - \alpha}{(1 - \alpha) + \exp(-\mu) \left[\alpha(1 - \delta) \left(1 + \dfrac{\mu}{\epsilon_b}\right)^{B_{d}} + \alpha\delta\left(1 + \dfrac{\mu}{\epsilon_s}\right)^{S_{d}}\right]}. \label{eq:postno} \end{align} \]

Likewise, posterior probabilities for a good-news and bad-news trading day are given by

\[ \begin{align} \operatorname{Pr}\left(\mathcal{G}\mid (B_{d}, S_{d}) \right) &= \dfrac{\operatorname{Pr}\left( B_{d} \mid \mathcal{G}\right) \operatorname{Pr}\left( S_{d} \mid \mathcal{G}\right) \operatorname{Pr}\left(\mathcal{N}\right)} {\operatorname{Pr}\left(B_{d}, S_{d}\right)} \notag \\ &= \dfrac{\alpha(1 - \delta) \exp(-\mu) \left(1 + \dfrac{\mu}{\epsilon_b}\right)^{B_{d}}}{(1 - \alpha) + \exp(-\mu) \left[\alpha(1 - \delta) \left(1 + \dfrac{\mu}{\epsilon_b}\right)^{B_{d}} + \alpha\delta\left(1 + \dfrac{\mu}{\epsilon_s}\right)^{S_{d}}\right]} \label{eq:postgood} \end{align} \]

and

\[ \begin{align} \operatorname{Pr}\left(\mathcal{B}\mid (B_{d}, S_{d}) \right) &= \dfrac{\operatorname{Pr}\left( B_{d} \mid \mathcal{B}\right) \operatorname{Pr}\left( S_{d} \mid \mathcal{B}\right) \operatorname{Pr}\left(\mathcal{N}\right)} {\operatorname{Pr}\left(B_{d}, S_{d}\right)} \notag \\ &= \dfrac{\alpha\delta\exp(-\mu) \left(1 + \dfrac{\mu}{\epsilon_s}\right)^{S_{d}}}{(1 - \alpha) + \exp(-\mu) \left[\alpha(1 - \delta) \left(1 + \dfrac{\mu}{\epsilon_b}\right)^{B_{d}} + \alpha\delta\left(1 + \dfrac{\mu}{\epsilon_s}\right)^{S_{d}}\right]}. \label{eq:postbad} \end{align} \]

To the best of our knowledge, Bayes’ posterior probabilities for conditions of trading days were not used before in the field of the probability of informed trading. While we can calculate the number of trading days static models predict to reside in each of the three possible states, posteriors allow to assign each trading day probabilities of no-news, good-news and bad-news condition.

Exemplary, assuming low probability parameters \(\alpha\) and \(\delta\), we can interpret this as few trading days in the sample period on which insiders triggered by positive private information enter the market. However, we are not able to relate information events to specific trading days in the datasource. Utilizing equations \ref{eq:postno} - \ref{eq:postbad} we can identify good-news days according to the magnitude of \(\operatorname{Pr}\left(\mathcal{G}\mid (B_{d}, S_{d}) \right)\) for each trading day. Hence, posterior probabilities deliver useful additional insights in classification of trading days and help to improve analyses of the EHO model.

Examples

The pinbasic packages is equipped with four synthetic datasets of daily buys and sells, BSinfrequent, BSfrequent, BSfrequent2015 and BSheavy. They represent infrequently, frequently and heavily traded equities, respectively. The datasets BSinfrequent, BSfrequent and BSheavy cover 60 trading days, whereas BSfrequent2015 contains simulated daily buys and sells for business days in 2015. The datasets can be loaded with data function.

data("BSinfrequent")
data("BSfrequent")
data("BSfrequent2015")
data("BSheavy")

summary(BSinfrequent)
#>       Buys           Sells      
#>  Min.   :117.0   Min.   :107.0  
#>  1st Qu.:130.0   1st Qu.:120.8  
#>  Median :136.5   Median :128.0  
#>  Mean   :141.4   Mean   :130.2  
#>  3rd Qu.:144.2   3rd Qu.:135.5  
#>  Max.   :216.0   Max.   :170.0
summary(BSfrequent)
#>       Buys          Sells     
#>  Min.   :2128   Min.   :1896  
#>  1st Qu.:2189   1st Qu.:1979  
#>  Median :2214   Median :2003  
#>  Mean   :2290   Mean   :2086  
#>  3rd Qu.:2245   3rd Qu.:2041  
#>  Max.   :3062   Max.   :2962
summary(BSfrequent2015)
#>       Buys          Sells     
#>  Min.   :1701   Min.   :1589  
#>  1st Qu.:1777   1st Qu.:1675  
#>  Median :1809   Median :1703  
#>  Mean   :1872   Mean   :1761  
#>  3rd Qu.:1844   3rd Qu.:1738  
#>  Max.   :2528   Max.   :2381
summary(BSheavy)
#>       Buys          Sells     
#>  Min.   :5277   Min.   :4943  
#>  1st Qu.:5376   1st Qu.:5056  
#>  Median :5422   Median :5112  
#>  Mean   :5593   Mean   :5283  
#>  3rd Qu.:5464   3rd Qu.:5189  
#>  Max.   :7272   Max.   :7058

For estimating the probability of informed trading \(\operatorname{PIN}\) pin_est function can be utilized. Exemplary, \(\operatorname{PIN}\) for BSheavy dataset is estimated.

# using default values for lower and upper bounds
# confidence interval computation enabled
pin_bsheavy <- pin_est(numbuys = BSheavy[,"Buys"],
                       numsells = BSheavy[,"Sells"], 
                       confint = TRUE, ci_control = list(n = 1000, seed = 123), 
                       posterior = TRUE)

# structure of returned list
str(pin_bsheavy)
#> List of 9
#>  $ Results   : num [1:5, 1:4] 0.2 0.5 5412.8 5103.5 1800.9 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:5] "alpha" "delta" "epsilon_b" "epsilon_s" ...
#>   .. ..$ : chr [1:4] "Estimate" "Std. error" "t value" "Pr(> t)"
#>  $ ll        : Named num 4963659
#>   ..- attr(*, "names")= chr "loglike"
#>  $ pin       : Named num 0.0331
#>   ..- attr(*, "names")= chr "PIN"
#>  $ conv      : Named num 0
#>   ..- attr(*, "names")= chr "Convergence"
#>  $ message   : chr "relative convergence (4)"
#>  $ iterations: Named num 1
#>   ..- attr(*, "names")= chr "Iterations"
#>  $ init_vals : Named num [1:5] 0.2 0.5 5412.8 5103.5 1800.9
#>   ..- attr(*, "names")= chr [1:5] "alpha" "delta" "epsilon_b" "epsilon_s" ...
#>  $ confint   : Named num [1:2] 0.0171 0.0503
#>   ..- attr(*, "names")= chr [1:2] "2.5%" "97.5%"
#>  $ posterior : 'matrix' num [1:60, 1:3] 1.00 1.00 1.00 4.22e-128 2.67e-119 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:3] "no" "good" "bad"

# convert matrix to data.frame for prettier output in the vignette
as.data.frame(pin_bsheavy$Results)

If model parameter estimates and therefore estimates of \(\operatorname{PIN}\) on a quarterly basis are required, qpin function which takes care of automatic dataset splitting is an appropriate choice.
The BSfrequent2015 dataset covers four quarters and a total of 261 trading days. Dates of trading days are stored in its rownames, so they can be passed to the dates argument of qpin.

# dates stored in rownames of dataset
head(rownames(BSfrequent2015))
#> [1] "2015-01-01" "2015-01-02" "2015-01-05" "2015-01-06" "2015-01-07"
#> [6] "2015-01-08"

# quarterly PIN estimates
# confidence interval computation enabled:
#   * using only 1000 simulated datasets
#   * confidence level set to 0.9
#   * seed set to 287

qpin2015 <- qpin(numbuys = BSfrequent2015[,"Buys"],
                 numsells = BSfrequent2015[,"Sells"],
                 dates = as.Date(rownames(BSfrequent2015), format = "%Y-%m-%d"),
                 confint = TRUE, ci_control = list(n = 1000, level = 0.9, seed = 287))

# list of length 4 is returned
names(qpin2015[["res"]])
#> [1] "2015.1" "2015.2" "2015.3" "2015.4"

# confidence intervals for all four quarters
ci_quarters <- lapply(qpin2015[["res"]], function(x) x$confint)
ci_quarters
#> $`2015.1`
#>         5%        95% 
#> 0.03021102 0.05918459 
#> 
#> $`2015.2`
#>          5%         95% 
#> 0.008149941 0.030166534 
#> 
#> $`2015.3`
#>         5%        95% 
#> 0.04097623 0.06958719 
#> 
#> $`2015.4`
#>         5%        95% 
#> 0.01099590 0.03499847

# each list element has the same structure as results from pin_est function
# convert matrices to data.frames for prettier output in the vignette
qpin2015_res <- lapply(qpin2015[["res"]], function(x) as.data.frame(x$Results))

qpin2015_res[[1]]

qpin2015_res[[4]]

Results returned by qpin can be visualized with ggplot.

library(ggplot2)
ggplot(qpin2015[["res"]])

Datasets of daily buys and sells can be simulated with simulateBS function which offers three arguments. Values of model parameters can be set via param argument, to ensure reproducibility seed should be specified and ndays determines the number of trading which are simulated. The probability parameters \(\alpha\) and \(\delta\) are used to sample trading days’ conditions. Once the sequence of states is computed, number of buys and sells for each trading day are drawn from Poisson distributions with intensities according to the scenario tree presented in EHO model section.

We use the estimated parameters of pin_bsheavy to simulate data for 100 trading days.

# getting the estimates
heavy_est <- pin_bsheavy$Results[,"Estimate"]

# simulate buys and sells data
set.seed(123)
sim_heavy <- simulateBS(param = heavy_est, ndays = 100)

# summary of simulated data
summary(sim_heavy)
#>       Buys          Sells     
#>  Min.   :5290   Min.   :4935  
#>  1st Qu.:5385   1st Qu.:5057  
#>  Median :5426   Median :5094  
#>  Mean   :5553   Mean   :5289  
#>  3rd Qu.:5476   3rd Qu.:5156  
#>  Max.   :7408   Max.   :6999

Computation of confidence intervals for \(\operatorname{PIN}\) can either be enabled by confint argument of the optimization routines (confint = TRUE in pin_est_core, pin_est, qpin) or performed with pin_confint directly which incorporates simulateBS for simulation of n datasets with the given parameter vector param. MLE is done for each simulated dataset to receive a total of n \(\operatorname{PIN}\) estimates. Quantiles, induced by level argument, of this series are calculated by quantile function from stats package.

We use the simulated sim_heavy data together with the corresponding parameter estimates to calculate a confidence interval for the probability of informed trading. In addition, we compare the execution times of single-core vs. parallel computation.¹⁸ The higher the number of simulation runs n the more the computation benefits from parallel execution.

# n = 10000 simulation runs, 
# level = 0.95 (confidence level)

system.time(heavy_ci <- pin_confint(param = heavy_est, 
                                    numbuys = sim_heavy[,"Buys"],
                                    numsells = sim_heavy[,"Sells"],
                                    seed = 321, ncores = 1))

#>    user  system elapsed 
#>  11.417   0.013  11.439

# same setting but 4 cpu cores
system.time(heavy_ci4 <- pin_confint(param = heavy_est, 
                                     numbuys = sim_heavy[,"Buys"],
                                     numsells = sim_heavy[,"Sells"],
                                     seed = 321, ncores = 4))

#>    user  system elapsed 
#>   3.897   0.103   7.876

heavy_ci
#>       2.5%      97.5% 
#> 0.02042598 0.04599843
heavy_ci4
#>       2.5%      97.5% 
#> 0.02042598 0.04599843

Posterior probabilities of trading days’ condition are returned by posterior and can be displayed with ggplot. Exemplary, we compute posteriors for BSheavy dataset and use corresponding parameter estimates stored in heavy_est.

# Calculating posterior probabilities
post_heavy <- posterior(param = heavy_est,
                        numbuys = BSheavy[,"Buys"], numsells = BSheavy[,"Sells"])

# Plotting                        
ggplot(post_heavy)

If x axis should show dates, names of numbuys and numsells need to be either in "%Y-%m-%d" or "%Y/%m/%d" format which can be converted with as.Date. The following code chunk shows how posterior probabilities for BSfrequent2015 in the third quarter can be visualized.

# Corresponding parameter estimates
freq_2015.3 <- qpin2015[["res"]]$'2015.3'$Results[,"Estimate"]

# Subsetting data
third_quarter <- subset(BSfrequent2015, subset = lubridate::quarter(rownames(BSfrequent2015)) == 3)

# Calculating posterior probabilities
post_third <- posterior(param = freq_2015.3, 
                        numbuys = third_quarter[,"Buys"], numsells = third_quarter[,"Sells"])

# Plotting
ggplot(post_third)

References

Aslan, Hadiye, David Easley, Soeren Hvidkjaer, and Maureen O’Hara. 2011. “The Characteristics of Informed Trading: Implications for Asset Pricing.” Journal of Empirical Finance 18 (5). https://doi.org/10.1016/j.jempfin.2011.08.001.

Brockman, Paul, and Dennis Y. Chung. 2008. “Investor Protection, Adverse Selection, and the Probability of Informed Trading.” Review of Quantitative Finance and Accounting 30 (2). https://doi.org/10.1007/s11156-007-0049-4.

Chung, Kee H., and Mingsheng Li. 2003. “Adverse-Selection Costs and the Probability of Information-Based Trading.” Financial Review 38 (2). https://doi.org/10.1111/1540-6288.00045.

Daley, D. J., and D. Vere-Jones. 2003. An Introduction to the Theory of Point Processes Volume I: Elementary Theory and Methods. Second. Springer.

Duarte, Jefferson, and Lance Young. 2009. “Why Is {Pin} Priced?” Journal of Financial Economics 91 (2). https://doi.org/10.1016/j.jfineco.2007.10.008.

Easley, David, Soeren Hvidkjaer, and Maureen O’Hara. 2002. “Is Information Risk a Determinant of Asset Returns?” The Journal of Finance 57 (5): 2185–2221. https://doi.org/10.1111/1540-6261.00493.

———. 2010. “Factoring Information into Returns.” Journal of Financial and Quantitative Analysis 45 (2): 293–309. https://doi.org/10.1017/S0022109010000074.

Easley, David, Soeren Hvidkjaer, Maureen O’Hara, and Joseph Paperman. 1996. “Liquidity, Information, and Infrequently Traded Stocks.” The Journal of Finance 51 (4): 1405–36. 10.1111/j.1540-6261.1996.tb04074.x.

Ersan, Oguz, and Aslı Alıcı. 2016. “An Unbiased Computation Methodology for Estimating the Probability of Informed Trading (Pin).” Journal of International Financial Markets, Institutions and Money 43: 74–94. https://doi.org/10.1016/j.intfin.2016.04.001.

Everitt, Brian S., Sabine Landau, Morven Leese, and Daniel Stahl. 2011. Cluster Analysis. Fifth. Wiley.

Gan, Quan, Wei Wang Chun, and David Johnstone. 2015. “A Faster Estimation Method for the Probability of Informed Trading Using Hierarchical Agglomerative Clustering.” Quantitative Finance 15 (11): 1805–21. https://doi.org/10.1080/14697688.2015.1023336.

Henry, Tyler R. 2006. “Short Selling, Informed Trading, and Stock Returns.” Working Paper, University of Georgia.

Kang, Moonsoo. 2010. “Probability of Information-Based Trading and the January Effect.” Journal of Banking & Finance 34 (12). https://doi.org/10.1016/j.jbankfin.2010.07.007.

Lee, Charles M. C., and Mark J. Ready. 1991. “Inferring Trade Direction from Intraday Data.” The Journal of Finance 46 (2). https://doi.org/10.1111/j.1540-6261.1991.tb02683.x.

Lee, Peter M. 2012. Bayesian Statistics: An Introduction. Wiley.

Lei, Qin, and Guojun Wu. 2005. “Time-Varying Informed and Uninformed Trading Activities.” Journal of Financial Markets 8 (2). https://doi.org/10.1016/j.finmar.2004.09.002.

Li, Haitao, Junbo Wang, Chunchi Wu, and Yan He. 2009. “Are Liquidity and Information Risks Priced in the Treasury Bond Market?” The Journal of Finance 64 (1). https://doi.org/10.1111/j.1540-6261.2008.01439.x.

Lin, Hsiou-Wei William, and Wen-Chyan Ke. 2011. “A Computing Bias in Estimating the Probability of Informed Trading.” Journal of Financial Markets 14 (4): 625–40. https://doi.org/10.1016/j.finmar.2011.03.001.

Müllner, Daniel. 2013. “fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python.” Journal of Statistical Software 53 (9). http://www.jstatsoft.org/v53/i09/.

Tijms, Henk C. 2003. A First Course in Stochastic Models. New York (N. Y.): Wiley. https://doi.org/10.1002/047001363X.

Yan, Yuxing, and Shaojun Zhang. 2012. “An Improved Estimation Method and Empirical Properties of the Probability of Informed Trading.” Journal of Banking & Finance 36 (2): 454–67. https://doi.org/10.1016/j.jbankfin.2011.08.003.

Zhou, Rhea Tingyu, and Rose Neng Lai. 2009. “Herding and Information Based Trading.” Journal of Empirical Finance 16 (3). https://doi.org/10.1016/j.jempfin.2009.01.004.

Poisson processes are directly related to the corresponding discrete Poisson distribution which is defined by the following cumulative distribution function (cdf) and probability mass fun \[ \begin{align} \text{cdf}: \:\: F\left(\lambda;n\right) = \exp\left(-\lambda\right) \sum\limits_{k=0}^n \dfrac{\lambda^k}{k!} \quad \text{and} \quad \text{pmf}: \:\: f\left(\lambda;n\right) = \exp\left(-\lambda\right) \dfrac{\lambda^n}{n!} \notag ; \:\: \text{with}\:\: x \in \mathbb{R}^+ \end{align} \] with the number of arrivals \(n\), the intensity parameter \(\lambda \in \mathbb{R}^+\) and \(n \in \mathbb{N}_0\). The mean and variance of the Poisson distribution are equal and given by the parameter \(\lambda\).↩
The exponential distribution is a continuous distribution which has cdf and probability density function (pdf) given below, \[ \begin{align} \text{cdf}: \:\: F\left(\lambda;x\right) = 1 - \exp\left(-\lambda x\right) \quad \text{and} \quad \text{pdf}: \:\: f\left(\lambda;x\right) = \lambda \exp\left(-\lambda x\right) \notag; \:\: \text{with}\:\: x \in \mathbb{R}^+ \end{align} \] with rate parameter \(\lambda\).↩
The merging of two homogeneous independent Poisson processes again yields a Poisson process, e.g. see Tijms (2003).↩
For most exchanges no data is available about the direction of transactions. Due to this reason algorithms like Lee & Ready are used to try to detect if a transaction is buyer- or seller-initiated (see Lee and Ready 1991).↩
All model parameters representing trading intensities are assumed to be positive real numbers.↩
Overflow errors occur if the calculated number is too big in magnitude so that it can not be longer represented by the machine/software.↩
Underflow errors occur if the number to represent is too small and vanishes to zero.↩
More details are given in the appendix of Yan and Zhang (2012).↩
These bounds are chosen according to the work by Yan and Zhang (2012). In principle, any value reasonably greater than 0 can be chosen as lower bound and any value reasonably smaller than 1 as upper bound.↩
Gan, Chun, and Johnstone (2015) mention that they tested different agglomerative clustering methods but that the complete-linkage method performed marginally better than the others. For a description of the other available methods, e.g. single-linkage or centroid-linkage, see Everitt et al. (2011).↩
According to Gan, Chun, and Johnstone (2015) we use the R function hclust to perform this task (see Müllner 2013).↩
A splitting of the intensity of informed trading is not present in the EHO model. However, one could extend the existing model with this feature and already had a suitable technique for generating initial values.↩
It is not ensured that \(\mu_b^0\) and \(\mu_s^0\) are positive. Hence, if \(\mu_b^0\) or \(\mu_s^0\) are negative we set them to 0.↩
The term unbiased refers to computing bias and not the statistical understanding.↩
In the work by Ersan and Alıcı (2016) \(j\) is chosen to be an integer in the range from 1 to 10.↩
In the work by Ersan and Alıcı (2016) this relation is not directly mentioned. However, at this point in the algorithm , we solely have information about the two groups of trading days and the absolute order imbalance at hand. Since the EHO model does not separate informed buy rate from informed sell rate, the difference in averages of the absolute order imbalances of no-event and event group is appropriate to capture the intensity of trading due to private information.↩
The specialized functions for generating initial values are not exported.↩
All computations were done on an Intel Core i5-4590 with four physical cores.↩

pinbasic: An R Package for Fast and Stable Estimation of Static Models for the Probability of Informed Trading

Andreas Recktenwald

2018-11-18

Introduction

General PIN Framework

EHO Model

Likelihood Factorizations

EHO Factorization

Lin and Ke Factorization

Package Interface

Initial Values

Grid Search Algorithm

HAC Algorithm

Refined HAC Algorithm

Package Interface

Posterior Probabilities for Trading Days’ States

Examples

References