Master of typing – tutor 1 2 34. I'm currently working with a hydrologist and he raised a question that occurs quite frequently with real data — what do you do when the data look like they need a log transformation, but there are zero values?
I asked the question on stats.stackexchange.com and received some useful suggestions. What follows is a summary based on these answers, my own experience, plus a few papers I discovered that deal with the topic. In general, the most appropriate course of action depends on the model and the context. Zeros can arise for several different reasons each of which may have to be treated differently.
In computing, Data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration and data management tasks such as data wrangling, data warehousing, data integration and application integration. Data transformation can be simple or complex based on the required changes to the data between the. The result is a percent change that is easily comparable to other annualized data. In this case, the 0.92 percent translates into an annualized 2.22 percent. The 0.15 becomes 1.81 percent (annualized), and the 0.22 figure becomes 2.67 percent (annualized).
Box-Cox (BC) transformations
There is a two-parameter version of the Box-Cox transformation that allows a shift before transformation:$$g(y;lambda_{1}, lambda_{2}) =begin{cases}frac {(y+lambda_{2})^{lambda_1} - 1} {lambda_{1}} & mbox{when } lambda_{1} neq 0 log (y + lambda_{2}) & mbox{when } lambda_{1} = 0end{cases}.$$The usual Box-Cox transformation sets $lambda_2=0$. One common choice with the two-parameter version is $lambda_1=0$ and $lambda_2=1$ which has the neat property of mapping zero to zero. Macos mojave vs catalina. There is even an R function for this: log1p()
. More generally, both parameters can be estimated. In R, the boxcox.fit()
function in package geoR
will fit the parameters.
Alternatively, when $lambda_1=0$, it has been suggested that $lambda_2$ should be approximately one half of the smallest, non-zero value. Another suggestion is that $lambda_2$ should be the square of the first quartile divided by the third quartile (Stahel, 2002).
I've used functions like this several times including in Hyndman & Grunwald (2000) where we used $log(y+lambda_2)$ applied to daily rainfall data.
One simple special case is the square root where $lambda_2=0$ and $lambda_1=0.5$. This works fine with zeros (although not with negative values). However, often the square root is not a strong enough transformation to deal with the high levels of skewness seen in real data.
Inverse hyperbolic sine (IHS) transformation
An alternative transformation family was proposed by Johnson (1949) and is defined by$$f(y,theta) = text{sinh}^{-1}(theta y)/theta = logleft(theta y + (theta^2 y^2 {+ 1})^{{1 / 2}}right)/theta,$$where $theta > 0$. For any value of $theta$, zero maps to zero. There is also a two parameter version allowing a shift, just as with the two-parameter BC transformation. Burbidge, Magee and Robb (1988) also discuss the IHS transformation including estimation of $theta.$
The IHS transformation works with data defined on the whole real line including negative values and zeros. For large values of $y$ it behaves like a log transformation, regardless of the value of $theta$ (except 0). As $thetarightarrow0$, $f(y,theta)rightarrow y$.
Mixture models
For continuous data, there can be a discrete spike at zero which can be associated with the sensitivity of the measurements. Mypoint pro 1 17 – coordinates tool for mac desktop. For example in wind energy, wind below 2m/s is often recorded as zero and the distribution of wind energy produced is continuous with a spike at zero.
0.1 Percent As A Fraction
With rainfall data, there is a spike at zero for a different reason – it didn't rain. These are genuine zeros (rather than indetectably small values).
Windows media video download for mac. With insurance data, a similar phenomenon occurs – the distribution of claims is continuous with a large spike at zero.
Easy Data Transform 1 1 0 Percent Formula
A fourth example might be income data – zero if someone is not in paid work, but a continuous positive value otherwise. https://coolbfile673.weebly.com/aristocrat-wonder-4-games.html.
In each of these cases, a mixture model is probably the most appropriate where part of the model determines the probability of a zero, and the other part of the model determines the distribution of the data when it is positive. We also used something like this in Hyndman and Grunwald (2000).
Please enable JavaScript to view the comments powered by Disqus.comments powered by Disqus