On the previous post, I connected the dots between the different time series, allowing to recover the full time order, and get a unique identifier for each asset.
In this part, I will use external data to disambiguate the assets and understand the magic quantities.
The data and details of the challenge are still here data.
The first thing we can do to learn what are the two quantities md
and bc
is to look at their distribution and their evolution over time.
When looking at these two quantities, it doesn’t look like usual distribution (Normal, Poisson, Cauchy, …). At least, we know the range of the values.
For md
, on the left, the first hypothesis was the log of the true asset price relatively to the bitcoin. Because no cryptoasset is more expensive than the Bitcoin, the price ratio values would be in \((0, 1]\), leading to log values in \((-\infty, 0]\).
For bc
, the values are mainly between [0, 1]
but some part of the distribution leaks bellow 0
, so this might not be a rate in %
.
Also, the values are weirdly capping at 1, which doesn’t look natural as negative values are possible.
bc
might be a correlation coefficient.
As we know the relative time of each asset, we can see how these two quantities evolve over time.
For md
we get:
And bc
:
We can see that there are major events that affect the md
and bc
time-series globally.
This is very clear for md
, where the lines seems parallel to each others, with a jump near week 20
and a progressive drop with recovery between week 160-180
.
The behavior of bc
is clearly different.
It it less stable, with short-term events of large amplitude.
The dates of the events of md
and bc
don’t seem to be related.
We can look at the history of each asset, to see how it evolves on the long term.
For md
, because of the closeness between the top-items, we expect they belong to a single asset.
This is confirmed by the following figure:
We can study bc
the same way, but here, the linkage is less clear:
The results are noisier and more difficult to understand.
While bc
is less table, most of the “top” or longest series correspond to the largest values (there are very few values close to 0
).
Selecting one asset at random, we get the following reconstructed signal:
The corresponding md
values is:
The relationship between asset and the md
value is not straightforward.
Given that the assets variation are quite large, a way to study it is to move to the log scale.
Superposing the two curves, (with different scaling), we get the following:
Here, we can see how similar are the two curves. The shape is similar, where short-term events occurring at the same time, but they are far from overlapping.
Looking at other assets, we get similar results:
In the challenge description, we were taught that returns are relatively to the Bitcoin price. Given the difference between the two curves, the difference cannot be due to the 24th missing point.
An hythotesis would be that md
is the value relatively to a stable fiat currency (USD or EUR).
Here, we used the natural log to convert the asset time-series. As a reminder, the value of an asset at time \(t\) given the previous results is:
\[X_t = X_0 \prod_{i=1}^t (1 + R_i)\]where \(X_0\) is the value of the asset at time \(t_0\), and \(R_i\) the return at time \(t_i\), i.e. \(R_i = \frac{X_i}{X_{i-1}} - 1\). Because we don’t know what is \(X_0\), we set it to \(1\) to study the behavior of \(\mathbf{X}\).
When moving to the log, everything become simpler:
\[\log(X_t) = \log(X_0) + \sum_{i=1}^t \log(1 + R_i)\]To convert one log from one base to another, we have a simple formula:
\[\log_b(a) = \frac{\log(a)}{\log(b)}\]The log in base \(b\) is simply the value of the log in the natural base divided by the value of the log of the base.
Here, we observed that the natural log is the best match \(md \propto A + \sum_{i=1}^t \log(1 + R_i)\).
Centering the asset log series to the md
mean, we can see that the scale is similar:
Because cryptocurrencies are “open systems”, prices are well recorded and available in multiple databases for free.
To confirm our hypothesis about md
interpretation, we wanted to compare the values to true historical asset prices.
We found Coinmetrics, which provides a day-by-day history for many assets.
For each asset, we have the price relatively to the Bitcoin price, and the price of the asset relatively to USD or Euro.
There are many other informations, but we won’t exploit them here.
The only issue we have with this dataset is the sampling: The Napoleon’s dataset is an hourly dataset, while the Coinmetrics base gives us the daily prices. We need to adjust to this by averaging over one day the return we have.
With this dataset in the hands, we can quickly see if our hypothesis is valid or not. The most valuable crypto asset after Bitcoin is Ethereum. If we look at its price relatively to the Bitcoin, we get:
And if we move to the log scale, we get this signal.
Averaging over one week’s chunk and moving to the log scale, we get:
where the signal exactly overlap the md
values of the best asset.
We can now put a date on it:
We are quite happy but we needed to do a small adjustment:
we needed to shift the time series by 1.8
points up, where there is no clear explanation to it.
When we use the wrong log base, it impacts the amplitude of the time-series movement. However, the amplitude was correct, and didn’t need to be adjusted.
It means that:
\[MD(t) = A + \log(ETC/BTC price(t)) = \log(\exp(A) \times ETC/BTC price(t))\]The reason is not clearly understood why we have this factor.
Additionally, the factor is not the same for all crypto-assets.
For XLM
, we needed to readjust by 7
points, for DOGE
by 8.8
points.
This is for sure not linked to the log. It might be a way to obfuscate the dataset.
The factor doesn’t seem to be related to the initial asset value.
Hopefully, we found that Ethereum is the top one, but for instance dash
which is pricey has a negative coefficient of -0.69
.
We did not find a general law to explain these coefficients.
It seems that for each asset, a random factor has been selected to transform the true price so we cannot recover it trivially without external knowledge.
To identify which dates suits the most to our dataset, we studied the cross-correlation.
For finding the recording date of md
, we tested averaging over one, two and three weeks the asset’s values.
We tested different starting days of the week, because it is possible that day 1
is not a Monday.
Additionally, in some country, the first day of the week is Sunday, so we checked for it.
The best averaging window was 1 week
, and the best starting date was the 2017-08-02
(i.e. averaging between this date and the 2017-08-09
).
The 2nd of August is a Wednesday. This date match md
for the earliest record.
To find the recording date of the asset value, we integrated our series and under-sampled it once every day:
(1 + R).cumprod()[::24]
We did not pay attention to the possible hour lag, as it is of limited interest.
By searching the best cross-correlation, testing different chunks, we found that the best starting day is 2017-07-19
.
Knowing that there are 216 weeks recorded, the recording ends the 2021-09-07
.
In other words, md
is the log value of the asset average value over the 3rd week of a cluster.
Using an external dataset, we were able to find out what is md
.
For this part, we did not provide any new submission: as the dataset is a daily dataset, we got a lot of adjustment errors (more than 5%
).
Plus, this is not really “fair”.
In the next part, we explain how we build a good solution using only dataset information.
Go to the 3rd part.
>> You can subscribe to my mailing list here for a monthly update. <<