I wanted to get in touch with financial time series. I found one dataset dealing with cryptocurrencies, that you can find on Challenge data, with the details of the challenge.
Prices are recorded every hours. You get chunks of 21 days, with the 24th hour price missing.
Additionally, time series are grouped into “clusters”. You need to predict the 24th hour price for the cluster (not for the individual time series). revious hour points the average change of the 24th hours.
The description of the dataset is obfuscated:
md
and bc
I found some regularities in the dataset, an I was able to:
bc
and md
I will detail in this article and the followings how I did it.
This one is dedicated to how to reconstruct the full history.
Starting with the secret quantities md
and bc
.
The acronymes are not standard (e.g. mean/std/var), so that useless to guess without analysis.
The first idea is to look at histograms to study the distribution.
bc
, the maximal value is 1
while the minimal value is not clearly visible (there is not a strong bottom line). It might be a rate 1==100%
for instance, but negative rate are often not allowed when capping to 100%
. It might be a correlation coefficientmd
, because of the negative values, it seems to be log
values.The second thing we can do is to look at the values over one cluster.
Here, the values are sorted by asset ID. As we have 21 chunks per cluster, we have 21 values per asset. (because the dataset is recorded week by week)
Looking at the following plot, the asset value is unchanged over the cluster lifetime.
It seems that asset IDs are assigned based on the decreasing value of md
.
For most of the clusters, it is true, but there are a few exceptions.
If md
and bc
are unique and constant over 21 days, I wanted to see how similar two time series with a similar md
or bc
looks like.
I studied first how unique are md
and bc
values:
23,607
unique pairs of (cluster, asset)
(given the IDs)17,864
unique md
values17,870
unique bc
valuesTherefore, we can find 5,750
pairs of time series with exactly the same bc
or the same md
.
So, we picked at random md
and bc
pairs, and we obtained that:
If you measure correlation, you get a perfect 1
Knowing that two assets in two different clusters are exactly the same, it is possible to “align” clusters, i.e., finding clusters which occur during the same period.
The 1464
train clusters were grouped into 311
time related groups.
The following plot show the size of the groups obtained, in terms of cluster number:
Over the 1464
, only \(\approx 100\) clusters are alone.
Therefore, we can learn plenty of stuff using the 90%
others.
In the previous part, we were able to find which clusters occur at the same time.
If we can group clusters that way, it might be possible to find overlap between time periods, and therefore find which group occurs after another one.
Because assets are crypto-assets, they are likely to not be independent of each other, and to be subject to the same market variations. If we look at several clusters occurring at the same date, we can see the market impact. Globally, the market is increasing. We see two major event at time \(\approx 150\) and \(\approx 250\).
Notes:
Even if in each cluster, there is only one asset which is shared, the correlation is very high between the means.
In the dataset, assuming that clusters overlap, we would find asset \(X_1\) and \(X_2\) where \(X_1(t+k) = X_2(t)\). There are two ways to identify these match:
When working at the asset level, we will try to find which asset matches a given asset. Because we have about \(18,000\) unique asset time series, the search cost (without optimization) is very expensive (\(\approx 18000^2/2\)).
When working at the cluster level, we take advantage of the market correlation.
Next, after finding which cluster follows another one, we would have to find which assets match.
In terms of complexity, this is much faster as we have only 311
cluster groups.
The easiest way to find matching time series is by looking at their cross-correlation coefficient (CC). The CC is simply the correlation coefficient of the two considered segments, separated by a lag \(dt\):
\[CC(X, Y; dt) = \frac{\sum_{i=1:n-dt}(x_i - \mu_X)(y_{i+dt}-\mu_Y)}{\sqrt{\sum_{i=1:n-dt}(x_i - \mu_X)^2 \times \sum_{i=1:n-dt}(y_{i+dt} - \mu_Y)^2}}\]The interpretation is the same as for Pearson correlation coefficient:
1
: signals are correlated0
: signals are not correlated-1
: signals are anti-correlated (unlikely to occurs with the market hypothesis)To compute the cross-correlation, we take the mean return rate of a cluster (i.e. non-integrated signals). Compared to the integrated signal, the resulting correlation is much more discriminative (because trends can fool the results).
\[\hat{R}_C(t) = \sum_{A \in C} X_A(t)\]The following image shows the cross correlation with lag=0
for all clusters pairs.
1
(hopefully)Setting an acceptance threshold to 0.5
, we find 122
matching pairs, which move our number of meta groups from 311
to 226
.
We verified that the clusters pairs match correctly by plotting their average integrated return rate:
Looking at the size of the meta groups, we can see that many of the singleton clusters are now assigned to a larger group.
Looking at the groups where we could not measure the cross correlation (white lines and columns on the previous matrix), we can see two things:
In other word, the lag between clusters is likely to be exactly 1 week
, which greatly reduces the size of our search space.
We repeated the same operation that in the previous sub-section, but with a lag of 1 week
(we also used a lag of 2 weeks
for the unmatched clusters).
For validation, we checked if we could find at least one asset in both groups separated of one week.
With some graph analysis, we recovered the full order, where element with missing values were either the start, either the end of the dataset.
In total, we have 217
full weeks of consecutive data.
You can check yourself with the clusters of the 10 first weeks:
// (weekNbr: [cluster_ID])
{
0: [136, 956],
1: [1169, 1172],
2: [512, 811, 951, 867, 639, 576, 93, 994],
3: [857, 1123, 92, 1129, 653, 1165],
4: [568, 1265, 515, 427, 156, 14, 1216, 542],
5: [752, 583, 430, 924, 492, 499, 55],
6: [268, 844, 77, 455, 237],
7: [648, 1448, 242, 614, 518, 325, 756],
8: [234, 842, 1378, 46, 1196, 691],
9: [1393, 538, 1055, 883, 1358, 859, 190],
10: [768, 732, 181, 1384, 511, 1115, 302, 1226],
}
Given this history, we created a “Frankenstein” time-series, which is the average price by connecting all clusters together.
Now that we have the ordered sequence, we can study the testing set. It is very likely that the testing set overlap the training part. We would do just as before:
md
and bc
;After computing the CC, we can visually check that results at the cluster level are coherent:
For all test clusters, we were able to align them with a train group.
With this information, we can stop there and propose a submission, without a lot of intelligence:
For a given day:
ytrain
), compute the average returnWith this approach, I got rank 7:
7 Sept. 2, 2022, 3:18 p.m. japoneris MSE 0.0061
This is not the end of the analysis. Let’s finish with asset matching.
Now that we know which clusters are recorded during the same period, we can search for matching assets.
When looking at md
values, we can see that some assets are present in both training and testing set.
Previously, we found how clusters follow each others. The goal now is to reconstruct for one assets its full price sequence over time. With a longer history, it would be easier to study the regularities / behavior.
We selected one test cluster at random, the 1472
, identified at week 34
.
If we compute the cross-correlation with the train assets of this week, we get this correlation matrix, where we have only a strong correlation for asset 11
(yellow color):
Now, we can look at past and future events.
If we look at assets starting the week before (33
), we get many more matching candidates.
For all but assets 9
and 10
, we could find their relatives:
And looking one week after (35
), each asset has an analog asset:
With these information, we can put all pieces together, and obtain a perfect overlap:
If we aligned them correctly by changing the starting value of each segment, we get:
We have identified 1262
unique time series.
A few assets cover all the dataset, but many of them are singleton (408
occurs only once.)
The following plot is the length of an assets versus its rank:
The longest signal we have covers all the dataset, and is very similar to the “Frankenstein” signal we have ploted before:
We analyzed the Napoleon X dataset, and we were able to recover the relative time for all clusters. We were also able to recreate assets signals that were cut into pieces. The longest asset cover all the dataset.
Without any sophisticated algorithms, we were able to get a good MSE score.
We still don’t know what are md
and bc
, which will be the topic of the next post on the series.
You can read the next article on the topic here
>> You can subscribe to my mailing list here for a monthly update. <<