SPDCA:

Having trouble sleeping, so I'll write a bit about some developments regarding the adventures of this algorithm.

Amazingly, I had the opportunity to use this at work. It was the case where I needed to design a tool that could consume log messages from another data source. These logs were sequentially numbered (like idkfa posts), were monotonically increasing, and came through at a very, very fast pace (say 80MB in a few seconds).

My log consumer needed to be robust, fast, and in addition to the statistical features it would have from scanning the log messages, it would need to "remember" if it had seen a log message provided that a message was a) accidentally/programmatically repeated, or b) explicitly sent again by an administrator. That meant that I needed a way to accurately track the messages that would be 1) quick to access and 2) take little memory. I decided to take the method I had for idkfa and implement it using Perl.

The "quick" part took about a week and a half to figure out, as I was having to continuously rewrite my code to try to keep up with the data stream. Performing a linear search was far, far too slow, even in the beginning, and would be unusable after only a short period of normal usage. By the time I was done, I was using two Red-Black trees as lower and upper bound indexes (similar to the database index I use on idkfa), as well as a few other tricks made possible by having data structures stored in memory rather than a database.

Unfortunately, memory usage became a killer issue. While my algorithm could keep up and scale well as it "saw" more data, there was never any guarantee that the "gaps" in the log messages would be filled. I was collecting such a huge amount of data that even if I was missing 1 in 1000 messages, within hours my process would run out of memory by being forced to store all of the gaps.

It was at that point I had to make a compromise. There was no way I was going to be able to build more complexity into the tool to track messages, and the probability that I would see a gap be filled grew less and less likely as time went on.

My decision was this: I would only store 1000 bounds. When it came time to add another bound, I would take the "oldest" two bounds (those that I was the least likely to see again), and merge them. This means that though I was introducing inaccuracy into the system, I was introducing it at the point where I was least likely to need accuracy. If I can still be 99% accurate over my configured time period, then I think this was a reasonable compromise.

If idkfa's seen/unseen system bogs down with the sheer quantity of posts we're adding, this may become a viable option. In this case, as you read, you would slowly "auto-read" the oldest posts as soon as you read new posts that couldn't be amended to currently stored boundaries. It would mean the system would be less accurate, but only so for older posts that you probably weren't going to read anyway.

I don't see that happening any time soon. More just making a note for the future.

#3983, posted at 2012-01-17 03:31:45 in idkfa