Mark Gritter (markgritter) wrote,
Mark Gritter

Don't trust published algorithms, the sequel

So I'm trying to implement an algorithm for extracting message types from log files. The author created several published versions, of which this one appears to be the preferred reference: (His TR omits one algorithm altogether, unfortunately.)

I can't wrap my head around what he meant by the following heuristic. P is a partition of log messages, which has already been tokenized and restricted to messages of a particular token length (i.e., count_token). We are trying to pick two columns within P for a subsequent partitioning step. Other published descriptions said that the heuristic was to pick the modal (most frequently occurring) token count per position, but the detailed version is:

1: Determine token count of P as count_token
2: if count_token = 2 then
3:    Set P1 to first token position.
4:    Set P2 to second token position.
5: else if { count_token is > 2 }
6: if P went through step 2 then
7:    Determine cardinality of each token position.
8:    Determine the token count value with the highest frequency 
      other than value 1 as freq_card.
9:    If there is a tie for highest frequency value then
10:      Select lower token value as freq_card
11:   end if
12:   if the frequency of freq_card > 1 then
13:      Set P1 to first token position with cardinality freq_card.
14:      Set P2 to second token position with cardinality freq_card.
15:   else { the frequency of freq_card = 1 }
16:      Set P1 to first token position with cardinality freq_card.
17:      Set P2 to first token position with the next most frequent 
         cardinality other than value 1.
18:   end if
19: else // other stuff, don't care

So, is freq_card allowed to be 1 or not? What if there is no mode, because all the positions have different cardinality?

I think what he's saying in a situation like

error: aaa bbb
error: aaa ccc
error: ddd eee

where the columns have cardinality 1, 2, and 3, is that we tiebreak among the "highest frequency value" by picking the lower cardinality: 1. But there is no "next most frequent cardinality" so I guess the only thing we could do is pick between 2 and 3, and the tiebreaker says to pick 2.

But if we had 1, 1, 2, 2 then we should pick 2 instead because it's not 1, in precedence to the tiebreaker in lines 8-9? Or if it's 1, 1, 1, 2, 2 we should pick 2 as well?

The rationale for picking the mode is that we're looking for the constant part of the messages. The messages include fixed phrases as well as variable components, so if there are N different messages in our partition, we should be able to find at least two columns that all have N different tokens in them. From that point of view, I guess if we can't find a modal token count then the heuristic simply shouldn't apply (but the algorithm spends quite a bit of effort when the relationship between the two columns we pick is not 1:1, so evidently he thought it was important to catch some other cases too.)
Tags: algorithm, programming
  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 1 comment