Commits

Nadav Rotem committed 1ea703c548a
[Compression] Implement a new strategy for finding frequently used substrings. The code-book compression (compression using a dictionary) shortens words by encoding references to a string table. This commit changes the code that constructs the string table. Before we were scanning all of the substrings in the input upto a certain length and we sorted them by frequency. The disadvantage of that approach was that we encoded parts of substrings multiple times. For example, the word "Collection" and the substring "ollec" had the same frequency. Attempts to prune the list was too compute intensive and not very effective (we checked if "ollec" is a substring of "Collection" and if they had a similar frequency). This commit implements a completely different approach. We now partition the long words into tokens. For example, the string "ArrayType10Collection" is split to "Array" + "Type" + "10Collection. This method is very effective and with the updated tables we can now reduce the size of the string table by 57%! This change also reduces the size of the string table by 1/3. With this change (the auto-generated h files, that are not included in this commit) the size of the swift dylib on x86 is reduced from 4.4MB to 3.6MB.