Searching a string using a hash

Searching a string can be reduced from O(nm) to O(n) using hashed values. This is done by ensuring the hash for the next substring in the searched text (MagicHashFunction(searchedtext.Substring(j+1, m))) can be calculated from the previous hash (MagicHashFunction(searchedtext.Substring(j, m)))substring in O(1).

The code below generates the hash by multiplying each character by a power of the size of the 'alphabet' being used. The power used is the depth in from the right edge of the string (starting from the right just means changes to the beginning of the string have a bigger impact - most people consider the change from AAAA to BAAA to be more significant than a change from AAAA to AAAB)

m-1
 Ʃ αm-i-1 * si+j
i=0

where:
   α = the number of allowed values for si
   m = length of the search term
   j = index into the string being searched
   s = the string being hashed (si+j is the character being hashed)

We can then calculate Hash for the substring starting and finish one character further into the string using an O(1) function:

α(p - αm-1 * sj-1) + sj+m

where:
   p = the previous hash value
   α = the number of allowed values for sn
   m = length of the search term
   j = index into the string being searched
   s = the string being hashed (si+j being the character being hashed)

For this to work in the real world, we need to keep the hash values down to a reasonable value (e.g. a modulus of the number). This would increase the collisions, and we should check the actual strings for character sameness. It should be noted that the hash used here isn't much of a hash. If you know m you can actually retrieve the original string.

The code below uses implements the above hash in order to perform a sort. It does not implement modulus but not collision handling.

/// <summary>Size of 'alphabet'</summary>
public const int charRange = 256;
 
/// <summary>Suitably large prime used to keep hash size down (else the 
/// comparision of the two hash could be as expensive as the comparision of 
/// the original strings</summary>
public const long prime = 4000037;
 
/// <summary>Simple hash for a string. Hash is the sum of each character's integer value
/// multiplied by the charRange to the power of n, where n is the position in from the right of
/// string. This isn't realistic hash function as the hash gets big quick.</summary>
/// <example>AB has the has value 256**1*64 + 256**0*65 = 16449</example>
/// <param name="text">Text to be hashed</param>
/// <returns>Hash</returns>
public static long HashIt(string text)
{
    long hash = 0;
    for (int i = 0; i < text.Length; i++)
    {
        double temp = Math.Pow(charRange, text.Length-i-1) * text[i];
        hash += (long) temp;
    }
    return hash % prime;
}
 
/// <summary>Calculates the hash for the string one character to the right of the 
/// passed hash</summary>
/// <param name="text">Text containing the substring to be hashed</param>
/// <param name="j">Start of the text to be hashed (must be one character to the right of <paramref name="previousHash"/></param>
/// <param name="m">Length of the string to hashed (must be the same as the length of the string hashed to create <paramref name="previousHash"/></param>
/// <param name="previousHash">Hash for text.Substring(j-1, m)</param>
/// <returns>Hash for the string</returns>
public static long HashIt(string text, double previousHash, int j, int m)
{
    // This would be constant for the search term expression and not calculated each time
    double amm1 = Math.Pow(charRange, m - 1);
 
    double hash = charRange * (previousHash - amm1*text[j-1]) + text[j+m-1];
    hash %= prime;
    if (hash < 0) hash += prime;
    return (long)(hash);
}
for (int i=0; i<searchedText.Length - searchTerm.Length; i++)
{
    if (i == 0) currentHash = Hash.HashIt(searchedText.Substring(0, searchTerm.Length));
    else currentHash = Hash.HashIt(searchedText, currentHash, i, searchTerm.Length);
    if (currentHash == searchHash)
    {
        Console.WriteLine("Candidate match found at index {0}", i);
    }
}