khmer, a simple k-mer counting library

khmer is a simple C++ library for counting k-mers in DNA sequences. It has a complete Python wrapping and should be pretty darned fast; it's intended for genome-scale k-mer counting.

The current version is 0.2. I haven't used it for much myself, but the test code functions & it should work as advertised.

khmer operates by building a 'ktable', a table of 4**k counters. It then maps each k-mer into this table with a simple (and reversible) hash function.

Right now, only the Python interface is documented here. The C++ interface is essentially identical; if you need to use it and want it documented, drop me a line.

Counting Speed and Memory Usage

On the 5 mb Shewanella oneidensis genome, khmer takes less than a second to count all k-mers, for any k between 6 and 12. At 13 it craps out because the table goes over my default stack size limit.

Approximate memory usage can be calculated by finding the size of a long long on your machine and then multiplying that by 4**k. For a 12bp wordsize, this works out to 16384*1024; on an Intel-based processor running Linux, long long is 8 bytes, so memory usage is approximately 128 mb.

Python interface

Essentially everything requires a ktable.

import khmer
ktable = khmer.new_ktable(L)

These commands will create a new ktable of size 4**L, suitable for counting L-mers.

Each ktable object has a few accessor functions:

  • ktable.ksize() will return L.
  • ktable.max_hash() will return the max hash value in the table, 4**L - 1.
  • ktable.n_entries() will return the number of table entries, 4**L.

The forward and reverse hashing functions are directly accessible:

  • hashval = ktable.forward_hash(kmer) will return the hash value

    of the given kmer.

  • kmer = ktable.reverse_hash(hashval) will return the kmer that hashes

    to the given hashval.

There are also some counting functions:

  • ktable.count(kmer) will increment the count associated with the given kmer by one.
  • ktable.consume(sequence) will run through the sequence and count each kmer present.
  • n = ktable.get(kmer|hashval) will return the count associated with the given kmer string or the given hashval, whichever is passed in.
  • ktable.set(kmer|hashval, count) set the count for the given kmer string or hashval.

In all of the cases above, 'kmer' is an L-length string, 'hashval' is a non-negative integer, and 'sequence' is a DNA sequence containg ONLY A/C/G/T.

Note: 'N' is not a legal DNA character as far as khmer is concerned!

And, finally, there are some set operations:

  • ktable.clear() empties the ktable.
  • ktable.update(other) adds all of the entries in other into ktable. The wordsize must be the same for both ktables.
  • intersection = ktable.intersect(other) returns a ktable where only nonzero entries in both ktables are kept. The count for ach entry is the sum of the counts in ktable and other.

An Example

This short code example will count all 6-mers present in the given DNA sequence, and then print them all out along with their prevalence.

# make a new ktable, L=6
ktable = khmer.new_ktable(6)

# count all k-mers in the given string

# run through all entries. if they have nonzero presence, print.
for i in range(0, ktable.n_entries()):
   n = ktable.get(i)
   if n:
      print ktable.reverse_hash(i), "is present", n, "times."

And that's all, folks... Let me know if there's other functionality that you think is important.

CTB, 3/2005