cows: Collections for wildcard strings¶
cows (collections for wildcard strings) is a Python library that provides efficient collection implementations where equality checking allows for wildcards in both the search string and the strings already in the collection.
Motivation¶
cows was developed for a common problem in bioinformatics: given a set of DNA
sequences with the alphabet A
, T
, C
, G
, along with a wildcard
N
(indicating that the base is unknown), find the unique sequences and
perform some operation on them. Examples of the operation are: counting how
many times each unique sequence occurs and generate a consensus sequence for
each unique sequence.
For a simple example, for counting unique sequences consider the following input and desired output:
input output
----- ------
ATNG ATNG 2 # Comprised of ATNG and ATCN
ATCN ANNT 1
ANNT GTTC 1
GTTC
Notice this task requires comparing strings with wilcards not just in one
string, but in both. For example, matching ATCN
to ATNG
requires that
the third and fourth characters both be considered wildcards.
Naively one could pairwise compare the sequences, ignore the positions where
either contains an N
, and check if all other positions match. However,
this quickly becomes intractable as it scales with the square of the number of
sequences.
cows uses a modified implementation of atrie (cows.trie
) to reduce
this complexity to scale linearly with the number of sequences.
Provided Data Structures¶
Below are examples for the data structures included with cows. Please see the documentation in Data Structure Reference for detailed API information.
cows.List
¶
A cows.list
is a simple list implementation where insertion functions
similarly to the builtin list
data structure, but accessor methods take
into account ambiguity. For example:
l = cows.List(['ABCD', 'ABC*', '****', 'DEFG'])
print(l.index('D***'))
The print statement outputs 2
since the first match for D***
is at
position 2 (with a value of ****
).
cows.Set
¶
A cows.set
stores unique strings similar to the builtin set
data
structure. Instead of using hashes for equality checks, the underlying
cows.trie
is used to check if the pattern being inserted matches any
existing member of the set, taking into account wildcards in both. For
example:
import cows
s = cows.Set(wildcard='*')
s.add('ABCD')
s.add('*EFG')
s.add('T')
s.add('ABC*') # Matches ABCD, so not added
s.add('HEF*') # Matches *EFG, so not added
print(s)
Produces:
cows.Set(['*EFG', 'ABCD', 'T'])
cows.Dict
¶
cows dictionaries are similar to the builtin dict
type insofar as they are
key/value stores. They have a few key differences, however.
First, when setting a value, if there is an existing (potentially ambiguous)
match already in the dictionary, you can set an updater
function to update
the existing value rather than simply overwrite it. Further, when inserting a
key/value pair, multiple existing keys may match the new key due to ambiguity.
Specifying a selector
function at instantiation lets you define to which of
the matches the updater
should be applied.
See cows.dictionary
for more detailed information.
import cows
def increment(match, old_value, new_value):
return old_value + new_value
my_dict = cows.Dict(updater=increment)
my_dict['ABC'] = 1
my_dict['DEF'] = 2
my_dict['AB*'] = 10
for k, v in sorted(my_dict.items()):
print('{} --> {}'.format(k, v))
Produces:
ABC --> 11
DEF --> 2
cows.Trie¶
Note
Generally the cows.trie
data structure shouldn’t be used
directly. Consider using one of its abstractions.
All other cows data structures are based on the cows.trie
class. It
allows for ambiguous queries taking into account wildcards both in the query
string and elements in the trie.
An example of it’s use:
import cows
t = cows.Trie()
t['ABCD'] = 1
t['DE*G'] = 5
print('Matches for ABC* {}'.format(list(t.get_matches("ABC*"))))
print('Matches for D*FG {}'.format(list(t.get_matches("D*FG"))))
Outputs:
Matches for ABC* [('ABCD', cows.Trie(D, 1))]
Matches for D*FG [('DE*G', cows.Trie(G, 5))]
Performance¶
cows is performant, requiring O(n) time for insertions and lookups with an input size of n strings. The naive approach which is currently quite common involves pairwise comparing the sequences in a collection resulting in O(n2), quickly becoming intractable.