Earth Hour

Sunday, November 8, 2009

An Effective way of removing duplicates in a file

I remember just a few days back I was asked to design a program to remove duplicates from a file. Well the problem seems quite straight forward and simple to approach, but the real hidden question here lies in, how effective can anyone come up with an algorithm to solve this problem. Initially thinking on the tracks of using hashtables, maps, sets etc. I finally landed up designing a custom 'tree' implementation for getting almost the optimal solution which would remove duplicates from a file in a single iteration! (or in O(n) time, for my geek friends! )

Well while surfing internet, I later found that the tree concept I used to solve this problem is pretty much similar to a standard implementation called 'Suffix tree'. Here I am not going to explain you about suffix trees, you can find plenty of matter on this once you google it, rather, I would be explaining how my solution to this problem worked. Well, I don't claim it to be the most optimal implementation, I am sure, someone out there may come up with a solution better than mine. But this one solves the purpose just fine.

The logic here is to keep building the custom tree (which now we know is similar to suffix tree) as we go through the text. Here is the basic algorithm:
1) Iterate through the text and do the following steps
1.1) As the beginning of a new word is encountered, start matching it starting from the root node to its children, successively. along with buffering the current word as we proceed.
1.2) If a match is found, try to match the next character in the word with its children
1.3) If a match is not found, add another node representing the current character as a child to the current node.
1.4) When the word ends, and the last character is matched, check its 'final' tag.
1.4.1) if its true, it means the current word is the repeated one, so don't output the current buffer word
1.4.2) if its false, it means the current word is encountered for the first time, so output the current buffer and flush the buffer.

Eg. let the text be "CAT CAN BAT". The tree for this sentence will be -


ROOT
/\
C B
/ \
A A
/ \ \
T* N* T*

Note: the (*) mark represents the final state.

Here goes my C/C++ implementation

No comments:

Post a Comment

You can drop me a message here. I will try to get back to you as soon as possible