Friday, July 18, 2014

Find first non repeating character in stream of characters

Problem Statement

There is a stream of characters, we need to tell the first non repeating character at any given point of time.


If we have a predefined set of characters and we need to figure out first non repeating character, then it is very simple. Create a hash table keyed on character and as and when we encounter a character in given set, we increase the count. Once we have processed all characters in set, we can go through the set again and return the first character which has count of occurrence of character is 1.

However, this approach cannot work when we have a continuous stream of character. Now let's look for some hints. We need to find the first non repeating character till a given point of time. If we we query at time t0 and next instance of character comes at t1 where t1> t0, we will return the character at t0. Once we have hit the second instance of the character, we will discard that character. But what will be the first non repeating character at t1?

So we need to store all the characters which are non repeating till t0 and when queried for first non repeating character, we will return the first character in the list. Got some hint? There is a pattern of characters coming in and leaving out and that is First come first out. Which data structure do we need to implement that? Yes, Queues. Done we finalized one thing : To get non repeating character at any time we need to implement queue and store each character which is seen only once in that queue.

Now how about remove a character when character occurs again. We need to remove that node from the queue. This node can be at any place in the queue. What is the best structure where to implement queue when we have to delete node from any place, from start, end or middle ? That is doubly linked list.  So till now we have figured out how to delete a character entry from queue and how to get the first non repeating character.
However, how to find the exact node to be deleted in queue implemented using doubly linked list. Either we traverse DLL each time, or we can store a mapping. Mapping will be between character and node pointer in DLL.

Now to check if this is character is seen two or more times, we can use a hash where we make true to character entry when we see character second time. So next time on wards we will not do any processing on that character.

Great! We got data structures. Now let's summarize algorithm.
  1. Take the input character. Check in 'visited' array if this character is seen more than two time.
  2. If yes, don't do anything, process next character.
  3. If it not seen two times yet, then we need check if it is seen first time.
  4. If seen first time, DDL node entry corresponding to that character will be null. Add node in DLL and update the hash table.
  5. If character is already seen once, we will have a node entry in hash table. We will remove that node from queue and marked 'visited' entry as visited.
Let's work out an example:
Stream of characters is : a,t,r,e,r,a,p,p,p,a,u,i,b,t,.............Continues
At first all our containers are false and NULL, queue is empty. 
Initial state

Character 'a' comes : Looking at visited array,we see that it is definitely not seen more than twice, else the place would have been marked true. Also looking at node map array we can see that this character is still not in queue, it means we have seen it first time. So we add this to queue and store pointer in node map array.
Data structure condition :

Now comes 't', same above conditions are true, hence 't' is also added to queue.
Similarly for r and e too.

When first instance of a character is seen

Twist comes when 'r' is seen again. This time we have a node pointer stored in node map array corresponding to 'r', but visited flag is false. This means we have seen 'r' second time. We till remove node 'r' from queue and mark visited['r'] as true, hence forth character will not considered as first non repeated character. So our data structures at this point look like this:
When second instance of character is seen
After this if query for first non repeated character, output will be 'a'.

We process next character 'a'. Again same fate as above 'r'. We will remove node from the queue and marked visited['a'] as true.

After this if query for first non repeated character, output will be 't'.

This will continue is similar fashion. Please comment if further explanation is required on example.


Complexity Analysis

Complexity of above code will be O(1) to find first non repeating character at a given point of time. We would use NO_OF_CHARACTERS extra space. 

Blogger Tricks