Sunday, December 14, 2014

Find N most frequently occurring words in a file

Let's solve a problem which involves two concepts : One of hash maps and other heap.

Problem statement

Given a dictionary like text file, find n top occurring words in it i.e. n words whose count is the maximum. Or in other words, find most frequently occurring words.
For example, if given file is like follows:
one, two, two, three, three, three, four,four, four,four, five, five, five, five, five, six,six, six, six, six, six
and we are looking for three most frequent words, output should be :  six,five, four.

Analysis

First step to solve this problem is to read the file. We can create an file object and open the file and read the file using fgets or gets. In this code, we are using C++ to read file. C code to read a file can be found here.

Second, we need to count frequency of each word. So when a word is read, we need to increment count corresponding to that word. Which data structure is best suited for this kind of operation? Hash of course. Hash gives us constant time complexity to lookup a string and increment its count.

Now, we have list of words and corresponding frequency of each word, we need to figure the top n most frequently appearing words. 
So, thump of rule says that whenever we have to figure our top or least from collection of items, heaps are best bet. To find top N, we will go with min heap.
This is how it works:
  • Take first N words appearing in map along with their count and create a min heap with these N elements.
  • Now, one by one read each word from hash and check if the frequency of the words is more than the minimum in heap, that is top of heap.
  • If yes, remove the top and add this word into the heap. If not, continue with next word.
  • When done with all words in hash, min heap will contain N most frequently occurring words in file.

To find K minimum elements using min heap is also explained here : Heaps : Find K smallest element from a given array

Implementation notes:

Problem has been solve using STL in C++, using map, vector and prioirty_queue libraries.
Also, hash map is created to map string with an object which contains string and its frequency to be used for min heap purpose. 
Also, default priority queue in implementation in STL gives max heap, to use it as min heap, new comparator is written.

Code

Complexity analysis

Reading M words from file will require O(M) time, where as creating N element heap will take O(N).
Also scanning through all elements will take O((M-N) Log N) complexity.
Hence overall complexity will be O(M Log N).

Same problem can be asked in other way:
Given  a file with million points (x,y) where x and y are x and y co-ordinates of point. find 100 closest points to origin.
Just update the hash to store the point and it's distance. Then create a max heap with first 100 points and then read one by one all points and update the max heap accordingly. At the end max heap will contain 100 most closest points to origin.