Monday 17 November 2014

Display top 10 trending words

Problem:
There is a big file of words which is dynamically changing. We are continuously adding some words into it. How would you keep track of top 10 trending words at each moment?

Solution:
To solve this problem efficiently we need to look at two aspects:
a. We  must be able to detect if the word is a new word or it occurred previously. This can be done using data structure like Binary Search tree, Hash map, tries etc.
b. At any moment we must be able to retrieve top 10 words. This can be achieved using data structures like linked list, min heap or sorted array list.

The code mentioned below use hashmap and linked list. The linked list contain one node for every unique word and also maintain frequency of that word. The list is sorted at any given point of time, so first 10 nodes will be top 10 trending words.

To maintain mapping of nodes in the linked list and word, we use a hash map.

Adding new word : o(words)

 import java.util.HashMap;  
 import java.util.Scanner;  
 public class Top10Words {  
   public static void main(String[] args) {  
     Scanner s = new Scanner(System.in);  
           while(true) {  
                String word = s.next();  
                if(word.equals("-1")) {  
                     break;  
                } else if (word.equals("-2")) {  
                  printTopTen();  
                }else {  
                  addWord(word);  
                }  
           }  
           s.close();  
   }  
   public static HashMap<String,Node> nodeMap = new HashMap<String,Node>();  
   public static Node head = null;  
   public static Node tail = null;  
   public static void addWord(String word) {  
     if(nodeMap.containsKey(word)) {  
       Node p = nodeMap.get(word);  
       p.count++;  
       // move this node up the list if needed  
       while(p.prev != null && p.count > p.prev.count) {  
               Node t = p.prev;  
                  if(t.prev != null) {  
                   t.prev.next = p;  
                  }  
                  if(p.next != null) {  
                   p.next.prev = t;  
                  }  
                  t.next = p.next;  
                  p.prev = t.prev;  
                  p.next = t;  
                  t.prev = p;  
                  if(tail == p) {  
                   tail = t;  
                  }  
                  if(head == t) {  
                   head = p;  
                  }  
             }  
     } else {  
                // add new node to list and make entry in node map  
                Node newNode = new Node(word);  
                if(tail == null) {  
                     tail = newNode;  
                     head = newNode;  
                } else {  
                     tail.next = newNode;  
                     newNode.prev = tail;  
                     tail = newNode;  
       }  
       nodeMap.put(word,newNode);  
     }  
   }  
   public static void printTopTen() {  
     if(head == null) {  
         System.out.println("No Words till now.");  
        } else {  
          Node iter = head;  
             int i=0;  
             while(iter != null && i < 10) {  
              i++;  
                 System.out.println("Word: " + iter.word + " freq: " + iter.count);  
                 iter = iter.next;  
             }  
       }  
   }   
 }  
 class Node {  
   int count;  
   String word;  
   Node next;  
   Node prev;  
   public Node(String word) {  
    this.word = word;  
    this.count = 0;  
    this.next = null;  
    this.prev = null;  
   }   
 }  

As a improvement to above solution, we can modify the information that is stored in the hash map and linked list.

In the hashmap, store following information along with the word as key.
a. Frequency of the word
b. Reference to the node in the list. This will be null if the word is not currently stored in the list.

The linked list (doubly linked list) will have 10 nodes. Also maintain pointer to the head and tail of node.

Algorithm to add a new word:
a. If word does not exist in the hash map then add word to the hash map with frequency as 1. Also check if word should be added to the list (comparing with tail of the list). If word is added to the list, set node reference to the node in the list otherwise null.
b. If word exist in both hash map and list , increment the frequency by 1 in hash map and linked list.Update the linked list to maintain sorted order.
c. If word exist in hash map but not in linked list. Increment the frequency by 1 in hash map. Compare the tail value with new word frequency and check if word should be added to list by replacing the tail node. Readjust the linked list if required.
Note: When a node is deleted from linked list, set the node reference of corresponding hash map entry as null.

There are other possible solutions as well using other data structures. Trie can be used in place of hash map. Sometime it is not good practice to store lots of word in hash map. In such situations trie can be used.

In place of linked list we can also use min heap. It will reduce the complexity of adding new word from O(n) to O(log n). It will be more beneficial if we have to keep track of more words.

Solution using Trie and min Heap

Trie will be used to search for the word. It will also maintain frequency of the word.

Min Heap will be used to maintain top 10 words at any moment.

Algorithm:
  • Search the word in the Trie. If word exists, increment the frequency count.
    • Call min heapify function from index at which current word is stored in heap. If any node is pushed down/up the heap, keep updating reference index in the trie that indicate position of node in heap.
  • If Word does not exist, add the word and make frequency count 1
    • Compare the frequency of min element in heap with the new word. 
    • If the frequency is less than current word frequency or the size of the heap is less than maximum size of heap
      • Set the reference of min heap element to -1 indicate that it is no longer part of heap.
      • Add the node to the heap at index 1 and call min heapify function from index 1. If any node is pushed down the heap, keep updating reference index in the trie that indicate position of node in heap.

At any moment element in min heap represent top 10 trending words.


 import java.util.Scanner;  
 public class DisplayTop10TrendingWord {  
      public static void main(String[] args) {  
           Scanner s = new Scanner(System.in);  
           Heap heap = new Heap(4);  
           while (s.hasNext()) {  
                String word = s.next();  
                TrieNode node = Trie.getWordNode(word);  
                node.frequency++;  
                if (node.reference != -1) {  
                     heap.minHeapify(node.reference);  
                } else {  
                     heap.addNodeToHeap(node);  
                }  
                for (int i = 1; i <= heap.heapsize; i++) {  
                     System.out.print(heap.heap[i].word + "("  
                               + heap.heap[i].frequency + ") ");  
                }  
                System.out.println();  
           }  
           s.close();  
      }  
 }  
 class TrieNode {  
      int frequency;  
      TrieNode[] childs;  
      int reference;  
      String word;  
      public TrieNode() {  
           this.frequency = 0;  
           this.childs = new TrieNode[26];  
           this.reference = -1;  
           this.word = null;  
      }  
 }  
 class Trie {  
      private static final TrieNode root = new TrieNode();  
      public static TrieNode getWordNode(String word) {  
           TrieNode node = root;  
           for (int i = 0; i < word.length(); i++) {  
                char ch = word.charAt(i);  
                if (node.childs[ch - 'a'] == null) {  
                     TrieNode child = new TrieNode();  
                     node.childs[ch - 'a'] = child;  
                }  
                node = node.childs[ch - 'a'];  
           }  
           node.word = word;  
           return node;  
      }  
 }  
 class Heap {  
      TrieNode[] heap;  
      int maxSize;  
      int heapsize;  
      public Heap(int maxSize) {  
           this.maxSize = maxSize;  
           this.heapsize = 0;  
           heap = new TrieNode[maxSize + 1];  
      }  
      public static int parent(int index) {  
           return index / 2;  
      }  
      public static int left(int index) {  
           return index * 2;  
      }  
      public static int right(int index) {  
           return index * 2 + 1;  
      }  
      public void minHeapify(int index) {  
           int minIndex = index;  
           int left = left(index);  
           int right = right(index);  
           if (left <= heapsize && heap[left].frequency < heap[minIndex].frequency) {  
                minIndex = left;  
           }  
           if (right <= heapsize  
                     && heap[right].frequency < heap[minIndex].frequency) {  
                minIndex = right;  
           }  
           if (minIndex != index) {  
                TrieNode temp = heap[index];  
                heap[index] = heap[minIndex];  
                heap[index].reference = index;  
                heap[minIndex] = temp;  
                heap[minIndex].reference = minIndex;  
                minHeapify(minIndex);  
           }  
      }  
      public void addNodeToHeap(TrieNode node) {  
           if (heapsize >= maxSize && heap[1].frequency >= node.frequency) {  
                node.reference = -1;  
           } else if (heapsize < maxSize) {  
                heap[++heapsize] = node;  
                int index = heapsize;  
                node.reference = index;  
                while (index > 1  
                          && heap[parent(index)].frequency > heap[index].frequency) {  
                     TrieNode temp = heap[parent(index)];  
                     heap[parent(index)] = heap[index];  
                     heap[parent(index)].reference = parent(index);  
                     heap[index] = temp;  
                     heap[index].reference = index;  
                     index = parent(index);  
                }  
           } else {  
                int index = 1;  
                heap[index].reference = -1;  
                heap[index] = node;  
                node.reference = index;  
                minHeapify(index);  
           }  
      }  
 }  

No comments:

Post a Comment