Top K words in a document
- Basic solution
a. use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.
b. sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.
c. After sorting, we just take the first K words. This takes O(K) time.
the total time is O(n+nlg(n)+K)
- use heap instead of complete sort
b'. build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;
c' extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).
To summarize, this solution cost time O(n+k*lg(n)).
- use bucket sort to reduce the time complexity
- large number of words: Have n map workers count frequencies on 1/nth of the text each, and for each word, send it to one of m reducer workers calculated based on the hash of the word. The reducers then sum the counts. Merge sort over the reducers' outputs will give you the most popular words in order of popularity
- use trie and heap
ref: http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/
- Optimized: use quick select O(n) time and O(k) space
ref: http://www.zrzahid.com/top-k-or-k-most-frequent-words-in-a-document/
public String[] topKWordsSelect(final String stream, final int k) { final Map