Building a Word Frequency Analyzer in Go: A Step-by-Step Guide
In this blog post, we will walk through a Go program that reads text input, counts the frequency of each unique word, and sorts and displays these words by their frequency of occurrence. This program is a great example of how to efficiently process text using Go’s standard library, especially the bufio and sort packages.
Introduction
Text processing is a common task in programming, whether it’s for analyzing logs, parsing documents, or performing natural language processing. One of the fundamental tasks in text processing is counting how many times each word appears in a given input. In this tutorial, we’ll write a Go program that reads text input, counts the occurrences of each word, and then sorts the words by their frequency.
The Program
Here’s the complete Go program we’ll be discussing:
Let’s break down the code step by step to understand how it works.
1. Setting Up the Scanner
The program begins by setting up a `Scanner` to read input from standard input (os.Stdin). The bufio package provides a convenient way to read input efficiently:
scan := bufio.NewScanner(os.Stdin)
The Scanner is a powerful tool for reading text, as it can be easily configured to split the input into words, lines, or custom tokens.
2. Configuring the Scanner to Split by Words
By default, the `Scanner` reads input line by line. However, since we want to count individual words, we need to configure the scanner to split input by words using bufio.ScanWords:
scan.Split(bufio.ScanWords)
The `bufio.ScanWords` function tells the scanner to treat each word as a separate token, which makes it easy to process each word individually.
3. Counting Word Frequencies
Next, we use a map to keep track of how many times each word appears in the input. Maps in Go are a key-value data structure, making them ideal for this purpose:
words := make(map[string]int)
for scan.Scan() {
words[scan.Text()]++
}
- scan.Scan(): Advances the scanner to the next word.
- scan.Text(): Retrieves the current word as a string.
- words[scan.Text()]++: Increments the count for the word in the map. If the word isn’t already in the map, it’s added with an initial count of 1.
4. Displaying the Number of Unique Words
Once all the input has been processed, we can determine how many unique words were found by checking the length of the map:
fmt.Println(len(words), “unique words”)
The len(words) function returns the number of keys in the map, which corresponds to the number of unique words.
5. Preparing Data for Sorting
To sort the words by their frequency, we need to move the word-count pairs from the map into a slice of structs. We define a simple data struct to hold each word and its count:
go
type data struct {
k string
v int
}
var s []data
for k, v := range words {
s = append(s, data{k, v})
}
Here, we iterate over the map, appending each word and its count to the slice s.
6. Sorting the Words by Frequency
Go’s sort package allows us to sort slices in a flexible manner. We use sort.Slice to sort our slice of data structs in descending order based on the word count:
sort.Slice(s, func(i, j int) bool {
return s[i].v > s[j].v
})
This sorting function compares the counts (v) of two elements and orders them so that the words with higher counts come first.
7. Displaying the Sorted Results
Finally, we iterate over the sorted slice and print each word along with its frequency:
for _, s := range s {
fmt.Println(s.k, "appears", s.v, "times")
}
This loop prints each word followed by the number of times it appeared in the input.
Conclusion
In this tutorial, we’ve walked through a Go program that reads text input, counts the frequency of each unique word, and sorts the words by their frequency. This program is an excellent demonstration of how to use Go’s bufio and sort packages for efficient text processing.