Find substring

Find substring

February 15, 2023

Moch Lutfi
Name
Moch Lutfi
Twitter
@kaptenupi

Problem

You have two arrays of strings, words and parts. Return an array that contains the strings from words, modified so that any occurrences of the substrings from parts are surrounded by square brackets [], following these guidelines:

If several parts substrings occur in one string in words, choose the longest one. If there is still more than one such part, then choose the one that appears first in the string.

Example

For words = ["Apple", "Melon", "Orange", "Watermelon"] and parts = ["a", "mel", "lon", "el", "An"], the output should be
solution(words, parts) = ["Apple", "Me[lon]", "Or[a]nge", "Water[mel]on"].

While "Watermelon" contains three substrings from the parts array, "a", "mel", and "lon", "mel" is the longest substring that appears first in the string.

Naive solution

We will tackle this problem in a straightforward manner by iterating through each word in the list of words and checking for the presence of substrings from the list of parts in each word. We will record the longest substring that appears earliest in the word.

naive.go
failed

func solution(words []string, parts []string) []string {
result := []string{}
for _, word := range words {
index, size := -1, 0
for _, p := range parts {
if i := strings.Index(word, p); i >= 0 && len(p) > size {
index, size = i, len(p)
}
}
if index >= 0 {
result = append(result, word[:index]+"["+word[index:index+size]+"]"+word[index+size:])
} else {
result = append(result, word)
}
}
return result
}

Let's analyze the performance of the code. With W as the number of words and P as the number of parts, the running time would be at least O(WP) due to the nested loop. In addition, the function calls if p in w and w.index(p) within the loop will have a running time of O(len(W)) as they scan the word w to find p. Although the performance can be improved by calling w.index(p) once and saving the result, the running time would still be O(WPN), where N is the length of a typical word in w.

According to the given constraints, W is at most 5000, N is at most 30 and P is at most 5000. This leads to 750 million loops, which means each loop would have to run in about 5 picoseconds to be completed within 4 seconds. This is a tough requirement, especially considering that the worst case scenario has been taken into account.

However, it's important to note that in real-world scenarios, the worst case scenario may not always be encountered. In the context of job interviews, companies are looking for candidates who can quickly solve the problems of today, rather than waiting for the perfect solution.

It's good to show the interviewer that you understand the complexity of the algorithm and that it's a combination of three multiplicative factors (length of a typical word, number of words, and number of substrings). You can also mention that the code passes the tests, but may not handle the worst case in a reasonable amount of time.

If you want to impress the interviewer, you can mention that there are better algorithms that can be used if there is a valid business case for investing more time in the solution. However, it's important to keep in mind that premature optimization is not advisable.

Using Trie

The main function solution(words []string, parts []string) []string will focus on constructing the trie and bringing together all the words at the end.

Here's the plan for solution:

  1. Initialize the variable lenLongSubstr to 0.
  2. Iterate through each character in the word, using it as the starting point for finding matching substrings in parts.
  3. For each character, utilize the trie to locate the longest substring in parts that begins with that character. If it's longer than the previous longest one, record the current position in the word and the length of the substring.
  4. Use the length of the longestSeenSubstring and its starting position to properly place the square brackets.

You'll notice that going through the word in order and only updating the position when we encounter a longer substring automatically satisfies the tie-breaking rule. The trickiest part is finding the longest substring from parts that starts at a particular location, as we need to keep track of the final terminal node that we passed before failing to match in the trie. It would be much simpler if we only had to keep track of the last node we were on before the match failed.

trie.go

// TrieNode is a struct that represents a node in a Trie
type TrieNode struct {
letter rune
terminal bool
children map[rune]*TrieNode
}
func newTrieNode(letter rune) *TrieNode {
return &TrieNode{letter: letter, children: make(map[rune]*TrieNode)}
}
func addFragmentToTrie(root *TrieNode, fragment string) {
currentNode := root
for _, letter := range fragment {
if _, ok := currentNode.children[letter]; !ok {
currentNode.children[letter] = newTrieNode(letter)
}
currentNode = currentNode.children[letter]
}
currentNode.terminal = true
}
func findSubstringInWord(w string, root *TrieNode) string {
var lenLongSubstr, longestPos int
for startPos := range w {
currNode := root
for position := startPos; position < len(w); position++ {
letter := rune(w[position])
if _, ok := currNode.children[letter]; !ok {
break
}
currNode = currNode.children[letter]
length := position - startPos + 1
if currNode.terminal && length > lenLongSubstr {
lenLongSubstr = length
longestPos = startPos
}
}
}
if lenLongSubstr == 0 {
return w
}
end := longestPos + lenLongSubstr
return w[:longestPos] + "[" + w[longestPos:end] + "]" + w[end:]
}
func solution(words []string, parts []string) []string {
root := newTrieNode(' ')
for _, p := range parts {
addFragmentToTrie(root, p)
}
result := make([]string, len(words))
for i, w := range words {
result[i] = findSubstringInWord(w, root)
}
return result
}

Our approach continues to loop through each word, but there is one limitation that hasn't been utilized yet - the length of each element in parts. In the worst-case scenario, we may need to check up to 5 levels at each position in each word. To be precise:

  • The number of words, W = 5000;
  • The number of letters in a word, N = 30;
  • The number of levels to check in the trie, P = 5.

Compared to the initial naive implementation, this is 1000 times more efficient.