Published on

Find substring

Authors

Problem

You have two arrays of strings, words and parts. Return an array that contains the strings from words, modified so that any occurrences of the substrings from parts are surrounded by square brackets [], following these guidelines:

If several parts substrings occur in one string in words, choose the longest one. If there is still more than one such part, then choose the one that appears first in the string.

Example

For words = ["Apple", "Melon", "Orange", "Watermelon"] and parts = ["a", "mel", "lon", "el", "An"], the output should be
solution(words, parts) = ["Apple", "Me[lon]", "Or[a]nge", "Water[mel]on"].

While "Watermelon" contains three substrings from the parts array, "a", "mel", and "lon", "mel" is the longest substring that appears first in the string.

Naive solution

We will tackle this problem in a straightforward manner by iterating through each word in the list of words and checking for the presence of substrings from the list of parts in each word. We will record the longest substring that appears earliest in the word.

naive.go
failed

_17
func solution(words []string, parts []string) []string {
_17
result := []string{}
_17
for _, word := range words {
_17
index, size := -1, 0
_17
for _, p := range parts {
_17
if i := strings.Index(word, p); i >= 0 && len(p) > size {
_17
index, size = i, len(p)
_17
}
_17
}
_17
if index >= 0 {
_17
result = append(result, word[:index]+"["+word[index:index+size]+"]"+word[index+size:])
_17
} else {
_17
result = append(result, word)
_17
}
_17
}
_17
return result
_17
}

Let's analyze the performance of the code. With W as the number of words and P as the number of parts, the running time would be at least O(WP) due to the nested loop. In addition, the function calls if p in w and w.index(p) within the loop will have a running time of O(len(W)) as they scan the word w to find p. Although the performance can be improved by calling w.index(p) once and saving the result, the running time would still be O(WPN), where N is the length of a typical word in w.

According to the given constraints, W is at most 5000, N is at most 30 and P is at most 5000. This leads to 750 million loops, which means each loop would have to run in about 5 picoseconds to be completed within 4 seconds. This is a tough requirement, especially considering that the worst case scenario has been taken into account.

However, it's important to note that in real-world scenarios, the worst case scenario may not always be encountered. In the context of job interviews, companies are looking for candidates who can quickly solve the problems of today, rather than waiting for the perfect solution.

It's good to show the interviewer that you understand the complexity of the algorithm and that it's a combination of three multiplicative factors (length of a typical word, number of words, and number of substrings). You can also mention that the code passes the tests, but may not handle the worst case in a reasonable amount of time.

If you want to impress the interviewer, you can mention that there are better algorithms that can be used if there is a valid business case for investing more time in the solution. However, it's important to keep in mind that premature optimization is not advisable.

Using Trie

The main function solution(words []string, parts []string) []string will focus on constructing the trie and bringing together all the words at the end.

Here's the plan for solution:

  1. Initialize the variable lenLongSubstr to 0.
  2. Iterate through each character in the word, using it as the starting point for finding matching substrings in parts.
  3. For each character, utilize the trie to locate the longest substring in parts that begins with that character. If it's longer than the previous longest one, record the current position in the word and the length of the substring.
  4. Use the length of the longestSeenSubstring and its starting position to properly place the square brackets.

You'll notice that going through the word in order and only updating the position when we encounter a longer substring automatically satisfies the tie-breaking rule. The trickiest part is finding the longest substring from parts that starts at a particular location, as we need to keep track of the final terminal node that we passed before failing to match in the trie. It would be much simpler if we only had to keep track of the last node we were on before the match failed.

trie.go

_60
_60
// TrieNode is a struct that represents a node in a Trie
_60
type TrieNode struct {
_60
letter rune
_60
terminal bool
_60
children map[rune]*TrieNode
_60
}
_60
_60
func newTrieNode(letter rune) *TrieNode {
_60
return &TrieNode{letter: letter, children: make(map[rune]*TrieNode)}
_60
}
_60
_60
func addFragmentToTrie(root *TrieNode, fragment string) {
_60
currentNode := root
_60
for _, letter := range fragment {
_60
if _, ok := currentNode.children[letter]; !ok {
_60
currentNode.children[letter] = newTrieNode(letter)
_60
}
_60
currentNode = currentNode.children[letter]
_60
}
_60
currentNode.terminal = true
_60
}
_60
_60
func findSubstringInWord(w string, root *TrieNode) string {
_60
var lenLongSubstr, longestPos int
_60
_60
for startPos := range w {
_60
currNode := root
_60
for position := startPos; position < len(w); position++ {
_60
letter := rune(w[position])
_60
if _, ok := currNode.children[letter]; !ok {
_60
break
_60
}
_60
currNode = currNode.children[letter]
_60
length := position - startPos + 1
_60
if currNode.terminal && length > lenLongSubstr {
_60
lenLongSubstr = length
_60
longestPos = startPos
_60
}
_60
}
_60
}
_60
_60
if lenLongSubstr == 0 {
_60
return w
_60
}
_60
end := longestPos + lenLongSubstr
_60
return w[:longestPos] + "[" + w[longestPos:end] + "]" + w[end:]
_60
}
_60
_60
func solution(words []string, parts []string) []string {
_60
root := newTrieNode(' ')
_60
for _, p := range parts {
_60
addFragmentToTrie(root, p)
_60
}
_60
result := make([]string, len(words))
_60
for i, w := range words {
_60
result[i] = findSubstringInWord(w, root)
_60
}
_60
return result
_60
}

Our approach continues to loop through each word, but there is one limitation that hasn't been utilized yet - the length of each element in parts. In the worst-case scenario, we may need to check up to 5 levels at each position in each word. To be precise:

  • The number of words, W = 5000;
  • The number of letters in a word, N = 30;
  • The number of levels to check in the trie, P = 5.

Compared to the initial naive implementation, this is 1000 times more efficient.