Find substring

# Find substring

February 15, 2023

Name
Moch Lutfi
@kaptenupi

## Problem

You have two arrays of strings, `words` and `parts`. Return an array that contains the strings from `words`, modified so that any occurrences of the substrings from `parts` are surrounded by square brackets `[]`, following these guidelines:

If several `parts` substrings occur in one string in `words`, choose the longest one. If there is still more than one such part, then choose the one that appears first in the string.

Example

For `words = ["Apple", "Melon", "Orange", "Watermelon"]` and `parts = ["a", "mel", "lon", "el", "An"]`, the output should be
`solution(words, parts) = ["Apple", "Me[lon]", "Or[a]nge", "Water[mel]on"]`.

While `"Watermelon"` contains three substrings from the `parts` array, `"a"`, `"mel"`, and `"lon"`, `"mel"` is the longest substring that appears first in the string.

## Naive solution

We will tackle this problem in a straightforward manner by iterating through each word in the list of words and checking for the presence of substrings from the list of parts in each word. We will record the longest substring that appears earliest in the word.

naive.go
failed
`func solution(words []string, parts []string) []string { result := []string{} for _, word := range words { index, size := -1, 0 for _, p := range parts { if i := strings.Index(word, p); i >= 0 && len(p) > size { index, size = i, len(p) } } if index >= 0 { result = append(result, word[:index]+"["+word[index:index+size]+"]"+word[index+size:]) } else { result = append(result, word) } } return result}`

Let's analyze the performance of the code. With `W` as the number of words and `P` as the number of parts, the running time would be at least O(WP) due to the nested loop. In addition, the function calls `if p in w` and `w.index(p)` within the loop will have a running time of O(len(W)) as they scan the word w to find p. Although the performance can be improved by calling `w.index(p)` once and saving the result, the running time would still be O(WPN), where N is the length of a typical word in w.

According to the given constraints, W is at most 5000, N is at most 30 and P is at most 5000. This leads to 750 million loops, which means each loop would have to run in about 5 picoseconds to be completed within 4 seconds. This is a tough requirement, especially considering that the worst case scenario has been taken into account.

However, it's important to note that in real-world scenarios, the worst case scenario may not always be encountered. In the context of job interviews, companies are looking for candidates who can quickly solve the problems of today, rather than waiting for the perfect solution.

It's good to show the interviewer that you understand the complexity of the algorithm and that it's a combination of three multiplicative factors (length of a typical word, number of words, and number of substrings). You can also mention that the code passes the tests, but may not handle the worst case in a reasonable amount of time.

If you want to impress the interviewer, you can mention that there are better algorithms that can be used if there is a valid business case for investing more time in the solution. However, it's important to keep in mind that premature optimization is not advisable.

## Using Trie

The main function `solution(words []string, parts []string) []string ` will focus on constructing the trie and bringing together all the words at the end.

Here's the plan for `solution`:

1. Initialize the variable `lenLongSubstr` to 0.
2. Iterate through each character in the word, using it as the starting point for finding matching substrings in `parts`.
3. For each character, utilize the trie to locate the longest substring in `parts` that begins with that character. If it's longer than the previous longest one, record the current position in the word and the length of the substring.
4. Use the length of the `longestSeenSubstring` and its starting position to properly place the square brackets.

You'll notice that going through the word in order and only updating the position when we encounter a longer substring automatically satisfies the tie-breaking rule. The trickiest part is finding the longest substring from `parts` that starts at a particular location, as we need to keep track of the final terminal node that we passed before failing to match in the trie. It would be much simpler if we only had to keep track of the last node we were on before the match failed.

trie.go
`// TrieNode is a struct that represents a node in a Trietype TrieNode struct { letter rune terminal bool children map[rune]*TrieNode}func newTrieNode(letter rune) *TrieNode { return &TrieNode{letter: letter, children: make(map[rune]*TrieNode)}}func addFragmentToTrie(root *TrieNode, fragment string) { currentNode := root for _, letter := range fragment { if _, ok := currentNode.children[letter]; !ok { currentNode.children[letter] = newTrieNode(letter) } currentNode = currentNode.children[letter] } currentNode.terminal = true}func findSubstringInWord(w string, root *TrieNode) string { var lenLongSubstr, longestPos int for startPos := range w { currNode := root for position := startPos; position < len(w); position++ { letter := rune(w[position]) if _, ok := currNode.children[letter]; !ok { break } currNode = currNode.children[letter] length := position - startPos + 1 if currNode.terminal && length > lenLongSubstr { lenLongSubstr = length longestPos = startPos } } } if lenLongSubstr == 0 { return w } end := longestPos + lenLongSubstr return w[:longestPos] + "[" + w[longestPos:end] + "]" + w[end:]}func solution(words []string, parts []string) []string { root := newTrieNode(' ') for _, p := range parts { addFragmentToTrie(root, p) } result := make([]string, len(words)) for i, w := range words { result[i] = findSubstringInWord(w, root) } return result}`

Our approach continues to loop through each word, but there is one limitation that hasn't been utilized yet - the length of each element in `parts`. In the worst-case scenario, we may need to check up to 5 levels at each position in each word. To be precise:

• The number of words, W = 5000;
• The number of letters in a word, N = 30;
• The number of levels to check in the trie, P = 5.

Compared to the initial naive implementation, this is 1000 times more efficient.