CSInterviewing
Problems
Register
Login
Problems
Register
Login
menu
Best Matching Substring (Frequency Analysis)
string
slidingwindow
array
# Best Matching Substring (Frequency Analysis) ## Difficulty Medium ## Tags `string`, `sliding window`, `array` ## Problem Statement You are given a text string `text` consisting of lowercase English letters and an integer `k`. You are also provided with a `target_distribution` representing the expected frequency of each character 'a' through 'z'. Find the substring of length `k` in `text` that most closely matches the `target_distribution`. The "closeness" is measured by the **Total Variation Distance**: the sum of absolute differences between the observed frequency of each character in the substring and the expected frequency. `Distance = Sum(|observed_count[c] - expected_count[c]|)` for all c in 'a'...'z'. Where: - `observed_count[c]` is the count of character `c` in the substring. - `expected_count[c]` is `target_distribution[c] * k`. Return the **starting index** of the best matching substring. If there are ties, return the smallest index. ## Constraints - 1 ≤ k ≤ text.length ≤ 10⁵ - `text` consists of lowercase English letters. - `target_distribution` is a list of 26 floats summing to 1.0. ## Signatures ### Python ```python def find_best_match(text: str, k: int, target_distribution: List[float]) -> int: pass ``` ### Java ```java public class Solution { public int find_best_match(String text, int k, List<double> target_distribution) { return null; } } ``` ### Csharp ```csharp public class Solution { public int find_best_match(string text, int k, IList<double> target_distribution) { throw new NotImplementedException(); } } ``` ### Cpp ```cpp class Solution { public: int find_best_match(string text, int k, vector<double> target_distribution) { } }; ``` ### C ```c int find_best_match(char* text, int k, vector<double> target_distribution) { } ``` ### Ruby ```ruby def find_best_match(text, k, target_distribution) end ``` ### Perl ```perl sub find_best_match { my ($self, @args) = @_; } ``` ### Scala ```scala object Solution { def find_best_match(text: String, k: Int, target_distribution: List[Double]): Int = { } } ``` ### Haskell ```haskell find_best_match :: ... -> ... find_best_match args = undefined ``` ### Clojure ```clojure (defn find_best_match [text k target_distribution] ) ``` ### Scheme ```scheme (define (find_best_match text k target_distribution) ) ``` ## Examples ### Example 1 **Input:** ``` text = "aaabbbccc" k = 3 target_distribution = [1.0, 0.0, ..., 0.0] (100% 'a') ``` **Output:** ``` 0 ``` **Explanation:** - Substring at 0: "aaa". Counts: {'a': 3}. Expected: {'a': 3}. Distance: 0. - Substring at 3: "bbb". Counts: {'b': 3}. Expected: {'a': 3}. Distance large. - Best match is at index 0. ### Example 2 **Input:** ``` text = "abacaba" k = 3 target_distribution = [0.33, 0.33, 0.33, ...] (Equal a, b, c) ``` **Output:** ``` 1 ``` **Explanation:** - Window "bac" (index 1) has 1 'a', 1 'b', 1 'c'. - Expected counts for uniform distribution (approx): 1 'a', 1 'b', 1 'c'. - This matches perfectly (or very closely). ## Hints 1. **Sliding Window**: Maintain the counts of the current window of length `k`. 2. **Initialization**: Calculate counts for the first window (0 to k-1). Calculate initial distance. 3. **Update**: Slide window one step to the right. - Decrement count of character leaving the window. - Increment count of character entering the window. - Update the distance metric efficiently. 4. Note that updating the total distance sum might require checking the specific characters that changed count, rather than recomputing the sum over 26 chars every time (though 26 is small constant, so recomputing is fine).
Login required
Please log in to view the solution editor and submit code.
Login
Register