Best Matching Substring (Frequency Analysis)

string

slidingwindow

array

# Best Matching Substring (Frequency Analysis)

## Difficulty
Medium

## Tags
`string`, `sliding window`, `array`

## Problem Statement
You are given a text string `text` consisting of lowercase English letters and an integer `k`. You are also provided with a `target_distribution` representing the expected frequency of each character 'a' through 'z'.

Find the substring of length `k` in `text` that most closely matches the `target_distribution`.

The "closeness" is measured by the **Total Variation Distance**: the sum of absolute differences between the observed frequency of each character in the substring and the expected frequency.

`Distance = Sum(|observed_count[c] - expected_count[c]|)` for all c in 'a'...'z'.

Where:
- `observed_count[c]` is the count of character `c` in the substring.
- `expected_count[c]` is `target_distribution[c] * k`.

Return the **starting index** of the best matching substring. If there are ties, return the smallest index.

## Constraints
- 1 ≤ k ≤ text.length ≤ 10⁵
- `text` consists of lowercase English letters.
- `target_distribution` is a list of 26 floats summing to 1.0.

## Signatures
### Python
```python
def find_best_match(text: str, k: int, target_distribution: List[float]) -> int:
    pass
```

### Java
```java
public class Solution {
    public int find_best_match(String text, int k, List<double> target_distribution) {
        return null;
    }
}
```

### Csharp
```csharp
public class Solution {
    public int find_best_match(string text, int k, IList<double> target_distribution) {
        throw new NotImplementedException();
    }
}
```

### Cpp
```cpp
class Solution {
public:
    int find_best_match(string text, int k, vector<double> target_distribution) {
        
    }
};
```

### C
```c
int find_best_match(char* text, int k, vector<double> target_distribution) {
    
}
```

### Ruby
```ruby
def find_best_match(text, k, target_distribution)
    
end
```

### Perl
```perl
sub find_best_match {
    my ($self, @args) = @_;
    
}
```

### Scala
```scala
object Solution {
    def find_best_match(text: String, k: Int, target_distribution: List[Double]): Int = {
        
    }
}
```

### Haskell
```haskell
find_best_match :: ... -> ...
find_best_match args = undefined
```

### Clojure
```clojure
(defn find_best_match [text k target_distribution]
  )
```

### Scheme
```scheme
(define (find_best_match text k target_distribution)
  )
```

## Examples
### Example 1
**Input:**
```
text = "aaabbbccc"
k = 3
target_distribution = [1.0, 0.0, ..., 0.0] (100% 'a')
```
**Output:**
```
0
```
**Explanation:**
- Substring at 0: "aaa". Counts: {'a': 3}. Expected: {'a': 3}. Distance: 0.
- Substring at 3: "bbb". Counts: {'b': 3}. Expected: {'a': 3}. Distance large.
- Best match is at index 0.

### Example 2
**Input:**
```
text = "abacaba"
k = 3
target_distribution = [0.33, 0.33, 0.33, ...] (Equal a, b, c)
```
**Output:**
```
1
```
**Explanation:**
- Window "bac" (index 1) has 1 'a', 1 'b', 1 'c'.
- Expected counts for uniform distribution (approx): 1 'a', 1 'b', 1 'c'.
- This matches perfectly (or very closely).

## Hints
1. **Sliding Window**: Maintain the counts of the current window of length `k`.
2. **Initialization**: Calculate counts for the first window (0 to k-1). Calculate initial distance.
3. **Update**: Slide window one step to the right.
   - Decrement count of character leaving the window.
   - Increment count of character entering the window.
   - Update the distance metric efficiently.
4. Note that updating the total distance sum might require checking the specific characters that changed count, rather than recomputing the sum over 26 chars every time (though 26 is small constant, so recomputing is fine).

Please log in to view the solution editor and submit code.