","

# Design and Architecture

This document lists the design and architectural decisions taken
during the development of Correct Markdown. It follows
the [Architecture Decisions](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions.html) format.

## StringView

**Date: (2024-10-01)**


### Context

The project goal is to correct pieces of text in a markdown document. This task cannot be done
with a straightforward find and replace because the markdown document contains information other
than plain-text, namely markdown tags and, eventually, html tags.

We need a data-structure to isolate markup information from plain-text information.

### Decision

Create StringView data-structure. This data-structure maintains an index and a view object
where the subsets of a string partition can be edited independently from each other.

To initialize a StringView we have to pass a partition of string. A partition is a collection of
disjoint sets which the union gives the original string. A partition is described by a dictionary
where the key represents the set name (view name) and the value represents the list of strings
the set is made of.

During initialization we build the index and view objects. The index is a list of `ViewSegmentItem`

```python
@dataclass
class ViewSegmentItem:
    view_name: Any
    master_index: int
    segment_index: int
```

After initialization, one can update a view without affecting the others.

```python
>>> segments = {
...  "text": ["","October journal","\n\n","October first","\n\n", "Today it rained."],
...  "markup": ["<h1>","</h1>","<h2>","</h2>","",""]
... }
>>> from danoan.correct_markdown.core.string_view import StringView
>>> SV = StringView(segments)

>>> SV["text"]
'October journal\n\nOctober first\n\nToday it rained.'

>>> SV["markup"]
'<h1></h1><h2></h2>'

>>> SV.get_content()
'<h1>October journal</h1>\n\n<h2>October first</h2>\n\nToday it rained.'


>>> s = SV["text"].find("October first")
>>> SV["text"] = s, "Monday, October first\n\nToday was sunny!"
>>> SV["text"]
'October journal\n\nMonday, October first\n\nToday was sunny!'


>>> SV.get_content()
'<h1>October journal</h1>\n\n<h2>Monday, October first</h2>\n\nToday was sunny!'

```
Notice that to replace a view segment we can bracket reference the view and then assign it
to an integer and a string. The integer represents the position in the view from where to
start the reassignment and the string is the new content.

### Status

Implemented.

### Consequences

Refactor strikethrough-items to start using StringView. It is likely the case that a helper
data-structure will be needed to address special requirements for the markdown correction
feature.

## MarkdownView: Remove HTML tags, but preserve markdown markup tags

**Date: (2024-10-05)**

### Context

In a first moment, the diff items were collected from two plain-text views. The first was
derived from the original markdown file and the second from the correction of the latter.

However, this approach was the source of undesirable editions in the corrected markdown.
For example:

```json
**touffus** qui parfois on a
Diff Item: touffus qui -> touffus que

Enhanced MD:
**touffus que** parfois on a
```

Therefore, we start processing the pure markdown view instead, that is, markdown markups are accepted
but not html markup tags. In the MarkdownView class we start using the base text as the pure markdown
view, and we instantiate the StringView with Html and NoHtml segments.

### Decision

Accepted.

### Status

Implemented.

### Consequences

We use LLM to correct the original text. It is more likely that the LLM will generate more reliable
output for plain-text input, but we need to edit the prompt to also accept pure markdown input.

## Diff Items: Diff by character and diff by word

**Date: (2024-10-07)**

### Context

Our goal is to correct a given markdown file (original) and generate another (corrected)
that is as similar as possible to the original one. That is, we only apply corrections to
words, not formatting. The formatting should be kept the same as the original and that includes
white spaces and new lines.

The diff items can be computed using either the `word-basis` or the `charater-basis` mode. In the
first, trailing spaces and new lines are ignores while in the second they are not.

From an application point of view, we are interested in a word-basis diff to identify the parts
of the text that differ from the original and corrected version. The diff items generated by the
word-basis mode are used to generate the explanations of the corrections. It is not helpful
to pass formatting differences to the explanation prompt.

On the other hand, to create the StringView segments, I need to use the `character-basis` mode
otherwise we cannot reconstruct the original string correctly.

### Decision

Accepted.

### Status

Implemented.

### Consequences

The joiner string in `word-basis` is a space; while in `character-basis` is the empty string.


## MarkdownView: Ignore white spaces and new lines during find operation

**Date: (2024-10-09)**

### Context

The diff items are computed using `word-basis` mode and this mode ignore white spaces and new lines.
Therefore, we cannot reconstruct the exact string present in the text from the diff items. New lines
and white spaces might be missing. That's why, during the find operation, we use regex to ignore
trailing white spaces and new lines.

### Decision

Accepted.

### Status

Implemented.

### Consequences

Use `[\s\n]*` between the words of the search_value in the find method of MarkdownView.