Design and Architecture

This document lists the design and architectural decisions taken during the development of Correct Markdown. It follows the Architecture Decisions format.

StringView

Date: (2024-10-01)

Context

The project goal is to correct pieces of text in a markdown document. This task cannot be done with a straightforward find and replace because the markdown document contains information other than plain-text, namely markdown tags and, eventually, html tags.

We need a data-structure to isolate markup information from plain-text information.

Decision

Create StringView data-structure. This data-structure maintains an index and a view object where the subsets of a string partition can be edited independently from each other.

To initialize a StringView we have to pass a partition of string. A partition is a collection of disjoint sets which the union gives the original string. A partition is described by a dictionary where the key represents the set name (view name) and the value represents the list of strings the set is made of.

During initialization we build the index and view objects. The index is a list of ViewSegmentItem

@dataclass
class ViewSegmentItem:
    view_name: Any
    master_index: int
    segment_index: int

After initialization, one can update a view without affecting the others.

>>> segments = {
...  "text": ["","October journal","\n\n","October first","\n\n", "Today it rained."],
...  "markup": ["<h1>","</h1>","<h2>","</h2>","",""]
... }
>>> from danoan.correct_markdown.core.string_view import StringView
>>> SV = StringView(segments)

>>> SV["text"]
'October journal\n\nOctober first\n\nToday it rained.'

>>> SV["markup"]
'<h1></h1><h2></h2>'

>>> SV.get_content()
'<h1>October journal</h1>\n\n<h2>October first</h2>\n\nToday it rained.'


>>> s = SV["text"].find("October first")
>>> SV["text"] = s, "Monday, October first\n\nToday was sunny!"
>>> SV["text"]
'October journal\n\nMonday, October first\n\nToday was sunny!'


>>> SV.get_content()
'<h1>October journal</h1>\n\n<h2>Monday, October first</h2>\n\nToday was sunny!'

Notice that to replace a view segment we can bracket reference the view and then assign it to an integer and a string. The integer represents the position in the view from where to start the reassignment and the string is the new content.

Status

Implemented.

Consequences

Refactor strikethrough-items to start using StringView. It is likely the case that a helper data-structure will be needed to address special requirements for the markdown correction feature.

MarkdownView: Remove HTML tags, but preserve markdown markup tags

Date: (2024-10-05)

Context

In a first moment, the diff items were collected from two plain-text views. The first was derived from the original markdown file and the second from the correction of the latter.

However, this approach was the source of undesirable editions in the corrected markdown. For example:

**touffus** qui parfois on a
Diff Item: touffus qui -> touffus que

Enhanced MD:
**touffus que** parfois on a

Therefore, we start processing the pure markdown view instead, that is, markdown markups are accepted but not html markup tags. In the MarkdownView class we start using the base text as the pure markdown view, and we instantiate the StringView with Html and NoHtml segments.

Decision

Accepted.

Status

Implemented.

Consequences

We use LLM to correct the original text. It is more likely that the LLM will generate more reliable output for plain-text input, but we need to edit the prompt to also accept pure markdown input.

Diff Items: Diff by character and diff by word

Date: (2024-10-07)

Context

Our goal is to correct a given markdown file (original) and generate another (corrected) that is as similar as possible to the original one. That is, we only apply corrections to words, not formatting. The formatting should be kept the same as the original and that includes white spaces and new lines.

The diff items can be computed using either the word-basis or the charater-basis mode. In the first, trailing spaces and new lines are ignores while in the second they are not.

From an application point of view, we are interested in a word-basis diff to identify the parts of the text that differ from the original and corrected version. The diff items generated by the word-basis mode are used to generate the explanations of the corrections. It is not helpful to pass formatting differences to the explanation prompt.

On the other hand, to create the StringView segments, I need to use the character-basis mode otherwise we cannot reconstruct the original string correctly.

Decision

Accepted.

Status

Implemented.

Consequences

The joiner string in word-basis is a space; while in character-basis is the empty string.

MarkdownView: Ignore white spaces and new lines during find operation

Date: (2024-10-09)

Context

The diff items are computed using word-basis mode and this mode ignore white spaces and new lines. Therefore, we cannot reconstruct the exact string present in the text from the diff items. New lines and white spaces might be missing. That’s why, during the find operation, we use regex to ignore trailing white spaces and new lines.

Decision

Accepted.

Status

Implemented.

Consequences

Use [\s\n]* between the words of the search_value in the find method of MarkdownView.