# Diff Poetry
#poem #code #fish

![[Diff poem from Against Prisons: "What criminals?" I?," / "What / "What other?"|../img/screenshots/pdiff/criminals_against-prisons.gif]]

![[Diff poem from Now: That's It's "something else" background."|../img/screenshots/pdiff/something-else_now.gif]]

![[Diff poem from If We Go, We Go On Fire: "together", / "alone"|../img/screenshots/pdiff/alone_if-we-go.gif]]

![[Diff poem from Why Riot?: we're / "cultural capital" / "undesirable" / "enormously--consistent" |../img/screenshots/pdiff/capital_why-riot.gif]]

![[Diff poem from Now: "Humanity" we're / "young people", "uncontrolled ones,"|../img/screenshots/pdiff/humanity_now.gif]]

![[Diff poem from Now: 'The Party', / don't / don't / 'send cheque'|../img/screenshots/pdiff/party_now.gif]]

![[Diff poem from Now: it's / It's / "everyday," / "everyday" / It's|../img/screenshots/pdiff/everyday_now.gif]]

![[Diff poem from I Don't Bash Back I Shoot First: don't shit--unless "Friendship, Contempt--these following"|../img/screenshots/pdiff/shit_i-dont-bash-back.gif]]

---

These are the results of a way of making algorithmic blackout poetry I discovered by accident.[^accident]

I noticed that the Prairieland Zines were littered with garbage characters in NCSA Mosaic, which would *not* do. The quotation marks, dashes, etc. were all garbled, thanks to fancy Unicode punctuation (“‘curly’ quotes” — emdashes — ellipses …) that Mosaic can't display.[^compat]

For the most part, this isn't a problem on pnppl.cc. While I do use UTF-8, I convert common characters to an ASCII equivalent in the HTML (then back again with CSS for display on modern systems).

But the zines don't get fed through the program that generates the rest of my site. They just... sit there, ready to go. So, they never get put through my make-everything-work-with-Mosaic filter, hence the problem.

To solve this, I ran the zines through a lovely program called [[konwert|https://packages.debian.org/trixie/konwert]].
```fish
function klean --argument-names html
	konwert utf8-iso1/html $html |
		konwert iso1-utf8/html |
		konwert html-htmldec
end
```

This converts everything to [[Latin-1|https://en.wikipedia.org/wiki/ISO/IEC_8859-1]], an extended ASCII[^extended] that includes most of the euro characters it's missing, then back into UTF-8. In effect, it produces a UTF-8 document reduced to the Latin-1 character subset.[^ascii] It then encodes everything that needs to be encoded — `&<>"` and non-ASCII — into character entities. It sounds nuts, but it works. Mosaic supports Latin-1 (or something akin to it, like the [[Windows Codepage|https://en.wikipedia.org/wiki/Windows-1252]]) and the extended characters display just fine.

So, I mostly just changed a bunch of punctuation. I needed to check it out and make sure I didn't screw anything up,[^restore] so I `diff`ed it with the old one. But that produced a line-based diff. Maybe those are ok for code, but for English they're terrible. You have to hunt for the character that actually changed. It's aggravating.

Well, we also have something called `wdiff` which diffs based on word splits.[^git] It still doesn't show the *character* that changed (like Forgejo's wonderful system) but it's good enough. It seems pretty useless at first because it doesn't color highlight anything, but it accepts arguments for strings to print before and after an insert/delete, so you can roll your own by printing color control codes.

Once I started skimming through the results, there was a striking rhythm that reminded me of something.

**Blackout poetry** is when you take a book and sharpie over everything except the words you want to assemble. I realized that I could use the wdiff highlight feature to make the parts of the text that *didn't* change invisible — just set the foreground and background color to black.

```fish
# poetry diff
function pdiff --argument-names old new --description "turn pair of files into blackout poetry"
	set highlight insert
#	set highlight delete

	set foreground setaf
	set background setab
	set black 0
	set white 15

	# we need to black out the contents before the first highlight
	tput $foreground $black
	tput $background $black
	wdiff \
		--no-deleted \
		--start-$highlight=(tput $foreground $black; tput $background $white) \
		--end-$highlight=(tput $foreground $black; tput $background $black) \
		$old \
		$new
end
```

All said and done, we can generate our diff poems like so:

```fish
function latin-poetry --argument-names html_in
	# psub lets stdout masquerade as a file
	pdiff $html_in (klean $html_in | psub) | less -r
end

latin-poetry index.html
```

The approach I used to convert the files means that it's mostly only words with an apostrophe, quote, or emdash in them that get highlighted. That emphasizes the most interesting *and* least interesting elements of the text. Quotes get you words that are being spoken *about*, or that the writer is distancing themself from, or that begin or end a sentence. Emdashes get you pairs of words that sentences pivot on with a particular rhythm.

Apostrophes get you people's names (as in `{Name}'s`), but mostly they get you contractions that are widely used in English. Many potential poems are ruined by too much repetition of words like it's, that's, can't, isn't, etc. However, this phenomenon also contributes to the ones that *do* work. We get both the more interesting words and some helper words to connect them.

I'm sure there are ways to generate the diff that will produce better results. Just replace `klean` with your own function. If you find any good ones, please email me and share them.


[^accident]: It's sort of like found art. Is all generative art found art since you have to search through noise for it?

[^compat]: The other source of incompatibility is the way HTML character entities are encoded. (If you're not sure what these are: you've probably seen `&amp;` before, which is how you write `&`  — **amp**ersand — in HTML.) You can write them as names (`&amp`) if one has been assigned, or as a number in decimal (`&#38;`) or hexadecimal (`&#x0026;`). Mosaic doesn't understand most of the names and doesn't support hexadecimal entities, but it *does* support decimal entities, so I output everything in that form.

[^extended]: ASCII only uses 7 of the 8 bits that make up a byte, so it can only encode 128 characters. If you turn on the high bit, you get another 128 for a total of 256 possible characters. UTF-8 is also sort of an extended ASCII, in that anything that's allowed by ASCII is also allowed by UTF-8. With Unicode, you use more bytes per character in exchange for more than a million possibilities, which enables you to use the same encoding for all languages.

[^ascii]: Why not encode into ASCII? Because there are lots of characters like `áéóç` used in some zines. We could throw them away, but Mosaic displays them fine, and it would constitute a much more significant degradation of the text than replacing angled quotes with straight ones.

[^restore]: Like I said, the documents are ultimately served as UTF-8, and while I accept the loss of unnecessary features like stylistic quotes, I don't want to wipe out stuff like Arabic, Hebrew, Chinese, etc. Those get put back in. It's fine if they garble on old systems, which would have no way of displaying them anyway.

[^git]: In a git repo, you can do `git diff --word-diff`.