LLMs have the wrong ontology for scholarship

I have read suggestions that LLMs might help with the routine and tedious parts of writing, like a literature review. This is undermined by their failure to distinguish the literature (which is to be reviewed) from discourse in general.

Recent work by Dirk Spennemann suggests that ChatGPT’s training set includes lots of secondary sources (like Wikipedia) but fewer primary sources. Yet the training set includes lots of passages where literature is cited and claims are attributed to sources, so the algorithm generates things that have the grammatical form of citations and attributions.

When you train a computer to play chess, you don’t use a language model trained on conversations about chess. Instead, you build it around a representation of board positions and possible moves. The structure of a game of chess is the right ontology. The the structure of the program should reflect the structure of the things that the program is supposed to work on. That is, the data ontology should reflect the ontology.1

Citing and quoting sources requires attending to what is written in those sources and where. This information needs to be treated as different in kind from glib turns of phrase and common forms of expression. But an LLM just represents all the discourse it is trained on as the same kind of thing, only nudged by the push and pull of learned association.

A palm civet in a monk's cowl drinking coffee.
  1. The general point is nicely argued by Dan Li, who writes that “we should design data ontology in such a way that is consistent with the knowledge that we have about the target phenomenon.”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.