Discussion about this post

User's avatar
Nathan Witkin's avatar

Wait ... is the possibility of this kind of n-gram analysis not an absolutely massive deal? Does it work with other AI-generated text? If so, then I imagine it'd be of extreme interest to Pangram and also totally transform the conversation surrounding AI-generated text.

If LLMs routinely uses verbatim snippets from their training data, that would presumably make AI writing even less desirable / more low-status than it is now, and would strengthen the case it counts as plagiarism.

Alex Taylor's avatar

I'd be interested in seeing this tool used on a piece of writing from 2020. Super interesting findings here but none of the examples you've identified are actually that rare or strange, especially not 'Something coiled inside her' or 'swallows a shout'. The ones identified in the app are even more benign most of the time. I'm certainly open to the Granta piece being AI, but even most of the 10 and 11 token n-grams in the demo are phrases I've read/heard before. Not sure this works as a source of authority with the current explanation/demo.

'the air sweet with cane and forgetting' is a reference to sugarcane , which is commonly eaten in Trinidad and Tobago. While the writing may not be to your taste, I don't think a flowery description of something instantly identifies it as a 'brilliant example of bad plagiarism'.

108 more comments...

Ready for more?