With xtool , you typically get two modes of deduplication:

LLM datasets often contain paraphrased versions of the same fact:

: It identifies identical data blocks across large inputs, which is particularly useful for modern games that often exceed 60GB and may contain duplicate assets.

Always deduplicate before tokenization. Removing duplicates at the raw text level is far more effective than after splitting into subwords.

Your raw dataset has the same row repeated 5 times: