With xtool , you typically get two modes of deduplication:

LLM datasets often contain paraphrased versions of the same fact:

: It identifies identical data blocks across large inputs, which is particularly useful for modern games that often exceed 60GB and may contain duplicate assets.

Always deduplicate before tokenization. Removing duplicates at the raw text level is far more effective than after splitting into subwords.

Your raw dataset has the same row repeated 5 times:

Xtool Dedup Parameter ^hot^

With xtool , you typically get two modes of deduplication:

LLM datasets often contain paraphrased versions of the same fact: xtool dedup parameter

: It identifies identical data blocks across large inputs, which is particularly useful for modern games that often exceed 60GB and may contain duplicate assets. With xtool , you typically get two modes

Always deduplicate before tokenization. Removing duplicates at the raw text level is far more effective than after splitting into subwords. xtool dedup parameter

Your raw dataset has the same row repeated 5 times: