This article discusses the need for more useful methods for describing distinctions between large language datasets used in machine learning. Researchers from the Allen Institute for AI, the University of Washington and the University of California have proposed a collection of tools called WIMBD: WHAT’S IN MY BIG DATA, which helps practitioners rapidly examine massive language datasets to research the content of large text corpora. The article also discusses the advantages and disadvantages of language models, as well as the need for more work to be done analyzing several datasets along the same dimensions.