This website contains age-restricted materials including nudity and explicit depictions of sexual activity.
By entering, you affirm that you are at least 18 years of age or the age of majority in the jurisdiction you are accessing the website from and you consent to viewing sexually explicit content.
I’m betting the truth is somewhere in between, models are only as good as their training data – so over time if they prune out the bad key/value pairs to increase overall quality and accuracy it should improve vastly improve every model in theory. But the sheer size of the datasets they’re using now is 1 trillion+ tokens for the larger models. Microsoft (ugh, I know) is experimenting with the “Phi 2” model which uses significantly less data to train, but focuses primarily on the quality of the dataset itself to have a 2.7 B model compete with a 7B-parameter model.
https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
This is likely where these models are heading to prune out superfluous, and outright incorrect training data.