This website contains age-restricted materials including nudity and explicit depictions of sexual activity.
By entering, you affirm that you are at least 18 years of age or the age of majority in the jurisdiction you are accessing the website from and you consent to viewing sexually explicit content.
There’s actually a perplexity improvement parameter-to-paramater for BitNet-1.58 which increases as it scales up.
So yes, post-training quantization perplexity issues are apparent, but if you train quantization in from the start it is better than FP.
Which makes sense through the lens of the superposition hypothesis where the weights are actually representing a hyperdimensional virtual vector space. If the weights have too much precision competing features might compromise on fuzzier representations instead of restructuring the virtual network to better matching nodes.
Constrained weight precision is probably going to be the future of pretraining within a generation or two looking at the data so far.