Hi,
I have to interface with systems that use iso-8859-x encoding (not by choice…), and I’m surprised that the following doesn’t throw an error:
>>> str(bytes(range(256)), encoding="iso-8859-1", errors="strict")
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
Bytes in the 0x80—0x9f range are not valid iso-8859-1, and I was expecting the above to raise a DecodeError of some sort; instead it looks like those are passed through.
I’m perfectly happy with this behaviour, I would like to make sure I can depend on it. Can I take an arbitrary byte buffer, decode as ISO-8859-1, and never get any error? Is it guaranteed to be lossless ?
I was curious, so I did some searches on this topic for you and found these pages:
The second link in particular notes:
Whether depending on this is actually correct or not is beyond me, but it seems like people have actually been using that pass-through behavior in practice and put it into things like Python2 -> 3 migration guides.
The first link suggests that the seemingly undefined ranges are valid as C0 and C1 control codes which may be why it doesn’t throw errors.
Thank you for the pointers @[email protected]
My use case certainly fall into that described by ESR, I only really need to understand markup that falls in the ASCII range and pass the rest unmodified
Maybe I’m tripping here but this kinda also explains why the human genome contains lots of noncoding DNA.
I think it has more to do with the fact Mother Nature is really inefficient and allocated much more DNA storage than necessary.