unicode - UTF-8 coding in Python -
i have utf-8 character encoded `_' in between, e.g., '_ea_b4_80'. i'm trying convert utf-8 character using replace method, can't correct encoding.
this code example:
import sys reload(sys) sys.setdefaultencoding('utf8') r = '_ea_b4_80' r2 = '\xea\xb4\x80' r = r.replace('_', '\\x') print r print r.encode("utf-8") print r2
in example, r not same r2; output.
\xea\xb4\x80 \xea\xb4\x80 관 <-- correctly shown
what might wrong?
\x
meaningful in string literals, you're can't use replace
add it.
to desired result, convert bytes, decode:
import binascii r = '_ea_b4_80' rhexonly = r.replace('_', '') # returns 'eab480' rbytes = binascii.unhexlify(rhexonly) # returns b'\xea\xb4\x80' rtext = rbytes.decode('utf-8') # returns '관' (unicode if py2, str py3) print(rtext)
which should 관
desire.
if you're using modern py3, can avoid import (assuming r
in fact str
; bytes.fromhex
, unlike binascii.hexlify
, take str
inputs, not bytes
inputs) using bytes.fromhex
class method in place of binascii.unhexlify
:
rbytes = bytes.fromhex(rhexonly) # returns b'\xea\xb4\x80'
Comments
Post a Comment