unicode - UTF-8 coding in Python -


i have utf-8 character encoded `_' in between, e.g., '_ea_b4_80'. i'm trying convert utf-8 character using replace method, can't correct encoding.

this code example:

import sys reload(sys)   sys.setdefaultencoding('utf8')  r = '_ea_b4_80' r2 = '\xea\xb4\x80'  r = r.replace('_', '\\x') print r print r.encode("utf-8") print r2 

in example, r not same r2; output.

\xea\xb4\x80 \xea\xb4\x80 관  <-- correctly shown  

what might wrong?

\x meaningful in string literals, you're can't use replace add it.

to desired result, convert bytes, decode:

import binascii  r = '_ea_b4_80'  rhexonly = r.replace('_', '')          # returns 'eab480' rbytes = binascii.unhexlify(rhexonly)  # returns b'\xea\xb4\x80' rtext = rbytes.decode('utf-8')         # returns '관' (unicode if py2, str py3) print(rtext) 

which should desire.

if you're using modern py3, can avoid import (assuming r in fact str; bytes.fromhex, unlike binascii.hexlify, take str inputs, not bytes inputs) using bytes.fromhex class method in place of binascii.unhexlify:

rbytes = bytes.fromhex(rhexonly)  # returns b'\xea\xb4\x80' 

Comments

Popular posts from this blog

Delphi XE2 Indy10 udp client-server interchange using SendBuffer-ReceiveBuffer -

Qt ActiveX WMI QAxBase::dynamicCallHelper: ItemIndex(int): No such property in -

Enable autocomplete or intellisense in Atom editor for PHP -