Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

a problem with the encoding in HBASE and python

avatar
New Contributor

Hi, we have an issue here involving an encoding problem with the saved data within the HBase instance from our cloudera cluster. Data from HBase are stored in UTF8, an external python process is invoked to to read that set of data and it seems that the method json.dumps changed this json into a unicode one and thats the way how it is being look in a webbrowser and the application which receives the json. We are not using any argument to json.dumps method, because it is the way to use a default encoding, UTF-8.
When i specify an UTF8 encoding when i invoke the json.dumps method, i can see in the web browser a set of weird characters.

This is part of the python code. I am creating a list of "movimientos" reading from HBase, the data loaded from HBase are in UTF-8, but if i invoke print json.dumps(movimientos) method without any parameter, i can see the data in unicode. If i invoke this line, print movimientos, i can see the data in UTF-8!

This is an entry from a row in HBase:

mov:ec72f5c5-961c-4ff5-bdaf-f958b1687acb   timestamp=1447937758934, value=type:"CARDS"|date:2015-07-21|amount:99.0|categoryId:6|categoryDescription:"ELECTR\xC3\x93NICADE CONSUMO-ELECTRODOM\xC3\x89STICO-MEN.HOGAR"|customCategoryId:|customCategoryDescription:|cardNumber:4511635231378008|clie
                                           ntId:101|cardType:"Credit"|transactionType:Compra|entityCode:7010|merchantCode:I5722|accountingDate:2015-07-21|amountOtherCu
                                           rrency:0.0|currency:"EUR"|channel:Tpv|movementDescription:"ACHICA.ES"|uuid:"ec72f5c5-961c-4ff5-bdaf-f958b1687acb"|textDescri
                                           ption:"ELECTRODOM\xC3\x89STICOS  EQUIPOS EL\xC3\x89CTRICOS  M\xC3\x81QUINAS ESPECIALES  ANTENAS PARAB\xC3\x93LICAS  AIRE ACO
                                           NDICIONADO Y RECEPCI\xC3\x93N CANALES TV."

You can see the UTF characters...

...
movimientos = []
...
for uid in uids_cards:
    movimiento = getRowsHBase(ip,port,table,rowKey,column_family+uid)
    movimientos.append(movimiento)
  for uid in uids_incomes:
    movimiento = getRowsHBase(ip,port,table,rowKey,column_family+uid)
    movimientos.append(movimiento)
  for uid in uids_receipts:
    movimiento = getRowsHBase(ip,port,table,rowKey,column_family+uid)
    movimientos.append(movimiento)

  if len(movimientos)==0:
    output['error'] = 1
    output['mensaje'] = 'No hay datos para la consulta'
    print json.dumps(output)
    sys.exit()
print json.dumps(movimientos)
...

1 ACCEPTED SOLUTION

avatar
Mentor
I am not quite sure I follow what the problem is. Could you post the differing outputs or a screenshot thereof?

This may not be the issue but note that printing the representation of the string in Python will not print out unicode characters (and instead print hexes).

View solution in original post

1 REPLY 1

avatar
Mentor
I am not quite sure I follow what the problem is. Could you post the differing outputs or a screenshot thereof?

This may not be the issue but note that printing the representation of the string in Python will not print out unicode characters (and instead print hexes).