Support Questions

vinicius_leal · ‎06-23-2018

Hi guys,

I'd like load Json file in Pig but the output format in Hive isn't good.

Step to produce:

I'm use in my enviroment:

Sandbox HDP 2.6.4

Pig 0.16.0

Hive 1.2.1000

Jsonviewer (jsonviewer.png)

1 - Updload Json File in HDFS for Ambari:

Json File: http://api.pgi.gov.br/api/1/serie/1678.json

2 - Run script Pig:

A = LOAD '/pig_scripts/1678.json' Using JsonLoader('url: chararray, id :chararray,nome :chararray,nome_estendido:chararray,descricao:chararray,inicio:chararray,final:chararray, formatacao:chararray, data_atualizacao:chararray, aditividade:chararray,url_origem: chararray,tempo_aditividade:chararray,portal_dados_abertos: chararray, disponibilizacao:tuple(disponibilizacao:chararray,dias:chararray),estado:tuple(estado:chararray),fonte_gestora:tuple(fonte_gestora_url:chararray,fonte_gestora_id:chararray,fonte_gestora_nome:chararray,fonte_gestora_descricao: chararray, fonte_gestora_tipo:chararray,orgao_primeiro_escalao:tuple(fonte_gestora_orgao_nome:chararray,fonte_gestora_orgao_descricao:chararray)), fonte_provedora:tuple(fonte_provedora_url:chararray,fonte_provedora_id:chararray,fonte_provedora_nome:chararray,fonte_provedora_descricao: chararray, fonte_provedora_tipo:chararray,orgao_primeiro_escalao:tuple(fonte_provedora_orgao_nome:chararray,fonte_provedora_orgao_descricao:chararray)), grupo_informacao:tuple(grupo_informacao_url:chararray,grupo_informacao_id:chararray,grupo_informacao_nome:chararray,grupo_informacao_palavras_chave:{(chararray)}), base_territorial:tuple(base_territorial:chararray),periodicidade:tuple(periodicidade: chararray),multiplicador:tuple(multiplicador_id:chararray,multiplicador_nome:chararray ), produto:tuple(produto_nome:chararray),publicacao:tuple(status_publicacao:chararray), unidade_medida:tuple(unidade_medida_url:chararray,unidade_medida_id:chararray,unidade_medida_nome:chararray,unidade_medida:chararray),orgao_primeiro_escalao:tuple(orgao_primeiro_escalao_nome:chararray,orgao_primeiro_escalao_descricao:chararray),valores:{(valor: chararray,municipio_ibge : chararray,ano : chararray)}'); C = FOREACH A GENERATE valores.valor,valores.municipio_ibge,valores.ano; STORE C INTO 'dados_pig/pdde_teste' USING PigStorage('\t');

3 - Create External Table Hive:

CREATE EXTERNAL TABLE `tb_pdde_teste`( `valor` string, `municipio_ibge` string, `ano` string) COMMENT 'Dados do Programa Dinheiro Direto na Escola' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://sandbox-hdp.hortonworks.com:8020/user/admin/dados_pig/pdde_teste' TBLPROPERTIES ( 'numFiles'='1', 'totalSize'='1453488', 'transient_lastDdlTime'='1529764110')

4 - Select in External Table:

SELECT * FROM tb_pdde_teste

5 - Visualize Output Hive (output-hive.png)

I'd like output format like this:

tb_pdde_teste.valor	tb_pdde_teste.municipio_ibge	tb_pdde_teste.ano
40200	120001	2003
17500	120005	2003
44900	120010	2003
23900	120013	2003
...	...	...

Any suggestion?

vmurakami · ‎06-24-2018

Hey @Vinicius Leal!
So we're namesake hum! Cool name hehe 🙂 Guess you're from Brazil?
Regarding your issue, you can try to use the LATERAL VIEW for Array typo on Hive.
So I took the liberty to make a test, here's:
1 ) Create the table in Hive

CREATE EXTERNAL TABLE `tb_pdde_teste`( 
url STRING
,id STRING
,nome STRING
,nome_estendido STRING
,descricao STRING
,inicio STRING
,final STRING
,formatacao STRING
,data_atualizacao STRING
,aditividade STRING
,url_origem STRING
,tempo_aditividade STRING
,portal_dados_abertos STRING
,disponibilizacao struct<disponibilizacao:STRING, dias:STRING>
,estado struct<estado:STRING>
,fonte_gestora struct<fonte_gestora_url:STRING,fonte_gestora_id:STRING,fonte_gestora_nome:STRING,fonte_gestora_descricao:STRING
,fonte_gestora_tipo:STRING,orgao_primeiro_escalao:struct<fonte_gestora_orgao_nome:STRING,fonte_gestora_orgao_descricao:STRING>>
,fonte_provedora struct<fonte_provedora_url:STRING,fonte_provedora_id:STRING,fonte_provedora_nome:STRING,fonte_provedora_descricao:STRING,fonte_provedora_tipo:STRING, orgao_primeiro_escalao:struct<fonte_provedora_orgao_nome:STRING,fonte_provedora_orgao_descricao:STRING>>
,grupo_informacao struct<grupo_informacao_url:STRING, grupo_informacao_id:STRING, grupo_informacao_nome:STRING, grupo_informacao_palavras_chave:array<STRING>>
,base_territorial struct<base_territorial:STRING>
,periodicidade struct<periodicidade:STRING>
,multiplicador struct<multiplicador_id:STRING,multiplicador_nome:STRING>
,produto struct<produto_nome:STRING>
,publicacao struct<status_publicacao:STRING>
,unidade_medida struct<unidade_medida_url:STRING,unidade_medida_id:STRING,unidade_medida_nome:STRING,unidade_medida:STRING>
,orgao_primeiro_escalao struct<orgao_primeiro_escalao_nome:STRING,orgao_primeiro_escalao_descricao:STRING>
,valores array<struct<valor:STRING, municipio_ibge:STRING,ano:STRING>>
) 
COMMENT 'Dados do Programa Dinheiro Direto na Escola' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/hive/dados_pig/pdde_teste';

2) Get the JSON and feed HDFS

[hive@c1123-node3 ~]$ curl -X GET http://api.pgi.gov.br/api/1/serie/1678.json > 1678.json
[root@c1123-node3 ~]# su - hdfs
[hdfs@c1123-node3 ~]$ hdfs dfs -mkdir /pig_scripts/
[hdfs@c1123-node3 ~]$ hdfs dfs -chown -R pig:hadoop /pig_scripts/
[hdfs@c1123-node3 ~]$ hdfs dfs -chmod 777 /pig_scripts
[hdfs@c1123-node3 ~]$ exit
[root@c1123-node3 ~]# su - hive
[hive@c1123-node3 ~]$ hdfs dfs -put 1678.json /pig_scripts

3) Run the Pig job using Hcatalog to throw the values onto the hive table

[hive@c1123-node3 ~]$ pig -useHCatalog 
grunt> A = LOAD '/pig_scripts/1678.json' Using JsonLoader('url: chararray, id :chararray,nome :chararray,nome_estendido:chararray,descricao:chararray,inicio:chararray,final:chararray, formatacao:chararray, data_atualizacao:chararray, aditividade:chararray,url_origem: chararray,tempo_aditividade:chararray,portal_dados_abertos: chararray, disponibilizacao:tuple(disponibilizacao:chararray,dias:chararray),estado:tuple(estado:chararray),fonte_gestora:tuple(fonte_gestora_url:chararray,fonte_gestora_id:chararray,fonte_gestora_nome:chararray,fonte_gestora_descricao: chararray, fonte_gestora_tipo:chararray,orgao_primeiro_escalao:tuple(fonte_gestora_orgao_nome:chararray,fonte_gestora_orgao_descricao:chararray)), fonte_provedora:tuple(fonte_provedora_url:chararray,fonte_provedora_id:chararray,fonte_provedora_nome:chararray,fonte_provedora_descricao: chararray, fonte_provedora_tipo:chararray,orgao_primeiro_escalao:tuple(fonte_provedora_orgao_nome:chararray,fonte_provedora_orgao_descricao:chararray)), grupo_informacao:tuple(grupo_informacao_url:chararray,grupo_informacao_id:chararray,grupo_informacao_nome:chararray,grupo_informacao_palavras_chave:{(chararray)}), base_territorial:tuple(base_territorial:chararray),periodicidade:tuple(periodicidade: chararray),multiplicador:tuple(multiplicador_id:chararray,multiplicador_nome:chararray ), produto:tuple(produto_nome:chararray),publicacao:tuple(status_publicacao:chararray), unidade_medida:tuple(unidade_medida_url:chararray,unidade_medida_id:chararray,unidade_medida_nome:chararray,unidade_medida:chararray),orgao_primeiro_escalao:tuple(orgao_primeiro_escalao_nome:chararray,orgao_primeiro_escalao_descricao:chararray),valores:{tuple(valor: chararray,municipio_ibge : chararray,ano : chararray)}'); 
STORE A INTO 'tb_pdde_teste' USING org.apache.hive.hcatalog.pig.HCatStorer();

4) Break the values from valores attrib on the query

hive> select x from tb_pdde_teste lateral view explode(valores) tb_pdde_teste as x limit 1; 
OK
{"valor":"40200","municipio_ibge":"120001","ano":"2003"}
Time taken: 0.162 seconds, Fetched: 1 row(s)
hive> select x.valor, x.municipio_ibge from tb_pdde_teste lateral view explode(valores) tb_pdde_teste as x limit 1; 
OK
40200	120001
Time taken: 0.11 seconds, Fetched: 1 row(s)

PS: I made some changes compared to your code, like:
- Added to valores attrib a tuple after the {} array declaration.
- Added the HCatStorer to save the result from Pig directly onto Hive
- Matched all fields from the JSON file and created the full DDL on Hive
- Used the concept of LATERAL VIEW coz we're using a single position in the Array typo with a lot of Struct values inside of the data.

Hope this helps!

View solution in original post

vmurakami · ‎06-24-2018