Support Questions

vinicius_leal · ‎06-23-2018

Hi guys,

I'd like load Json file in Pig but the output format in Hive isn't good.

Step to produce:

I'm use in my enviroment:

Sandbox HDP 2.6.4

Pig 0.16.0

Hive 1.2.1000

Jsonviewer (jsonviewer.png)

1 - Updload Json File in HDFS for Ambari:

Json File: http://api.pgi.gov.br/api/1/serie/1678.json

2 - Run script Pig:

A = LOAD '/pig_scripts/1678.json' Using JsonLoader('url: chararray, id :chararray,nome :chararray,nome_estendido:chararray,descricao:chararray,inicio:chararray,final:chararray, formatacao:chararray, data_atualizacao:chararray, aditividade:chararray,url_origem: chararray,tempo_aditividade:chararray,portal_dados_abertos: chararray, disponibilizacao:tuple(disponibilizacao:chararray,dias:chararray),estado:tuple(estado:chararray),fonte_gestora:tuple(fonte_gestora_url:chararray,fonte_gestora_id:chararray,fonte_gestora_nome:chararray,fonte_gestora_descricao: chararray, fonte_gestora_tipo:chararray,orgao_primeiro_escalao:tuple(fonte_gestora_orgao_nome:chararray,fonte_gestora_orgao_descricao:chararray)), fonte_provedora:tuple(fonte_provedora_url:chararray,fonte_provedora_id:chararray,fonte_provedora_nome:chararray,fonte_provedora_descricao: chararray, fonte_provedora_tipo:chararray,orgao_primeiro_escalao:tuple(fonte_provedora_orgao_nome:chararray,fonte_provedora_orgao_descricao:chararray)), grupo_informacao:tuple(grupo_informacao_url:chararray,grupo_informacao_id:chararray,grupo_informacao_nome:chararray,grupo_informacao_palavras_chave:{(chararray)}), base_territorial:tuple(base_territorial:chararray),periodicidade:tuple(periodicidade: chararray),multiplicador:tuple(multiplicador_id:chararray,multiplicador_nome:chararray ), produto:tuple(produto_nome:chararray),publicacao:tuple(status_publicacao:chararray), unidade_medida:tuple(unidade_medida_url:chararray,unidade_medida_id:chararray,unidade_medida_nome:chararray,unidade_medida:chararray),orgao_primeiro_escalao:tuple(orgao_primeiro_escalao_nome:chararray,orgao_primeiro_escalao_descricao:chararray),valores:{(valor: chararray,municipio_ibge : chararray,ano : chararray)}'); C = FOREACH A GENERATE valores.valor,valores.municipio_ibge,valores.ano; STORE C INTO 'dados_pig/pdde_teste' USING PigStorage('\t');

3 - Create External Table Hive:

CREATE EXTERNAL TABLE `tb_pdde_teste`( `valor` string, `municipio_ibge` string, `ano` string) COMMENT 'Dados do Programa Dinheiro Direto na Escola' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://sandbox-hdp.hortonworks.com:8020/user/admin/dados_pig/pdde_teste' TBLPROPERTIES ( 'numFiles'='1', 'totalSize'='1453488', 'transient_lastDdlTime'='1529764110')

4 - Select in External Table:

SELECT * FROM tb_pdde_teste

5 - Visualize Output Hive (output-hive.png)

I'd like output format like this:

tb_pdde_teste.valor	tb_pdde_teste.municipio_ibge	tb_pdde_teste.ano
40200	120001	2003
17500	120005	2003
44900	120010	2003
23900	120013	2003
...	...	...

Any suggestion?

vmurakami · ‎06-24-2018

Hey @Vinicius Leal!
So we're namesake hum! Cool name hehe 🙂 Guess you're from Brazil?
Regarding your issue, you can try to use the LATERAL VIEW for Array typo on Hive.
So I took the liberty to make a test, here's:
1 ) Create the table in Hive

CREATE EXTERNAL TABLE `tb_pdde_teste`( 
url STRING
,id STRING
,nome STRING
,nome_estendido STRING
,descricao STRING
,inicio STRING
,final STRING
,formatacao STRING
,data_atualizacao STRING
,aditividade STRING
,url_origem STRING
,tempo_aditividade STRING
,portal_dados_abertos STRING
,disponibilizacao struct<disponibilizacao:STRING, dias:STRING>
,estado struct<estado:STRING>
,fonte_gestora struct<fonte_gestora_url:STRING,fonte_gestora_id:STRING,fonte_gestora_nome:STRING,fonte_gestora_descricao:STRING
,fonte_gestora_tipo:STRING,orgao_primeiro_escalao:struct<fonte_gestora_orgao_nome:STRING,fonte_gestora_orgao_descricao:STRING>>
,fonte_provedora struct<fonte_provedora_url:STRING,fonte_provedora_id:STRING,fonte_provedora_nome:STRING,fonte_provedora_descricao:STRING,fonte_provedora_tipo:STRING, orgao_primeiro_escalao:struct<fonte_provedora_orgao_nome:STRING,fonte_provedora_orgao_descricao:STRING>>
,grupo_informacao struct<grupo_informacao_url:STRING, grupo_informacao_id:STRING, grupo_informacao_nome:STRING, grupo_informacao_palavras_chave:array<STRING>>
,base_territorial struct<base_territorial:STRING>
,periodicidade struct<periodicidade:STRING>
,multiplicador struct<multiplicador_id:STRING,multiplicador_nome:STRING>
,produto struct<produto_nome:STRING>
,publicacao struct<status_publicacao:STRING>
,unidade_medida struct<unidade_medida_url:STRING,unidade_medida_id:STRING,unidade_medida_nome:STRING,unidade_medida:STRING>
,orgao_primeiro_escalao struct<orgao_primeiro_escalao_nome:STRING,orgao_primeiro_escalao_descricao:STRING>
,valores array<struct<valor:STRING, municipio_ibge:STRING,ano:STRING>>
) 
COMMENT 'Dados do Programa Dinheiro Direto na Escola' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/hive/dados_pig/pdde_teste';

2) Get the JSON and feed HDFS

[hive@c1123-node3 ~]$ curl -X GET http://api.pgi.gov.br/api/1/serie/1678.json > 1678.json
[root@c1123-node3 ~]# su - hdfs
[hdfs@c1123-node3 ~]$ hdfs dfs -mkdir /pig_scripts/
[hdfs@c1123-node3 ~]$ hdfs dfs -chown -R pig:hadoop /pig_scripts/
[hdfs@c1123-node3 ~]$ hdfs dfs -chmod 777 /pig_scripts
[hdfs@c1123-node3 ~]$ exit
[root@c1123-node3 ~]# su - hive
[hive@c1123-node3 ~]$ hdfs dfs -put 1678.json /pig_scripts

3) Run the Pig job using Hcatalog to throw the values onto the hive table

[hive@c1123-node3 ~]$ pig -useHCatalog 
grunt> A = LOAD '/pig_scripts/1678.json' Using JsonLoader('url: chararray, id :chararray,nome :chararray,nome_estendido:chararray,descricao:chararray,inicio:chararray,final:chararray, formatacao:chararray, data_atualizacao:chararray, aditividade:chararray,url_origem: chararray,tempo_aditividade:chararray,portal_dados_abertos: chararray, disponibilizacao:tuple(disponibilizacao:chararray,dias:chararray),estado:tuple(estado:chararray),fonte_gestora:tuple(fonte_gestora_url:chararray,fonte_gestora_id:chararray,fonte_gestora_nome:chararray,fonte_gestora_descricao: chararray, fonte_gestora_tipo:chararray,orgao_primeiro_escalao:tuple(fonte_gestora_orgao_nome:chararray,fonte_gestora_orgao_descricao:chararray)), fonte_provedora:tuple(fonte_provedora_url:chararray,fonte_provedora_id:chararray,fonte_provedora_nome:chararray,fonte_provedora_descricao: chararray, fonte_provedora_tipo:chararray,orgao_primeiro_escalao:tuple(fonte_provedora_orgao_nome:chararray,fonte_provedora_orgao_descricao:chararray)), grupo_informacao:tuple(grupo_informacao_url:chararray,grupo_informacao_id:chararray,grupo_informacao_nome:chararray,grupo_informacao_palavras_chave:{(chararray)}), base_territorial:tuple(base_territorial:chararray),periodicidade:tuple(periodicidade: chararray),multiplicador:tuple(multiplicador_id:chararray,multiplicador_nome:chararray ), produto:tuple(produto_nome:chararray),publicacao:tuple(status_publicacao:chararray), unidade_medida:tuple(unidade_medida_url:chararray,unidade_medida_id:chararray,unidade_medida_nome:chararray,unidade_medida:chararray),orgao_primeiro_escalao:tuple(orgao_primeiro_escalao_nome:chararray,orgao_primeiro_escalao_descricao:chararray),valores:{tuple(valor: chararray,municipio_ibge : chararray,ano : chararray)}'); 
STORE A INTO 'tb_pdde_teste' USING org.apache.hive.hcatalog.pig.HCatStorer();

4) Break the values from valores attrib on the query

hive> select x from tb_pdde_teste lateral view explode(valores) tb_pdde_teste as x limit 1; 
OK
{"valor":"40200","municipio_ibge":"120001","ano":"2003"}
Time taken: 0.162 seconds, Fetched: 1 row(s)
hive> select x.valor, x.municipio_ibge from tb_pdde_teste lateral view explode(valores) tb_pdde_teste as x limit 1; 
OK
40200	120001
Time taken: 0.11 seconds, Fetched: 1 row(s)

PS: I made some changes compared to your code, like:
- Added to valores attrib a tuple after the {} array declaration.
- Added the HCatStorer to save the result from Pig directly onto Hive
- Matched all fields from the JSON file and created the full DDL on Hive
- Used the concept of LATERAL VIEW coz we're using a single position in the Array typo with a lot of Struct values inside of the data.

Hope this helps!

View solution in original post

vmurakami · ‎06-24-2018

Hey @Vinicius Leal!
So we're namesake hum! Cool name hehe 🙂 Guess you're from Brazil?
Regarding your issue, you can try to use the LATERAL VIEW for Array typo on Hive.
So I took the liberty to make a test, here's:
1 ) Create the table in Hive

CREATE EXTERNAL TABLE `tb_pdde_teste`( 
url STRING
,id STRING
,nome STRING
,nome_estendido STRING
,descricao STRING
,inicio STRING
,final STRING
,formatacao STRING
,data_atualizacao STRING
,aditividade STRING
,url_origem STRING
,tempo_aditividade STRING
,portal_dados_abertos STRING
,disponibilizacao struct<disponibilizacao:STRING, dias:STRING>
,estado struct<estado:STRING>
,fonte_gestora struct<fonte_gestora_url:STRING,fonte_gestora_id:STRING,fonte_gestora_nome:STRING,fonte_gestora_descricao:STRING
,fonte_gestora_tipo:STRING,orgao_primeiro_escalao:struct<fonte_gestora_orgao_nome:STRING,fonte_gestora_orgao_descricao:STRING>>
,fonte_provedora struct<fonte_provedora_url:STRING,fonte_provedora_id:STRING,fonte_provedora_nome:STRING,fonte_provedora_descricao:STRING,fonte_provedora_tipo:STRING, orgao_primeiro_escalao:struct<fonte_provedora_orgao_nome:STRING,fonte_provedora_orgao_descricao:STRING>>
,grupo_informacao struct<grupo_informacao_url:STRING, grupo_informacao_id:STRING, grupo_informacao_nome:STRING, grupo_informacao_palavras_chave:array<STRING>>
,base_territorial struct<base_territorial:STRING>
,periodicidade struct<periodicidade:STRING>
,multiplicador struct<multiplicador_id:STRING,multiplicador_nome:STRING>
,produto struct<produto_nome:STRING>
,publicacao struct<status_publicacao:STRING>
,unidade_medida struct<unidade_medida_url:STRING,unidade_medida_id:STRING,unidade_medida_nome:STRING,unidade_medida:STRING>
,orgao_primeiro_escalao struct<orgao_primeiro_escalao_nome:STRING,orgao_primeiro_escalao_descricao:STRING>
,valores array<struct<valor:STRING, municipio_ibge:STRING,ano:STRING>>
) 
COMMENT 'Dados do Programa Dinheiro Direto na Escola' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/hive/dados_pig/pdde_teste';

2) Get the JSON and feed HDFS

[hive@c1123-node3 ~]$ curl -X GET http://api.pgi.gov.br/api/1/serie/1678.json > 1678.json
[root@c1123-node3 ~]# su - hdfs
[hdfs@c1123-node3 ~]$ hdfs dfs -mkdir /pig_scripts/
[hdfs@c1123-node3 ~]$ hdfs dfs -chown -R pig:hadoop /pig_scripts/
[hdfs@c1123-node3 ~]$ hdfs dfs -chmod 777 /pig_scripts
[hdfs@c1123-node3 ~]$ exit
[root@c1123-node3 ~]# su - hive
[hive@c1123-node3 ~]$ hdfs dfs -put 1678.json /pig_scripts

3) Run the Pig job using Hcatalog to throw the values onto the hive table

[hive@c1123-node3 ~]$ pig -useHCatalog 
grunt> A = LOAD '/pig_scripts/1678.json' Using JsonLoader('url: chararray, id :chararray,nome :chararray,nome_estendido:chararray,descricao:chararray,inicio:chararray,final:chararray, formatacao:chararray, data_atualizacao:chararray, aditividade:chararray,url_origem: chararray,tempo_aditividade:chararray,portal_dados_abertos: chararray, disponibilizacao:tuple(disponibilizacao:chararray,dias:chararray),estado:tuple(estado:chararray),fonte_gestora:tuple(fonte_gestora_url:chararray,fonte_gestora_id:chararray,fonte_gestora_nome:chararray,fonte_gestora_descricao: chararray, fonte_gestora_tipo:chararray,orgao_primeiro_escalao:tuple(fonte_gestora_orgao_nome:chararray,fonte_gestora_orgao_descricao:chararray)), fonte_provedora:tuple(fonte_provedora_url:chararray,fonte_provedora_id:chararray,fonte_provedora_nome:chararray,fonte_provedora_descricao: chararray, fonte_provedora_tipo:chararray,orgao_primeiro_escalao:tuple(fonte_provedora_orgao_nome:chararray,fonte_provedora_orgao_descricao:chararray)), grupo_informacao:tuple(grupo_informacao_url:chararray,grupo_informacao_id:chararray,grupo_informacao_nome:chararray,grupo_informacao_palavras_chave:{(chararray)}), base_territorial:tuple(base_territorial:chararray),periodicidade:tuple(periodicidade: chararray),multiplicador:tuple(multiplicador_id:chararray,multiplicador_nome:chararray ), produto:tuple(produto_nome:chararray),publicacao:tuple(status_publicacao:chararray), unidade_medida:tuple(unidade_medida_url:chararray,unidade_medida_id:chararray,unidade_medida_nome:chararray,unidade_medida:chararray),orgao_primeiro_escalao:tuple(orgao_primeiro_escalao_nome:chararray,orgao_primeiro_escalao_descricao:chararray),valores:{tuple(valor: chararray,municipio_ibge : chararray,ano : chararray)}'); 
STORE A INTO 'tb_pdde_teste' USING org.apache.hive.hcatalog.pig.HCatStorer();

4) Break the values from valores attrib on the query

hive> select x from tb_pdde_teste lateral view explode(valores) tb_pdde_teste as x limit 1; 
OK
{"valor":"40200","municipio_ibge":"120001","ano":"2003"}
Time taken: 0.162 seconds, Fetched: 1 row(s)
hive> select x.valor, x.municipio_ibge from tb_pdde_teste lateral view explode(valores) tb_pdde_teste as x limit 1; 
OK
40200	120001
Time taken: 0.11 seconds, Fetched: 1 row(s)

PS: I made some changes compared to your code, like:
- Added to valores attrib a tuple after the {} array declaration.
- Added the HCatStorer to save the result from Pig directly onto Hive
- Matched all fields from the JSON file and created the full DDL on Hive
- Used the concept of LATERAL VIEW coz we're using a single position in the Array typo with a lot of Struct values inside of the data.

Hope this helps!

Cloudera Community

Support Questions

Load Json Nested Data in Pig

Converting Nested JSON to Flat JSON using JOLT

Replacing characters within a nested json

getting null from JoltTransformJson in nifi while ...

Importing and Querying JSON data in Hive

Parse nested json using Spark RDD

QueryRecord processor issue with nested JSON

Pig data load problem

Pig - Load data from two different path

Loading data into HBase using Pig script

Decompressing nested ZIP files in NiFi