Support Questions

Find answers, ask questions, and share your expertise

Sqoop to write JSON in HDFS

avatar
Contributor

Can I import data in JSON format from Oracle tables using Sqoop ?

 

I am using below version,

Sqoop 1.4.5-cdh5.3.3

 

Thanks,

Vinod

1 ACCEPTED SOLUTION

avatar
Rising Star

@Vinod369

Sqoop doesn't directly support JSON file format. The only way I can think of importing data as JSON to utilise the hcatalog import function. This can be done in below formats:

 

1) Create a table in hcatalog with JSON serde. Below is just an example:

 

hcat -e "create table json_test (id int, value string) row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'"

2) Use Sqoop to import data into that table:

 

 

sqoop import\
--connect 'jdbc:mysql://nightly57-unsecure-1.gce.cloudera.com:3306/sqoop1'\
--username sqoop1 --password cloudera --table test -m 1\
--hcatalog-database default --hcatalog-table json_test

3) Check the imported data:

 

 

[root@nightly57-unsecure-1 ~]# hadoop fs -ls /user/hive/warehouse/json_test
Found 1 items
-rw-rw-rw-   3 root hive         46 2016-07-14 18:35 /user/hive/warehouse/json_test/part-m-00000
[root@nightly57-unsecure-1 ~]# hadoop fs -cat /user/hive/warehouse/json_test/part-m-00000
{"id":1,"value":"abcd"}
{"id":2,"value":"cd"}

 

The downside of this method is that you will have to create a Hive/HCatalog table before the import. You will have to clean it if it is not need in future.

 

Please let me know if you have any further question.

 

View solution in original post

6 REPLIES 6

avatar
Rising Star

@Vinod369

Sqoop doesn't directly support JSON file format. The only way I can think of importing data as JSON to utilise the hcatalog import function. This can be done in below formats:

 

1) Create a table in hcatalog with JSON serde. Below is just an example:

 

hcat -e "create table json_test (id int, value string) row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'"

2) Use Sqoop to import data into that table:

 

 

sqoop import\
--connect 'jdbc:mysql://nightly57-unsecure-1.gce.cloudera.com:3306/sqoop1'\
--username sqoop1 --password cloudera --table test -m 1\
--hcatalog-database default --hcatalog-table json_test

3) Check the imported data:

 

 

[root@nightly57-unsecure-1 ~]# hadoop fs -ls /user/hive/warehouse/json_test
Found 1 items
-rw-rw-rw-   3 root hive         46 2016-07-14 18:35 /user/hive/warehouse/json_test/part-m-00000
[root@nightly57-unsecure-1 ~]# hadoop fs -cat /user/hive/warehouse/json_test/part-m-00000
{"id":1,"value":"abcd"}
{"id":2,"value":"cd"}

 

The downside of this method is that you will have to create a Hive/HCatalog table before the import. You will have to clean it if it is not need in future.

 

Please let me know if you have any further question.

 

avatar
Explorer

Data can be directly read from the Oracle DB and write in the JSON format using SPARK and there is no need for SQOOP in between it.

 

val DF1 = sqlContext.read.format("jdbc").option("url","<connection string>").option("dbtable","<table name>").option("user","<user name>").option("password","<password>").load()

 

DF1.write.format("org.apache.spark.sql.json").save(<path>)

 

This post is under the wrong section but I wanted to tell the way data can be directly loaded from Oracle DB and save as JSON format without SQOOP.

avatar
Rising Star

@Aditya

This spark code looks awesome!

 

I have some questions:

1) Does spark read data from Oracle in driver or executors? 

2) If Spark reads Oracle in executors, how does the source data importing is split among different executors? That is, how each executor know which part of data is should read?

 

Sqoop does a lot of optimization in such areas. Now its Oracle connectors, especially the direct connector, can read metadata from Oracle, and use the metadata to split data importing among different mappers. Its performance is quite good. Does Spark have a similar optimization or not.

avatar
Mentor
> 1) Does spark read data from Oracle in driver or executors?

It would read it from the executor-run task. The read is done via an RDD.

> 2) If Spark reads Oracle in executors, how does the source data importing is split among different executors? That is, how each executor know which part of data is should read?

There's a partitioning mechanism which if left unspecified then it will read the whole query/table inside a single task. This is what would happen in the provided trivial example above. If you do specify full partition attributes in the read function, as documented over http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases, then it will generate appropriate WHERE clauses and read them from multiple tasks in parallel with the partition and its bounds applied to divide each task properly.

Agreed that Sqoop specialises the DB reads overall much more than Spark does at this point (especially so when specialised connectors are involved), but for very simple ad-hoc usage where code tie-in is required, Spark JDBC may be used. In all other use-cases it seems like the wrong tool to use for the sole purpose of a DB import or export given the lack of specialisations.

avatar
Rising Star

Thanks for the insight, @Harsh J!

 

Have a quick look at the document you pasted here, it looks like that Spark needs extra information (low/high bound) to finish the partitioning, while Sqoop can figure this out by itself. Yes, I agree with you that Spark is very convenient if only an ad-hoc import is needed in a coding env, but other than that, Sqoop is a better tool to import/export data in RDBMS.

avatar
New Contributor

Great Suggestion!
Thanks for that!

 

But i would like to ask one question that, if i want to have a struct or array fields in target, then how i should transform the mysql data, so that it will fit in HCatalog schema. The need here is to just have nested data from other collection instead of foreign key representation. Currently we are using sqoop import only, and we are trying to modify the query so that it will be accepted by hcat schema.


Thanks & Regards,
Mahendra