Created 06-25-2021 04:35 AM
Hello Team,
can you please advise if there is an equivalent of the function on Kudu below to extract address and location from a table with XML data type?
select TMP_ACCOUNT_CODE_N,decode(EXTRACTVALUE (address_x, '//ADDRESS_DTLS/@STREET_DESC'),'.',null,EXTRACTVALUE (address_x, '//ADDRESS_DTLS/@STREET_DESC'))||' '||EXTRACTVALUE (address_x, '//ADDRESS_DTLS/@SUB_LOCALITY_DESC')
||' '||EXTRACTVALUE (address_x, '//ADDRESS_DTLS/@CITY_DESC')
New_installation_address from tmp_address_xml@cbsstandby
where address_type_n = 4
Regards,
Roshan
Created 07-08-2021 02:24 PM
Hi @roshanbi ,
You can simply put your sqoop command in a shell file and then have Cron run that script according to your desired schedule (depending on how frequently you expect the data to be updated in the Oracle source table).
Alternatively, you can use Oozie to orchestrate the running of the sqoop job.
Regards,
Alex
Created on 06-25-2021 06:53 AM - edited 06-25-2021 07:21 AM
Which methods do you think would be most appropriate to use? I was thinking of using beehive to read the table/run query in real time and save it in parquet. The load from parqet to Kudu?
or using Sqoop to read Oracle tables into HDFS. Then from HDFS to Kudu.
Created 06-26-2021 07:03 AM
Hi @roshanbi,
There is no equivalent of oracle's EXTRACTVALUE in Kudu or Impala SQL. XML is not one of the supported formats.
I think your approach of running the SQL with EXTRACTVALUE as part of the sqoop job is a good one. That way you are writing the fields you need into target Hive table and from there insert into Kudu. This would be the fastest path and would work if you are not planning to get more fields from your source XML table in the future.
An alternative approach is to use Hive with a 3rd party SerDe for XML. Then let Hive do the conversion from XML format to something like Avro, and then create your Impala table based on that Avro data. This way is more involved, but gives you flexibility in terms of the schema.
Hope that helps,
Alex
Created 06-26-2021 07:18 AM
Hi @aakulov
thanks for the update.
Can you please advise how can I schedule this scoop job so that it will update the hive table with incremental changes(CDC)? for example, suppose the XML fields are updated on Oracle, how can I schedule scoop job to replicate the incremental changes on Hive and Kudu?
Regards,
Roshan
Created 07-08-2021 02:24 PM
Hi @roshanbi ,
You can simply put your sqoop command in a shell file and then have Cron run that script according to your desired schedule (depending on how frequently you expect the data to be updated in the Oracle source table).
Alternatively, you can use Oozie to orchestrate the running of the sqoop job.
Regards,
Alex