Member since
02-01-2017
42
Posts
1
Kudos Received
0
Solutions
09-27-2019
12:10 PM
Hi Eric, My table is partitioned I was expecting that after I do a refresh on the table I would see the most recent data in the table. However sometimes there is a lag from when the refresh completes to when I see the most recent data. I think invalidate metadata would fix this issue but it will be costly to run on a large table. Thanks
... View more
09-25-2019
12:01 PM
I have an Oozie Workflow where I am have a job job which loads some data into a table, I refresh the table in Impala and then have an Impala query to export the most recent data in this table to a CSV File. My Problem is that even after doing the Impala refresh I do not get the most recent data only the data for the previous load. For Example I have a process that starts running at 1pm spark job finishes at 1:15pm impala refresh is executed 1:20pm then at 1:25 my query to export the data runs but it only shows the data for the previous workflow which run at 12pm and not the data for the workflow which ran at 1pm. I am using Oozie and cdh 5.15.1. Sample Warning Message Read 972.32 MB of data across network that was expected to be local. Block locality metadata for table '..' may be stale. Consider running "INVALIDATE METADATA ... Thanks
... View more
Labels:
- Labels:
-
Apache Impala
-
Apache Oozie
-
Apache Spark
02-23-2018
12:23 PM
Is it possible to save the output of an impala query to hdfs . Sample query impala-shell --ssl -i "${load_balancer}" -f "${2}" -o "${3}"
Would like to have it saved not to local but to hdfs. Thanks
... View more
Labels:
- Labels:
-
Apache Impala
-
HDFS
10-10-2017
11:01 AM
I am doing something like this create table test2 stored as parquet as select * from t1; And I would like to make sure that only 2 parquet files are created lets say is this possible somehow. As know there is no predictable threshold for how many files will be created. Thanks
... View more
Labels:
- Labels:
-
Apache Impala
08-22-2017
08:10 AM
In Impala 5.7 can I do compute incremental stats on dynamic partiitons like compute incremental stats table partition(id>1 and id<10) or with a where clause somewhere. I receive an error requires = identifer not allowed >. Is there way to compute stats for specific partitions and not others. Right now can only do compute incremental stats table partition(iid=1) Thanks
... View more
Labels:
- Labels:
-
Apache Impala
07-28-2017
03:38 PM
How do I add comments to a column in an impala table for specific columns after creating it. Thanks
... View more
Labels:
- Labels:
-
Apache Impala
04-02-2017
08:23 PM
Properties file
# Environment settings
queueName = default
kerberos_realm = A
jobTracker = B:8032
nameNode = hdfs://nameservice1
hive2_server = C
hive2_port = 10000
impala_server = D:21000
edge_server = E
jobTracker = yarnrm
# Project specific paths
projectPath = /user/${user.name}/oozie/mediaroom-logs
keyTabLocation = /user/${user.name}/keytabs
# job path
oozie.wf.application.path = ${projectPath}/BXDB/wf
# Project specific jars and other libraries
oozie.libpath = ${projectPath}/lib,${projectPath}/util
# Standard useful properties
oozie.use.system.libpath = true
oozie.wf.rerun.failnodes = true
# Keytab specifics
keyTabName = A.keytab
keyTabUsername = A
focusNodeLoginIng = A
focusNodeLogin = A
# Email notification list
emailList = B xml file
<workflow-app xmlns="uri:oozie:workflow:0.4" name="bxdb">
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
</global>
<credentials>
<credential name="hive2_credentials" type="hive2">
<property>
<name>hive2.jdbc.url</name>
<value>jdbc:hive2://${hive2_server}:${hive2_port}/default</value>
</property>
<property>
<name>hive2.server.principal</name>
<value>hive/${hive2_server}@${kerberos_realm}</value>
</property>
</credential>
</credentials>
<start to="sshFileTransfer"/>
<action name="sshFileTransfer">
<ssh xmlns="uri:oozie:ssh-action:0.1">
<host>${focusNodeLoginIng}</host>
<!-- Change the name of the script -->
<command>/A/B/EsdToHDFS.sh</command>
<args>A</args>
<args> B</args>
<args> C</args>
<capture-output />
</ssh>
<ok to="process-bxdb"/>
<error to="sendEmailDQ_SRC"/>
</action>
<!-- Move from landing zone on HDFS to processing -->
<!-- Emit whether data is complete or partial, together with timestamp -->
<!-- Spark job to process the snapshots and cdr data -->
<action name="process-bxdb">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Process BXDB</name>
<class>IngestBXDB</class>
<jar>bxdb_sproc_cataloguereport-1.0-SNAPSHOT.jar</jar>
<spark-opts>--num-executors 8 --executor-cores 2 --executor-memory 4G --driver-memory 4g --driver-cores 2</spark-opts>
<arg>${nameNode}/user/hive/warehouse/belltv_lnd.db/bxdb_sproc_cataloguereport</arg>
<arg>Hello</arg>
<arg>World</arg>
</spark>
<ok to="impala-refresh-iis"/>
<error to="sendEmailDQ_SRC"/>
</action>
<!-- Impala invalidate/refresh metadata -->
<action name="impala-refresh-iis">
<shell xmlns="uri:oozie:shell-action:0.3">
<exec>impala-command.sh</exec>
<argument>${keyTabName}</argument>
<argument>${keyTabUsername}</argument>
<argument>${impala_server}</argument>
<argument>refresh belltv_expl.bxdb_sproc_cataloguereport</argument>
<file>${nameNode}/${keyTabLocation}/${keyTabName}</file>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<action name="sendEmailDQ_SRC">
<email xmlns="uri:oozie:email-action:0.1">
<to>${emailList}</to>
<subject>Error in the workflow please verify</subject>
<body>BXDB project returned an error please verify</body>
</email>
<ok to="fail"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>"BXDB ingestion failure"</message>
</kill>
<end name='end'/>
</workflow-app> command to run oozie job -abc.properties -run
... View more
03-31-2017
12:43 PM
I set up my workflow put it into hdfs as well and I try to run the conf directory the properties file with this syntax. I am really not sure why it is not working if I have a typo in my workflow.xml or job.properties or if I need to modify some config setting. Thanks Error message Here is the link to the error message, https://ibb.co/dkHnJv
... View more
Labels:
- Labels:
-
Apache Oozie
02-15-2017
06:54 AM
1 Kudo
I am not sure what you mean when you say metadata tab as I see no tab named metadata after clicking on the job. Thanks
... View more
02-15-2017
06:46 AM
thanks for the response really good and detailed could you give a little bit of a lower level response as well say how would I add data from a dataframe in spark to a table in hive effeciently. The goal is to improve the speed by using spark instead of hive or impala for db insertions thanks.
... View more