Member since
02-01-2017
42
Posts
0
Kudos Received
0
Solutions
09-27-2019
12:10 PM
Hi Eric, My table is partitioned I was expecting that after I do a refresh on the table I would see the most recent data in the table. However sometimes there is a lag from when the refresh completes to when I see the most recent data. I think invalidate metadata would fix this issue but it will be costly to run on a large table. Thanks
... View more
09-25-2019
12:01 PM
I have an Oozie Workflow where I am have a job job which loads some data into a table, I refresh the table in Impala and then have an Impala query to export the most recent data in this table to a CSV File. My Problem is that even after doing the Impala refresh I do not get the most recent data only the data for the previous load. For Example I have a process that starts running at 1pm spark job finishes at 1:15pm impala refresh is executed 1:20pm then at 1:25 my query to export the data runs but it only shows the data for the previous workflow which run at 12pm and not the data for the workflow which ran at 1pm. I am using Oozie and cdh 5.15.1. Sample Warning Message Read 972.32 MB of data across network that was expected to be local. Block locality metadata for table '..' may be stale. Consider running "INVALIDATE METADATA ... Thanks
... View more
Labels:
- Labels:
-
Apache Impala
-
Apache Oozie
-
Apache Spark
03-19-2018
01:50 PM
I want to export an impala query where it concats many columns into one without a delimeter and the extra quotes that come with it . Here is my query impala-shell --ssl -B -i ${load_balancer} -f ${sql_file_local_subscriber} -o ${file_path_edge_subscriber} --print_header; Basically it gives " ""A"",""B"",""C"" " I want "A" , "B" , "C". How can I accomplish this through impala. Thank you
... View more
Labels:
- Labels:
-
Apache Impala
-
HDFS
02-23-2018
12:23 PM
Is it possible to save the output of an impala query to hdfs . Sample query impala-shell --ssl -i "${load_balancer}" -f "${2}" -o "${3}"
Would like to have it saved not to local but to hdfs. Thanks
... View more
Labels:
- Labels:
-
Apache Impala
-
HDFS
12-20-2017
02:39 PM
I am looking for <dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.0</version>
</dependency> But the cloudera version how do I find while maintaning the spark version same for spark streaming <dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.6.1</version>
</dependency> As well as for say different versions of spark I found this but I am still a bit lost. https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_vd_cdh5_maven_repo_57x.html Thanks
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
-
Quickstart VM
10-10-2017
11:01 AM
I am doing something like this create table test2 stored as parquet as select * from t1; And I would like to make sure that only 2 parquet files are created lets say is this possible somehow. As know there is no predictable threshold for how many files will be created. Thanks
... View more
Labels:
- Labels:
-
Apache Impala
09-01-2017
10:16 AM
To get a hive table to appear in impala I can do invalidate metadata on everything. But that will be very memory intensive is there a way to invalidate metadata and just get this database invalidated. Like say I have 100 schemas and I create a new one in hive say Ab. Can I do something like invalidate metadata Ab. I know this can be done on a table but what about a schema? Thanks
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Impala
08-22-2017
08:10 AM
In Impala 5.7 can I do compute incremental stats on dynamic partiitons like compute incremental stats table partition(id>1 and id<10) or with a where clause somewhere. I receive an error requires = identifer not allowed >. Is there way to compute stats for specific partitions and not others. Right now can only do compute incremental stats table partition(iid=1) Thanks
... View more
Labels:
- Labels:
-
Apache Impala
07-28-2017
03:38 PM
How do I add comments to a column in an impala table for specific columns after creating it. Thanks
... View more
Labels:
- Labels:
-
Apache Impala
04-02-2017
08:23 PM
Properties file
# Environment settings
queueName = default
kerberos_realm = A
jobTracker = B:8032
nameNode = hdfs://nameservice1
hive2_server = C
hive2_port = 10000
impala_server = D:21000
edge_server = E
jobTracker = yarnrm
# Project specific paths
projectPath = /user/${user.name}/oozie/mediaroom-logs
keyTabLocation = /user/${user.name}/keytabs
# job path
oozie.wf.application.path = ${projectPath}/BXDB/wf
# Project specific jars and other libraries
oozie.libpath = ${projectPath}/lib,${projectPath}/util
# Standard useful properties
oozie.use.system.libpath = true
oozie.wf.rerun.failnodes = true
# Keytab specifics
keyTabName = A.keytab
keyTabUsername = A
focusNodeLoginIng = A
focusNodeLogin = A
# Email notification list
emailList = B xml file
<workflow-app xmlns="uri:oozie:workflow:0.4" name="bxdb">
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
</global>
<credentials>
<credential name="hive2_credentials" type="hive2">
<property>
<name>hive2.jdbc.url</name>
<value>jdbc:hive2://${hive2_server}:${hive2_port}/default</value>
</property>
<property>
<name>hive2.server.principal</name>
<value>hive/${hive2_server}@${kerberos_realm}</value>
</property>
</credential>
</credentials>
<start to="sshFileTransfer"/>
<action name="sshFileTransfer">
<ssh xmlns="uri:oozie:ssh-action:0.1">
<host>${focusNodeLoginIng}</host>
<!-- Change the name of the script -->
<command>/A/B/EsdToHDFS.sh</command>
<args>A</args>
<args> B</args>
<args> C</args>
<capture-output />
</ssh>
<ok to="process-bxdb"/>
<error to="sendEmailDQ_SRC"/>
</action>
<!-- Move from landing zone on HDFS to processing -->
<!-- Emit whether data is complete or partial, together with timestamp -->
<!-- Spark job to process the snapshots and cdr data -->
<action name="process-bxdb">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Process BXDB</name>
<class>IngestBXDB</class>
<jar>bxdb_sproc_cataloguereport-1.0-SNAPSHOT.jar</jar>
<spark-opts>--num-executors 8 --executor-cores 2 --executor-memory 4G --driver-memory 4g --driver-cores 2</spark-opts>
<arg>${nameNode}/user/hive/warehouse/belltv_lnd.db/bxdb_sproc_cataloguereport</arg>
<arg>Hello</arg>
<arg>World</arg>
</spark>
<ok to="impala-refresh-iis"/>
<error to="sendEmailDQ_SRC"/>
</action>
<!-- Impala invalidate/refresh metadata -->
<action name="impala-refresh-iis">
<shell xmlns="uri:oozie:shell-action:0.3">
<exec>impala-command.sh</exec>
<argument>${keyTabName}</argument>
<argument>${keyTabUsername}</argument>
<argument>${impala_server}</argument>
<argument>refresh belltv_expl.bxdb_sproc_cataloguereport</argument>
<file>${nameNode}/${keyTabLocation}/${keyTabName}</file>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<action name="sendEmailDQ_SRC">
<email xmlns="uri:oozie:email-action:0.1">
<to>${emailList}</to>
<subject>Error in the workflow please verify</subject>
<body>BXDB project returned an error please verify</body>
</email>
<ok to="fail"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>"BXDB ingestion failure"</message>
</kill>
<end name='end'/>
</workflow-app> command to run oozie job -abc.properties -run
... View more
03-31-2017
12:43 PM
I set up my workflow put it into hdfs as well and I try to run the conf directory the properties file with this syntax. I am really not sure why it is not working if I have a typo in my workflow.xml or job.properties or if I need to modify some config setting. Thanks Error message Here is the link to the error message, https://ibb.co/dkHnJv
... View more
Labels:
- Labels:
-
Apache Oozie
03-06-2017
12:15 PM
My error was than in intellij I needed to modify the vm to options. -Dspark.master=local -Dspark.driver-memory=4g -Dspark.executor-memory=4g -XX:MaxPermSize=2g -Dhive.metastore.uris=thrift://127.0.0.1:9083 Then everything worled
... View more
03-06-2017
11:43 AM
I receive the error database does not exist with this code in intellij spark maven project and on all new projects. But what is weird is that in an older project with the same code it works. Here is the code and below the error log. I have the hive-sites.xml in both the hive and spark conf and set all directories user/hive/warehouse to 777 what else could be the error. Thanks I am using the cloudera quickstart vm cdh 5.7.2 Spark 1.6 scala 2.10.5
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
03-02-2017
10:36 AM
I saw this https://issues.cloudera.org/browse/IMPALA-1832 and I was wondering if there was any update on this. For example estimated memory used is x and actual memory used per node 3x or 4x. Talking about queries which are using gb of ram. Thanks
... View more
Labels:
- Labels:
-
Apache Impala
-
Cloudera Manager
02-28-2017
07:26 AM
When I search in hue for running or completed jobs lets say abc it only shows me the 10 or 50 most recent on the page. My problem is that it doesnt the 10 or 50 which match the search ony 10 or 50 most recent so that I need to click next many times if I want to find a job more than a day old. My question is there a way to be able to see more on one page like 500 so that it would come up or to make my search search through all the jobs of all time. What I receive is a buch of empty pages and having to click next many times before I find the job that I am looking for. Thank you
... View more
Labels:
- Labels:
-
Apache Oozie
-
Cloudera Hue
02-27-2017
02:18 PM
I am now getting this errror when trying to execute this command. from cm_api.api_client import ApiResource Type "help", "copyright", "credits" or "license" for more information. >>> from cm_api.api_client import ApiResource Traceback (most recent call last): File "<stdin>", line 1, in <module> File "cm_api.py", line 31, in <module> hdlr = logging.FileHandler('/var/tmp/cm_api.log') File "/usr/lib64/python2.6/logging/__init__.py", line 835, in __init__ StreamHandler.__init__(self, self._open()) File "/usr/lib64/python2.6/logging/__init__.py", line 854, in _open stream = open(self.baseFilename, self.mode) IOError: [Errno 13] Permission denied: '/var/tmp/cm_api.log'
... View more
02-27-2017
01:52 PM
I had tried to pip install the cloudera manager api after having installed pip. I executed the command pip install -vvv cmd-api and I tried it as well with sudo and received the same error. Here is a screen shot of the error I tried modifying my certificates thought might be a proxy related error but not sure. I am using python 2.6.6 any suggestions are greatly appreciated. It still seems to have not installed. Thanks
... View more
Labels:
- Labels:
-
Cloudera Manager
02-23-2017
07:30 AM
I found it under port 8888. 127.0.0.1:8888 will give access to hue.
... View more
02-22-2017
12:13 PM
I wanted to know if hue is installed on cloudera quick start vm. I found the cloudera manager which I was able to access from 127.0.01:7180 but I wanted to know if it was possible to access hue as well. I think we are using cdh 5.7.2 on the vm. Thanks
... View more
Labels:
- Labels:
-
Cloudera Hue
-
Manual Installation
02-15-2017
06:54 AM
I am not sure what you mean when you say metadata tab as I see no tab named metadata after clicking on the job. Thanks
... View more
02-15-2017
06:46 AM
thanks for the response really good and detailed could you give a little bit of a lower level response as well say how would I add data from a dataframe in spark to a table in hive effeciently. The goal is to improve the speed by using spark instead of hive or impala for db insertions thanks.
... View more
02-15-2017
06:25 AM
Thanks for the response . I did not see the part of it not running on a cluster as I will be using a cluster. I had one more question why would it not work on a cluster does it have something to do with it being distributed like in general?
... View more
02-14-2017
08:24 AM
I wanted to know how to do sftp transfer to hdfs in spark 1.6. By loading the data mainly in csv format mid size files a few to maybe 50 gigs size workflow. is this reccomended to do in spark or better from a script. I found a library https://github.com/springml/spark-sftp and wanted to know if this a reccomended way of doing things . One of my problems as well was using this library how would I handle say touch files when I need to read data from a specific data to a specific date. Thanks I am using spark 1.6 and scala with cloudera manager version around 5.7.2 I think . It is routinely upgraded might be around 5.9
... View more
Labels:
- Labels:
-
Apache Spark
-
HDFS
02-14-2017
08:13 AM
I am ingesting data that is put into hdfs and I would like to convert the hive sql script to spark sql to improve the speed. Looking for docs or a general solution to a problem of this sort. Any feedback is greatly appreciated. The spark code would be written in scala.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Impala
-
Apache Spark
02-13-2017
02:06 PM
Hi thanks for the docs. I actually needed to boot it from the cloudera manager. I could not do it from the commandline must have something to do with my setup.
... View more
02-10-2017
06:51 AM
Thanks for the response this did not work for me unfortunately. This is what I tried. First I checked the status it was not running then I started the service with sudo service hadoop-hdfs-datanode start Then tried hadoop fs -ls / This gave me the same error as before. Do I need to also start a namenode or something but Im thinking I shouldnt because I am not in control of namenodes and on other coworkers computers it just works. Any suggestions are appreciated.
... View more
02-09-2017
01:30 PM
did you resolve the issue I am facing the same issue when trying to execute a command even after starting the service and having the status say okay.
... View more