About bleonhardi

bleonhardi · ‎03-20-2016

You need to add the jar to the hive process. To do this you have two ( or more ) ways: a) Create a folder auxlib under your hive directory ( /usr/hdp/<version>/hive/auxlib ) And put the jar in there and restart hiveserver2 or start the hive console. Any job which runs will pick up the jars in that folder. b) Copy the jar somewhere on the hiveserver and run the ADD JAR command. This is good for temporary testing c) There is also an aux libs folder variable you can use if you want to use a different folder. I prefer version a) it seems to be the solution that always works. ( Hive can be a bit tricky since you have multiple processes in a query, the server, the client and the tez jobs that are kicked off ).

bleonhardi · ‎03-20-2016

Ah cool didn't see that!

bleonhardi · ‎03-19-2016

@Artem Ervits @gopal as said from looking in the code I am pretty sure it is. They check for the hive input format class but sometimes they refactored it to become an interface so the check doesn't work anymore.

bleonhardi · ‎03-19-2016

Good that you figured it out. You weren't using special characters in the original question. So yes the parameter needs to be in quotes if the values are not normal letters. You might also have to escape things sometimes. i.e. --hivevar "day=2016/3/01" --hivevar "regex=.*\\|" ( if you want the regex .*\| ) And if you use it in shell scripts you sometimes have to escape more. I once needed 32 \\ in oozie to have 1 backslash in a pig script.

bleonhardi · ‎03-19-2016

The correct way has changed a bit. You shouldn't use hiveconf as is shown in the link since it sets parameters and is restricted when you use something like SQLStdAuth. The correct way to do it is beeline -u jdbc:hive2://hostname:10000 -n xxxx -p xxxx -f /home/hdfs/scripts/hive/store_wordcount.hql --hivevar day=20160301 You can then use this variable as ${day} in the SQL script. hivevars are only variables, hiveconf could also be settings like mapreduce.map.memory.mb.

bleonhardi · ‎03-19-2016

They are actually quite different. Partitioning divides a table into subfolders that are skipped by the Optimizer based on the WHERE conditions of the table. They have a direct impact on how much data is being read. The influence of Bucketing is more nuanced it essentially describes how many files are in each folder and has influence on a variety of Hive actions. I tried to describe it here. https://community.hortonworks.com/questions/23103/hive-deciding-the-number-of-buckets.html There is also a concept called Predicate pushdown which allows Hive ORC readers to skip parts of a ORC file based on an Index in the file, it sometimes plays together with bucketing. A good overview of this is here: http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data Finally Hive has a jira to implement bucket pruning. This means Bucket files could be ignored by the Split generation without actually having to open the files ( more like partitioning ) but this is in the future. At the moment bucketing have pretty specialized usecases.

bleonhardi · ‎03-18-2016

hmmm good question, can you tell me how you start the program? I.e. what is in the path variable? You are sure that the part files are not simply old and still around from before? I do not see any other function in your code that would write the file. Is the content changing?

bleonhardi · ‎03-17-2016

Hello Hoda, so I think I know the problem. When you do foreachRDD it essentially executes your function on each RDD of the DStream you save it all to the same file. So they overwrite each others data and the first or last writer wins. There are savefunctions available on the DStream so you could just transform the data with mapPartition instead of foreachRDD and then save it with DStream.saveAsTextFile. or the easiest way you save them in a file with a unique name. wDataFrame.rdd().coalesce(1,true,null).saveAsTextFile(path + time.milliseconds.toString);} I think the time variable comes already in automatically with foreachRDD but you might have to instantiate a current date before if not. Now this is not very elegant since you could have the same timestamp twice but that is actually how the spark streaming guys do it if you look into the DStream.saveAsTextFile method. You could make this even more unique by adding a random number that is large enough to never run into duplicates or find a way to get the executor id. I would prefer this if you find a way to get it I would be thankful :-).

bleonhardi · ‎03-17-2016

The output tells you where to find additional information. Look in the end: /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start zkfc'' returned 1. starting zkfc, logging to /var/log/hadoop/hdfs/hadoop-hdfs-zkfc-sachadooptst2.corp.mirrorplus.com.out So can you look into "/var/log/hadoop/hdfs/hadoop-hdfs-zkfc-sachadooptst2.corp.mirrorplus.com.out" to get more information.

bleonhardi · ‎03-17-2016

Can you have a look into that folder? You specified the namenode directory to /hadoop so it either doesn't exist or you don't have access to it or it got corrupted. Also formatting the namenode is dangerous since it also deletes all files in the cluster. Jobs may not work anymore because the libraries are missing etc. ( You can find what to setup in the manual HDP installation guide )

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Loading unstructured CSV files to Hive

Re: Is CombineHiveInputFormat deprecated by OrcInp...

Re: Is CombineHiveInputFormat deprecated by OrcInp...

Re: parameter in beeline script

Re: parameter in beeline script

Re: what is basic difference between Partitioning ...

Re: How to save all the output of spark sql query ...

Re: How to save all the output of spark sql query ...

Re: Unable to start zookeeper failover controller ...

Re: Permission denied error during NameNode start