Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 7358 | 08-12-2016 01:02 PM | |
| 2708 | 08-08-2016 10:00 AM | |
| 3674 | 08-03-2016 04:44 PM | |
| 7214 | 08-03-2016 02:53 PM | |
| 1864 | 08-01-2016 02:38 PM |
03-20-2016
11:09 PM
1 Kudo
You need to add the jar to the hive process. To do this you have two ( or more ) ways: a) Create a folder auxlib under your hive directory ( /usr/hdp/<version>/hive/auxlib ) And put the jar in there and restart hiveserver2 or start the hive console. Any job which runs will pick up the jars in that folder. b) Copy the jar somewhere on the hiveserver and run the ADD JAR command. This is good for temporary testing c) There is also an aux libs folder variable you can use if you want to use a different folder. I prefer version a) it seems to be the solution that always works. ( Hive can be a bit tricky since you have multiple processes in a query, the server, the client and the tez jobs that are kicked off ).
... View more
03-20-2016
02:25 PM
1 Kudo
Ah cool didn't see that!
... View more
03-19-2016
10:32 PM
1 Kudo
@Artem Ervits @gopal as said from looking in the code I am pretty sure it is. They check for the hive input format class but sometimes they refactored it to become an interface so the check doesn't work anymore.
... View more
03-19-2016
08:22 PM
3 Kudos
Good that you figured it out. You weren't using special characters in the original question. So yes the parameter needs to be in quotes if the values are not normal letters. You might also have to escape things sometimes. i.e. --hivevar "day=2016/3/01" --hivevar "regex=.*\\|" ( if you want the regex .*\| ) And if you use it in shell scripts you sometimes have to escape more. I once needed 32 \\ in oozie to have 1 backslash in a pig script.
... View more
03-19-2016
06:19 PM
3 Kudos
The correct way has changed a bit. You shouldn't use hiveconf as is shown in the link since it sets parameters and is restricted when you use something like SQLStdAuth. The correct way to do it is beeline -u jdbc:hive2://hostname:10000 -n xxxx -p xxxx -f /home/hdfs/scripts/hive/store_wordcount.hql --hivevar day=20160301 You can then use this variable as ${day} in the SQL script. hivevars are only variables, hiveconf could also be settings like mapreduce.map.memory.mb.
... View more
03-19-2016
10:20 AM
6 Kudos
They are actually quite different. Partitioning divides a table into subfolders that are skipped by the Optimizer based on the WHERE conditions of the table. They have a direct impact on how much data is being read. The influence of Bucketing is more nuanced it essentially describes how many files are in each folder and has influence on a variety of Hive actions. I tried to describe it here. https://community.hortonworks.com/questions/23103/hive-deciding-the-number-of-buckets.html There is also a concept called Predicate pushdown which allows Hive ORC readers to skip parts of a ORC file based on an Index in the file, it sometimes plays together with bucketing. A good overview of this is here: http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data Finally Hive has a jira to implement bucket pruning. This means Bucket files could be ignored by the Split generation without actually having to open the files ( more like partitioning ) but this is in the future. At the moment bucketing have pretty specialized usecases.
... View more
03-18-2016
12:19 AM
2 Kudos
hmmm good question, can you tell me how you start the program? I.e. what is in the path variable? You are sure that the part files are not simply old and still around from before? I do not see any other function in your code that would write the file. Is the content changing?
... View more
03-17-2016
08:50 PM
4 Kudos
Hello Hoda, so I think I know the problem. When you do foreachRDD it essentially executes your function on each RDD of the DStream you save it all to the same file. So they overwrite each others data and the first or last writer wins. There are savefunctions available on the DStream so you could just transform the data with mapPartition instead of foreachRDD and then save it with DStream.saveAsTextFile. or the easiest way you save them in a file with a unique name. wDataFrame.rdd().coalesce(1,true,null).saveAsTextFile(path + time.milliseconds.toString);} I think the time variable comes already in automatically with foreachRDD but you might have to instantiate a current date before if not. Now this is not very elegant since you could have the same timestamp twice but that is actually how the spark streaming guys do it if you look into the DStream.saveAsTextFile method. You could make this even more unique by adding a random number that is large enough to never run into duplicates or find a way to get the executor id. I would prefer this if you find a way to get it I would be thankful :-).
... View more
03-17-2016
02:33 PM
1 Kudo
The output tells you where to find additional information. Look in the end: /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start zkfc'' returned 1. starting zkfc, logging to /var/log/hadoop/hdfs/hadoop-hdfs-zkfc-sachadooptst2.corp.mirrorplus.com.out So can you look into "/var/log/hadoop/hdfs/hadoop-hdfs-zkfc-sachadooptst2.corp.mirrorplus.com.out" to get more information.
... View more
03-17-2016
12:18 PM
Can you have a look into that folder? You specified the namenode directory to /hadoop so it either doesn't exist or you don't have access to it or it got corrupted. Also formatting the namenode is dangerous since it also deletes all files in the cluster. Jobs may not work anymore because the libraries are missing etc. ( You can find what to setup in the manual HDP installation guide )
... View more