Member since
09-23-2015
800
Posts
897
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2208 | 08-12-2016 01:02 PM | |
1283 | 08-08-2016 10:00 AM | |
1247 | 08-03-2016 04:44 PM | |
2641 | 08-03-2016 02:53 PM | |
736 | 08-01-2016 02:38 PM |
07-27-2016
12:58 PM
The main consequences are for running jobs some of which may depend on ats ( too late for that ) and any investigation of performance of old jobs ( which are now gone ) apart from that nothing I would know about. Would be interested to know who set the retention period to 8 years 🙂 That doesnt make any sense at all. You could have simply changed that as well he should have cleaned up the data soon as well. Hope that works.
... View more
07-27-2016
10:59 AM
Damn not fast enough, was about to write this, you get the column counts, types and some statistics out of it, you will have to invent the column names though. [root@sandbox ~]# hadoop fs -ls /apps/hive/warehouse/torc
Found 2 items
-rwxrwxrwx 3 root hdfs 16653 2016-03-14 15:35 /apps/hive/warehouse/torc/000000_0
-rwxrwxrwx 3 root hdfs 16653 2016-03-14 15:35 /apps/hive/warehouse/torc/000000_0_copy_1
[root@sandbox ~]# hive --orcfiledump /apps/hive/warehouse/torc/000000_0
WARNING: Use "yarn jar" to launch YARN applications.
Processing data file /apps/hive/warehouse/torc/000000_0 [length: 16653]
Structure for /apps/hive/warehouse/torc/000000_0
File Version: 0.12 with HIVE_8732
16/07/27 10:57:36 INFO orc.ReaderImpl: Reading ORC rows from /apps/hive/warehouse/torc/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
16/07/27 10:57:36 INFO orc.RecordReaderFactory: Schema is not specified on read. Using file schema.
Rows: 823
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:string,_col1:string,_col2:int,_col3:int>
Stripe Statistics:
Stripe 1:
Column 0: count: 823 hasNull: false
Column 1: count: 823 hasNull: false min: 00-0000 max: 53-7199 sum: 5761
Column 2: count: 823 hasNull: false min: Accountants and auditors max: Zoologists and wildlife biologists sum: 28550
Column 3: count: 823 hasNull: false min: 340 max: 134354250 sum: 403062800
Column 4: count: 819 hasNull: true min: 16700 max: 192780 sum: 39282210
... View more
07-25-2016
04:52 PM
You still will not have HIVE_HOME because the scripts set it dynamically you need to replace that place holder with /usr/hdp/<yourversionlookitupinlinux>/hive
... View more
07-25-2016
04:51 PM
So on the ambari go to hosts, select the host you want and press the big Add+ button
... View more
07-25-2016
04:35 PM
Are you using HDP? Then you would install them through Ambari. Host-> Add Client
... View more
07-25-2016
03:42 PM
you need to replace hive home with the actual path? /usr/hdp/<version>/hive/ Also on the node where you run it you need a hive client installed.
... View more
07-25-2016
01:14 PM
1 Kudo
There are literally a dozen different options here: a) Did you enable SQL Optimization of SPSS ( requires the modeler server licence ) after that it can push tasks into the hive datasource. Not sure if Hive is a supported datasource but I would assume so. You can look into the documentation. https://www.ibm.com/support/knowledgecenter/SS3RA7_15.0.0/com.ibm.spss.modeler.help/sql_overview.htm b) SPSS also supports a set of UDFs for in database scoring but that is not what you want. c) Finally there is the SPSS Analytic Server which can essentially run most functions as an Mapreduce job on the cluster. ftp://public.dhe.ibm.com/software/analytics/spss/documentation/analyticserver/1.0/English/IBM_SPSS_Analytic_Server_1_Users_Guide.pdf Unfortunately if you neither have the Modeler Server licence nor analytic server there is not much you can do besides manually pushing prefilters into the hive database or optimizing your SPSS jobs more.
... View more
07-25-2016
01:02 PM
2 Kudos
@Kartik Vashishta I think you should read an HDFS book :-). Replication is not dedicated to specific discs. HDFS will put 3 copies of the replica on different nodes. He doesn't have to choose a specific disc or node. The only rules are that - All three blocks will be on different nodes - If you have rack topology enabled the second and third copy will be on a different rack from the first copy It does not have to be a specific drive or node. HDFS will search for free space on any node that fits the requirements. The only issue I could imagine would be one huge node and some very small nodes that cannot match the size of the other node in total. ( Have seen this with physical nodes and vm nodes. )
... View more
07-25-2016
12:08 PM
@Kartik Vashishta Again you don't understand HDFS. There is no limiting factor apart from the total disc capacity of the cluster. HDFS will put blocks ( simply files on the local file system ) on the discs of your cluster and fill them up. It will also make sure that you have 3 copies of each block. There is no limit but the total size of space. Now its not very smart to have differently sized discs in your cluster because this means not all spindles will be utilized equally but there is no limit per se. The performance problem will be that the small drives will be filled up and then all write activity will happen on the bigger drives. So equal drive sizes are recommended. But again not required. So its recommended to have equally sized discs but its not a requirement. The other discs will not be empty. Also its not a requirement to have the same number of drives in each node but you need to configure each node with the correct number of drives using config groups as sagar said.
... View more
07-25-2016
10:44 AM
What Sagar says, Its not like RAID as in that whole discs are mirrored across nodes. Blocks are put on different nodes and HDFS will try to fill up the available space. Its pretty flexible.
... View more
07-25-2016
10:42 AM
1 Kudo
If you use a distribution like HDP you cannot individually upgrade a component. They are tested and supported together. So if you want a newer version of Hive you would need to upgrade the whole distribution. HDP 2.5 will have a technical preview of Hive 2.0 for example. If you don't care about support then good luck. You can just try to install Hive manually. The problem is that you will need to upgrade Tez manually as well. The hive.apache.org website will have instructions on getting it to run under "builds".
... View more
07-25-2016
09:37 AM
@Sunile Manjee Short answer theoretically ORC ALWAYS makes sense. Just less so than if you have a subset of columns. Then its no question - Its stored in a protobuf format so parsing the data is much faster than deserializing strings. - It enables vectorization, if you aggregate on a column ORC can read 10000 rows at a time and aggregate them all in one go. Much better than parsing one row at a time. - And features like predicate pushdown if you have where conditions. Once you read all columns there is no magic anymore it will take some time. I would focus on the query analysis and trying to identify any bottlenecks but my guess would be that ORC still are your best bet.
... View more
07-22-2016
01:17 PM
In your other question it looked like there was simply a bug in the phoenix part of the hbase installation. Sometimes that happens. A support ticket would log that. But I am 99% sure that normally phoenix gets installed without any yum commands. in 2.3 and 2.4
... View more
07-22-2016
01:14 PM
3 Kudos
Phoenix is installed per default in HDP 2.3 at least in my version. Its just a set of libraries in hbase that are always installed unless I am completely mistaken now. Or do you refer to the Phoenix Query server? Which gets installed as a client on the nodes? ( When you install you can select it in the window where you also select datanodes, nodemanagers etc. If you forgot doing that you can install the PQS later on a host using Ambari on the host page. But the Phoenix libraries should be installed with hbase by default.
... View more
07-22-2016
01:06 PM
5 Kudos
1+2) Its simple the way hadoop works. MapReduce guarantees that the input to the reducers is sorted. There are two reasons for this: a) By definition a reducer is an operation on a key and ALL values that belong to that key regarding from which mapper they come. A reducer simply could read the full input set and create a big hashmap in memory but this would be ridiculously costly so the other option is to sort the input dataset. So it simply reads all values for key1 and the moment it sees key2 it knows that there will be no more values for key1. So you see we have to sort the reducer input to enable the reducer function. b) Sorting the keys gives a lot of nice benefits like the ability to do a global sort more or less for free. 3) Reducers only merge sort the input of the different mappers. so that they have a globally sorted input list, this is low effort since the input sets are sorted. 4) "I have seen like nearly 3 times we are doing sorting and Sorting is too costly operation. No you only sort once. The output of the mappers is sorted and reducers merge sort the inputs from the mappers. It is a single global sort operation. The mappers "local" sort their output and the reducer merges these parts together. And as explained above you HAVE to sort the reducer input for the reducer to work.
... View more
07-22-2016
12:14 PM
@Arunkumar Dhanakumar You can simply compress text files before you upload them. Common codecs include gzip, snappy and lzo. HDFS does not care. All Mapreduce/Hive/pig jobs support these standard codecs and identify them by their file extension. If you use gzip you just need to make sure that each file is not too big since its not splittable. I.e. each gzip file will result in one mapper. You can also compress the output of jobs. So you could run a pig job that reads the text files and writes them again. I think you simply need to add the name .gz for example to the output. Again you need to understand that now each part file is gzipped and will run in one mapper later. Lzo and snappy on the other hand are splittable but do not provide as good a compression. http://stackoverflow.com/questions/4968843/how-do-i-store-gzipped-files-using-pigstorage-in-apache-pig
... View more
07-22-2016
10:10 AM
4 Kudos
It is a Tez application. They stay around for a while to wait for new dags ( execution graph) otherwise you need to create a new session for every query which adds around 20s to your query time. Its configured here (normally a couple minutes) tez.session.am.dag.submit.timeout.secs
... View more
07-22-2016
09:56 AM
5 Kudos
1. On what basis Application Master decides that it needs more containers? Depends completely on the application master. For example in Pig/Hive he computes the number of mappers based on input splits ( blocks ) so if you have a 1GB file with 10 blocks he will ask for 10 map containers. If you then specified 5 reducers the application master will then ask for 5 containers for the reducers. This is a different configuration for each type of yarn workload. One example of "dynamic" requests are in slider which is a framework you can "flex" up or down based on command line. But again in the end the user tells slider to request more. There is no magic scaling inherent to yarn it depends on your application. 2. Will each mapper have separate containers? In classic MapReduce One map = 1 container. ( Tez on the other hand has container reuse so a container it asked for for a "map" task will then be used for a reducer for example ). And finally we will soon have LLAP which can run multiple map/reduce task in the same container as threads similar to a Spark Executor. So all is possible. 3. Let's say one mapper launched in container and mapper completed 20% of work and if it requires more resources to complete remaining 80% of the task then how the resources will be allocated and who will allocate? If Distribution happens between containers then how it will happen? Again depends. Mapreduce is stupid it knows in advance how much work there is ( number of blocks ) and then asks for the number of mappers/containers it needs and distributes the work between them. For the reducers hive/tez for example can compute the number of reducers it needs based on the output size of the mappers. But once the reducer stage is started it doesnt change that anymore. So your question is not really correct. Summary: In the end you assume Yarn would automatically scale containers on need but that is not what happens. What really happens is that the different workloads in yarn predict how much containers they need based on file sizes/ map output etc. and then ask for the correct amount of containers for a stage. There is no scaling up/down in a single task normally. What is dynamic is yarn providing containers. So if an application master asks for 1000 tasks and there are only 200 slots and some occupied by other tasks. Yarn can provide them to the application master piece by piece. Some application masters like Mapreduce are fine with that. Other application masters like spark will not start processing until they got all the containers running they requested at the same time. Again it depends on the application master. Now there is nothing prohibiting a Application Master to do this if he wanted to do that but its not what happens in reality for most of the workloads like Tez/MapReduce/Spark/ ... The only dynamic scaling I am aware of is in pig/hive between stages as in the application master predicts how many containers it needs for the reducer stage based on the size of the map output.
... View more
07-22-2016
09:45 AM
2 Kudos
First regarding ORC: It is a column store format so it only reads the columns you need. So yes less columns good more columns bad. however its still better than a flat file which reads all columns all the time and is not stored as efficiently ( protobuf, vectorization access ... ) . But its not magic. So the question is if you see big performance hits, is the join correct. Normally the CBO already does a decent job of figuring that out if you have statistics as Constantin says. So that is the first step. The second then is to analyze the explain plan and see if it makes sense. Worst case you could break up the query into multiple pieces with temp tables/with statements to see if a different order results in better performance. I am also a fan of checking the execution with hive.tez.exec.print.summary to see if there is a stage that takes a loong time and doesnt have enough reducers/mappers. I.e. a bottleneck.
... View more
07-22-2016
09:15 AM
1 Kudo
Auxlib works. Its the only thing that works consistently for me. Are you using the hive command line or beeline? Depending on this you need to put the jars into the directory of the hive server or the hive client. You also need to restart hive server.
... View more
07-21-2016
11:19 AM
@Josh Elser Sorry about that :-). Protobuf?
... View more
07-21-2016
11:13 AM
1 Kudo
Honestly if I knew I would have mentioned them :-). I setup a cluster with a simple shell script ssh-all.sh: for i in server1 server2 server3; do ssh $i S1; done and created users manually on a small cluster before ( we only had ~10 users so it didn't seem worth it to setup LDAP ). I never bothered about uids and never ran into problems. But we used standard stuff oozie, hive ... and never ran into problems. But other people told me that some components don't take this well. Honestly not sure which could be that I am sure that Namenode HA setup with NFS does not work because NFS depends on the same UID but I have problems thinking of a component that would need the same uids in an hadoop environment. HDFS does not care about uids. It cares about usernames.
... View more
07-21-2016
11:01 AM
3 Kudos
Yarn timeline store should cleanup old values. Parameters are: Cleanup cycle ( when he deletes ) yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms Time to live ( what to delete ) yarn.timeline-service.ttl-ms Oh and finally enable the age off yarn.timeline-service.ttl-enable https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/ref-e54bc3f2-f1bc-4bc6-b6cb-e6337589feb6.1.html I would check those. If you want to clean things you can set that to a low settings and restart. Alternatively you should be able to simply delete the database if you want to its just log information after all. Finally if the parameters are correct and do not work you might want to open a support ticket.
... View more
07-21-2016
09:36 AM
4 Kudos
HDFS: - You need ( per default ) 30GB on the datanodes ( 3x replication ) - On the namenode the space is negligible you have 40 blocks x 3 = 120 blocks ~ 12 kbytes of RAM on the Namenode. ( you need around 100byte of RAM for every block in the Namenode memory. You also need a bit of space on disc but thats even less in the fsimage ( files but not blocks are stored on disc, however namenodes need a bit more since they also store edits and multiple versions of the fsimage. But still very small. HBase: More complicated question. In hbase it depends on the way you store data. Every field in your hbase table is stored in HFiles together with the key, the fieldname, the timestamp ... So if you store it in a single field per row your storage is much less than if you would have hundreds of 2 byte columns. On the other hand you can also enable compression in Hbase so that reduces space.
... View more
07-21-2016
09:20 AM
3 Kudos
On big clusters people normally setup an ldap server. Ipa for example is free and simple. Look on github for the security workshops of Ali baijwa. Or as said below use a ssh script or ansible or pshell to run commands on all nodes. Note some more esoteric components of the stack require that usernames have the same uid on all nodes of the cluster. https://github.com/abajwa-hw
... View more
07-20-2016
04:29 PM
1 Kudo
I heard there is some group caching in HDFS. But it should be refreshed after 5 minutes hadoop.security.groups.cache.secs Any chance to restart hdfs/yarn to make sure thats not the problem?
... View more
07-20-2016
02:49 PM
I think you should look at Spark RDD programming introduction. What you get is an RDD of integers. You can then use Spark functions like map/foreach etc. to do stuff with it. So the question is what you actually want to do. Why do you want a List is my question. You can do rdd.collect to get it all in a big Array on your driver but that is most likely not what you actually want to do. http://spark.apache.org/docs/latest/programming-guide.html I.e. clusterPoints.collect() will give you an array of points in your local driver. However it downloads all results to your local driver and doesn't work in parallel anymore. If that works with your data volumes great. But normally you should use the functions like map etc. of spark to make computations in parallel. Below is a scoring example that runs a scoring point by point so you could do other things in this function as well. Whatever you want to do with the information essentially. http://blog.sequenceiq.com/blog/2014/07/31/spark-mllib/ <code>val clusters: KMeansModel = KMeans.train(data, K, maxIteration, runs)
val vectorsAndClusterIdx = data.map{ point =>
val prediction = clusters.predict(point)
(point.toString, prediction)
}
... View more
07-20-2016
02:35 PM
If you want the information for your input points which belongs to which clusters you need to use the predict method.
... View more
07-20-2016
02:34 PM
The class provides the method clusterCenters public Vector[] clusterCenters() Each Vector is a point or cluster center. Or as said export it to PMML
... View more
07-20-2016
01:00 PM
2 Kudos
The data is a Java class that contains the cluster information. Clusters Centers Statistic ... If you want to work with that you either need to use the spark mlib library to do extraction/scoring etc. OR you can export a lot of these models as PMML which is a XML based standard for Clustering models that is understood by a lot of data mining tools. And can be exported for a lot of the models kmeansModel.toPMML("/path/to/kmeans.xml") https://databricks.com/blog/2015/07/02/pmml-support-in-apache-spark-mllib.html Not all Mlib models support PMML though
... View more