Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5165 | 08-12-2016 01:02 PM | |
2145 | 08-08-2016 10:00 AM | |
2517 | 08-03-2016 04:44 PM | |
5337 | 08-03-2016 02:53 PM | |
1367 | 08-01-2016 02:38 PM |
07-27-2016
12:58 PM
The main consequences are for running jobs some of which may depend on ats ( too late for that ) and any investigation of performance of old jobs ( which are now gone ) apart from that nothing I would know about. Would be interested to know who set the retention period to 8 years 🙂 That doesnt make any sense at all. You could have simply changed that as well he should have cleaned up the data soon as well. Hope that works.
... View more
07-27-2016
10:59 AM
Damn not fast enough, was about to write this, you get the column counts, types and some statistics out of it, you will have to invent the column names though. [root@sandbox ~]# hadoop fs -ls /apps/hive/warehouse/torc
Found 2 items
-rwxrwxrwx 3 root hdfs 16653 2016-03-14 15:35 /apps/hive/warehouse/torc/000000_0
-rwxrwxrwx 3 root hdfs 16653 2016-03-14 15:35 /apps/hive/warehouse/torc/000000_0_copy_1
[root@sandbox ~]# hive --orcfiledump /apps/hive/warehouse/torc/000000_0
WARNING: Use "yarn jar" to launch YARN applications.
Processing data file /apps/hive/warehouse/torc/000000_0 [length: 16653]
Structure for /apps/hive/warehouse/torc/000000_0
File Version: 0.12 with HIVE_8732
16/07/27 10:57:36 INFO orc.ReaderImpl: Reading ORC rows from /apps/hive/warehouse/torc/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
16/07/27 10:57:36 INFO orc.RecordReaderFactory: Schema is not specified on read. Using file schema.
Rows: 823
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:string,_col1:string,_col2:int,_col3:int>
Stripe Statistics:
Stripe 1:
Column 0: count: 823 hasNull: false
Column 1: count: 823 hasNull: false min: 00-0000 max: 53-7199 sum: 5761
Column 2: count: 823 hasNull: false min: Accountants and auditors max: Zoologists and wildlife biologists sum: 28550
Column 3: count: 823 hasNull: false min: 340 max: 134354250 sum: 403062800
Column 4: count: 819 hasNull: true min: 16700 max: 192780 sum: 39282210
... View more
07-25-2016
04:52 PM
You still will not have HIVE_HOME because the scripts set it dynamically you need to replace that place holder with /usr/hdp/<yourversionlookitupinlinux>/hive
... View more
07-25-2016
04:51 PM
So on the ambari go to hosts, select the host you want and press the big Add+ button
... View more
07-25-2016
04:35 PM
Are you using HDP? Then you would install them through Ambari. Host-> Add Client
... View more
07-25-2016
03:42 PM
you need to replace hive home with the actual path? /usr/hdp/<version>/hive/ Also on the node where you run it you need a hive client installed.
... View more
07-25-2016
01:14 PM
1 Kudo
There are literally a dozen different options here: a) Did you enable SQL Optimization of SPSS ( requires the modeler server licence ) after that it can push tasks into the hive datasource. Not sure if Hive is a supported datasource but I would assume so. You can look into the documentation. https://www.ibm.com/support/knowledgecenter/SS3RA7_15.0.0/com.ibm.spss.modeler.help/sql_overview.htm b) SPSS also supports a set of UDFs for in database scoring but that is not what you want. c) Finally there is the SPSS Analytic Server which can essentially run most functions as an Mapreduce job on the cluster. ftp://public.dhe.ibm.com/software/analytics/spss/documentation/analyticserver/1.0/English/IBM_SPSS_Analytic_Server_1_Users_Guide.pdf Unfortunately if you neither have the Modeler Server licence nor analytic server there is not much you can do besides manually pushing prefilters into the hive database or optimizing your SPSS jobs more.
... View more
07-25-2016
01:02 PM
2 Kudos
@Kartik Vashishta I think you should read an HDFS book :-). Replication is not dedicated to specific discs. HDFS will put 3 copies of the replica on different nodes. He doesn't have to choose a specific disc or node. The only rules are that - All three blocks will be on different nodes - If you have rack topology enabled the second and third copy will be on a different rack from the first copy It does not have to be a specific drive or node. HDFS will search for free space on any node that fits the requirements. The only issue I could imagine would be one huge node and some very small nodes that cannot match the size of the other node in total. ( Have seen this with physical nodes and vm nodes. )
... View more
07-25-2016
12:08 PM
@Kartik Vashishta Again you don't understand HDFS. There is no limiting factor apart from the total disc capacity of the cluster. HDFS will put blocks ( simply files on the local file system ) on the discs of your cluster and fill them up. It will also make sure that you have 3 copies of each block. There is no limit but the total size of space. Now its not very smart to have differently sized discs in your cluster because this means not all spindles will be utilized equally but there is no limit per se. The performance problem will be that the small drives will be filled up and then all write activity will happen on the bigger drives. So equal drive sizes are recommended. But again not required. So its recommended to have equally sized discs but its not a requirement. The other discs will not be empty. Also its not a requirement to have the same number of drives in each node but you need to configure each node with the correct number of drives using config groups as sagar said.
... View more
07-25-2016
10:44 AM
What Sagar says, Its not like RAID as in that whole discs are mirrored across nodes. Blocks are put on different nodes and HDFS will try to fill up the available space. Its pretty flexible.
... View more