About bleonhardi

bleonhardi · ‎07-27-2016

The main consequences are for running jobs some of which may depend on ats ( too late for that ) and any investigation of performance of old jobs ( which are now gone ) apart from that nothing I would know about. Would be interested to know who set the retention period to 8 years 🙂 That doesnt make any sense at all. You could have simply changed that as well he should have cleaned up the data soon as well. Hope that works.

bleonhardi · ‎07-27-2016

Damn not fast enough, was about to write this, you get the column counts, types and some statistics out of it, you will have to invent the column names though. [root@sandbox ~]# hadoop fs -ls /apps/hive/warehouse/torc Found 2 items -rwxrwxrwx 3 root hdfs 16653 2016-03-14 15:35 /apps/hive/warehouse/torc/000000_0 -rwxrwxrwx 3 root hdfs 16653 2016-03-14 15:35 /apps/hive/warehouse/torc/000000_0_copy_1 [root@sandbox ~]# hive --orcfiledump /apps/hive/warehouse/torc/000000_0 WARNING: Use "yarn jar" to launch YARN applications. Processing data file /apps/hive/warehouse/torc/000000_0 [length: 16653] Structure for /apps/hive/warehouse/torc/000000_0 File Version: 0.12 with HIVE_8732 16/07/27 10:57:36 INFO orc.ReaderImpl: Reading ORC rows from /apps/hive/warehouse/torc/000000_0 with {include: null, offset: 0, length: 9223372036854775807} 16/07/27 10:57:36 INFO orc.RecordReaderFactory: Schema is not specified on read. Using file schema. Rows: 823 Compression: ZLIB Compression size: 262144 Type: struct<_col0:string,_col1:string,_col2:int,_col3:int> Stripe Statistics: Stripe 1: Column 0: count: 823 hasNull: false Column 1: count: 823 hasNull: false min: 00-0000 max: 53-7199 sum: 5761 Column 2: count: 823 hasNull: false min: Accountants and auditors max: Zoologists and wildlife biologists sum: 28550 Column 3: count: 823 hasNull: false min: 340 max: 134354250 sum: 403062800 Column 4: count: 819 hasNull: true min: 16700 max: 192780 sum: 39282210

bleonhardi · ‎07-25-2016

You still will not have HIVE_HOME because the scripts set it dynamically you need to replace that place holder with /usr/hdp/<yourversionlookitupinlinux>/hive

bleonhardi · ‎07-25-2016

So on the ambari go to hosts, select the host you want and press the big Add+ button

bleonhardi · ‎07-25-2016

Are you using HDP? Then you would install them through Ambari. Host-> Add Client

bleonhardi · ‎07-25-2016

you need to replace hive home with the actual path? /usr/hdp/<version>/hive/ Also on the node where you run it you need a hive client installed.

bleonhardi · ‎07-25-2016

There are literally a dozen different options here: a) Did you enable SQL Optimization of SPSS ( requires the modeler server licence ) after that it can push tasks into the hive datasource. Not sure if Hive is a supported datasource but I would assume so. You can look into the documentation. https://www.ibm.com/support/knowledgecenter/SS3RA7_15.0.0/com.ibm.spss.modeler.help/sql_overview.htm b) SPSS also supports a set of UDFs for in database scoring but that is not what you want. c) Finally there is the SPSS Analytic Server which can essentially run most functions as an Mapreduce job on the cluster. ftp://public.dhe.ibm.com/software/analytics/spss/documentation/analyticserver/1.0/English/IBM_SPSS_Analytic_Server_1_Users_Guide.pdf Unfortunately if you neither have the Modeler Server licence nor analytic server there is not much you can do besides manually pushing prefilters into the hive database or optimizing your SPSS jobs more.

bleonhardi · ‎07-25-2016

@Kartik Vashishta I think you should read an HDFS book :-). Replication is not dedicated to specific discs. HDFS will put 3 copies of the replica on different nodes. He doesn't have to choose a specific disc or node. The only rules are that - All three blocks will be on different nodes - If you have rack topology enabled the second and third copy will be on a different rack from the first copy It does not have to be a specific drive or node. HDFS will search for free space on any node that fits the requirements. The only issue I could imagine would be one huge node and some very small nodes that cannot match the size of the other node in total. ( Have seen this with physical nodes and vm nodes. )

bleonhardi · ‎07-25-2016

@Kartik Vashishta Again you don't understand HDFS. There is no limiting factor apart from the total disc capacity of the cluster. HDFS will put blocks ( simply files on the local file system ) on the discs of your cluster and fill them up. It will also make sure that you have 3 copies of each block. There is no limit but the total size of space. Now its not very smart to have differently sized discs in your cluster because this means not all spindles will be utilized equally but there is no limit per se. The performance problem will be that the small drives will be filled up and then all write activity will happen on the bigger drives. So equal drive sizes are recommended. But again not required. So its recommended to have equally sized discs but its not a requirement. The other discs will not be empty. Also its not a requirement to have the same number of drives in each node but you need to configure each node with the correct number of drives using config groups as sagar said.

bleonhardi · ‎07-25-2016

What Sagar says, Its not like RAID as in that whole discs are mirrored across nodes. Blocks are put on different nodes and HDFS will try to fill up the available space. Its pretty flexible.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Yarn Timeline db consuming 466GB space

Re: How can I upload ORC files to Hive?

Re: "org.apache.hadoop.hive.metastore.HiveMetaExce...

Re: "org.apache.hadoop.hive.metastore.HiveMetaExce...

Re: "org.apache.hadoop.hive.metastore.HiveMetaExce...

Re: "org.apache.hadoop.hive.metastore.HiveMetaExce...

Re: Best Practices to access HIVE from IBM SPSS

Re: Size of HDFS

Re: Size of HDFS

Re: Size of HDFS