About kaliyugantagoni

kaliyugantagoni · ‎05-26-2016

So if you distribute by smapiname_ver you would have all values with the same smapiname in the same output file. Also if you distribute with the partition key you can make sure that each reducer only writes to a single output file. Does this mean that I have to explicitly set the no. of reducers on the Hive prompt ? Is it mandatory for the CORRECT insertion of data ? set mapred.reduce.tasks=; (what value shall I provide?) But again if you don't understand it you might be better off with buckets sorted. and the optimized sorted load. Does this mean something like this : set hive.enforce.sorting=true; set hive.enforce.bucketing=true; set optimize.sort.dynamic.partitioning=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=3000; AND CREATE EXTERNAL TABLE IF NOT EXISTS FactSampleValue_ORC (`Snapshot_Id` int ,`ECU_Id` int ,.OTHER COLUMNS.)PARTITIONED BY (`SmapiName_ver` varchar(30)) CLUSTERED BY SmapiName_ver INTO 256 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS ORC LOCATION '/hiveorc'; Hmmm no, you do not filter by the snapshot_id, you still need all of them so the predicate pushdown doesn't help you thee. + Make sure that the partition key is part of the distribute by and any other key you want to add where conditions on. Unfortunately, there is only one where condition(where smapiname_ver ='dist_1'), so I am left only with one column on which partitioning is already considered. Does this mean something like this : INSERT INTO FactSampleValue_ORC PARTITION (SmapiName_ver) SELECT * FROM FactSampleValue DISTRIBUTE BY SmapiName_ver SORT BY ?;

kaliyugantagoni · ‎05-26-2016

Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1 Nodes : 1 NN(8 X 1TB hdd, 16 X 2.53 GHz core processor,48GB RAM, RHEL 6.5) + 8 DN(8 X 600GB hdd, 16 X 2.53 GHz core processor, 75GB RAM, RHEL 6.5). Nodes are connected by a 10-gig network The background thread for detailed information. Tables information : FactSampleValue : 24 Billion rows DimSnapshot : 8 million Currently, I have imported these tables in text format onto HDFS and have created plain/staging Hive external tables Once the final table strategy is decided, I will create another set of FINAL Hive external tables and populate them with insert into FINAL.table select * from staging.table The seemingly innocuous query : select f.*from factsamplevalue f join DimSnapshot t on t.snapshot_id = f.snapshot_id where smapiname_ver ='dist_1'; To check the cardinality and the skew , I executed the following queries on the vanilla/staging Hive table(as expected, it took ages 😛 ) select count(distinct Snapshot_Id) cardin_Snapshot_Id, count(distinct SmapiName_ver) cardin_SmapiName_ver, count(distinct SmapiColName) cardin_SmapiColName FROM FactSampleValue; cardin_Snapshot_Id cardin_SmapiName_ver cardin_SmapiColName 7967285 2717 45207 I created an empty external orc table as follows : CREATE EXTERNAL TABLE IF NOT EXISTS FactSampleValue ( `Snapshot_Id` int , `ECU_Id` int , . OTHER COLUMNS . ) PARTITIONED BY (`SmapiName_ver` varchar(30)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS ORC LOCATION '/hiveorc'; My thoughts on the above table : Since the only where condition is having SmapiName_ver , PARTITION BY SmapiName_ver Following is the INSERT query that I have thought about : INSERT INTO odp_dw_may2016_orc.FactSampleValue PARTITION (SmapiName_ver) SELECT * FROM odp_dw_may2016.FactSampleValue DISTRIBUTE BY ? SORT BY Snapshot_Id DESC My thoughts : As per my understanding and the community inputs, SORT BY is required to speed up the search and DISTRIBUTE BY to create less no. of output files(1-10 or more?, I don't really understand this concept), thereby, speeding up the search The JOIN is on the integer column Snapshot_Id, hence, SORT BY Snapshot_Id My brain fails to decide a value DISTRIBUTE BY, I have the following queries : Is it necessary that the DISTRIBUTE BY column has LOW CARDINALITY Is it crucial for the READ performance that the query has in the where/join condition the DISTRIBUTE BY column The only OTHER column with low cardinality BUT NOT USED in the query is ECU_Id(int) which has 44 DISTINCT VALUES How shall I load the data in the final ORC table

kaliyugantagoni · ‎05-24-2016

Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1 The cluster is having 1NN + 8 DN = 9 nodes. Some business sensitive data has been loaded onto HDFS via Sqoop. While access to the Ambari URL at http://NN:8080 is acceptable, the access to the http://NN:50070/ and further utilities like 'Browsing the file system' should be restricted to only 2-3 selected users. Right now, anyone can browse the hdfs contents via the browser. How to do it, preferably via Ambari ? Note : The access to different components(Hive, HDFS) etc. role wise is a later part, right now, just hiding the data is the concern

kaliyugantagoni · ‎05-24-2016

'partitioned by ( hash(SmapiName_ver) % 64 ) ' and then then sort by the name itself to at least skip some values This is a good idea, there are only two facts that worry me : The SmapiName_ver column is varchar, I am unsure how evenly/oddly will the SmapiName_ver keys be hashed into sets ! Is there a way in Hive to figure out what hash a key will go if you have the distinct key values with you ? As mentioned in my original question, the SmapiName_ver is having a heavy skew(file skew-smapiname-ver-fsv.txt has the desc order row count for each key). For example 'dist_1' key is present in 2 billion rows, 'ct_1' in 98 million, 'tm_1' in 8 million and so on. Now if 'dist_1' and 'ct_1' land in the same hash/partition, then instead of scanning 2billion in case of partitioned by SmapiName_ver, it will be now 2 billion + 98 million rows to scan - please correct if my understanding is incorrect Make sure to use distribute by OR bucketing to create a couple of files ( 1-10 blocks is a good size ) and distribute bucket by something OTHER than the smapiname Which column to use for distribute by is a question as : Snapshot_Id(int) is having a high cardinality(it is the PK in the other smaller table) i.e 8 million distinct values. This is THE JOIN COLUMN, does it make sense to use distribute by on this ? Is it a mandate to have DISTRIBUTE BY on a column which will be used in the query(either WHERE or JOIN) ? If you partition by it directly sorting doesnt help you much because you dont have a second where condition I didn't quite understand this statement SORT BY : How exactly SORT BY helps to skip records Does SORT BY sorts data within each file created by DISTRIBUTED BY or sorts data within EACH PARTITION Given the query(select f.* from factsamplevalue f join DimSnapshot t on t.snapshot_id = f.snapshot_id where smapiname_ver = 'dist_1'), is it necessary that the SORT BY should be either snapshot_id or smapiname_ver

kaliyugantagoni · ‎05-24-2016

In the bigger table, yes, there is a skew, please check the original question, I have also uploaded the column wise(SmapiName_ver)count of rows. As for the snapshot_id, it is a linear, incremental int field. Which(columns) max and min values would assist you ?

kaliyugantagoni · ‎05-23-2016

I read the article and now I have some grip on how to proceed, however, some questions too : How to I begin with the partitioning - there are 2719 distinct values for the column SmapiName_ver ( Can you check the attached file which has the distribution of the smapiname_ver ? ) Regarding use of partitions, there was a point 'No range partitioning/No continuous values'. In my case, the column smapiname_ver is a varchar column. How does it impact ? Given the 'INSERT INTO ORC_TABLE SELECT * FROM STAGING_TABLE', I didn't get what the 'Loading data using distribution' section Couldn't find the optimize.sort.dynamic.partitioning setting, is it same as : Is 'DISTRIBUTE BY' clause relevant(even in my example) when it comes to querying the large tables ? Isn't it's role limited to the data loading(ORC table creation) phase ?If not, how do I use it ? 'For PPD, use SORT BY' meaning we don't enforce a total order, also, this means a field like date or some integer. In my case, I guess I need to use the snapshot_id (latest entry will have the highest snapshot_id). Correct me if I am mistaken.

kaliyugantagoni · ‎05-23-2016

The source schema is a 'STAR schema' and the larger table is actually a fact table, hence, all the t.snapshot_id will be present in the f.snapshot_id column

kaliyugantagoni · ‎05-23-2016

Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1 Nodes : 1 NN(8 X 1TB hdd, 16 X 2.53 GHz core processor,48GB RAM, RHEL 6.5) + 8 DN(8 X 600GB hdd, 16 X 2.53 GHz core processor, 75GB RAM, RHEL 6.5). Nodes are connected by a 10-gig network Tables information : FactSampleValue : 24 Billion rows DimSnapshot : 8 million Currently, I have imported these tables in text format onto HDFS and have created plain/staging Hive external tables Once the final table strategy is decided, I will create another set of FINAL Hive external tables and populate them with insert into FINAL.table select * from staging.table The seemingly innocuous query : select f.* from factsamplevalue f join DimSnapshot t on t.snapshot_id = f.snapshot_id where smapiname_ver = 'dist_1'; Probably, a silly question to ask at this stage - does the infrastructure seem enough ? The obvious Hive strategies + existing forum threads lead me to believe : Partitioning + Skew To check the cardinality and the skew , I executed the following queries on the vanilla/staging Hive table(as expected, it took ages 😛 ) select count(distinct Snapshot_Id) cardin_Snapshot_Id, count(distinct SmapiName_ver) cardin_SmapiName_ver, count(distinct SmapiColName) cardin_SmapiColName FROM FactSampleValue; cardin_Snapshot_Id cardin_SmapiName_ver cardin_SmapiColName 7967285 2717 45207 If SmapiName_ver column is used, it will create 2717 partitions which I am unsure is a sane no. of partitions, however, the skew considerations below may help to reduce the partitions in some way(list bucketing or ??). I need to use dynamic partitioning. I also suspected a skew in the SmapiName_ver column, hence, executed the following query : select SmapiName_ver, count(1) SmapiName_ver_distrib from FactSampleValue group by SmapiName_ver; I have not provided the actual string values in SmapiName_ver but attached is the distribution in descending order skew-smapiname-ver-fsv.txt Bucketing(normal or 'List Bucketing') The smaller table(8 million) is having an integer snapshot id as PK in the source DB, hence, I guess I can use buckets to split the snapshot ids in some uniform manner but how many buckets ? Unsure how to use for the bigger table. I guess based on the 'skew' information, can list bucketing be used ? If yes, how ? Format and compression I have decided to proceed with the ORC format but given the massive size of the dataset, shall I go with default compression(I guess it will be zlib will be used by default) settings or ?? Indexes Not considering this at the moment - is it a sensible decision ? Misc. params I had a set of queries based on my limited knowledge of configuration : If I use bucketing on any of the tables, during data load in the ORC tables, I need to specify no. reducers = no. of buckets. Is this true even during execution of Hive queries i.e can I simply rely on Tez without having to wonder what reducer count should be set From the existing threads, I got several inputs like : The tables should be well sorted - does this mean that while inserting data into final tables, I need to use some 'order by/sort by' clause While flipping through the Hive doc., I came across the following statement : Cartesian products are not executed in parallel, and they are not optimized in any way using MapReduce Should I consider this in the given simple query? If yes, what I need to do ? I am currently not focusing on the cluster tweaks like 'Have set the task memory settings correctly so you fully utilize the cluster during queries' suggested by the community. I will first fix the Hive side

kaliyugantagoni · ‎05-16-2016

Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1. Tables from a SQL Server schema have been imported on HDFS using Sqoop. The Hive config in Ambari is as follows(Tez is the execution engine) : There is a huge table with 24187713416 rows but it is just an ordinary external table with no partitioning etc., when I execute a select query, following is the output(I couldn't figure out if Tez is in action) : hive (odp_dw_may2016)> > > select * from FactSampleValue where Snapshot_Id = 643067 and SmapiName_ver = 'Abnormal_ShutDown_ct_1' and SmapiColName = 'Abnormal_ShutDown_ct_1_000'; Query ID = hive_20160516092942_4b1c90ab-77bd-4ea8-bf3e-9af611a63143 Total jobs = 1 Launching Job 1 out of 1 Tez session was closed. Reopening... Session re-established. Status: Running (Executing on YARN cluster with App id application_1446726117927_0076) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 3098 3098 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 1341.80 s -------------------------------------------------------------------------------- OK factsamplevalue.chassino factsamplevalue.snapshot_id factsamplevalue.ecu_id factsamplevalue.smapiname_ver factsamplevalue.smapicolname factsamplevalue.indexet factsamplevalue.value factsamplevalue.samplefrequency factsamplevalue.vectorperc factsamplevalue.smapiname factsamplevalue.diffvalue factsamplevalue.commvalue 1872663 643067 28 Abnormal_ShutDown_ct_1 Abnormal_ShutDown_ct_1_000 0 1.0 NULL 100.0 Abnormal_ShutDown_ct NULL NULL Time taken: 1350.861 seconds, Fetched: 1 row(s) I have several questions pertaining only to the usage of Tez : Can I safely assume that Tez is the execution engine right now ? Does Tez maintain its logs where the Hive query execution is logged ? I read in the Hortonworks doc. about enabling Tez, it is necessary even when Ambari is used ? Can the change in the Hive config via Ambari seamlessly switch between Tez and MapReduce ? I checked the Apache Tez doc. but I am unclear about the utility of the Tez UI ?

kaliyugantagoni · ‎05-13-2016

Some confusion, can you help ? Does the split-by column wrong choice cause the mismatch in no. of rows in source schema and hdfs Does the --num-mappers wrong choice cause the mismatch in no. of rows in source schema and hdfs Can a combination of above two cause the mismatch in no. of rows in source schema and hdfs

Online	Offline
Last Visited	‎03-18-2020 10:21 AM

Member Since	‎04-11-2016 02:31 PM
Last Visited	‎03-18-2020 10:21 AM
Posts	174
Kudos received	29

Cloudera Community

Re: NiFi custom processor custom log, logging in t...

Re: Separate log file for custom processor

Re: Sqoop imported more records than source

Re: Sqoop import to HCatalog/Hive - table not visi...

Re: HDFS Space not reclaimed

Re: Part-2 : Join involving 24 billion X 1 to 8 mi...

Part-2 : Join involving 24 billion X 1 to 8 millio...

Quickly secure the access to the cluster via http

Re: Part-1 : Join involving 24 billion X 1 to 8 mi...

Re: Part-1 : Join involving 24 billion X 1 to 8 mi...

Re: Part-1 : Join involving 24 billion X 1 to 8 mi...

Re: Part-1 : Join involving 24 billion X 1 to 8 mi...

Part-1 : Join involving 24 billion X 1 to 8 millio...

Enable/disable Tez and verify

Re: Sqoop import : composite primary key and textu...