Member since
06-08-2017
1049
Posts
518
Kudos Received
312
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 11227 | 04-15-2020 05:01 PM | |
| 7131 | 10-15-2019 08:12 PM | |
| 3114 | 10-12-2019 08:29 PM | |
| 11496 | 09-21-2019 10:04 AM | |
| 4343 | 09-19-2019 07:11 AM |
01-06-2019
02:42 PM
@Basil Paul 1.Try with below query: select count(*) from demo where 2gusage is NULL; 2.If literal NULL is in your data for 2gusage column then use the below query: select count(*) from demo where 2gusage = "NULL";
... View more
01-05-2019
11:28 PM
1 Kudo
@PP Include over() clause in your select query . Try with below query: select row_number() over(),* from testDB.testTable; Example: select row_number() over() as rn,* from ( select stack(2,1,"foo",2,"bar") as (id,name) )t;
+-----+-------+---------+--+
| rn | t.id | t.name |
+-----+-------+---------+--+
| 1 | 1 | foo |
| 2 | 2 | bar |
+-----+-------+---------+--+ "rn" is the row number column that we have added in the above result.
... View more
01-03-2019
01:51 AM
@Jack Here are the properties: hive.compactor.delta.pct.threshold Default: 0.1
Metastore
Percentage (fractional) size of the delta files relative to the base that will trigger a major compaction. 1 = 100%, so the default 0.1 = 10%. hive.compactor.abortedtxn.threshold Default: 1000
Metastore
Number of aborted transactions involving a given table or partition that will trigger a major compaction. Setting Compaction properties TBLProperties: CREATE TABLE table_name ( id int, name string ) CLUSTERED BY (id) INTO 2 BUCKETS STORED AS ORC TBLPROPERTIES ("transactional"="true", "compactor.mapreduce.map.memory.mb"="2048", -- specify compaction map job properties "compactorthreshold.hive.compactor.delta.num.threshold"="4", -- trigger minor compactionifthere are more than4delta directories "compactorthreshold.hive.compactor.delta.pct.threshold"="0.5"-- trigger major compactionifthe ratio of size of delta files to -- size of base files is greater than50% ); ALTER TABLE table_name COMPACT 'minor'
WITH OVERWRITE TBLPROPERTIES ("compactor.mapreduce.map.memory.mb"="3072"); -- specify compaction map job properties
ALTER TABLE table_name COMPACT 'major'
WITH OVERWRITE TBLPROPERTIES ("tblprops.orc.compress.size"="8192"); -- change any other Hive table properties We can trigger major compactions by using below command: alter table <table-name> partition(<partition-name>,<nested-partition-name>,..) compact 'major'; More details on this page: https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
... View more
01-02-2019
03:30 AM
@Jack THe below parameters controls the triggering the compactions. Configuration Parameter Description hive.compactor.delta.num.threshold Specifies the number of delta directories in a partition that triggers an automatic minor compaction. The default value is 10. hive.compactor.delta.pct.threshold Specifies the percentage size of delta files relative to the corresponding base files that triggers an automatic major compaction. The default value is.1, which is 10 percent. hive.compactor.abortedtxn.threshold Specifies the number of aborted transactions on a single partition that trigger an automatic major compaction. For all the hive compaction parameters refer to the below link: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_data-access/content/understanding-administering-compactions.html
... View more
12-28-2018
07:23 PM
@GEORGE NASIS In case of caluculating average we are not going to get same value. Example: Hive table data: select * from i;
+---------+------+--------+-------+--+
| gender | age | total | name |
+---------+------+--------+-------+--+
| M | 10 | 12 | st |
| F | 10 | 8 | st |
| M | 12 | 15 | st |
+---------+------+--------+-------+--+ Caluculate Avg on one field: select avg(total) from i group by name;
+---------------------+--+
| _c0 |
+---------------------+--+
| 11.666666666666666 |
+---------------------+--+ Calculate avg using subquery: select avg(tt) from (
select name,avg(total)tt from i group by gender,name)t group by name;
+--------+--+
| _c0 |
+--------+--+
| 10.75 |
+--------+--+ as we are doing 2 dimensions in innerquery then 1D group by on outer query we can see the difference between averages because number of rows are 2 in outer query instead of 3. Correct subquery would be: select sum(tt)/sum(cnt) from( select name,sum(total)tt,count(*)cnt,avg(total) from i group by gender,name //add cnt column,sum,avg)t group by name;
+---------------------+--+
| _c0 |
+---------------------+--+
| 11.666666666666666 |
+---------------------+--+ You can try with this approach (adding count column then sum(total) divide sum(count) to get correct avg results) and cache the most appropriate dataframe then spark optimizer will choose the most efficient plan to run these tasks.
... View more
12-28-2018
03:43 AM
1 Kudo
@GEORGE NASIS As you are using df3 dataframe for all groupBy clauses, so cache the "df3" dataframe then spark will not recompute the data from file for all 16 times instead uses the cached dataframe(df3) and This will significantly increases the performance. val df3 = df2.withColumn("Age", currentDate($"BirthDate"))
df3.cache() //caching the dataframe into memory
... View more
12-28-2018
03:35 AM
@Jack There are two types of compactions happens in Acid tables: 1.Minor Compaction:-A ‘minor’ compaction will takes all the delta files and rewrites them to single delta file. This compaction wont take much resources. 2.Major Compaction:-A ‘major’ compaction will takes one or more delta files(same as minor compaction) and the base file for the bucket and rewrites them into a new base file per bucket. Delta files will be cleared out when Minor/Major compaction happens and all these tasks will be initiated by hive in background based on the hive-site.xml configs, Refer to this link for more details. Take a look on this thread for understand how to initialize Hive compactions manually.
... View more
12-26-2018
03:27 AM
@CarlosHS Even when we replace columns in hive text table the data will not be changed i.e wikipedia_link data will be still presented in HDFS file. So when we try to access the table hive reads the data with "," delimited and gives wikipedia_link data in place of keywords column. - Steps to drop wikipedia_link column with data: hive> set hive.support.quoted.identifiers=none;
hive> create table countries_temp row format delimited fields terminated by ',' stored as textfile as select `(wikipedia_link)?+.+` from countries; //create temp table with out wikipedia_link columnhive> drop table countries; //drop countries table
hive> alter table countries_temp rename to countries; //rename temp table to countries. (or) Another way would be creating a view without wikipedia_link column. hive> create view vw_countries as select id,code,name,continent,keywords from countries; then accessing data from view instead of countries table.
... View more
12-13-2018
01:33 PM
1 Kudo
@Jose Luis Moreno Use DetectDuplicate processor, so that if you are having same hive query as this processor will detect that query transfers those flowfiles into duplicate connection and also you can use Age Off Duration property to age off cached flowfiles. Refer to this link for more details regards to DetectDuplicate processor usage. - Another ways would be using ControlRate processor but this processor will release the flowfile that has been waiting in the queue after TimeDuration has been finished. Refer this link for more details regards to ControlRate processor.
... View more
12-13-2018
01:27 PM
1 Kudo
@Tejash Tarun Use UpdateAttribute Processor and add new properties in the processor these properties will be added as flowfile attribute so that you can use them in PublishJMS processor.
... View more