About Shu_ashu

Shu_ashu · ‎01-06-2019

@Basil Paul 1.Try with below query: select count(*) from demo where 2gusage is NULL; 2.If literal NULL is in your data for 2gusage column then use the below query: select count(*) from demo where 2gusage = "NULL";

Shu_ashu · ‎01-05-2019

@PP Include over() clause in your select query . Try with below query: select row_number() over(),* from testDB.testTable; Example: select row_number() over() as rn,* from ( select stack(2,1,"foo",2,"bar") as (id,name) )t; +-----+-------+---------+--+ | rn | t.id | t.name | +-----+-------+---------+--+ | 1 | 1 | foo | | 2 | 2 | bar | +-----+-------+---------+--+ "rn" is the row number column that we have added in the above result.

Shu_ashu · ‎01-03-2019

@Jack Here are the properties: hive.compactor.delta.pct.threshold Default: 0.1 Metastore Percentage (fractional) size of the delta files relative to the base that will trigger a major compaction. 1 = 100%, so the default 0.1 = 10%. hive.compactor.abortedtxn.threshold Default: 1000 Metastore Number of aborted transactions involving a given table or partition that will trigger a major compaction. Setting Compaction properties TBLProperties: CREATE TABLE table_name ( id int, name string ) CLUSTERED BY (id) INTO 2 BUCKETS STORED AS ORC TBLPROPERTIES ("transactional"="true", "compactor.mapreduce.map.memory.mb"="2048", -- specify compaction map job properties "compactorthreshold.hive.compactor.delta.num.threshold"="4", -- trigger minor compactionifthere are more than4delta directories "compactorthreshold.hive.compactor.delta.pct.threshold"="0.5"-- trigger major compactionifthe ratio of size of delta files to -- size of base files is greater than50% ); ALTER TABLE table_name COMPACT 'minor' WITH OVERWRITE TBLPROPERTIES ("compactor.mapreduce.map.memory.mb"="3072"); -- specify compaction map job properties ALTER TABLE table_name COMPACT 'major' WITH OVERWRITE TBLPROPERTIES ("tblprops.orc.compress.size"="8192"); -- change any other Hive table properties We can trigger major compactions by using below command: alter table <table-name> partition(<partition-name>,<nested-partition-name>,..) compact 'major'; More details on this page: https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

Shu_ashu · ‎01-02-2019

@Jack THe below parameters controls the triggering the compactions. Configuration Parameter Description hive.compactor.delta.num.threshold Specifies the number of delta directories in a partition that triggers an automatic minor compaction. The default value is 10. hive.compactor.delta.pct.threshold Specifies the percentage size of delta files relative to the corresponding base files that triggers an automatic major compaction. The default value is.1, which is 10 percent. hive.compactor.abortedtxn.threshold Specifies the number of aborted transactions on a single partition that trigger an automatic major compaction. For all the hive compaction parameters refer to the below link: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_data-access/content/understanding-administering-compactions.html

Shu_ashu · ‎12-28-2018

@GEORGE NASIS In case of caluculating average we are not going to get same value. Example: Hive table data: select * from i; +---------+------+--------+-------+--+ | gender | age | total | name | +---------+------+--------+-------+--+ | M | 10 | 12 | st | | F | 10 | 8 | st | | M | 12 | 15 | st | +---------+------+--------+-------+--+ Caluculate Avg on one field: select avg(total) from i group by name; +---------------------+--+ | _c0 | +---------------------+--+ | 11.666666666666666 | +---------------------+--+ Calculate avg using subquery: select avg(tt) from ( select name,avg(total)tt from i group by gender,name)t group by name; +--------+--+ | _c0 | +--------+--+ | 10.75 | +--------+--+ as we are doing 2 dimensions in innerquery then 1D group by on outer query we can see the difference between averages because number of rows are 2 in outer query instead of 3. Correct subquery would be: select sum(tt)/sum(cnt) from( select name,sum(total)tt,count(*)cnt,avg(total) from i group by gender,name //add cnt column,sum,avg)t group by name; +---------------------+--+ | _c0 | +---------------------+--+ | 11.666666666666666 | +---------------------+--+ You can try with this approach (adding count column then sum(total) divide sum(count) to get correct avg results) and cache the most appropriate dataframe then spark optimizer will choose the most efficient plan to run these tasks.

Shu_ashu · ‎12-28-2018

@GEORGE NASIS As you are using df3 dataframe for all groupBy clauses, so cache the "df3" dataframe then spark will not recompute the data from file for all 16 times instead uses the cached dataframe(df3) and This will significantly increases the performance. val df3 = df2.withColumn("Age", currentDate($"BirthDate")) df3.cache() //caching the dataframe into memory

Shu_ashu · ‎12-28-2018

@Jack There are two types of compactions happens in Acid tables: 1.Minor Compaction:-A ‘minor’ compaction will takes all the delta files and rewrites them to single delta file. This compaction wont take much resources. 2.Major Compaction:-A ‘major’ compaction will takes one or more delta files(same as minor compaction) and the base file for the bucket and rewrites them into a new base file per bucket. Delta files will be cleared out when Minor/Major compaction happens and all these tasks will be initiated by hive in background based on the hive-site.xml configs, Refer to this link for more details. Take a look on this thread for understand how to initialize Hive compactions manually.

Shu_ashu · ‎12-26-2018

@CarlosHS Even when we replace columns in hive text table the data will not be changed i.e wikipedia_link data will be still presented in HDFS file. So when we try to access the table hive reads the data with "," delimited and gives wikipedia_link data in place of keywords column. - Steps to drop wikipedia_link column with data: hive> set hive.support.quoted.identifiers=none; hive> create table countries_temp row format delimited fields terminated by ',' stored as textfile as select `(wikipedia_link)?+.+` from countries; //create temp table with out wikipedia_link columnhive> drop table countries; //drop countries table hive> alter table countries_temp rename to countries; //rename temp table to countries. (or) Another way would be creating a view without wikipedia_link column. hive> create view vw_countries as select id,code,name,continent,keywords from countries; then accessing data from view instead of countries table.

Shu_ashu · ‎12-13-2018

@Jose Luis Moreno Use DetectDuplicate processor, so that if you are having same hive query as this processor will detect that query transfers those flowfiles into duplicate connection and also you can use Age Off Duration property to age off cached flowfiles. Refer to this link for more details regards to DetectDuplicate processor usage. - Another ways would be using ControlRate processor but this processor will release the flowfile that has been waiting in the queue after TimeDuration has been finished. Refer this link for more details regards to ControlRate processor.

Shu_ashu · ‎12-13-2018

@Tejash Tarun Use UpdateAttribute Processor and add new properties in the processor these properties will be added as flowfile attribute so that you can use them in PublishJMS processor.

Online	Offline
Last Visited	‎04-04-2021 06:38 PM

Member Since	‎06-08-2017 08:15 PM
Last Visited	‎04-04-2021 06:38 PM
Posts	1,049
Kudos received	516

Cloudera Community

Re: Get column values in comma separated value

Re: nifi Json data using routeonattributeto to spl...

Re: HIVE MANAGED TABLE

Re: CSV file with Duplicate Headers

Re: NIFI - SQL Server Lookup

Re: How to find the count of NULL values in Hive

Re: row_number in hive query giving error : Only C...

Re: The acid table will have folders of delta and ...

Re: The acid table will have folders of delta and ...

Re: Scala - Spark Boost GroupBy Computing for mult...

Re: Scala - Spark Boost GroupBy Computing for mult...

Re: The acid table will have folders of delta and ...

Re: Problem removing column from Hive table

Re: [Nifi] Unify ReplaceText output FlowFile into ...

Re: How to set flow file attributes ?