About rbiswas1

rbiswas1 · ‎10-21-2016

@Alena Melnikova Good to hear that you are happy with the results:) Answers: 1. You can go as low as 1k. Choose a balanced option on the average number of rows you query. 2. The usage of function to_date I believe will cause the orc index to stop working (Haven't tested that). Google "why function based index?"

rbiswas1 · ‎10-18-2016

@Alena Melnikova you got it, there is no difference apart from a very subtle one. On approach one we still kept a bit of dependency on partition pruning and on approach 2 it is entirely dependent on ordering of data via ORC index.

rbiswas1 · ‎10-17-2016

Although it is a simple solution, there is not much reference to this exact problem in the www. We have seen people doing multiple projection, flattening and group by to get the data into shape for storing into the final table using pig after the aggregation and group operations. Source Table: CREATE EXTERNAL TABLE IF NOT EXISTS source( pk1 string, pk2 string, agg1 INT, agg2 INT ) STORED AS ORC tblproperties("orc.compress"="ZLIB"); Data: Target Table: CREATE EXTERNAL TABLE IF NOT EXISTS target_aggregated( pk1 string, pk2 string, sum_agg1 BIGINT, sum_agg2 BIGINT ) STORED AS ORC tblproperties("orc.compress"="ZLIB"); Pig Script: --Load the data in pig relations staging = LOAD 'DEFAULT.SOURCE' USING org.apache.hive.hcatalog.pig.HCatLoader(); --Group the data group_staging = group staging BY (pk1,pk2); --Flatten the grouped data and generate aggregates with same attribute names as the target table calculate_group_staging = FOREACH group_staging GENERATE FLATTEN(group) AS(pk1,pk2),SUM(staging.agg1) as sum_agg1, SUM(staging.agg2) as sum_agg2; --Order the data if required calculate_group_staging_ordered = ORDER calculate_group_staging BY pk1,pk2; --Store the data using HCatStorer --Data will be automatically dereferenced by using the HCatalog metastore STORE calculate_group_staging_ordered INTO 'DEFAULT.TARGET_AGGREGATED' USING org.apache.hive.hcatalog.pig.HCatStorer(); Final Results: Hope this helps. Thanks.

rbiswas1 · ‎10-17-2016

@Alena Melnikova taking a second glance at my answer I feel that you would not need any partition at all, provided you can compact yearly data into 1 file. So 10 files in total. A bit more steps but would work as gracefully as the yearly partition. So try out solution 1 and when it works, try out solution 2 and then pick one:) Thanks

rbiswas1 · ‎10-17-2016

Hi @Alena Melnikova Disclaimer: Without having any knowledge of the data STEPS: 1. For such small sets of data I would partition by YEAR. 2. I would insert the data ordering by timestamp. (Use PIG if Hive is taking more time) 3. Table structure: .... PARTITIONED BY ( year STRING COMMENT '') STORED AS ORC tblproperties("orc.compress"="ZLIB", "orc.bloom.filter.columns"="time_stamp", "orc.create.index"="true", "orc.stripe.size"="268435456" ,"orc.row.index.stride"="12000", "orc.compress.size"="262144" ); 4. Collect statistics on table. 5. Set few config parameters in hive set hive.optimize.index.filter=true; set hive.exec.orc.skip.corrupt.data=true; set hive.vectorized.execution.enabled=true; set hive.exec.compress.output=true; set hive.execution.engine=tez; set tez.am.container.reuse.enabled=TRUE; set hive.compute.query.using.stats=true; set stats.reliable=true; set hive.cbo.enable=true; set hive.optimize.sort.dynamic.partition=true; set hive.optimize.ppd=true; set hive.optimize.ppd.storage=true; set hive.merge.tezfiles=true; set hive.hadoop.supports.splittable.combineinputformat=true; set mapreduce.map.speculative=true; 6. Query with YEAR extracted from timestamp (try regex_replace function in hive) and TIMESTAMP DO not miss any of the steps above and post us about the awesome results you get:) Thanks

rbiswas1 · ‎10-13-2016

Hi, For very large datasets in PB range does it help creating large ORC files? I understand they should be greater than block size. So lets say I have a block size of 256 mb and am creating 1 GB ORC files for a hive table of total size 3 TB. So would it help to create bigger file sizes say of 2 GB? Keep in mind I will be using ORC index to query only 1 file per partition and that data output would be in kb. Thanks

rbiswas1 · ‎07-18-2016

Hi @shubham chhabra I am guessing you are doing some move/delete/rename operation in you code which breaks the parallel flow as in one mapper is deleting the file still used by another mapper. If I was debugging one quick way I would start is comment the lines performing move/delete/rename operation and rerun the code to prove that this is the case and then start editing. I hope this will work. Keep us posted. Thanks

rbiswas1 · ‎07-18-2016

Hi @Jaime I think it is the namenode total java max heap size. Please go through this settings (You have to change it from HDFS config): https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/ref-80953924-1cbf-4655-9953-1e744290a6c3.1.html Please let us if this fixes the problem. Thanks

rbiswas1 · ‎07-18-2016

Hi @bigdata.neophyte Here is a detailed answer to your question: https://martin.atlassian.net/wiki/pages/viewpage.action?pageId=27885570 You can grant hdfs user level privileges which is a superuser group to other groups. In short if you change this parameter (Under HDFS config): dfs.permissions.supergroup = mygroup And add any user to this mygroup it should have the superuser privileges. Also refer this: http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#Configuration_Parameters But in my opinion superuser account should be very limited if not one. Thanks

rbiswas1 · ‎07-12-2016

Thank you @Ravi Mutyala

Online	Offline
Last Visited	‎05-03-2018 08:15 PM

Member Since	‎04-04-2016 06:50 PM
Last Visited	‎05-03-2018 08:15 PM
Posts	166
Kudos received	168

Cloudera Community

Re: How to "defragment" hdfs data?

Re: How to connect hive LLAP via ODBC using http a...

Re: which time actaul block size assign ? Is it pr...

Re: Hive - i would like to calculate percentage of...

Re: Get the length of time an oozie workflow took ...

Re: The best approach to the thousands of small pa...

Re: The best approach to the thousands of small pa...

Aggregation in PIG and storage in HIVE

Re: The best approach to the thousands of small pa...

Re: The best approach to the thousands of small pa...

Number of ORC files effect on namnode?

Re: Lease Expired exception in mapper

Re: YARN memory configuration parameters and Java ...

Re: hdfs privileges to other user-defined users

Re: How to automate the setup of DR cluster?