Member since
04-04-2016
166
Posts
168
Kudos Received
29
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3454 | 01-04-2018 01:37 PM | |
5875 | 08-01-2017 05:06 PM | |
1997 | 07-26-2017 01:04 AM | |
9620 | 07-21-2017 08:59 PM | |
3023 | 07-20-2017 08:59 PM |
10-21-2016
02:07 PM
@Alena Melnikova Good to hear that you are happy with the results:) Answers: 1. You can go as low as 1k. Choose a balanced option on the average number of rows you query. 2. The usage of function to_date I believe will cause the orc index to stop working (Haven't tested that). Google "why function based index?"
... View more
10-18-2016
02:04 PM
@Alena Melnikova you got it, there is no difference apart from a very subtle one. On approach one we still kept a bit of dependency on partition pruning and on approach 2 it is entirely dependent on ordering of data via ORC index.
... View more
10-17-2016
07:02 PM
2 Kudos
Although it
is a simple solution, there is not much reference to this exact problem in the
www. We have seen
people doing multiple projection, flattening and group by to get the data into
shape for storing into the final table using pig after the aggregation and
group operations. Source
Table: CREATE EXTERNAL TABLE IF NOT EXISTS source(
pk1 string,
pk2 string,
agg1 INT,
agg2 INT
)
STORED AS ORC tblproperties("orc.compress"="ZLIB");
Data: Target
Table: CREATE EXTERNAL TABLE IF NOT EXISTS target_aggregated(
pk1 string,
pk2 string,
sum_agg1 BIGINT,
sum_agg2 BIGINT
)
STORED AS ORC tblproperties("orc.compress"="ZLIB");
Pig Script: --Load the data in pig relations
staging = LOAD 'DEFAULT.SOURCE' USING org.apache.hive.hcatalog.pig.HCatLoader();
--Group the data
group_staging = group staging BY (pk1,pk2);
--Flatten the grouped data and generate aggregates with same attribute names as the target table
calculate_group_staging = FOREACH group_staging GENERATE FLATTEN(group) AS(pk1,pk2),SUM(staging.agg1) as sum_agg1, SUM(staging.agg2) as sum_agg2;
--Order the data if required
calculate_group_staging_ordered = ORDER calculate_group_staging BY pk1,pk2;
--Store the data using HCatStorer
--Data will be automatically dereferenced by using the HCatalog metastore
STORE calculate_group_staging_ordered INTO 'DEFAULT.TARGET_AGGREGATED' USING org.apache.hive.hcatalog.pig.HCatStorer();
Final Results: Hope this helps. Thanks.
... View more
Labels:
10-17-2016
06:54 PM
@Alena Melnikova taking a second glance at my answer I feel that you would not need any partition at all, provided you can compact yearly data into 1 file. So 10 files in total. A bit more steps but would work as gracefully as the yearly partition. So try out solution 1 and when it works, try out solution 2 and then pick one:) Thanks
... View more
10-17-2016
06:48 PM
2 Kudos
Hi @Alena Melnikova Disclaimer: Without having any knowledge of the data STEPS: 1. For such small sets of data I would partition by YEAR. 2. I would insert the data ordering by timestamp. (Use PIG if Hive is taking more time) 3. Table structure: .... PARTITIONED BY ( year STRING COMMENT '') STORED AS ORC tblproperties("orc.compress"="ZLIB", "orc.bloom.filter.columns"="time_stamp", "orc.create.index"="true", "orc.stripe.size"="268435456" ,"orc.row.index.stride"="12000", "orc.compress.size"="262144" ); 4. Collect statistics on table. 5. Set few config parameters in hive set hive.optimize.index.filter=true;
set hive.exec.orc.skip.corrupt.data=true; set hive.vectorized.execution.enabled=true; set hive.exec.compress.output=true; set hive.execution.engine=tez; set tez.am.container.reuse.enabled=TRUE; set hive.compute.query.using.stats=true; set stats.reliable=true;
set hive.cbo.enable=true; set hive.optimize.sort.dynamic.partition=true; set hive.optimize.ppd=true; set hive.optimize.ppd.storage=true;
set hive.merge.tezfiles=true; set hive.hadoop.supports.splittable.combineinputformat=true;
set mapreduce.map.speculative=true; 6. Query with YEAR extracted from timestamp (try regex_replace function in hive) and TIMESTAMP DO not miss any of the steps above and post us about the awesome results you get:) Thanks
... View more
10-13-2016
06:35 PM
2 Kudos
Hi, For very large datasets in PB range does it help creating large ORC files? I understand they should be greater than block size. So lets say I have a block size of 256 mb and am creating 1 GB ORC files for a hive table of total size 3 TB. So would it help to create bigger file sizes say of 2 GB? Keep in mind I will be using ORC index to query only 1 file per partition and that data output would be in kb. Thanks
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
07-18-2016
04:40 PM
Hi @shubham chhabra I am guessing you are doing some move/delete/rename operation in you code which breaks the parallel flow as in one mapper is deleting the file still used by another mapper. If I was debugging one quick way I would start is comment the lines performing move/delete/rename operation and rerun the code to prove that this is the case and then start editing. I hope this will work. Keep us posted. Thanks
... View more
07-18-2016
04:31 PM
Hi @Jaime I think it is the namenode total java max heap size. Please go through this settings (You have to change it from HDFS config): https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/ref-80953924-1cbf-4655-9953-1e744290a6c3.1.html Please let us if this fixes the problem. Thanks
... View more
07-18-2016
04:24 PM
2 Kudos
Hi @bigdata.neophyte Here is a detailed answer to your question: https://martin.atlassian.net/wiki/pages/viewpage.action?pageId=27885570 You can grant hdfs user level privileges which is a superuser group to other groups. In short if you change this parameter (Under HDFS config): dfs.permissions.supergroup = mygroup And add any user to this mygroup it should have the superuser privileges. Also refer this: http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#Configuration_Parameters But in my opinion superuser account should be very limited if not one. Thanks
... View more