Member since
11-07-2016
70
Posts
40
Kudos Received
16
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4103 | 02-22-2018 09:20 PM | |
7048 | 01-30-2018 03:41 PM | |
1281 | 10-25-2017 06:09 PM | |
10949 | 08-15-2017 10:54 PM | |
3424 | 06-26-2017 05:05 PM |
01-21-2017
11:54 PM
1 Kudo
Hi @Praveen PentaReddy, 1. If you want to backup a data in archive location as a table, then create a table as original one create external table MYTABLE_archive like MYTABLE location '<hdfs_archive_location>'; 2. Copy the data from original table to archive one: 2.a. using HQL: INSERT INTO TABLE MYTABLE_archive partition(date) SELECT * from MYTABLE where date>= <start_date> and date<=<end_date>;
-- dynamic partitioning should be enabled; 2.b. using file transfer: Identify location of each partition to be copied. Copy directories to the location of archive table and then add partitions: hdfs dfs -cp <orig_partitions_loc> <archive_table_location>/
# alternatively you can use distcp in case of huge volumes, but check if you have permissions to execute it. Also make sure you understand the difference between those two command in regards to file attributes. then add new partitions in archve: MSCK REPAIR TABLE MYTABLE_archive;
... View more
01-09-2017
08:30 PM
VI is great! 🙂 but you will be able to use gedit. You don't have to create .hql or .pig scripts. When it says "script" - it means you must mention each and every step you do in order to get your task done. You will also be given file names in which you need to save your code/script, so you won't need to spend your time on thinking about naming convention. 🙂 Good luck!
... View more
01-09-2017
08:11 PM
1 Kudo
Hi @Sami Ahmad, 1. 2 Hours should be sufficient for any questions you will get asked. But only in case you know what to do and don't need to read too much documentation. I've done certification about a month ago. The only recommendation: do install test VM on AWS as recommended and try to complete test exam prior to your actual exam. 2. You don't need to remember how to create script, but you need to be familiar with content and have skills in rack awareness configuration using Ambari. Good luck!
... View more
01-05-2017
07:56 PM
@Venkat Ranganathan, Small correction: if you use hcatalog, but your table is still textfile format with "|" field delimiter, you'll still have the same issue. You, probably, mean to use HCat import with ORC formatted table - that will definitely work.
... View more
01-05-2017
07:46 PM
Hi @Kaliyug Antagonist, Let me answer inline: 1. For
the fact_rtc_starschema, the no. of files are different for CTAS copy and
the CONCAT on the original table, why so ? The files in original table are less than HDFS default block
size (44MB - 76MB). Both CTAS and CONCAT are mapreduce (run with either M/R or
TEZ engine). So in both cases your application will try to spawn containers
(mappers) according to the tez/mr configuration you have defined. Number of
files can be different, as CTAS and CONCAT use different set of params that
will control parallelism degree. 2. For
fact_rtc, irrespective of the CTAS or CONCAT, the no. of files is
unchanged, why so ? In this table, original number of files is already 1009.
Also all files are bigger than HDFS default block size, so both CTAS and CONCAT
won’t be different, moreover, most probably the files will have the same records
as in original files. 3. Certainly,
the number of columns has an impact on the size but does a larger size and
aplenty columns cause a large number of files ? Number of columns doesn’t impact execution plan, nor number
of files. Number of files are defined by input-split-size, number of reducers
defined (if there are reducers for a job), number of files per job (if such
restriction defined), combined input split config param (and related) values. 4.I am
unsure if 'small no. of large files' should be taken far. Currently, there
are 4 data nodes, soon there will 8 and so on. Does this argument hold
then ? Please refer to the article
explaining ORC creation strategy. The article also has references to some
additional resources to consider while working with ORC. 5. I am
confused about which table is behaving incorrectly. Or is it both are
correct ? Both tables behaving per your job configurations. Correct or
incorrect – I would change configuration (especially on such a cluster as 4-8
data nodes). 6. The
problem started with a query differing drastically in response time on the
two types of tables. The reduction of the files has improved the
performance but added more confusions, hence, I will post those as a
separate thread, later. That makes sense. Less files (till’ some number) – better performance.
See the article I mentioned above. Recommendations for your use case:
If you append data to your
tables over the time, avoid using CONCAT. After upgrading to new Hive/ORC
version you will have issues with having ORC files of different versions for
CONCAT. Job won’t fail, but you will lose some data. If you decide to use CONCAT, remember, that it concatenates files on a stripe level. So, if you have small files with small stripes, that won't make ORC files much more efficient, although it might reduce number of resources to be allocated by resource manager. Create ORC tables as
mentioned in the article with following configuration: set hive.hadoop.supports.splittable.combineinputformat=true;
set tez.grouping.min-size=1073741824; -- 1 GB - depends on your container heap size
set tez.grouping.max-size=2147483648; -- 2 GB - depends on your container head size
set mapreduce.input.fileinputformat.split.minsize=1073741824; -- 1 GB - depends on your container heap size
set mapreduce.input.fileinputformat.split.maxsize=2147483648; -- 2 GB - depends on your container head size
set hive.merge.smallfiles.avgsize=1073741824;
set hive.exec.reducers.bytes.per.reducer=2147483648;
... View more
01-05-2017
05:42 PM
1 Kudo
@Krishna Srinivas Sqoop doesn't support multiple-characters delimiters. You have to use single char. I would suggest to use a character that is native to Hive text file format - ^A #!/bin/bash
# ...
delim_char=$( printf "\x01" )
sqoop import ... --input-fields-terminated-by ${delim_char} ...
# ...
... View more
01-03-2017
05:59 PM
1 Kudo
@Praveen PentaReddy, As @Timothy Spann mentioned above - hive MD5 should work. But since you don't have it as built-in in your version of Hive, you still have few more option to choose from: Use built-in "hash" function (32 bits, since it is INT) Create your custom UDF and implement checksum as you would do in regular Java Use ReflectUDF: SELECT reflect('org.apache.commons.codec.digest.DigestUtils', '<your_method>', 'your_string') where <your_method> can be: md5Hex, sha1Hex, etc... Consider the trade-off on coding/dev investments and performance.
... View more
01-02-2017
10:42 PM
@petri koski, UDF (either for Hive or PIG) are running during map-reduce stage (doesn't matter whether it is M/R or TEZ execution engine). In other words, you are println during distributed computing. The code that prints your output is not under your execution shell (unless you are running in local mode). How to see your printed lines? There are some ways: - using job tracker UI - find your job and click on logs. One by one across all containers, until you will find it (or in each of them, if your code is applicable to each and every record of processed data). - using yarn get aggregated logs yarn logs -applicationId <aplpicationID>
... View more
12-30-2016
09:03 PM
21 Kudos
Synopsis. ORC is a columnar storage format for Hive. This document is to explain how creation of ORC data files
can improve read/scan performance when querying the data. TEZ execution engine provides
different ways to optimize the query, but it will do the best with correctly
created ORC files. ORC Creation Strategy. Example: CREATE [EXTERNAL] TABLE OrcExampleTable
(clientid int, name string, address string, age int)
stored as orc
TBLPROPERTIES (
"orc.compress"="ZLIB",
"orc.compress.size"="262144",
"orc.create.index"="true",
"orc.stripe.size"="268435456",
"orc.row.index.stride"="3000",
"orc.bloom.filter.columns"="clientid,age,name"); Ingesting data into Hive tables heavily depends on usage
patterns. In order to make queries running efficiently, ORC files should be
created to support those patterns.
-Identify most
important/frequent queries that will be running against your data set (based on
filter or JOIN conditions) -Configure optimal data file
size -Configure stripe and stride
size -Distribute and sort data
during ingestion -Run “analyze” table in
order to keep statistics updated Usage Patterns. Filters are mainly used in “WHERE” clause and “JOIN … ON”. An
information about the fields being used in filters should be used as well for
choosing correct strategy for ORC files creation. Example: select * from orcexampletable
where clientid=100 and age between 25 and 45; Does size matter? As known, small files are a pain in HDFS. ORC files aren’t
different than others. Even worse. First of all, small files will impact NameNode memory and
performance. But more importantly is response time from the query. If ingestion
jobs generate small files, it means there will be large number of the files in
total. When query is submitted, TEZ will need an information about
the files in order to build an execution plan and allocate resources from YARN. So, before TEZ engine starts a job:
-TEZ gets an information
from HCat about table location and partition keys. Based on this information
TEZ will have exact list of directories (and subdirectories) where data files
can be found. -TEZ reads ORC footers and
stripe level indices in each file in order to determine how many blocks
of data it will need to process. This is where the problem of large number of
files will impact the job submission time. -TEZ requests containers
based on number of input splits. Again, small files will cause less flexibility
in configuring input split size, and as result, larger number of containers
will need to be allocated Note, if query submit stage time-outs, check the number of
ORC files (also, see below how ORC split strategy (ETL vs BI) can affect query submission
time). There is always a trade-off between ingestion query
performance. Keep to a minimum number of ORC files being created, but to
satisfy acceptable level of ingestion performance and data latency. For transactional data being ingested continuously during
the day, set up daily table/partition re-build process to optimize number of
files and data distribution. Stripes and Strides. ORC files are splittable on a stripe level. Stripe
size is configurable and should depend on average length (size) of records and
on how many unique values of those sorted fields you can have. If search-by
field is unique (or almost unique), decrease stripe size, if heavily repeated –
increase. While default is 64 MB, keep stripe size in between ¼ of block-size
to 4 blocks-size (default ORC block size is 256 MB). Along with that you can
play with input split size per job to decrease number of containers required
for a job. Sometimes it’s even worth to reconsider HDFS block size (default HDFS
block size if 128 MB). Stride is a set of records for which range index
(min/max and some additional stats) will be created. Stride size (number of
records, default 10K): for unique values combinations of fields in bloom filter
(or close to unique) – go with 3-7 K records. Non-unique 7-15 K records or even
more. If bloom filter contains unsorted fields, that will also make you go with
smaller number of records in stride. Bloom filter can be used on sorted field in
combination with additional fields that can participate in search-by clause. Sorting and Distribution. Most important for efficient search within the data set is
how this set stored. Since TEZ utilize ORC file level information (min/max range
index per field, bloom filter, etc.), it is important that those ranges will
give the best reference to the exact block of data having desired values. Here is an example: This example shows that with unsorted data, TEZ will request
4 containers and up to full table scan, while with sorted data – only single
container for single stripe read. The best result can be achieved by globally sorting the data
in a table (or partition). Global sorting in Hive (“ORDER BY”) enforces single reducer
to sort final data set. It can be inefficient. That’s when “DISTRIBUTE BY”
comes in help. For example, let’s say we have daily partition with 200 GB
and field “clientid” that we would like to sort by. Assuming we have enough
power (cores) to run 20 parallel reducers, we can: 1. Limit number of reducers to
20 (mapred.reduce.tasks) 2. Distribute all the records
to 20 reducers equally: insert into …
select … from …
distribute by floor(clientid/((<max(clientid)> – <min(clientid)> + 1)/ 20 )
sort by clientid.
Note, this will work well if client ID values are distributed evenly on scale
between min and max values. If that’s not the case, find better distribution
function, but ensure that ranges of values going to different reducers aren’t
intersecting. 3. Alternatively, use PIG to order by client id (with parallel 20). Usage. There is a good article on query optimization: https://community.hortonworks.com/articles/68631/optimizing-hive-queries-for-orc-formatted-tables.html I would only add to that following items to consider when
working with ORC:
-Proper Input Split Size for
query job will result in less resources (cores/memory/containers) allocation. -set hive.hadoop.supports.splittable.combineinputformat=true; -set
hive.exec.orc.split.strategy=ETL; -- this will work only for specific
values scan, if full table scan will be required anyway, use default (HYBRID)
or BI. -Check out other TEZ/ORC
parameters on this page:
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties Analyze table. Once the data is ingested and ready, run: analyze table t [partition p] compute
statistics for [columns c,...]; Refer to https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive
for more details. Note, ANALYZE statement is time consuming. More columns are defined to be analyzed – longer time it takes to complete. Let me know if you have more tips in this area!
... View more
Labels:
12-20-2016
02:11 AM
When using JDBC, local files will be created in local directories of a host running HiveServer2 you are connecting to (if you have more than one and connecting thru zookeeper - check all server hosts). So make sure you have writing permissions on those hosts, not on Nifi's one. Note, the story is totally different when you are using Hive CLI of Beeline embedded mode.
... View more
- « Previous
- Next »