Member since
02-01-2018
37
Posts
2
Kudos Received
0
Solutions
03-11-2018
10:26 PM
Actually not Impala but I run Hive 0.13. My table is partitioned on date basis. In Hive 0.13, I didn't found any CBO property.
... View more
03-10-2018
09:51 PM
@csguna Yes I have tried SET hive.compute.query.using.stats=true; SET hive.stats.fetch.column.stats=true; Can you tell me which stats are these? Table A was in sequence file format. Yes by default ZLIB in enabled in B and snappy in A. I am looking for latency.
... View more
03-08-2018
11:34 AM
@Binu Mathew Hey, In my case lot of mappers are launched when I run a select query on ORC file. Also, are there some particular settings of hive to be turned on so that read operations in ORC use ppd. I have tried a lot but almost all my queries read the same as size(of my ORC table), which means reader is reading the whole ORC file. I run Hive 0.13.
... View more
03-08-2018
08:38 AM
@rtrivedi The qurey was a simple select a column query based on age. # Detailed Table Information
Database: a0m01lf
Owner: cbb_interactions
CreateTime: Thu Mar 01 01:20:27 PST 2018
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: maprfs:/hive/a0m01lf.db/temp_segments
Table Type: MANAGED_TABLE
Table Parameters:
last_modified_by a0m01lf
last_modified_time 1519896138
transient_lastDdlTime 1519896138
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
... View more
03-08-2018
08:26 AM
And also by mistake I wrote that data read is same, actually I wanted to say that data read from B is actually equal to size of B. No predicate pushdown.
... View more
03-08-2018
08:08 AM
Hey @kgautam, I didn't understand what you said. I think row groups should be skipped based on max and min stored for them. And also I tried for one selecting one column too. It didn't work. Maybe I'm wrong in my understanding.
... View more
03-08-2018
12:28 AM
I have a table A with a column age (string) in it. (table size is 74GB, Hive 0.13) Now I created a table B with same data as of A, but in ORC file format and also included sort by age while creating B. Now when I run the query: select count(id) from X where age=25; The data read from B table is same as size of B (I expected predicate pushdown) and also the time taken in A is almost equal to B. Theoretically, because of predicate pushdown in ORC, a lot of data must have been skipped thus saving a lot of time. I suspect the indexes created in orc file are not getting read. I have tried almost everything. But nothing works. Please help with this.
... View more
Labels:
- Labels:
-
Apache Hive
03-07-2018
06:46 AM
I have a table A, with age a column in it. I created a table B in orc with age column sorted. Now, select * from A where age=60; select * from B where age=60; Both reads same amount of data and no difference in time was observed. Please help me with this.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
03-05-2018
06:09 AM
Will usage of DISTRIBUTE BY or SORT BY be helpful?
... View more
03-05-2018
04:28 AM
My Hive table is in ORC format and queries in it run fastest when columns in where clause are sorted. But in my case there are not currently. What is the syntax to sort a column just before query?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
03-02-2018
06:39 PM
@owen My number of mapper and reducers are almost down to half in ORC for a query. bytes read from HDFS is also reduced significantly. But still the time taken by ORC query is almost same as sequence file query.
... View more
03-02-2018
05:57 PM
when I run any query with explain on my table A (Hive 0.13). Below is a snapshot of result after a simple select query. STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: temp_segments
Statistics: Num rows: ****** Data size: ****** Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: column1 (type: string)
outputColumnNames: _col0
Statistics: Num rows: ****** Data size: ****** Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
Statistics: Num rows: ****** Data size: ****** Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1 The table is in ORC for sure (verified). But here it is showing that table is text file. Any idea why is this there?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
03-02-2018
05:38 PM
1 Kudo
Yup @Michael Young. Another way I found was through hadoop fs -text <file-location> On the top of results, INFO compress.CodecPool: Got brand-new decompressor [.snappy] is written which I think is a confirmation that snappy compression is applied.
... View more
03-01-2018
08:59 AM
The problem was resolved. The mistake was somewhere else.
... View more
02-27-2018
06:45 PM
I am using Hive 0.13 I didn't try turning on vectorization yet. It was sum of entire column (of a partition in my table).
... View more
02-27-2018
10:01 AM
My hadoop jobs fail if I set org.apache.hadoop.io.compress.DefaultCodec as my mapred.output.compression.codec in Hive 0.13. I am not able to find reasons.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
02-27-2018
07:57 AM
1 Kudo
So I compressed my table in hive using snappy compression and it did get compress. The size was reduced. But when i run hadoop fs -lsr /hive/user.db/table_name, I see no file extensions with .snappy. I want to know if they really were snappy compressed or not?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
02-26-2018
06:24 PM
Link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC Above link has a part named serialization. Can somebody tell what serialization is and for what it is used for?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
02-25-2018
07:36 PM
Are the changes in schema like adding, deleting, renaming, modifying data type in columns permitted without breaking anything in ORC files in Hive 0.13.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
02-23-2018
07:52 AM
Hey, I am pretty much confused which storage format is suited for which type of data. You said "Parquet is well suited for data warehouse kind of solutions where aggregations are required on certain column over a huge set of data.", But I think its true for ORC too. And As @owen said, ORC contains indexes at 3 levels (2 levels in parquet), shouldn't ORC be faster than Parquet for aggregations.
... View more
02-22-2018
06:44 AM
Hey Owen, my file A was 130GB in sequence file format and 78GB in ORC+ZLIB format. Now rolling out a sum (columnA) query on ORC+ZLIB format takes 11132 sec cumulative CPU time whereas 10858 sec in sequence format. Theoretically ORC+ZLIB should have calculated sum much much faster than sequence file. Is there a specific reason for this result?
... View more
02-20-2018
11:59 AM
Thanks for the reply @haraldberghoff. Correct me where I'm wrong - a) The client reads the file (which we wish to compress i.e. write into an orc file) b) This reading is done normally by reaching out to name node and getting block addresses. c) Now the client applies all logic of orc in this file (creates stripes, indexes, wrapper etc). d) Now again a typical write operation is carried out by reaching out to name node for getting new addresses. e) The only part where ORC comes in was (c) which was independent of name nodes and data nodes.
... View more
02-20-2018
11:27 AM
Thanks for reply Owen, Had a doubt. These minimum and maximum values are used for skipping files and stripes right? But as they are not sorted, not many stripes and files will be skipped. So how does read become significant faster in ORC?
... View more
02-20-2018
10:08 AM
One more doubt. When we create an ORC table and dump data into it from an existing one, what is the flow of data. I mean the data to be compressed was in data nodes. Now this data has to be processed (stripe and indexing) for ORC. So can you briefly tell the working?
... View more
02-19-2018
10:44 AM
I am trying to understand ORC file format. In the documentation it is written that a file is divided into fixed sized stripes. But according my basic hdfs understanding, stripe must be divided from blocks stored in datanodes? Am I correct?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
02-19-2018
06:34 AM
My table test_orc contains (for one partition): col1 col2 part1 . abc def 1 . ghi jkl 1 . mno pqr 1 . koi hai 1 . jo pgl 1 . hai tre 1 . hive --orcfiledump /hive/user.db/test_orc/part1=1/000000_0 gives output Structure for /hive/a0m01lf.db/test_orc/part1=1/000000_0 . 2018-02-18 22:10:24 INFO: org.apache.hadoop.hive.ql.io.orc.ReaderImpl - Reading ORC rows from /hive/a0m01lf.db/test_orc/part1=1/000000_0 with {include: null, offset: 0, length: 9223372036854775807} . Rows: 6 . Compression: ZLIB . Compression size: 262144 . Type: struct<_col0:string,_col1:string> . Stripe Statistics: Stripe 1: Column 0: count: 6 . Column 1: count: 6 min: abc max: mno sum: 17 . Column 2: count: 6 min: def max: tre sum: 18 . File Statistics: Column 0: count: 6 . Column 1: count: 6 min: abc max: mno sum: 17 . Column 2: count: 6 min: def max: tre sum: 18 . Stripes: Stripe: offset: 3 data: 58 rows: 6 tail: 49 index: 67 . Stream: column 0 section ROW_INDEX start: 3 length 9 . Stream: column 1 section ROW_INDEX start: 12 length 29 . Stream: column 2 section ROW_INDEX start: 41 length 29 . Stream: column 1 section DATA start: 70 length 20 . Stream: column 1 section LENGTH start: 90 length 12 . Stream: column 2 section DATA start: 102 length 21 . Stream: column 2 section LENGTH start: 123 length 5 . Encoding column 0: DIRECT . Encoding column 1: DIRECT_V2 . Encoding column 2: DIRECT_V2 . I did not understand the Stripes part! And how do they calculate sum of column (string values)?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
02-16-2018
06:36 AM
Even after columnar compression techniques like parquet my files are turning out to be bigger than sequence files. I wanted to know that is columnar compression a sure shot way to compression or is there some kind of data which fails here.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive