Member since
04-24-2017
106
Posts
13
Kudos Received
7
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1420 | 11-25-2019 12:49 AM | |
2506 | 11-14-2018 10:45 AM | |
2258 | 10-15-2018 03:44 PM | |
2126 | 09-25-2018 01:54 PM | |
1948 | 08-03-2018 09:47 AM |
09-11-2018
02:06 PM
Here the extended explain of the DF, LLAP is not enabled in our cluster: df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [importtime: timestamp, tester: string ... 21 more fields]
== Parsed Logical Plan ==
GlobalLimit 1
+- LocalLimit 1
+- Project [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22]
+- SubqueryAlias mytable
+- Relation[importtime#0,tester#1,testerhead#2,matchcode#3,revision#4,usertext#5,teststufe#6,temp#7,filedat#8,starttime#9,endtime#10,lotnrc#11,wafernr#12,testname#13,testnumber#14,testtype#15,unit#16,highl#17,lowl#18,highs#19,lows#20,valuelist#21,hashvalue#22] orc
== Analyzed Logical Plan ==
importtime: timestamp, tester: string, testerhead: string, matchcode: string, revision: string, usertext: string, teststufe: string, temp: string, filedat: date, starttime: timestamp, endtime: timestamp, lotnrc: string, wafernr: string, testname: string, testnumber: string, testtype: string, unit: string, highl: string, lowl: string, highs: string, lows: string, valuelist: string, hashvalue: int
GlobalLimit 1
+- LocalLimit 1
+- Project [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22]
+- SubqueryAlias mytable
+- Relation[importtime#0,tester#1,testerhead#2,matchcode#3,revision#4,usertext#5,teststufe#6,temp#7,filedat#8,starttime#9,endtime#10,lotnrc#11,wafernr#12,testname#13,testnumber#14,testtype#15,unit#16,highl#17,lowl#18,highs#19,lows#20,valuelist#21,hashvalue#22] orc
== Optimized Logical Plan ==
InMemoryRelation [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- CollectLimit 1
+- *FileScan orc mydb.mytable[importtime#0,tester#1,testerhead#2,matchcode#3,revision#4,usertext#5,teststufe#6,temp#7,filedat#8,starttime#9,endtime#10,lotnrc#11,wafernr#12,testname#13,testnumber#14,testtype#15,unit#16,highl#17,lowl#18,highs#19,lows#20,valuelist#21,hashvalue#22] Batched: false, Format: ORC, Location: CatalogFileIndex[hdfs://hdp-m-01:8020/apps/hive/warehouse/mydb.db/mytab..., PartitionCount: 1000, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<importtime:timestamp,tester:string,testerhead:string,matchcode:string,revision:string,user...
== Physical Plan ==
InMemoryTableScan [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22]
+- InMemoryRelation [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- CollectLimit 1
+- *FileScan orc mydb.mytable[importtime#0,tester#1,testerhead#2,matchcode#3,revision#4,usertext#5,teststufe#6,temp#7,filedat#8,starttime#9,endtime#10,lotnrc#11,wafernr#12,testname#13,testnumber#14,testtype#15,unit#16,highl#17,lowl#18,highs#19,lows#20,valuelist#21,hashvalue#22] Batched: false, Format: ORC, Location: CatalogFileIndex[hdfs://hdp-m-01:8020/apps/hive/warehouse/mydb.db/mytab..., PartitionCount: 1000, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<importtime:timestamp,tester:string,testerhead:string,matchcode:string,revision:string,user...
<br>
... View more
09-11-2018
12:01 PM
I have a huge Hive Table (ORC) and I want to select just a few rows of the table (in Zeppelin). %spark sqlContext.setConf("spark.sql.orc.filterPushdown", "true") val df1 = sqlContext.sql("SELECT * FROM mydb.myhugetable LIMIT 1") // Takes 10 mins val df2 = sqlContext.sql("SELECT * FROM mydb.myhugetable").limit(1) // Takes 10 mins Using the LIMIT clause in my SQL statement or the corresponding dataframe method DF.limit doesn't help, as the query still takes too long. It seems to read the whole table first and then just returning the n rows. How can I achieve, that the filter limits the data during running the SQL and therefore runs faster? Shouldn't the filter pushdown help here? I can't see any difference with the setting spark.sql.orc.filterPushdown set to true or false.
Thank you!
... View more
Labels:
09-03-2018
07:15 AM
Thank you! Can you give me some details about this or do you have some helpful links?
... View more
08-15-2018
07:35 AM
I have a huge Hive Table, that I want to process in Spark. To avoid a full table scan in my test purposes, I first run this code: %spark
val df1 = sqlContext.sql("select * from mydb.mytable limit 10")
df1.show()
That works very fast (1 Sec). Now I added another query to "transform" the result a little bit - Now Hive seems to make a full table scan, as this execution takes more than 10 minutes! %spark
val df2 = sqlContext.sql("SELECT concat_ws(\":\", lpad(cast((hashvalue % 200) as String), 3, \"0\"), concat_ws(\"_\", name, number, test, step, cast(starttime as STRING))) as key, hashvalue, valuelist from (select * from mydb.mytable limit 10) a")
df2.show()
1. Is this normal behavior? 2. Is there a way to avoid full table scans, when using sub-queries? Thank you!
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
-
Apache Zeppelin
08-08-2018
08:40 AM
1 Kudo
I created a Hive Table, that can access another HBase table. Something like this: create external table hbase_50mio_200regions (rowkey String, valuelist String)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:valueList")
TBLPROPERTIES ('hbase.table.name' = 'hbase_50mio_200regions')
The access works and HiveQL queries are fast for small HBase table. The problems are the queries to huge HBase tables. It takes very long to finish e.g. a simple count (takes >1h on 50 Mio rows HBase table with about 1300 GB of HFiles)! Is there a way to pass start and stop key to my Hive SELECT or COUNT queries? As I saw in the following question https://stackoverflow.com/a/40729818 the WHERE conditions won't avoid a full table scan, as they only filter the rows after iterating all rows of the HBase table. Is this right and/or is there another way to avoid full table scans when accessing my HBase table through Hive?
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
08-07-2018
02:14 PM
Is there a possibility to query a salted HBase table by using the HBase Shell. E.g. my HBase table row key looks like this: 000:rowkey0
000:rowkey2
...
001:rowkey1
001:rowkey3
...
How can I use my HBase Shell to query this table without doing a full table scan, e.g. when I want to read out the row with rowkey1? Thank you!
... View more
Labels:
- Labels:
-
Apache HBase
08-06-2018
12:40 PM
@Josh Elser Thank you for that information. I changed my HBase table creation now to following command: create 'hbase_1m_10r', {NAME => 'cf'}, {SPLITS => ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']}<br> When running the following query: INSERT OVERWRITE TABLE dmueller.hbase_1m_10r SELECT concat_ws(":", cast((hashvalue % 10) as String), concat_ws("_", name, number, test, step, cast(starttime as STRING))) as k, valuelist from (select * from testdb.test_orc limit 1000000) a distribute by split(k, ":")[0] sort by k<br> I still have only 1 reducer... Any idea why?
... View more
08-03-2018
09:47 AM
Found following article: http://www.openkb.info/2017/05/hive-on-tez-how-to-control-number-of.html That helped me. I now set the numbers of Tasks to a fix amount.
... View more
08-03-2018
07:34 AM
I found the following article about how to fill a HBase table with data from Hive: https://community.hortonworks.com/articles/2745/creating-hbase-hfiles-from-an-existing-hive-table.html I also did the steps for me, and it seems to work. The problem is, when I call the following HiveQL set hfile.family.path=/tmp/test_hbase/cf
set hive.hbase.generatehfiles=true
INSERT OVERWRITE TABLE testdb.test_hbase SELECT distinct concat_ws("_", name, number, test, step, cast(starttime as STRING)) as k, hashValue, valuelist from testdb.test_orc order by k, hashvalue limit 1000 I need to combine 4 columns to get a unique row key for my HBase table. Another problem is, that my valueList column can contain huge Strings, between 0 and 1 MB. When I run the query, Tez creates 100 containers for Mapping jobs. This takes a few minutes to complete, which is also slow for 1000 rows, but ok. After the Map step, a Reduce step follows. And this could be the problem in my oppinion, because there's only 1 Reducer for this huge amount of data. This seems to be too less, as the job takes hours now (still not completed yet!) My questions here:
What are the Map and Reduce step doing in this scenario? Why is there only 1 Reducer? Can I somehow change this behavior (e.g. disable Reducing or using more Reducers)? Thank you!
... View more
Labels:
08-01-2018
08:12 PM
Good question. I'm just running the insert into ... select * from ... command, like I could do it e.g. in Beeline or Ambari Hive View (JDBC). Is this running in Single-Insert or Batch mode?
... View more