Member since
04-24-2017
106
Posts
13
Kudos Received
7
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1420 | 11-25-2019 12:49 AM | |
2508 | 11-14-2018 10:45 AM | |
2258 | 10-15-2018 03:44 PM | |
2126 | 09-25-2018 01:54 PM | |
1948 | 08-03-2018 09:47 AM |
08-01-2018
07:05 PM
@Josh Elser Thank you for your support. I updated my question by some information, to answer a few of the questions. Where can I find information in the logs? Can you tell me, which file(s) on which server (HBase Master, HBase RegionServer) are helpful? Will make some more "benchmarking" and log searching tomorrow. Any hints or assumptions yet?
... View more
08-01-2018
02:13 PM
Formatting of code is not saved, sorry!
... View more
08-01-2018
02:11 PM
I'm plying around with HBase tables that are managed by Hive. Therefore I run the following commands in Zeppelin: %hbase
# Create Table with 1 CF and 10 Regions
create 'my_test', {NAME => 'cf1', VERSIONS => 3}, {SPLITS => ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09']}
After creating the HBase table, I create a external Hive Table with HBaseStorageHandler: %hive
-- CREATE EXTERNAL HIVE MANAGED HBASE TABLE!
CREATE EXTERNAL TABLE dmueller.my_test(
key String,
hashvalue int,
valuelist String
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:hashValue,cf1:valueList")
TBLPROPERTIES("hbase.table.name" = "my_test", "hbase.mapred.output.outputtable" = "my_test")
Then I fill the HBase table by reading data (1000 rows) from another Hive table into this external Hive Table: %hive
-- INSERT DATA INTO THE MANAGED TABLE
INSERT OVERWRITE TABLE dmueller.my_test SELECT concat_ws("_", testname, lotnrc, testnumber, teststufe) as key, hashvalue, valuelist from dmueller.hivetable limit 1000
The problem here is, that the INSERT statement is too slow. It takes around 4 minutes to insert the new values (1000 rows with 6 column to read)! The dmueller.hivetable is a partitioned ORC table with around 60 mio rows and ~500 GB of ORC file size in HDFS. How can this INSERT statement made faster? What am I doing wrong? Update - Some more information: I'm using Hive with Tez. Writing less data takes less time, e.g. one row takes already ~30 sec. (so it seems to be an exponential runtime). Reading the data (e.g. the 1000 rows) directly via Zeppelin Hive (JDBC) interpreter takes ~2 sec. Re-running the same INSERT statement (e.g. with 1000 rows) takes always the same time (+/- a few seconds of course).
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
07-18-2018
08:59 AM
I created an ORC table in Hive (saved in HDFS path /apps/hive/warehouse/mydb.db/mytable). As I need to add some rows manually sometimes, I call the INSERT statement. This creates many small files in the table directory in HDFS, which is the expected behavior. Now I run the command ALTER TABLE mydb.mytable CONCATENATE; to merge these small files together to bigger files. What I'm observing here is, that sometimes all small files are merged to one big file (~80 MB) and sometimes I have the big file and some small files (with a few KB each) over, they seem not to be merged. Is this normal behavior of the CONCATENATE command? Is there a way to influence this behavior (to avoid having these small files sometimes after the Concatenate command)? Thank you!
... View more
Labels:
- Labels:
-
Apache Hive
07-10-2018
01:54 PM
Yes, I'm familiar with Spark. What I wondered about was the caching behavior. It really seems to know which HiveQL statement belongs to the cached data, and re-uses it automatically when the same query comes: // Cache the table for the first time => takes some time!
val df1_1 = sqlContext.sql("SELECT a, b FROM db.table limit 1000000")
val df1_2 = df1_1.cache()
df1_2.count()
// This re-uses the cached object, as the request is the same as before => very fast!
val df2_1 = sqlContext.sql("SELECT a, b FROM db.table limit 1000000")
val df2_2= df2_1.cache()
df2_2.count()
// This caches the data, because the request is different (another limit clause) => takes some time!
val df3_1 = sqlContext.sql("SELECT a, b FROM db.table limit 10")
val df3_2= df3_1.cache()
df3_2.count()
Thanks for your help @Felix Albani
... View more
07-10-2018
01:04 PM
Thank you for the fast answer. The problem seems to be the cache method, as this takes always long for the first iteration and is faster in all the following iterations. This behavior is independent of which paragraphs I run before... It's clear to me, that the first action launches the application (and therefore starts the Executors etc.). But it's not clear to me, why Spark "knows" that it can re-use the cached dataframe from the iterations before, because I overwrite the variables (or even use some others).
... View more
07-10-2018
12:34 PM
I wrote a small Zeppelin paragraph that tries around with caching (reading Hive tables). Here my code: %spark
// Caching test #4: df.cache with 10 partitions
val input1 = sqlContext.sql("SELECT * FROM db.table").repartition(10)
val input2 = input1.cache()
input2.count()
The first time I run this paragraph it takes about 10 minutes to finish it. When I run this paragraph some more times, it always needs around 0,5 and 1 second. I added another paragraph to "initialize" the Spark interpreter after restarting it (I want to avoid these different run-times for the paragraph above). %spark
sc.version
sqlContext.sql("select * from db.table limit 1").show()
Also this paragraph needs some more time in the first iteration, and less in the following. BUT: The main paragraph (with the caching test) still takes so long in the first run! Am I doing something wrong here? Is Zeppelin re-using the cached elements in the further iterations (even if I "overwrite" the objects by reading the whole table again before calling the cache method)? Is there a difference between the two transformations df.cache() and sqlContext.cacheTable(name)? Thanks for your help!
... View more
Labels:
04-26-2018
03:06 PM
Here my workflow so far:
... View more
04-26-2018
03:04 PM
I also tried to "generate" the XML by the GenerateFlowFile processor, but still the same problem (thought it has something to do with my read XML maybe, but seems not to be so)
... View more
04-26-2018
02:51 PM
Thanks for your fast answer! I checked the settings and there are definitivly no upper/lower case problems. I just saw that the NiFi version is 1.4, not 1.5. Is there a problem with this processor?
... View more